[LRUG] Sphinx Data Sources
Tom Stuart
tom at experthuman.com
Thu May 28 02:10:25 PDT 2009
On 28 May 2009, at 09:50, Andrew Stewart wrote:
> Does anybody know of Sphinx drivers for PDF or Word documents?
> Something using Sphinx's xmlpipe2 data source would be fine.
I might be stating the bloody obvious here, but what you want to do is
turn your PDF or Word document into plain text, then build an XML
document with that plain text stuck into some element (<content> or
whatever) inside <sphinx:document>, then feed that to xmlpipe2.
So all you need here is something that'll munge the source document
into vaguely-acceptable plain text. pdftotext works pretty well for
PDFs; wvText works pretty well for Word. You don't really care whether
the output looks nice or not, as long as it contains all the words
from the original document -- the actual content gets incinerated by
Sphinx during indexing.
Cheers,
-Tom
More information about the Chat
mailing list