[LRUG] Sphinx Data Sources

Tom Stuart tom at experthuman.com
Thu May 28 02:10:25 PDT 2009


On 28 May 2009, at 09:50, Andrew Stewart wrote:
> Does anybody know of Sphinx drivers for PDF or Word documents?   
> Something using Sphinx's xmlpipe2 data source would be fine.

I might be stating the bloody obvious here, but what you want to do is  
turn your PDF or Word document into plain text, then build an XML  
document with that plain text stuck into some element (<content> or  
whatever) inside <sphinx:document>, then feed that to xmlpipe2.

So all you need here is something that'll munge the source document  
into vaguely-acceptable plain text. pdftotext works pretty well for  
PDFs; wvText works pretty well for Word. You don't really care whether  
the output looks nice or not, as long as it contains all the words  
from the original document -- the actual content gets incinerated by  
Sphinx during indexing.

Cheers,
-Tom



More information about the Chat mailing list