[LRUG] Using Ruby for semantic analysis and categorising documents - first steps?

Frederick Cheung frederick.cheung at gmail.com
Mon Aug 20 07:15:42 PDT 2012


On 20 Aug 2012, at 15:05, Chris Adams <mail at chrisadams.me.uk> wrote:

> Hi all,
> 
> I'm trying to spec out a feature at work, to sift through a load of text in case studies or similar articles, and categorise them according to some pre-determined criteria, and present them later to users of an app we're build, to help them discover useful steps their business on take to reduce emissions. 
> 
> So far, we've been looking through case studies manually to get an idea of the shape of the data, to work out how we might retrieve it later on, and right now, we're relying on people to understand the content and categorise it, and this feels like something screaming out to be automated, if we can get half decent results from semantic analysis tools.
> 

If you're trying to classify a set of documents into a known set of classes then liblinear has a bunch of classifiers http://www.csie.ntu.edu.tw/~cjlin/liblinear/ that can be applied to this sort of stuff. You will need a decent amount of pre-classified data to train it on though. I spent a while trying to get the 'official' ruby bindings for it to do what I wanted before getting frustrated with SWIG and writing my own (https://rubygems.org/gems/ruby_linear). You may want to be a little careful about your choice of feature - sometimes using n-grams (i.e. pairs, triplets, quartets etc.) of words can work a lot better than just using individual words. I've never used of the term extraction APIs though,

Fred


> On the surface this sounds like something I might use some OpenCalais-based service, or gem like SemExtractor[1], for a first pass, and then allow clean up manually, but I'm not really familiar enough with semantic analysis tools to know if what I'm doing is a fools errand or not yet.
> 
> Before I lose a few days trying to learn about the quirks of various term-extraction API's, I wanted to ask – what tips do you wish you'd have been given before you spent a couple of days on this problem, if you've done this already?
> 
> We're working with Rails 3, and totally open to using something like SOLR or Elastic Search for parts of this, if it helps.
> 
> Thanks, and apologies in advance for the somewhat open ended question.
> 
> C
> 
> 
> 
> 
> [1]: https://github.com/apneadiving/SemExtractor
> 
> 
> -- 
> Chris Adams
> mobile: 07974 368 229
> twitter: @mrchrisadams
> www: chrisadams.me.uk
> 
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4399 bytes
Desc: not available
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20120820/ee6455c7/attachment-0004.bin>


More information about the Chat mailing list