[LRUG] Using Ruby for semantic analysis and categorising documents - first steps?

Mon Aug 20 07:24:00 PDT 2012

You may want to look at LDA (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and its implementation in Vowpal Wabbit (http://hunch.net/~vw/).

--  
Jan Szumiec
+44 756 367 1812

On Monday, August 20, 2012 at 3:15 PM, Frederick Cheung wrote:

>  
> On 20 Aug 2012, at 15:05, Chris Adams <mail at chrisadams.me.uk (mailto:mail at chrisadams.me.uk)> wrote:
>  
> > Hi all,
> >  
> > I'm trying to spec out a feature at work, to sift through a load of text in case studies or similar articles, and categorise them according to some pre-determined criteria, and present them later to users of an app we're build, to help them discover useful steps their business on take to reduce emissions.  
> >  
> > So far, we've been looking through case studies manually to get an idea of the shape of the data, to work out how we might retrieve it later on, and right now, we're relying on people to understand the content and categorise it, and this feels like something screaming out to be automated, if we can get half decent results from semantic analysis tools.
>  
> If you're trying to classify a set of documents into a known set of classes then liblinear has a bunch of classifiers http://www.csie.ntu.edu.tw/~cjlin/liblinear/ that can be applied to this sort of stuff. You will need a decent amount of pre-classified data to train it on though. I spent a while trying to get the 'official' ruby bindings for it to do what I wanted before getting frustrated with SWIG and writing my own (https://rubygems.org/gems/ruby_linear). You may want to be a little careful about your choice of feature - sometimes using n-grams (i.e. pairs, triplets, quartets etc.) of words can work a lot better than just using individual words. I've never used of the term extraction APIs though,
>  
> Fred
>  
>  
> > On the surface this sounds like something I might use some OpenCalais-based service, or gem like SemExtractor[1], for a first pass, and then allow clean up manually, but I'm not really familiar enough with semantic analysis tools to know if what I'm doing is a fools errand or not yet.
> >  
> > Before I lose a few days trying to learn about the quirks of various term-extraction API's, I wanted to ask – what tips do you wish you'd have been given before you spent a couple of days on this problem, if you've done this already?
> >  
> > We're working with Rails 3, and totally open to using something like SOLR or Elastic Search for parts of this, if it helps.
> >  
> > Thanks, and apologies in advance for the somewhat open ended question.
> >  
> > C
> >  
> >  
> >  
> >  
> > [1]: https://github.com/apneadiving/SemExtractor
> >  
> >  
> > --  
> > Chris Adams
> > mobile: 07974 368 229
> > twitter: @mrchrisadams
> > www: chrisadams.me.uk
> >  
> > _______________________________________________
> > Chat mailing list
> > Chat at lists.lrug.org (mailto:Chat at lists.lrug.org)
> > http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>  
>  
>  
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org (mailto:Chat at lists.lrug.org)
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>  
>  
> Attachments:  
> - smime.p7s
>