[LRUG] Using Ruby for semantic analysis and categorising documents - first steps?

Mon Aug 20 07:33:45 PDT 2012

On 20 August 2012 15:05, Chris Adams <mail at chrisadams.me.uk> wrote:
> Hi all,
>
> I'm trying to spec out a feature at work, to sift through a load of text in
> case studies or similar articles, and categorise them according to some
> pre-determined criteria, and present them later to users of an app we're
> build, to help them discover useful steps their business on take to reduce
> emissions.

Sounds like a textbook classification problem to me! As Frederick
mentions, you should be able to use bindings to liblinear [1] or
libsvm [2] to implement a classifier based on a Support Vector Machine
[3], and train it with the data you've already manually classified.
You'll also need to transform your text into a form that provides a
decent feature set for the svm model to learn from, this typically
would involve tokenizing [4] the document and stemming [5] the
resulting tokens, then using some algorithm [6] to assign a specific
numeric score for each word in each document. There's libraries ([7],
[8]) for C-ruby to do this for you. You might also get some milage out
of using the Apache Mahout [9] and [10] Lucene libraries with JRuby to
do some of this work - <gratuitous_plug>I did  a talk at last year's
Ruby Manor about precisely this</gratuitous_plug>[11]. I'd also
recommend reading up on how and why you should divide your training
data into training, testing and validation sets [12] when working on
supervised machine-learning problems, if you're not already familiar
with the concept!

Hope this is useful, and good luck! It sounds like a pretty fun
problem to work on.

Tim

REFERENCES
--------------
[1] http://www.csie.ntu.edu.tw/~cjlin/liblinear/
[2] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[3] http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
[4] http://en.wikipedia.org/wiki/Tokenization
[5] http://en.wikipedia.org/wiki/Stemming
[6] http://en.wikipedia.org/wiki/Tf*idf
[7] https://github.com/aurelian/ruby-stemmer
[8] http://rubygems.org/gems/tokenizer
[9] http://mahout.apache.org/
[10] http://lucene.apache.org/core/
[11] http://timcowlishaw.co.uk/post/16004059224/jruby-on-elephants
[12] http://stackoverflow.com/questions/2976452/whats-the-diference-between-train-validation-and-test-set-in-neural-networks