[LRUG] Multi-threading, Ruby & Rails

Tim Cowlishaw tim at timcowlishaw.co.uk
Mon Sep 17 15:30:58 PDT 2012


On 17 September 2012 22:38, Roland Swingler <roland.swingler at gmail.com> wrote:

>> b) Throw it out to cloud-like infrastructures like Hadoop/MapReduce, but the problems needs direct SQL access and that can get messy
>
> I've not tried it and I don't know whether you need SQL or "SQL-like"
> but there are things like hive http://hive.apache.org/ built on top of
> hadoop that may be of some use?
>

If I recall correctly, Hive provides a SQL-like querying layer on top
of information that's stored in a Hadoop (HDFS cluster), rather than
providing integration with a SQL db. However, there's a DB input
format [1] for hadoop that allows you to use the rows returned by a DB
query as the input to a mapreduce job which might be helpful in this
case. It depends a little on the complexity of the query - in my
fairly limited experience, doing complex joins can get rather messy
(although there are patterns for writing MR jobs that alleviate this -
Nathan Marz's 'Big Data' book [2] which is in Manning EAP at the
moment is in its infancy, but it looks like it's going to become a
good reference for this sort of stuff when it's published, as is their
Hadoop book [3])

Of course, using Hadoop would mean embracing some Java-ish
infrastructure to a greater or lesser extent (you could use MRI ruby
to run your jobs with hadoop streaming, but hadoop itself is still a
Java tool. Alternatively you can use JRuby to access the Java apis
directly, and if you're going down this road then:

> JRuby threads are Java threads, so you you get their benefits - i.e.
> proper use of all cores, no global interpreter lock.

...which might give you the performance increase you need without the
extra overhead of setting up and maintaining a hadoop cluster. If you
go don this route but are keen to use some sort of higher-level
concurrency primitive than threads, locks, mutexes etc then you might
want to take a look at akka [4], an erlang-ish   library for
actor-based concurrency on the JVM.  It's written and maintained by
Typesafe, the scala guys, but is usable from any other JVM language
too (and it looks like people have had some success using it with
JRuby [5]), so it might prove fruitful if you decide that JRuby's the
way you want to go.

Hope this helps!

Tim

REFERENCES
------------------

[1] http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
[2] http://www.manning.com/marz/
[3] http://www.manning.com/lam/
[4] http://akka.io/
[5] http://metaphysicaldeveloper.wordpress.com/2010/12/16/high-level-concurrency-with-jruby-and-akka-actors/



More information about the Chat mailing list