[LRUG] Linked Data & Ruby

Tom Morris tom at tommorris.org
Tue Jun 30 13:10:22 PDT 2009


On Tue, Jun 30, 2009 at 17:22, Anthony Green<Anthony.Green at bbc.co.uk> wrote:
>>   People just don't, for better or worse, seem to have much of a lust for it.
>
> The sides inferred that Ruby people don't but that (some part of ?) the
> Python community has, if not a lust for it, made not inconsiderable efforts
> into producing tools for working with RDF.
>
> A quick Google of RDF and Python does seem to indicate that to be the case.
>

Well, yes, there's rdflib in Python which is quite mature (the few
cases I've found of rough edges are SPARQL-related), and there's the
Python bindings for Redland. There's also cwm, which is a useful
command-line hacking tool for RDF geeks - that's written in Python (if
you ever feel want to feel giddy, download the complete CVS of cwm -
they have over 100 megabytes of tests in there - the tarball of
cwm-1.2.1 is 608 K). I don't think this means that Python is heavily
invested in RDF. Because, well, there's even more code in Java, C and
Perl for RDF processing. Python has been popular with some in the RDF
community because, well, it's pretty much become the hacking language
of choice at the W3C and for some of the people who work in RDF land.

For Ruby, you can either use the Ruby bindings for Redland. It's not
the most user-friendly library you can find though. If you are using
JRuby, just download Jena and the various Jena plugin jars and use
those.

A while back I started work on Reddy, a native Ruby RDF library -
<http://github.com/tommorris/reddy/>. The RDF/XML parsing depends on
Addressable and LibXML-Ruby, while the Notation3 parsing is done using
Treetop. Version 0.0.1 (codename: Abject Pessimism) is on Rubyforge,
so sudo gem install reddy if you want to admire my deep-alpha work.

It's not particularly useful though, and needs a lot of work. I'm too
busy to work on it at the moment, but please fork, hack and patch. I'm
busy with academic work until at least September. After that, I'll be
able to get back to work on Reddy and put a less shitty release out.

My plans for when I get back into working on Reddy are:
1. Change the implementation of literals so they are based on either
ActiveSupport::Multibyte, java.lang.String (in JRuby) or something
else that's UTF-8 aware.
2. Rewrite the RDF/XML parser to sit on top of Nokogiri (partly
because Nokogiri is a lot less hassle than libxml-ruby and also
because JRuby 1.3.0 works with Nokogiri brilliantly).
3. Write a serializer back into RDF/XML.
4. Write a Nokogiri-based RDFa parser.
5. Write an RDF-EASE parser.
6. Perhaps implement a subset of SPARQL.
7. More and better unit test coverage.
8. Maybe adding OWL and OWL 2 based features - perhaps following the
OntoModel approach of Chris Bizer's RDF API for PHP (RAP).
9. SPARQL backend compatibility.

Contributing to Reddy is pretty easy: the code isn't particularly
complex (although my first go at the RDF/XML parser sucks), and the
W3C provide a pretty comprehensive test suite for the RDF/XML, RDFa
and SPARQL specifications.

Despite Reddy's crapness, Patrick Sinclair at the BBC has successfully
written an adapter so you can use it with ActiveRDF in Rails -
<http://github.com/metade/activerdf_reddy>

That's the state of consuming RDF in Ruby: as for producing, well,
RDF/XML is XML, and so you can use XML Builder in Rails. RDFa should
be pretty easy to add to ERb. HAML-heads can have fun with that.

As for why Rubyists should care about linked data? Because (a) APIs
are boring and (b) writing wrapper code around APIs is also boring.
RDF and SPARQL together provide a really nice alternative to SQL - the
way I think about it is that RDF/SPARQL are to SQL what dynamic typing
is to strong typing. A triplestore basically lets you chuck any RDF
data into what is essentially a bucket, and then you can sift through
it. You don't have to carefully map out a data schema ahead of time.
Once you've got all the junk in the bucket, you write your query and
get back out XML or JSON (in the case of SELECT queries), a boolean
value (for ASK queries), RDF (for DESCRIBE or CONSTRUCT queries).

Here's an example - dbpedia is basically Wikipedia as linked data.
What dbpedia does is parses all of the templates, infoboxes,
categories and other common patterns on Wikipedia and then dumps all
that back into a triplestore. The parsing code is all written in PHP5,
and then an HTTP endpoint to that data is made using a server called
Virtuoso which is open source.

Here is a query written in SPARQL that runs on dbpedia:
SELECT ?url ?name
WHERE { ?url <http://dbpedia.org/property/wikiPageUsesTemplate>
<http://dbpedia.org/resource/Template:beauty_pageant> .
?url <http://dbpedia.org/property/winner> ?name . }

Basically, this finds any two nodes that match both of the 3-tuple
statements. It gives you back an XML, JSON or HTML representation of a
table with two columns: one with the dbpedia URI for a beauty contest
and the other with either a dbpedia URI or a string representing the
winner of that contest. (I apologise for the tackiness: it was just a
serendipitous discovery of the sheer volume of stuff on Wikipedia and
dbpedia).

RDF makes merging disparate data easy. If I gave you two sets of rows
from databases or two XML documents or JSON structures and told you to
merge them, the strategy you use would be dependent on the semantics
of the data. Let's say you had:
1. { name = Jane, gender = female }
2. { name = Jane, age = 30 }
How do you merge that? In RDF, the solution is always clear:
3. { name = Jane, age = 30, gender = female }
That's because the data model is additive, following some fairly to
easy rules about identity and datatypes. But in any other data model,
how do you do it? Perhaps in a SQL database, Jane *has* removed one of
those properties and they are now null. XML might have specific
ordering or cardinality rules about elements. Obviously, where this
comes in useful is merging data relating objects together.

I'm planning to rebuild my personal site soon in Rails (from PHP5) and
it will use a lot of linked data, a lot of RDFa and a lot of SPARQL.
I'm planning to have no database at all. Just a triple-store and
SPARQL. There will still be model classes, but they'll just abstract
away the triple store. I'm thinking that basically method_missing on
the triple-store-based classes will run something roughly like like:

SELECT DISTINCT ?data
WHERE { <#{self.uri}> ?p ?data .
FILTER(REGEX(str(?p), "#{method_name}$", "i")) }

So, if you call User.username, it'd call to the database and search
for everything about the user, then filter out only those with a
string matching "username". I may actually keep a SQL store just for
login details and other sensitive stuff, at least until I figure out
how to do permissions properly on whatever triple store I use.

If this stuff is interesting to people, Mark Birbeck is giving a talk
about RDFa at Skills Matter on July 13:
<http://skillsmatter.com/event/ajax-ria/the-possibilities-of-rdfa-and-the-semantic-web>

Rubyists taking their first steps into the strange world of the
Semantic Web can always consult the nice people on irc.freenode.net
#swig

Over and out,

-- 
Tom Morris
http://tommorris.org/



More information about the Chat mailing list