[LRUG] Parsing text

Paul Robinson paul at iconoplex.co.uk
Wed May 14 02:44:26 PDT 2014


On 14 May 2014 08:49, Andrew Stewart <boss at airbladesoftware.com> wrote:

I'm sure this is trivial with the correct approach.



Hint: it confuses the heck out of most postgrads in Computer Science
departments, so, yeah, not quite...



> - In 2006 before SASS etc existed, I wrote a Rails plugin for nested CSS.
>  It read a nested stylesheet and flattened it into normal CSS.  Back then I
> wasn't sure how to parse a nested stylesheet...and I still don't know how.
>  (Stop laughing at the back!)
>


That use case sounds almost perfect for using a PEG as a starting point (as
already pointed out). Start here:

http://www.rubyinside.com/writing-parsers-in-ruby-using-treetop-3911.html



> - A few months ago I needed to convert hospital admissions records from a
> PDF to CSV.  Each record had fields like id, name, various dates, clinical
> history, attending doctor, etc.  The fields weren't always in the same
> order due to the layout of text in the PDF, and some fields were optional.
>  Sometimes there were several fields on a line, and a field could be spread
> over several lines.  I did my usual thing of looping over each line,
> matching field names with regular expressions, and trying to keep track of
> where I was with a state variable.  Its sole virtue was that it (sort of)
> worked; otherwise it was horrible: hard to understand, hard to modify, hard
> to extend, and very hard to debug.
>


Short version: go and read *Metaprogramming in Ruby* and you will
immediately see how to do this using #send and #method_missing in a much
more elegant way.

Long version: I actually think every Ruby coder should read that book
because then you understand what your code is actually doing, rather than
just saying "well, this is how I used to do it in
[perl/python/bash/php/java/whatever]" and producing huge procedural blobs
of unmaintainable mess. It's my favourite Ruby book by some margin.

In this use case, let's assume you have a PatientRecord object with some
accessors defined:

class PatientRecord
  attr_accessor :id, :name, :date, [...]
end

I can write a loop over the process that assuming @current_page has an
array of attribute hashes that look like {:name => "id", :value => 12345} I
can just write this:

@record = Record.new
@current_page.attributes.each{|attribute|
@record.send("#{attribute[:name]}=", attribute[:value]) }
puts @record.inspect # => {:id => 12345, :name => "John Smith", :date =>
2013-12-25 [...] }

No state. No ifs or case statements. The entire parsing (excluding setting
up the data into a sort-of-consistent format), is one line.

That's the tip of the iceberg of what you can do with a smidge more
knowledge of how to metacode. The next step would be to allow
method_missing to do some work if needed, and to be able to create related
objects (like Doctors and so on). YMMV, but to my mind it's worth learning,
especially for this sort of coding problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20140514/5d03459b/attachment.html>


More information about the Chat mailing list