[LRUG] Parsing text

Tim Cowlishaw tim at timcowlishaw.co.uk
Wed May 14 01:11:24 PDT 2014


It sounds like you need Parsing Expression Grammars, a nice, declarative
way of solving exactly this sort of problem.

Treetop is probably the most-used ruby implementation:
http://treetop.rubyforge.org/

However, the canonical use-case for this sort of thing is writing parsers
for programming languages, and I've been unable to find documentation or
examples for the use-cases you describe. Still the principles should be the
same.

Hope this helps!

Cheers,

Tim



On 14 May 2014 08:49, Andrew Stewart <boss at airbladesoftware.com> wrote:

> Hello El Rug!
>
> From time to time I encounter a situation where I would like to parse
> (semi-)structured text.  I'm sure this is trivial with the correct
> approach.  Regrettably I don't know anything about parsers/compilers/etc
> and I end up hand-rolling fragile, line-based state machines which are soon
> impossible to reason about.
>
> I'd like to know how to do this properly but I don't know where to begin.
>
> Here are a couple of specific examples:
>
> - In 2006 before SASS etc existed, I wrote a Rails plugin for nested CSS.
>  It read a nested stylesheet and flattened it into normal CSS.  Back then I
> wasn't sure how to parse a nested stylesheet...and I still don't know how.
>  (Stop laughing at the back!)
>
> - A few months ago I needed to convert hospital admissions records from a
> PDF to CSV.  Each record had fields like id, name, various dates, clinical
> history, attending doctor, etc.  The fields weren't always in the same
> order due to the layout of text in the PDF, and some fields were optional.
>  Sometimes there were several fields on a line, and a field could be spread
> over several lines.  I did my usual thing of looping over each line,
> matching field names with regular expressions, and trying to keep track of
> where I was with a state variable.  Its sole virtue was that it (sort of)
> worked; otherwise it was horrible: hard to understand, hard to modify, hard
> to extend, and very hard to debug.
>
> Please could someone enlighten me?
>
> Cheers,
> Andy Stewart
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20140514/8ce2298f/attachment.html>


More information about the Chat mailing list