[LRUG] Parsing text

Andrew Stewart boss at airbladesoftware.com
Wed May 14 00:49:46 PDT 2014


Hello El Rug!

From time to time I encounter a situation where I would like to parse (semi-)structured text.  I'm sure this is trivial with the correct approach.  Regrettably I don't know anything about parsers/compilers/etc and I end up hand-rolling fragile, line-based state machines which are soon impossible to reason about.

I'd like to know how to do this properly but I don't know where to begin.

Here are a couple of specific examples:

- In 2006 before SASS etc existed, I wrote a Rails plugin for nested CSS.  It read a nested stylesheet and flattened it into normal CSS.  Back then I wasn't sure how to parse a nested stylesheet...and I still don't know how.  (Stop laughing at the back!)

- A few months ago I needed to convert hospital admissions records from a PDF to CSV.  Each record had fields like id, name, various dates, clinical history, attending doctor, etc.  The fields weren't always in the same order due to the layout of text in the PDF, and some fields were optional.  Sometimes there were several fields on a line, and a field could be spread over several lines.  I did my usual thing of looping over each line, matching field names with regular expressions, and trying to keep track of where I was with a state variable.  Its sole virtue was that it (sort of) worked; otherwise it was horrible: hard to understand, hard to modify, hard to extend, and very hard to debug.

Please could someone enlighten me?

Cheers,
Andy Stewart


More information about the Chat mailing list