[LRUG] Parsing text
Roland Swingler
roland.swingler at gmail.com
Wed May 14 02:36:25 PDT 2014
I think your two examples are quite different - one is a formal language
which you could definitely use something like a parser to handle whereas
the other sounds a lot more messy - I'm not sure you're going to be able to
do better than hacky stringing together regexps in that case.
I did a presentation on treetop back in the dark days of LRUG 2009 if
you're interested:
http://www.slideshare.net/knaveofdiamonds/treetop-id-rather-have-one-problem?type=presentation-
also, have a look at parslet:
http://kschiess.github.io/parslet/ if you're interested in treetop -
parslet has a lot nicer error reporting, and the benefit of being real ruby
than odd sort-of-ruby.
Cheers,
Roland
On Wed, May 14, 2014 at 8:49 AM, Andrew Stewart
<boss at airbladesoftware.com>wrote:
> Hello El Rug!
>
> From time to time I encounter a situation where I would like to parse
> (semi-)structured text. I'm sure this is trivial with the correct
> approach. Regrettably I don't know anything about parsers/compilers/etc
> and I end up hand-rolling fragile, line-based state machines which are soon
> impossible to reason about.
>
> I'd like to know how to do this properly but I don't know where to begin.
>
> Here are a couple of specific examples:
>
> - In 2006 before SASS etc existed, I wrote a Rails plugin for nested CSS.
> It read a nested stylesheet and flattened it into normal CSS. Back then I
> wasn't sure how to parse a nested stylesheet...and I still don't know how.
> (Stop laughing at the back!)
>
> - A few months ago I needed to convert hospital admissions records from a
> PDF to CSV. Each record had fields like id, name, various dates, clinical
> history, attending doctor, etc. The fields weren't always in the same
> order due to the layout of text in the PDF, and some fields were optional.
> Sometimes there were several fields on a line, and a field could be spread
> over several lines. I did my usual thing of looping over each line,
> matching field names with regular expressions, and trying to keep track of
> where I was with a state variable. Its sole virtue was that it (sort of)
> worked; otherwise it was horrible: hard to understand, hard to modify, hard
> to extend, and very hard to debug.
>
> Please could someone enlighten me?
>
> Cheers,
> Andy Stewart
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20140514/0e75ec89/attachment-0003.html>
More information about the Chat
mailing list