Aug 102012
 

In order to solve problems in my work, I frequently need to extract information from structured text.  Often it’s in a report or log file generated by software I don’t control.  Other times it’s inside source code in one of a multitude of languages I need to deal with (some of which are domain-specific languages without readily available grammar definitions).  Sometimes it’s even to answer complex questions about the structure or behavior of code I need to maintain.

If I’m lucky, I can tackle this with some simple regular expression matching.  That kind of text matching has been built into many languages (e.g. Perl, Python, Ruby) for a long time.  (C++ has even recently joined the crowd.)  Unfortunately, I’m not always so lucky.

If I need to extract information from something that’s a context-free language, I don’t have any great options for getting the job done quickly.  As an example of what I mean, consider a fairly simple format made up of a list of items, each of which can be either a number or a nested list.  In EBNF (leaving out whitespace for brevity, which should be allowed before and after items):

List ::= Item ("," Item)*
Item ::= Number | NestedList
NestedList ::= '(' List ')'
Number ::= ('-'|'+')? Digit+
Digit ::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'

It’s hard to get much simpler than that.  Now take a moment and, using whatever language you like, write a parser that will recognize that and give it back to you as a nested list structure (e.g. a Python list of ints and lists, a Perl array ref of scalars and array refs, etc.).  Don’t forget to handle whitespace, particularly lists spanning multiple lines.  Not so easy right?

The most common mode of text processing today is line-by-line, splitting or matching each line.  That works fine in a lot of simple cases, but it’s wholly inadequate for even fairly simple languages like this.

Of course there are plenty of tools and libraries out there to automate parsing (bison, antlr, elkhound … it’s a long list).  The problem is they’re primarily geared towards writing a compiler, an interpreter, or a similar tool which needs a very complete and precise representation of the structure in the original text.  I’ve used some of those tools before to do exactly that, and I wouldn’t hesitate to use them again for writing similar software.  However, for these common, short lifetime, small scope problem solving tasks (many of which would best be described as debugging, or even just understanding a complex situation) the effort required to use such tools is far too great.

I want something fast, something that I can use to help me solve a problem in an afternoon.  I want a parsing system I can feel comfortable using for throw-away code.  I want it to be easy to ignore parts of the input I don’t care about, so I can quickly get at the parts that are relevant to what I’m trying to get done.  I want to be sloppy and do a less than perfect job of parsing the input, which is the opposite of what you want when writing a compiler or interpreter.  If I had a tool like this I would frequently be able to get answers faster with less effort.

I suspect there are at least a couple reasons why it’s hard to find something like this:

  1. There may not be much demand for it.  The ready availability of regular expression matching makes it easy to extract information from a lot of simple formats and possible (if painful) to write poor parsers by hand for some more things.  I’m sure most people solve such problems through searching and manual inspection (which I resort to far more often than I would like).  Perhaps most people don’t even see such problems as parsing problems.
  2. Parsing is inherently hard.  If you start to get into anything complicated (say, parsing program source code) you can get into trouble with ambiguities (the dangling-else problem, type name vs. variable name, etc.), left recursion, and many other traps for the unwary.  There are well-known ways to deal with some of the common complexities of parsing, but there’s no denying that it can get quite tricky.

Of course there are plenty of options out there, and I certainly haven’t tried all of them.  I’ve followed the trend of Parsing Expression Grammars (which incidentally remind me a lot of LIM), and I’ve tried a few PEG-based tools.  I am aware of Perl 6 rules, and I should probably make the time to become familiar with Rakudo and/or Pegex.

What matters most to me is how fast can I get something accomplished.  The easier it is for me to pull structured information out of a pile of text, the less time I’ll spend doing so manually.  Scripting languages with regular expression matching represented a significant leap forward in the ability to solve problems by gathering and analyzing information… 20 years ago.  Since then, I don’t think there been much progress in this area.  (Of course I’d be thrilled if you could convince me that I’m wrong.)

Update: A cheeky friend suggested privately the following Python implementation for parsing the EBNF grammar above:

eval(the_list.replace('(','[').replace(')',']'))

Effective for this trivial example, although unsafe.  Even if I didn’t find it distasteful, it’s hardly generalizable.

  6 Responses to “Parsing Should be Easier”

  1. I think the parser-writing is something that gets easier with practice (i.e. “Use makes master”), but we as working programmers don’t have the time to practice it. And it depends on one’s definition of “quickly” for a given task: I’ve done what you’ve done numerous times, and you’re probably talking about getting something done in an hour or two at most.

  2. Python has been my primary language for the past many years, so it tends to be my first thought nowadays and it has a pyparsing module. http://pyparsing.wikispaces.com/ It allows for a fairly easy DSL parsing. Not very good for things that don’t follow a very structured grammar though.

    A sample from the examples looks like it would do the nested list example:
    listStr = Forward()
    listItem = integer| Group(listStr)
    listStr << (lbrack + Optional(delimitedList(listItem) + Optional(Suppress(","))) + rbrack)

    What I've found though is that when I need to parse thing that don't have a structured grammar I tend to combine command line tools for preparsing them into something that is more structured before attempting to write code to handle the parsing. So, perhaps the easier solution is to have a preparsing generalized solution to turn text into structured text.

  3. Have you tried Scala’s Parser Combinator library? Here‘s an example of my Parser for Google’s Protobuf .proto files. If you ignore all the type annotations (they are necessary or the PackratParser to work correctly), building a simple parser in Scala is really simple.

    You can also try Parboiled, a Java/Scala parsing library.

    • *for

      And by type annotations, I meant the : PackratParser[ReturnType] stuff (and you would also be using simple defs instead of lazy vals).

    • I have a feeling I ran into Parboiled in some of my searches for parsing tools, though I haven’t spent any time with it. It’s certainly trying to fill the space I’m talking about (above regular expressions, but below compiler-building tools).

      I also don’t know Scala, but from a little reading it definitely sounds like a language I could enjoy. Having packrat parsing in its standard library is impressive.