In order to solve problems in my work, I frequently need to extract information from structured text. Often it’s in a report or log file generated by software I don’t control. Other times it’s inside source code in one of a multitude of languages I need to deal with (some of which are domain-specific languages without readily available grammar definitions). Sometimes it’s even to answer complex questions about the structure or behavior of code I need to maintain.
If I’m lucky, I can tackle this with some simple regular expression matching. That kind of text matching has been built into many languages (e.g. Perl, Python, Ruby) for a long time. (C++ has even recently joined the crowd.) Unfortunately, I’m not always so lucky.
If I need to extract information from something that’s a context-free language, I don’t have any great options for getting the job done quickly. As an example of what I mean, consider a fairly simple format made up of a list of items, each of which can be either a number or a nested list. In EBNF (leaving out whitespace for brevity, which should be allowed before and after items):
List ::= Item ("," Item)*
Item ::= Number | NestedList
NestedList ::= '(' List ')'
Number ::= ('-'|'+')? Digit+
Digit ::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
It’s hard to get much simpler than that. Now take a moment and, using whatever language you like, write a parser that will recognize that and give it back to you as a nested list structure (e.g. a Python list of ints and lists, a Perl array ref of scalars and array refs, etc.). Don’t forget to handle whitespace, particularly lists spanning multiple lines. Not so easy right?
The most common mode of text processing today is line-by-line, splitting or matching each line. That works fine in a lot of simple cases, but it’s wholly inadequate for even fairly simple languages like this.
Of course there are plenty of tools and libraries out there to automate parsing (bison, antlr, elkhound … it’s a long list). The problem is they’re primarily geared towards writing a compiler, an interpreter, or a similar tool which needs a very complete and precise representation of the structure in the original text. I’ve used some of those tools before to do exactly that, and I wouldn’t hesitate to use them again for writing similar software. However, for these common, short lifetime, small scope problem solving tasks (many of which would best be described as debugging, or even just understanding a complex situation) the effort required to use such tools is far too great.
I want something fast, something that I can use to help me solve a problem in an afternoon. I want a parsing system I can feel comfortable using for throw-away code. I want it to be easy to ignore parts of the input I don’t care about, so I can quickly get at the parts that are relevant to what I’m trying to get done. I want to be sloppy and do a less than perfect job of parsing the input, which is the opposite of what you want when writing a compiler or interpreter. If I had a tool like this I would frequently be able to get answers faster with less effort.
I suspect there are at least a couple reasons why it’s hard to find something like this:
- There may not be much demand for it. The ready availability of regular expression matching makes it easy to extract information from a lot of simple formats and possible (if painful) to write poor parsers by hand for some more things. I’m sure most people solve such problems through searching and manual inspection (which I resort to far more often than I would like). Perhaps most people don’t even see such problems as parsing problems.
- Parsing is inherently hard. If you start to get into anything complicated (say, parsing program source code) you can get into trouble with ambiguities (the dangling-else problem, type name vs. variable name, etc.), left recursion, and many other traps for the unwary. There are well-known ways to deal with some of the common complexities of parsing, but there’s no denying that it can get quite tricky.
Of course there are plenty of options out there, and I certainly haven’t tried all of them. I’ve followed the trend of Parsing Expression Grammars (which incidentally remind me a lot of LIM), and I’ve tried a few PEG-based tools. I am aware of Perl 6 rules, and I should probably make the time to become familiar with Rakudo and/or Pegex.
What matters most to me is how fast can I get something accomplished. The easier it is for me to pull structured information out of a pile of text, the less time I’ll spend doing so manually. Scripting languages with regular expression matching represented a significant leap forward in the ability to solve problems by gathering and analyzing information… 20 years ago. Since then, I don’t think there been much progress in this area. (Of course I’d be thrilled if you could convince me that I’m wrong.)
Update: A cheeky friend suggested privately the following Python implementation for parsing the EBNF grammar above:
Effective for this trivial example, although unsafe. Even if I didn’t find it distasteful, it’s hardly generalizable.