Wednesday, March 17, 2010

Russ Cox on Regular Expressions

I've long been a fan of Russ Cox's 2007 paper Regular Expressions Can Be Simple and Fast. In it he explained why many of the common regular expression libraries in languages such as Perl, Python, Java, PHP, and others can exhibit exponential behavior, while the libraries associated with the Unix tools, such as egrep and awk, have provably linear performance on all inputs. Oddly enough, the O(1) Unix regular expression tools are based on an algorithm invented by Ken Thompson in 1968 for the QED editor, while the (potentially) O(2n) implementations used by PERL et al. were based on Henry Spencer's much later Regular Expression Library.

I've just discovered that Russ wrote two other papers in the series: Regular Expression Matching: the Virtual Machine Approach (2009) and Regular Expression Matching in the Wild (2010). The first of these discusses various virtual machine implementations for regular expressions. The first paper discussed an implementation based on Thompson's ideas but it worked by converting the regular expression into an explicit nondeterministic finite automaton (NFA) and executing that. The implementations in the second paper are closer to Thompson's actual implementation, which generated IBM 7094 machine code and executed it directly. Cox shows that nevertheless the two implementations are essentially the same and that each is revealing in its own way. This is a wonderful paper—in some ways better than the first—and is really a must read.

The final paper discusses the RE2 regular expression system used at Google and now released as an open source project. RE2 is based on the ideas discussed in the first two papers but is a real implementation meant for industrial strength applications.

If you've ever wondered how regular expressions work or even have the slightest interest in them, NFAs, or just enjoy excellent software engineering, you should definitely give these papers a read. You won't be sorry.

No comments:

Post a Comment