Markup Optimisation by Probabilistic Parsing: LALR(5000000), the First Prize Winner in the 2001 ICFP Programming Contest
Chung-chieh Shan and Dylan Thurston
Our submission to the ICFP 2001 programming contest won the first prize. Our program is named “LALR(5000000)”, for the reason explained in the README file:
The problem is equivalent to finding the most likely parse (or rather, a pretty good parse) of a given string in a probabilistic context-free grammar. More precisely, the string to be parsed is the sequence of decorated characters (or rather, contiguous chunks thereof); the PCFG describes the stack transitions and costs of the markup language; each parse tree is an acceptable way to describe the given sequence of decorated characters using the markup language. Fortunately, the grammar is highly regular; unfortunately, the grammar is also highly recursive and connected.
Standard parsing algorithms exist for PCFG parsing. Our program is roughly based on one such algorithm, probabilistic CYK parsing. Seldom are parsing algorithms designed to handle multi-megabyte sentences, however…
A draft write-up that describes our entry in more detail is available, dated September 20, 2001 (06:25:33; revision 1.23):
You can also download the code we submitted.