For most of us, when it comes to manipulating or interpreting some chunk of plain text, our first instinct will be to reach for regular expressions.
That’s understandable - regular expressions are accessible, standardized and available on every conceivable language/platform.
But for certain tasks, parser combinators are a far better solution, producing code that is much easier to read and simpler to maintain. I definitely wish I had familiarised myself with them sooner.
Parser combinators are less widely known than regular expressions, but the concept is pretty simple. In fact, the name contains everything you need to know.
A ‘parser’ is simply a program that transforms unstructured plain text input into a concrete data structure.
Parser combinators are a special type of parser, created by combining a number of smaller parsers together (hence the name).
Rather, what we’re about to see is an example of when and how you might want to use an off-the-shelf parser combinator library.
If you’re impatient, you can jump straight to the code on GitHub.
Regular expressions and dialog
Say your program requires the user to repeat a sentence, but allows them to choose a certain part of their response:
“Hi, my name is [?]”
“I’m [0-99] years old”
where [?] means “accept any string”, and [0-99] mean “accept any integer between 0 and 99”.
We’ve basically created our own very basic grammar - a set of rules to determine whether a given user input is valid.
We could validate the user’s input against this grammar with a regular expression:
So “Hi, my name is” would fail, but “Hi, my name is Nick” would succeed. Pretty straightforward.
But we probably also want to validate that the expected sentences themselves are correct.
If the grammar only allows integers between 0 and 99 (excluding 100), we shouldn’t be able to write a rule like “The temperature today is [0-100] degrees”.
We could also do this with a regular expression:
That’s already pretty ugly. What’s going to happen when your grammar changes to, say, require variable names to capture user-provided values?
“Hi, my name is [?]<yourName>”
“I’m [0-99] years old<yourAge>”
Or you want to add some conditions, say to only present the user with a response if those variables satisfy a given value:
”<?if yourAge < 20>Hey fam!”
”<?if yourAge >= 20>Good afternoon!”
There are three problems here.
- Figuring out the regular expression will take a lot of effort.
- No matter what you come up with, it won’t be very human-readable.
- Inverting the expression - figuring out the grammar from the regex - will be nigh impossible.
Parser combinators in F#
Parser combinators avoid these problems by composing a number of small parsers together to form a single top-level parser.
This allows you to incrementally construct your overall grammar; you can first write a parser for the opening conditional (“<?”), then chain this to an integer range parser (“0-99”), and so on.
The go-to parser combinator library in F# is FParsec, a re-implementation of Haskell’s Parsec library.
FParsec is very well documented and has an excellent tutorial, which I definitely recommend you check out.
The grammar we’re about to implement is slightly more complicated than the above, and needs to accept:
“Plain string” // Present to user as-is, no parsing needed
“Hi <Range[0,100] myVar>” // User must respond with “Hi “ followed by some integer value between 0 and 100 which is stored in the variable “myVar”
“Hi <String[Alice|Bob] myOtherVar>” // User must respond with “Hi “ followed by “Alice” or “Bob”, which is stored in the variable “myOtherVar”
“Hi <!myOtherVar>” // Interpolation, insert the value of the myOtherVar and present to user
”<?myVar gt 1>…” // Conditional, present to user where some integer variable greater than 1
”<?myVar lt 1>…” // Conditional, present to user where some integer variable less than 1”
”<?myVar eq 1>…” // Conditional, present to user where some integer variable equal than 1”
”<?myVar ne 1>…” // Conditional, present to user where some integer variable not equal to 1”
When you think about it, our parser/grammar checker needs to represent a set of value constraints programatically - “must be in this set of strings”, “must be in within this integer range”.
We’re effectively designing a small meta-language, which means we can represent these constraints very compactly as F# types for:
- a plain text expression
- a variable declaration expression
- a conditional (boolean) expression
- an interpolation expression
We will therefore design the parser to return one or more of the following types whenever it fully consumes a valid input:
(FParsec returns a single type on success, so using a discriminated union to encapsulate these types will make it easier to manipulate later on).
Let’s go through these types one by one.
The Text type is straightforward. If the parser doesn’t find any special characters, it just returns a plain string with no abstract representation needed.
The Boolean (“<?myVar eq 10>”), VariableDeclaration (“Hi <String[Alice|Bob] myVar>”) and Interpolation (“Hi <!myVar>”) types all inherit from an abstract IExpression type.
This is because all three types are associated with a variable name (“myVar”), a provided value (“Gary”) and a permitted value (e.g. “Alice”, or for interpolations, any non-null value).
Our boolean expressions compare two values and return the boolean result. Specifically, we consider equality, inequality, less than and greater than. These are represented as four concrete types:
So “<?myVar eq 10>Say 10!” will parse to an Equality expression and a Text expression, “<?myVar gt 10>Greater than 10!” will parse to a GreaterThan expression and a Text expression, and so on.
To choose the next sentence to display to the user, the program can invoke the expression’s IsMetBy method with the context (i.e. what the user said in previous dialog turns). If IsMetBy returns true, the program displays this sentence to the user; if it fails, the sentence is discarded.
Inside each IExpression, the target value’s type is represented as a Discriminated Union:
This means we can use a match expression to perform the boolean comparison:
String equality/inequality comparisons are different from integer comparisons, hence why we discriminate between the two.
There’s some redundant code here, though - after all, we have no concept of string greater-than/less-than comparisons:
This exception is a code-smell, suggesting we could rethink the design of our types. But it will do for now!
In addition to Boolean expressions, I define Option and Range types to represent variables (or values) that we expect to capture from the user:
We’ve now created an F# type for every abstract representation in our grammar.
This is already looking fairly compact and clean. To extend our grammar with another representation, we can simply add another type.
Now, let’s see how FParsec lets us construct parsers to generate these types from our source text.
Building parsers with FParsec
Remember that FParsec lets us compose a number of small parsers together.
Each small parser is just a rule that says “return true if you see this character, or return an error if you don’t”:
pstring is a primitive FParsec parser that says “match on the provided string”. Our openAngle and closeAngle parsers will therefore return true for the strings < and > respectively.
FParsec lets you chain these together as follows:
.». is an FParsec function that says “first apply the left parser, then apply the right, and yield the results of both”.
Applying this parser to the string <> therefore yields <>. To yield only the result of the left parser, use .»; parsing <> would then return <. For the result of the right parser only, use ».; parsing <> would return >.
No matter which yield function is used, if the parser fails, nothing is returned. So providing < to a parser that expects <> will return Failure.
However, FParsec lets you backtrack on parse failure; effectively saying “if you cannot parse this as an X, backtrack to where you started and try and parse as Y”.
We can safely use this cascade approach to parsing our grammar because conditions, declarations and interpolations are all mutually exclusive. We can’t have both a condition and a declaration in a given string, so if the string isn’t a condition, it must be a declaration or an interpolation; if it isn’t a declaration, it must be an interpolation, and so on.
This isn’t a universal rule of parsing, though - just a consequence of the design of my particular grammar.
Primitive FParsec parsers
FParsec ships a number of primitive functions for common parsers.
Parsing a plain Text object, for example, should return true for any non-empty string that does not contain a <.
The many1Satisfy is a built-in FParsec primitive that returns Success for strings containing at least one character that satisfies the given predicate.
Likewise, by default, parsers return immediately upon consuming valid input. To create a parser that continues to consume until the end of a string is reached, FParsec ships with a many parser:
So that’s FParsec in a nutshell! There’s obviously a lot more going on under the hood, but I won’t go any deeper here. I highly recommend digging through the documentation for an hour or so if you want to get a deeper understanding.
Now, let’s go through the actual parser implementation for our dialog grammar.
Remember that the top-level parser is composed of lower-level parsers, so this won’t match the order in which the code itself is written - I’ll explain some things bottom-up rather than top-down.
Basic character parsers
This is self-explanatory - I just want to explicitly name these parsers for better readability.
The top-level parser
The entry point for our top-level parser looks like this:
Pretty simple - invoke the evaluate parser on the target string as many times as is necessary until the end of the string is reached.
All parsers return a union type - Failure (if the parser fails), or Success (if the parser succeeds, wrapping the parse result).
Setting up back-tracking
We’ve already seen this. If the parser cannot parse the string as a plain Text expression, it will attempt to parse as a condition (Boolean) expression, then a variable capture expression, then an interpolation expression.
If none of those match, the entire parser will return a Failure.
The text parser returns a Text expression if the string contains at least one character not matching <.
A declaration (variable capture) is a < followed by a Range (e.g. “Int[1,100]”) or an Option (e.g. “String[Bob|Alice]”).
We don’t want the parser to return the <, so note the use of ». here.
The Range parser looks for a [, skips Int and then looks for two integers separated by a comma, followed by ], then a variable name and a closing >.
The Option parser looks for a [, skips String and returns everything up to the closing ], then looks for a variable name and a closing >.
The pipe2 function here just takes the result of two parsers and pipes them into a function that accepts two arguments.
Boolean (conditional) expressions
Following roughly the same format, the next set of parsers look for an opening angle followed by ! (for interpolations) or ? (for conditions).
For conditions, first the equality parser is applied, then the inequality, the greater-than and finally the less-than parser.
This isn’t very efficient, as each parser back-tracks to the variable name if we’ve chosen the wrong path.
When a parser succeeds, the corresponding type is instantiated and returned.
That’s all there is to it - a fairly sophisticated parser in less than 150 lines of code.
The type abstractions aren’t perfect, and some of the parsing code is definitely too long, so there’s definitely room for improvement.
But all things considered, this is a huge improvement over a deep, obscure, nested regular expression.
The next time you need to parse text into a concrete data structure, I highly recommend trying out parser combinators.
It may not be the right solution for your particular problem, but you’ll definitely come away with a better understanding of what problems they are suited for.