Parser grammar
Primitives
The boundary between what we consider primitives and derived parsers can become a bit vague, nevertheless here is a selection of the most important primitive parsers.
value(x)Always succeeds, doesn’t consume input, returns
xfail(msg)Always fails, raises an exception with
msgas text.itemGet a single byte from the stream.
text_literal(str)Succeeds if the next characters in the stream exactly match
str.char_pred(pred)Advances the end of the cursor if
predsucceeds.text_end_by(char)Advances the end of the cursor as until
charis found.push(x)Push a value on the auxiliary stack.
pop()Pop a value from the auxiliary stack.
We also defined some derived parsers that should be useful in most contexts.
whitespaceMatches tabs spaces and newlines.
eolMatches End of Line characters (i.e.: either
\nor\n\r).integerMatches an integer value.
scientific_numberMatches a floating point number, possibly in scientific notation.
Combinators
The next question is, how can we combine our primitive parsers? We already listed the main combinators briefly, here we go into a little more detail.
choice(*p)Tries every parser
pin sequence until one succeeds. If all fail,choicegathers all exceptions and composes an error message from that.sequence(*p)Runs every parser
pin sequence and only returns the result of the last one.named_sequence(**p)Runs every parser
pin sequence and stores results in a dictionary. Keys that start with an underscore are not stored.many(p)Runs the parser
puntil it fails. Returns a list of parsed items.some(p)Parses
pat least one time, or fail.
The many and some combinators come in several flavours. Both have a variant called many_char and some_char that return a string instead of a list. One more flavour is many_char_0 and some_char_0 that do not flush the cursor.
Some derived combinators help us shape a little language to describe grammars.
optional(p, default=None)Parses
por gives the default value.tokenize(p)Parses
pfollowed by optional whitespace. This makes sure we always start at the next token.fmap(f)Takes a function
f, returns a lambda that maps an argument throughfto avalueparser. That sounds complicated, but it allows us to pass a parsed result throughfusing the>>operator. For an example, see the PPM parser at the end of this paper.
named_sequence and construct
The named_sequence combinator forms a particularly useful pair with the construct function. Used on its own, the named_sequence creates a dictionary. Many times when we’re parsing, we want our results to form some class. The construct function takes a dictionary and constructs an object by forwarding the dictionary as keyword arguments.
@dataclass
Point:
x: float
y: float
point = named_sequence(
_1=tokenize(char("(")),
x=tokenize(scientific_number),
_2=tokenize(char(","))
y=tokenize(scientific_number),
_3=tokenize(char(")"))
) >> construct(Point)
The point parser then constructs Point objects, such that
parse_bytes(point, b"(1, 2)")
gives Point(x=1, y=2) as output.
using_config and with_config
We may use the auxiliary stack to store a config variable that can be accessed from any parser. To make this use a bit more user-friendly, we define two functions: with_config() and the @use_config decorator. Functions decorated with @use_config should have the last argument be the config variable. The with_config parser sets a config dictionary to be the bottom of the auxiliary stack.
Example: We have as input a number and a string. The string is returned in upper-case if the number is 1:
@using_config
def set_case(x, config):
config["uppercase"] = (x == 1)
return value(None)
@using_config
def get_text(config):
if config["uppercase"]:
return many_char(item, lambda x: x.decode().upper())
else:
return many_char(item, lambda x: x.decode())
assert parse_bytes(
with_config(sequence(integer >> set_case, get_text())),
b'0hello') == "hello"
assert parse_bytes(
with_config(sequence(integer >> set_case, get_text())),
b'1hello') == "HELLO"