Parser grammar

Primitives

The boundary between what we consider primitives and derived parsers can become a bit vague, nevertheless here is a selection of the most important primitive parsers.

value(x): Always succeeds, doesn’t consume input, returns x
fail(msg): Always fails, raises an exception with msg as text.
item: Get a single byte from the stream.
text_literal(str): Succeeds if the next characters in the stream exactly match str.
char_pred(pred): Advances the end of the cursor if pred succeeds.
text_end_by(char): Advances the end of the cursor as until char is found.
push(x): Push a value on the auxiliary stack.
pop(): Pop a value from the auxiliary stack.

We also defined some derived parsers that should be useful in most contexts.

whitespace: Matches tabs spaces and newlines.
eol: Matches End of Line characters (i.e.: either \n or \n\r).
integer: Matches an integer value.
scientific_number: Matches a floating point number, possibly in scientific notation.

Combinators

The next question is, how can we combine our primitive parsers? We already listed the main combinators briefly, here we go into a little more detail.

choice(*p): Tries every parser p in sequence until one succeeds. If all fail, choice gathers all exceptions and composes an error message from that.
sequence(*p): Runs every parser p in sequence and only returns the result of the last one.
named_sequence(**p): Runs every parser p in sequence and stores results in a dictionary. Keys that start with an underscore are not stored.
many(p): Runs the parser p until it fails. Returns a list of parsed items.
some(p): Parses p at least one time, or fail.

The many and some combinators come in several flavours. Both have a variant called many_char and some_char that return a string instead of a list. One more flavour is many_char_0 and some_char_0 that do not flush the cursor.

Some derived combinators help us shape a little language to describe grammars.

optional(p, default=None): Parses p or gives the default value.
tokenize(p): Parses p followed by optional whitespace. This makes sure we always start at the next token.
fmap(f): Takes a function f, returns a lambda that maps an argument through f to a value parser. That sounds complicated, but it allows us to pass a parsed result through f using the >> operator. For an example, see the PPM parser at the end of this paper.

`named_sequence` and `construct`

The named_sequence combinator forms a particularly useful pair with the construct function. Used on its own, the named_sequence creates a dictionary. Many times when we’re parsing, we want our results to form some class. The construct function takes a dictionary and constructs an object by forwarding the dictionary as keyword arguments.

@dataclass
Point:
  x: float
  y: float

point = named_sequence(
  _1=tokenize(char("(")),
  x=tokenize(scientific_number),
  _2=tokenize(char(","))
  y=tokenize(scientific_number),
  _3=tokenize(char(")"))
  ) >> construct(Point)

The point parser then constructs Point objects, such that

parse_bytes(point, b"(1, 2)")

gives Point(x=1, y=2) as output.

`using_config` and `with_config`

We may use the auxiliary stack to store a config variable that can be accessed from any parser. To make this use a bit more user-friendly, we define two functions: with_config() and the @use_config decorator. Functions decorated with @use_config should have the last argument be the config variable. The with_config parser sets a config dictionary to be the bottom of the auxiliary stack.

Example: We have as input a number and a string. The string is returned in upper-case if the number is 1:

@using_config
def set_case(x, config):
    config["uppercase"] = (x == 1)
    return value(None)

@using_config
def get_text(config):
    if config["uppercase"]:
        return many_char(item, lambda x: x.decode().upper())
    else:
        return many_char(item, lambda x: x.decode())

assert parse_bytes(
    with_config(sequence(integer >> set_case, get_text())),
    b'0hello') == "hello"
assert parse_bytes(
    with_config(sequence(integer >> set_case, get_text())),
    b'1hello') == "HELLO"

Parser grammar

Primitives

Combinators

named_sequence and construct

using_config and with_config

`named_sequence` and `construct`

`using_config` and `with_config`