Parser grammar
Primitives
The boundary between what we consider primitives and derived parsers can become a bit vague, nevertheless here is a selection of the most important primitive parsers.
value(x)
Always succeeds, doesn’t consume input, returns
x
fail(msg)
Always fails, raises an exception with
msg
as text.item
Get a single byte from the stream.
text_literal(str)
Succeeds if the next characters in the stream exactly match
str
.char_pred(pred)
Advances the end of the cursor if
pred
succeeds.text_end_by(char)
Advances the end of the cursor as until
char
is found.push(x)
Push a value on the auxiliary stack.
pop()
Pop a value from the auxiliary stack.
We also defined some derived parsers that should be useful in most contexts.
whitespace
Matches tabs spaces and newlines.
eol
Matches End of Line characters (i.e.: either
\n
or\n\r
).integer
Matches an integer value.
scientific_number
Matches a floating point number, possibly in scientific notation.
Combinators
The next question is, how can we combine our primitive parsers? We already listed the main combinators briefly, here we go into a little more detail.
choice(*p)
Tries every parser
p
in sequence until one succeeds. If all fail,choice
gathers all exceptions and composes an error message from that.sequence(*p)
Runs every parser
p
in sequence and only returns the result of the last one.named_sequence(**p)
Runs every parser
p
in sequence and stores results in a dictionary. Keys that start with an underscore are not stored.many(p)
Runs the parser
p
until it fails. Returns a list of parsed items.some(p)
Parses
p
at least one time, or fail.
The many
and some
combinators come in several flavours. Both have a variant called many_char
and some_char
that return a string instead of a list. One more flavour is many_char_0
and some_char_0
that do not flush the cursor.
Some derived combinators help us shape a little language to describe grammars.
optional(p, default=None)
Parses
p
or gives the default value.tokenize(p)
Parses
p
followed by optional whitespace. This makes sure we always start at the next token.fmap(f)
Takes a function
f
, returns a lambda that maps an argument throughf
to avalue
parser. That sounds complicated, but it allows us to pass a parsed result throughf
using the>>
operator. For an example, see the PPM parser at the end of this paper.
named_sequence
and construct
The named_sequence
combinator forms a particularly useful pair with the construct
function. Used on its own, the named_sequence
creates a dictionary. Many times when we’re parsing, we want our results to form some class. The construct
function takes a dictionary and constructs an object by forwarding the dictionary as keyword arguments.
@dataclass
Point:
x: float
y: float
point = named_sequence(
_1=tokenize(char("(")),
x=tokenize(scientific_number),
_2=tokenize(char(","))
y=tokenize(scientific_number),
_3=tokenize(char(")"))
) >> construct(Point)
The point
parser then constructs Point
objects, such that
parse_bytes(point, b"(1, 2)")
gives Point(x=1, y=2)
as output.
using_config
and with_config
We may use the auxiliary stack to store a config variable that can be accessed from any parser. To make this use a bit more user-friendly, we define two functions: with_config()
and the @use_config
decorator. Functions decorated with @use_config
should have the last argument be the config
variable. The with_config
parser sets a config dictionary to be the bottom of the auxiliary stack.
Example: We have as input a number and a string. The string is returned in upper-case if the number is 1:
@using_config
def set_case(x, config):
config["uppercase"] = (x == 1)
return value(None)
@using_config
def get_text(config):
if config["uppercase"]:
return many_char(item, lambda x: x.decode().upper())
else:
return many_char(item, lambda x: x.decode())
assert parse_bytes(
with_config(sequence(integer >> set_case, get_text())),
b'0hello') == "hello"
assert parse_bytes(
with_config(sequence(integer >> set_case, get_text())),
b'1hello') == "HELLO"