{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Parser grammar\n", "\n", "### Primitives\n", "\n", "The boundary between what we consider *primitives* and derived parsers can become a bit vague, nevertheless here is a selection of the most important primitive parsers.\n", "\n", "`value(x)`\n", ": Always succeeds, doesn't consume input, returns `x`\n", "\n", "`fail(msg)`\n", ": Always fails, raises an exception with `msg` as text.\n", "\n", "`item`\n", ": Get a single byte from the stream.\n", "\n", "`text_literal(str)`\n", ": Succeeds if the next characters in the stream exactly match `str`.\n", "\n", "`char_pred(pred)`\n", ": Advances the end of the cursor if `pred` succeeds.\n", "\n", "`text_end_by(char)`\n", ": Advances the end of the cursor as until `char` is found.\n", "\n", "`push(x)`\n", ": Push a value on the auxiliary stack.\n", "\n", "`pop()`\n", ": Pop a value from the auxiliary stack.\n", "\n", "We also defined some derived parsers that should be useful in most contexts.\n", "\n", "`whitespace`\n", ": Matches tabs spaces and newlines.\n", "\n", "`eol`\n", ": Matches End of Line characters (_i.e.:_ either `\\n` or `\\n\\r`).\n", "\n", "`integer`\n", ": Matches an integer value.\n", "\n", "`scientific_number`\n", ": Matches a floating point number, possibly in scientific notation.\n", "\n", "### Combinators\n", "\n", "\n", "The next question is, how can we combine our primitive parsers? We already listed the main combinators briefly, here we go into a little more detail.\n", "\n", "`choice(*p)`\n", ": Tries every parser `p` in sequence until one succeeds. If all fail, `choice` gathers all exceptions and composes an error message from that.\n", "\n", "`sequence(*p)`\n", ": Runs every parser `p` in sequence and only returns the result of the last one.\n", "\n", "`named_sequence(**p)`\n", ": Runs every parser `p` in sequence and stores results in a dictionary. Keys that start with an underscore are not stored.\n", "\n", "`many(p)`\n", ": Runs the parser `p` until it fails. Returns a list of parsed items.\n", "\n", "`some(p)`\n", ": Parses `p` at least one time, or fail.\n", "\n", "The `many` and `some` combinators come in several flavours. Both have a variant called `many_char` and `some_char` that return a string instead of a list. One more flavour is `many_char_0` and `some_char_0` that do not flush the cursor.\n", "\n", "Some derived combinators help us shape a little language to describe grammars.\n", "\n", "`optional(p, default=None)`\n", ": Parses `p` or gives the default value.\n", "\n", "`tokenize(p)`\n", ": Parses `p` followed by optional whitespace. This makes sure we always start at the next token.\n", "\n", "`fmap(f)`\n", ": Takes a function `f`, returns a lambda that maps an argument through `f` to a `value` parser. That sounds complicated, but it allows us to pass a parsed result through `f` using the `>>` operator. For an example, see the PPM parser at the end of this paper.\n", "\n", "### `named_sequence` and `construct`\n", "\n", "The `named_sequence` combinator forms a particularly useful pair with the `construct` function. Used on its own, the `named_sequence` creates a dictionary. Many times when we're parsing, we want our results to form some class. The `construct` function takes a dictionary and constructs an object by forwarding the dictionary as keyword arguments.\n", "\n", "```python\n", "@dataclass\n", "Point:\n", " x: float\n", " y: float\n", "```\n", "\n", "```python\n", "point = named_sequence(\n", " _1=tokenize(char(\"(\")),\n", " x=tokenize(scientific_number),\n", " _2=tokenize(char(\",\"))\n", " y=tokenize(scientific_number),\n", " _3=tokenize(char(\")\"))\n", " ) >> construct(Point)\n", "```\n", "\n", "The `point` parser then constructs `Point` objects, such that\n", "\n", "```python\n", "parse_bytes(point, b\"(1, 2)\")\n", "```\n", "\n", "gives `Point(x=1, y=2)` as output.\n", "\n", "### `using_config` and `with_config`\n", "We may use the auxiliary stack to store a config variable that can be accessed from any parser. To make this use a bit more user-friendly, we define two functions: `with_config()` and the `@use_config` decorator. Functions decorated with `@use_config` should have the last argument be the `config` variable. The `with_config` parser sets a config dictionary to be the bottom of the auxiliary stack.\n", "\n", "Example: We have as input a number and a string. The string is returned in upper-case if the number is 1:\n", "\n", "```python\n", "@using_config\n", "def set_case(x, config):\n", " config[\"uppercase\"] = (x == 1)\n", " return value(None)\n", "\n", "@using_config\n", "def get_text(config):\n", " if config[\"uppercase\"]:\n", " return many_char(item, lambda x: x.decode().upper())\n", " else:\n", " return many_char(item, lambda x: x.decode())\n", "\n", "assert parse_bytes(\n", " with_config(sequence(integer >> set_case, get_text())),\n", " b'0hello') == \"hello\"\n", "assert parse_bytes(\n", " with_config(sequence(integer >> set_case, get_text())),\n", " b'1hello') == \"HELLO\"\n", "```" ] } ], "metadata": { "language_info": { "name": "python" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }