API Documentation

Cursors

A Cursor object contains a reference to the buffer (bytes, bytearray or mmap), together with a begin and end pointer. While parsing, usually only the end pointer is updated. Certain parsers first lex the input, i.e. find where a token begins and ends, and then use some function to convert the selection to a useable object.

An immediate example: we may use the Python built-in float function to convert a string to a floating point number. Such a routine should then first flush the cursor, so that begin and end point to the same location. After passing a number of numeric characters, decimal point, exponent indication etc, the part that we think represents a floating-point number can be passed to the float function. This saves us the bother of coding floating point conversion manually.

byteparsing.cursor.Buffer: Type for the buffer. One of: bytes, bytearray, mmap.mmap.

class byteparsing.cursor.Cursor(data: Union[bytes, bytearray, mmap], begin: int = 0, end: int = 0, encoding: str = 'utf-8')

Encapsulates a byte string and two offsets to reference the input data.

property at: Next byte (at end location).

property content: Byte content of current selection.

property content_str: Decoded string content of current selection.

find(x: bytes): Get a cursor where the end position is shifted to the next location where x is found.

flush(): Creates new cursor where begin is flushed to end location.

static from_bytes(data): Constructs a Cursor object from a byte string. Initialises begin and end fields at 0.

increment(n: int = 1): Creates new cursor where end is incremented by n.

look_ahead(n: int = 1): Get the next n bytes.

Exceptions

A parser can indicate a failure to parse by raising a Failure. Most of the time such a Failure is not fatal, since it would be part of a range of parser choices. Say we want to parse a number that is either an int or a float. First we would try to parse using the int parser. If that fails we can try for a floating point number instead.

exception byteparsing.failure.EndOfInput: Raised when parser reaches end of input.

exception byteparsing.failure.Expected(x, *irritants): Raised to indicate different expectations by a parser.

exception byteparsing.failure.Failure(description): Base class for all parser failures. Indicates a failure to parse the input by a specific parser.

exception byteparsing.failure.MultipleFailures(*x): Raised by the choice parser if all options fail.

Parsers

Some general remarks:

tokenize

The tokenize function is an important tool to deal with whitespace. A tokenized parser first parses whatever it is supposed to, and then whitespace of some kind (could also include comments). Note however that tokenize only strips trailing whitespace.

char variants of many and some

The parsers many and some take a parser and use it many times to generate a list of objects.

Say we have a tokenized parser that parses integers called integer, then:

>>> parse_bytes(many(integer), b"1 2 3 4")
[1, 2, 3, 4]

Now, we have class of characters ascii_alpha parsing any latin character:

>>> parse_bytes(many(ascii_alpha), b"abcd")
[b'a', b'b', b'c', b'd']

Probably this is not what you wanted. In many cases what we want is not to parse a sequence of objects but rather allow for a range of characters to repeat and then read off the entire resulting string in one go. This is why we have many_char:

>>> parse_bytes(many_char(ascii_alpha), b"abcd efgh")
b"abcd"

How this works is that many_char flushes the cursor before running, then parses ascii_alpha until that fails. At that point the cursor is again flushed and the resulting content returned.

If you use the many_char parser as part of a larger parser that uses the cursor selection, you should use many_char_0. This does the same as many_char without flushing. As a consequence many_char_0 only moves the cursor, it doesn’t return anything.

Auxiliary stack

An auxilary stack variable is threaded through to keep bits of information. A parser may push or pop values to this stack. The most common use case for this is to retrieve values from the middle of a sequence.

If we’re parsing a delimited list say (1 2 3 4) we can use sequence:

>>> parse_bytes(
...     sequence(char('('), many(integer), char(')')),
...     b"(1 2 3 4)")
b')'

What happened is that sequence returns the value of the last parser in the list. to get at the actual juice, we can push the important value and then pop it at the end.

>>> parse_bytes(
...     sequence(char('('), many(integer) >> push, char(')'), pop()),
...     b"(1 2 3 4)")
[1, 2, 3, 4]

This is not the pretiest thing, but it works.

Config variable

We may use the auxiliary stack to store a config variable that can be accessed from any parser. To make this use a bit more user friendly, we define two functions: with_config() and the use_config() decorator.

Example

We have as input a number and a string. The string is returned in upper-case if the number is 1:

@using_config
def set_case(x, config):
    config["uppercase"] = (x == 1)
    return value(None)

@using_config
def get_text(config):
    if config["uppercase"]:
        return many_char(item, lambda x: x.decode().upper())
    else:
        return many_char(item, lambda x: x.decode())

assert parse_bytes(
    with_config(sequence(integer >> set_cap, get_text())),
    b'0hello') == "hello"
assert parse_bytes(
    with_config(sequence(integer >> set_cap, get_text())),
    b'1hello') == "HELLO"

Parsers

byteparsing.parsers.byte_none_of(x: bytes) → Parser: Parses none of the characters in x.

byteparsing.parsers.char(c: Union[str, int]) → Parser: Parses a single character maching c.

byteparsing.parsers.char_pred(pred: Callable[[int], bool]) → Parser: Parses a single character passing a given predicate.

byteparsing.parsers.check_size(n: int) → Callable: Raises an exception if size is not n.

byteparsing.parsers.choice(*ps: Parser) → Parser: Parses using the first parser in ps that succeeds.

byteparsing.parsers.construct(f)

Construct an object f by passing a dictionary as keyword arguments. Use this in conjunction with named_sequence.

>>> @dataclass
... class Point:
...     x: float
...     y: float

>>> point = named_sequence(
...     _1=tokenize(char("(")),
...     x=tokenize(scientific_number),
...     _2=tokenize(char(","))
...     y=tokenize(scientific_number),
...     _3=tokenize(char(")"))
...     ) >> construct(Point)

>>> parse_bytes(point, "(1, 2)")
Point(x=1, y=2)

byteparsing.parsers.fail(msg: str) → Parser: A parser that always fails with the given message.

byteparsing.parsers.flush(transfer=<function <lambda>>): Flush the cursor and return the underlying data. The return value can be mapped by the optional transfer function.

byteparsing.parsers.flush_decode(): Flush the cursor and return the underlying data as a decoded string.

byteparsing.parsers.fmap(f)

Maps a parsed value by a function f.

>>> parse_bytes(text_literal("hello") >> fmap(lambda x: x.upper()), b"hello")
"HELLO"

byteparsing.parsers.get_aux(): Get the entire auxiliary stack. Not commonly used.

byteparsing.parsers.ignore(p: Parser): Runs the given parser, but doesn’t mutate the stack.

byteparsing.parsers.literal(x: bytes) → Parser: Parses the exact sequence of bytes given in x.

byteparsing.parsers.many(p: Parser, init: Optional[List[Any]] = None) → Parser: Parse p any number of times.

byteparsing.parsers.many_char(p: ~byteparsing.trampoline.Parser, transfer=<function <lambda>>) → Parser: Parse p zero or more times, returns the string.

byteparsing.parsers.named_sequence(**kwargs: Parser) → Parser: Similar to sequence, this parses using all of the arguments in order. The result is now a dictionary where the elements are assigned using the result of each given parser.

byteparsing.parsers.optional(p: Parser, default=None): Doesn’t fail if the given parser fails, but returns a default value.

byteparsing.parsers.parse_bytes(p: Parser, data: Union[bytes, bytearray, mmap]): Call parser p on data and returns result.

byteparsing.parsers.pop(transfer=<function <lambda>>): Pops a value off the auxiliary stack. The result may be transformed by a transfer function, which defaults to the identity function.

byteparsing.parsers.push(x: Any): Pushes a value to the auxiliary stack.

byteparsing.parsers.quoted_string(quote='"'): Parses a quoted string, no quote escaping implemented yet.

byteparsing.parsers.sep_by(p: Parser, sep: Parser) → Parser: Parse p separated by sep. Returns list of p.

byteparsing.parsers.sequence(first: Parser, *rest: Parser) → Parser: Parse first, then sequence(*rest). The parser result is that of the last parser in the sequence.

byteparsing.parsers.set_aux(x: Any): Replace the entire auxiliary stack. Not commonly used.

byteparsing.parsers.some(p: Parser) → Parser: Parse p one or more times.

byteparsing.parsers.some_char(p: ~byteparsing.trampoline.Parser, transfer=<function <lambda>>) → Parser: Parses p one or more times.

byteparsing.parsers.some_char_0(p: Parser) → Parser: Parses one or more characters; doesn’t return a value, just moves the cursor for later flushing.

byteparsing.parsers.text_literal(x: str) → Parser: Parses the contents of x encoded by the encoding given in the cursor.

byteparsing.parsers.text_one_of(x: str) → Parser: Parses any of the characters in x.

byteparsing.parsers.tokenize(p: Parser) → Parser: Parses p, clearing surrounding whitespace.

byteparsing.parsers.using_config(f): Use this decorator to pass the config as a keyword argument to a parser generator.

byteparsing.parsers.value(x) → Parser: Parses to value x without taking input.

byteparsing.parsers.with_config(p: Parser, **kwargs) → Parser: Creates a config object at the bottom of the auxiliary stack. The config will be a empty dictionary. The resulting parser should be the outer-most parser being used.

byteparsing.array.array(dtype: dtype, size: int) → Parser: Reads the next sizeof(dtype) * product(shape) bytes from the cursor and interprets them as numeric binary data.

byteparsing.array.binary_value(dtype: dtype): Parses a single binary value of the given dtype.

Trampoline

The trampoline is a pattern for running recursive algorithms on the heap. A trampolined function returns either another Trampoline object or some result. If we get a Trampoline object, the loop continues. This technique allows for the expression of tail-recursive functions without running into stack overflow errors.

class byteparsing.trampoline.Call(p: Callable[[...], Union[Tuple[Any, Cursor, Any], Trampoline]], cursor: Cursor, aux: Any): Stores a delayed call to a parser. Part of the parser trampoline.

class byteparsing.trampoline.Parser(func: Optional[Callable[[...], Union[Tuple[Any, Cursor, Any], Trampoline]]]): Wrapper for parser functions.

class byteparsing.trampoline.Trampoline

Base class implementing the trampoline loop structure.

invoke(): Invoke the trampoline.

byteparsing.trampoline.bind(p: Parser, f: Callable[[Any], Parser]) → Parser

Call parser p and feed result to function f, which should create a new parser. The Parser class defines the >> operator as an alias for bind.

Together with value this defines a monad on Parser.

If you are unfamiliar with monads, this function can be a bit hard to grasp. However, a tutorial on monads is outside the scope of this document.

The >> operator is one of the primary ways of composing parsers (the other being choice).

byteparsing.trampoline.parser(f: Callable[[Cursor, Any], Union[Tuple[Any, Cursor, Any], Trampoline]])

The parser decorator creates a parser out of a function with the following signature:

@parser
def some_parser(cursor: Cursor, aux: Any) -> tuple[T, Cursor, Any]:
    pass

A parser takes a Cursor object and returns a parsed object together with the updated cursor. The aux object is used to pass around auxiliary state.

This decorator function is an alias for Parser.__init__.