Parsers

Parsing functions

The dataclass Parser contains a single field, func. func is expected to be a function with the following signature:

\[f: (Cursor, Aux) \longrightarrow (data, Cursor, Aux)\]

Wait a second! Why is the function returning something else than the data itself?

The reason is that, apart from the obvious output of data, it is very convenient to return an updated Cursor and an updated Aux. Typically, the updated Cursor will contain only the remaining, non-parsed content of the data. This is very convenient if we want to concatenate different parsers, passing the output of the first to the next one… and so on.

Let’s see an example:

In this example we will create a very simple parser, that just reads the first letter of a text string.

[1]:
from byteparsing import Cursor
[2]:
# Create some data to be parsed
data = b"Hello world!"

# Initialize the Cursor
c = Cursor(data, begin=0, end=0)

# Use an empty auxiliary variable
a = []

# Create a parsing function
def read_one(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

In the snippet below we see why it is convenient to use the updated Cursor.

[3]:
while c:
    x, c, a = read_one(c, a)
    print(x)
H
e
l
l
o

w
o
r
l
d
!

We can even write a new parsing function, based on the previous one, that parses to the end of the string:

[4]:
def read_all(c, a):
    x = [] # Initialize as empty list
    while c:
        temp, c, a = read_one(c, a)
        x.append(temp)
    return x, c, a

Let’s try it:

[5]:
# Restart the Cursor
c = Cursor(data, begin=0, end=0)

x, c, a = read_all(c, a)
print(x)
print(c)
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Cursor(data=b'Hello world!', begin=12, end=12, encoding='utf-8')

Notice that the composition of two parsing functions (say, \(f\) and \(g\)) is slightly more complicated than \(f \circ g\), because the input and the output spaces of parsing functions are slightly different.

For these and other reasons, it is advisable to manage parsing functions with a more flexible data structure.

We introduce the Parser (data) class.

The Parser class

This section is work in progress.

We’ll manage parsing functions using a Parser class. Parser is a dataclass that contains a single field, func, representing a parser function.

Let’s build a Parser class from the read_one parsing function defined in the previous section:

[6]:
from byteparsing.trampoline import Parser, parser
[7]:
# Initialize the Cursor
c = Cursor(data, begin=0, end=0)

# Use an empty auxiliary variable
a = []

# Create a function
def read_one(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

read_one_p = Parser(read_one)

Note: the lines above are entirely equivalent to:

@parser
def read_one_p(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

Parsers are callable, but they don’t return anything informative until they are invoked:

[8]:
print(read_one_p(c, a)) # Whithout invoking
Call(p=<function read_one at 0x7fa3bb735b90>, cursor=Cursor(data=b'Hello world!', begin=0, end=0, encoding='utf-8'), aux=[])
[9]:
x, c, a = read_one_p(c, a).invoke()
print(x)
print(c)
H
Cursor(data=b'Hello world!', begin=1, end=1, encoding='utf-8')