Parsers
Parsing functions
The dataclass
Parser
contains a single field, func
. func
is expected to be a function with the following signature:
Wait a second! Why is the function returning something else than the data
itself?
The reason is that, apart from the obvious output of data
, it is very convenient to return an updated Cursor
and an updated Aux
. Typically, the updated Cursor
will contain only the remaining, non-parsed content of the data. This is very convenient if we want to concatenate different parsers, passing the output of the first to the next one… and so on.
Let’s see an example:
In this example we will create a very simple parser, that just reads the first letter of a text string.
[1]:
from byteparsing import Cursor
[2]:
# Create some data to be parsed
data = b"Hello world!"
# Initialize the Cursor
c = Cursor(data, begin=0, end=0)
# Use an empty auxiliary variable
a = []
# Create a parsing function
def read_one(c, a):
c = c.increment() # Increase end index by one
x = c.content_str # Read content
c = c.flush() # Flush (i.e.: move begin to end)
return x, c, a
In the snippet below we see why it is convenient to use the updated Cursor
.
[3]:
while c:
x, c, a = read_one(c, a)
print(x)
H
e
l
l
o
w
o
r
l
d
!
We can even write a new parsing function, based on the previous one, that parses to the end of the string:
[4]:
def read_all(c, a):
x = [] # Initialize as empty list
while c:
temp, c, a = read_one(c, a)
x.append(temp)
return x, c, a
Let’s try it:
[5]:
# Restart the Cursor
c = Cursor(data, begin=0, end=0)
x, c, a = read_all(c, a)
print(x)
print(c)
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Cursor(data=b'Hello world!', begin=12, end=12, encoding='utf-8')
Notice that the composition of two parsing functions (say, \(f\) and \(g\)) is slightly more complicated than \(f \circ g\), because the input and the output spaces of parsing functions are slightly different.
For these and other reasons, it is advisable to manage parsing functions with a more flexible data structure.
We introduce the Parser (data) class.
The Parser class
This section is work in progress.
We’ll manage parsing functions using a Parser
class. Parser
is a dataclass
that contains a single field, func
, representing a parser function.
Let’s build a Parser
class from the read_one
parsing function defined in the previous section:
[6]:
from byteparsing.trampoline import Parser, parser
[7]:
# Initialize the Cursor
c = Cursor(data, begin=0, end=0)
# Use an empty auxiliary variable
a = []
# Create a function
def read_one(c, a):
c = c.increment() # Increase end index by one
x = c.content_str # Read content
c = c.flush() # Flush (i.e.: move begin to end)
return x, c, a
read_one_p = Parser(read_one)
Note: the lines above are entirely equivalent to:
@parser
def read_one_p(c, a):
c = c.increment() # Increase end index by one
x = c.content_str # Read content
c = c.flush() # Flush (i.e.: move begin to end)
return x, c, a
Parsers are callable, but they don’t return anything informative until they are invoked:
[8]:
print(read_one_p(c, a)) # Whithout invoking
Call(p=<function read_one at 0x7fa3bb735b90>, cursor=Cursor(data=b'Hello world!', begin=0, end=0, encoding='utf-8'), aux=[])
[9]:
x, c, a = read_one_p(c, a).invoke()
print(x)
print(c)
H
Cursor(data=b'Hello world!', begin=1, end=1, encoding='utf-8')