Architecture

Cursors

The most basic element of this package is the Cursor. Roughly, a Cursor represents some data + a selection of it.

Let’s build our first Cursor:

[1]:
from byteparsing import Cursor
[2]:
data = b"Hello world!"

c = Cursor(data, begin=0, end=3)
print(c)
Cursor(data=b'Hello world!', begin=0, end=3, encoding='utf-8')

The Cursor is implemented as a dataclass. This means that it contains fields (particularly: data, begin, end, and encoding), and also methods.

For instance, the method content returns the subsetted data (i.e.: the data between begin and end):

[3]:
c.content # This method is decorated as a property, so parentheses are not needed
[3]:
b'Hel'

The method increment returns a new cursor where end has been increased (by default, to end + 1).

[4]:
c = Cursor(data, begin=0, end=3)
print(c)
ci = c.increment()
print(ci)
Cursor(data=b'Hello world!', begin=0, end=3, encoding='utf-8')
Cursor(data=b'Hello world!', begin=0, end=4, encoding='utf-8')

An interesting property of cursors is that they can be evaluated to a boolean. Particularly, a Cursor is True if and only if end is not at the end of the data string.

[5]:
cT = Cursor(data, begin=0, end=0)
cF = Cursor(data, begin=0, end=len(data))

assert(cT)
assert(not cF)

Wait a second. Why would we want a Cursor to be True or False? The reason is that it is very convenient for easily looping “to the end of the data”.

See for instance the loop below:

[6]:
c = Cursor(data, begin=0, end=0)
while c:
    c = c.increment()
    print(c.content)
b'H'
b'He'
b'Hel'
b'Hell'
b'Hello'
b'Hello '
b'Hello w'
b'Hello wo'
b'Hello wor'
b'Hello worl'
b'Hello world'
b'Hello world!'

Parsers

[7]:
from byteparsing import Cursor
[8]:
# Create some data to be parsed
data = b"Hello world!"

# Initialize the Cursor
c = Cursor(data, begin=0, end=0)

# Use an empty auxiliary variable
a = []

# Create a parsing function
def read_one(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

In the snippet below we see why it is convenient to use the updated Cursor.

[9]:
while c:
    x, c, a = read_one(c, a)
    print(x)
H
e
l
l
o

w
o
r
l
d
!

We can even write a new parsing function, based on the previous one, that parses to the end of the string:

[10]:
def read_all(c, a):
    x = [] # Initialize as empty list
    while c:
        temp, c, a = read_one(c, a)
        x.append(temp)
    return x, c, a

Let’s try it:

[11]:
# Restart the Cursor
c = Cursor(data, begin=0, end=0)

x, c, a = read_all(c, a)
print(x)
print(c)
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Cursor(data=b'Hello world!', begin=12, end=12, encoding='utf-8')

Notice that the composition of two parsing functions (say, \(f\) and \(g\)) is slightly more complicated than \(f \circ g\), because the input and the output spaces of parsing functions are slightly different.

For these and other reasons, it is advisable to manage parsing functions with a more flexible data structure.

We introduce the Parser (data) class.

The Parser class

This section is work in progress.

We’ll manage parsing functions using a Parser class. Parser is a dataclass that contains a single field, func, representing a parser function.

Let’s build a Parser class from the read_one parsing function defined in the previous section:

[12]:
from byteparsing.trampoline import Parser, parser
[13]:
# Initialize the Cursor
c = Cursor(data, begin=0, end=0)

# Use an empty auxiliary variable
a = []

# Create a function
def read_one(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

read_one_p = Parser(read_one)

Note: the lines above are entirely equivalent to:

@parser
def read_one_p(c, a):
    c = c.increment() # Increase end index by one
    x = c.content_str # Read content
    c = c.flush() # Flush (i.e.: move begin to end)
    return x, c, a

Parsers are callable, but they don’t return anything informative until they are invoked:

[14]:
print(read_one_p(c, a)) # Whithout invoking
Call(p=<function read_one at 0x7f4dbf560c10>, cursor=Cursor(data=b'Hello world!', begin=0, end=0, encoding='utf-8'), aux=[])
[15]:
x, c, a = read_one_p(c, a).invoke()
print(x)
print(c)
H
Cursor(data=b'Hello world!', begin=1, end=1, encoding='utf-8')