Examples of usage

In this section we’ll show several examples of usage of the byteparsing package.

For now, we’ll import all the parsers.

[1]:
from byteparsing.parsers import *

Simple email address parser

An email address typically contains three pieces of information:

  • Username

  • Host

This information is easy to parse with the naked eye:

[username]@[host]

A parser, of course, has no eyes. Nor common sense. So we’ll need to use some explicit instructions. What about the following?

  1. Keep in mind that not all chars are valid for an email.

  2. The first email-valid chars constitute the user field. It should contain at least one char.

  3. After the user field we expect an “@”. We check that it is there, and we ignore it.

  4. The next email-valid chars after the “@” correspond to the host field. It should contain at least one char.

In the example below, you can see the implementation of this algorithm.

[2]:
# First, we define what charachters are acceptable on an email (email-valid chars)
email_char = choice(ascii_alpha_num, ascii_underscore, text_literal("."), text_literal("-"))

# We abstract the information contained in an email as:
# [username]@[host]
email = named_sequence(
            user=some_char(email_char), # Step 1
            _1=text_literal("@"),       # Step 2
            host=some_char(email_char)  # Step 3
        )

# Notice that we ignore the "@" by assigning it to the field "_1".
# Why not use just "_"? Because we need these fields to be unique.
# In case we had more than one ignored value, we recommend to use
# _1, _2, and so on for the ignored fields.

Let’s apply to a made-up email address and see if it works:

[3]:
parsed = parse_bytes(email, b'p.rodriguez-sanchez@esciencecenter.nl')

print(parsed)
{'user': b'p.rodriguez-sanchez', 'host': b'esciencecenter.nl'}

Notice that we used the parse_bytes method to actually apply the parser. We’ll use this method very often, so it is good to stop for a moment and reflect about its structure. Typically, parse_bytes will take two arguments as an input:

  1. A parser, indicating the kind of data we expect.

  2. The data itself.

The output will be the parsed data.

Fancier email address parsers

More detailed fields

The information contained in an email address can be further dissected. For instance, the host information can be split in server and country code. That is:

[username]@[server].[country]

We can create a more detailed parser that splits strings wherever it finds a dot.

In order to do this, we first have to redefine the set of acceptable email chars, to not include the dot anymore.

[8]:
email_char = choice(ascii_alpha_num, ascii_underscore, text_literal("-"))

Now, we can use the parser below to dissect email components.

[9]:
email_component = sep_by(some_char(email_char, bytes.decode), text_literal("."))

Let’s build the improved parser.

[10]:
better_email = named_sequence(
                user=email_component,
                _1=text_literal("@"),
                host=email_component
                )

And try it:

[11]:
my_email = parse_bytes(better_email,
            b"pablo.rodriguez-sanchez@esciencecenter.nl")

The output is a dictionary containing the dissected parts of the email.

[13]:
print(my_email)
{'user': ['pablo', 'rodriguez-sanchez'], 'host': ['esciencecenter', 'nl']}

Pro tip: construct a data class

We can use the dictionary to create an instance of a data class. As we will see, this will allow for maximum flexibility.

First, we create a data class representing an email address.

[14]:
from dataclasses import dataclass

@dataclass
class Email:
    user: List[str]
    host: List[str]

    @property
    def country(self):
        """Return the country code"""
        return self.host[-1]

    def __str__(self):
        """Prints the email in a human-readable fashion"""
        return ".".join(self.user) + "@" + ".".join(self.host)

The construct method pipes the output directly into the class constructor

[15]:
even_better_email = named_sequence(
                        user=email_component,
                        _1=text_literal("@"),
                        host=email_component
                    ) >> construct(Email)

Let’s try it:

[16]:
my_email = parse_bytes(even_better_email,
            b"pablo.rodriguez-sanchez@esciencecenter.nl")

str(my_email)
[16]:
'pablo.rodriguez-sanchez@esciencecenter.nl'

The output is an instance of the class Email.

[17]:
my_email
[17]:
Email(user=['pablo', 'rodriguez-sanchez'], host=['esciencecenter', 'nl'])

We can of course use the class’ methods:

[18]:
str(my_email)
[18]:
'pablo.rodriguez-sanchez@esciencecenter.nl'
[19]:
my_email.country
[19]:
'nl'

Parse a list of emails

Imagine now we want to parse several email addresses from a file containing the information below. Notice that each email address is separated by an end-of-line char.

[22]:
data = b"j.hidding@esciencecenter.nl\np.rodriguez-sanchez@esciencencenter.nl"
print(data.decode())
j.hidding@esciencecenter.nl
p.rodriguez-sanchez@esciencencenter.nl

The following parser will be helpful for dealing with end-of-line chars, because they are encoded differently depending on the OS.

[23]:
eol = choice(text_literal("\n"), text_literal("\n\r"))

We can create a parser for a list of emails just by:

[24]:
list_of_emails = sep_by(even_better_email, eol)

Let’s try it:

[25]:
our_emails = parse_bytes(list_of_emails, data)

It returns a list of instances of the class Email.

[26]:
our_emails
[26]:
[Email(user=['j', 'hidding'], host=['esciencecenter', 'nl']),
 Email(user=['p', 'rodriguez-sanchez'], host=['esciencencenter', 'nl'])]

And once again, we can access the class’ methods:

[27]:
for email in our_emails:
    print(email)
    print(email.country)
j.hidding@esciencecenter.nl
nl
p.rodriguez-sanchez@esciencencenter.nl
nl

Parse a CSV

Imagine we have stored numerical information on a CSV file. The file looks like this:

1;-2;3.14;-4
5;-6.2;7;-8.1
9;-10;11;-12

We can first create a parser for a single line.

The recipe is the following:

  1. Split the content (floats) by separator (“;”)

  2. Expect 0 or 1 end-of-line chars

[28]:
csvline = sequence(
    sep_by(scientific_number, text_literal(";")) >> push,
    many(eol), # Just check the eol exists. Don't store it
    pop()) # Return pushed content

Let’s try it:

[29]:
data = b"1;-2;3.14;-4/n"
parse_bytes(csvline, data)
[29]:
[1, -2, 3.14, -4]

A complete CSV just contains several lines like the one above. Our syntax makes this generalization remarkably simple:

[30]:
csv = some(csvline) # A csv contains at least one line

Let’s try it:

[31]:
data = b"1;-2;3;-4\n5;-6.2;7;-8.1\n9;-10;11;-12"

parse_bytes(csv, data)
[31]:
[[1, -2, 3, -4], [5, -6.2, 7, -8.1], [9, -10, 11, -12]]