Advanced examples

In this section we’ll show several examples of usage of the byteparsing package for dealing with files combining ASCII and binary data.

Parsing PPM files

To show how we can mix ASCII and binary data, we have an example where we parse Portable PixMap files (PPM). These files have a small ASCII header and the image itself in binary. The header looks something like this:

P6   # this marks the file type in the Netpbm family
640 480
256
<<binary rgb values: 3*w*h bytes>>
[1]:
import numpy as np
from dataclasses import dataclass
from byteparsing import parse_bytes
from byteparsing.parsers import (
    text_literal, integer, eol, named_sequence, sequence, construct,
    tokenize, item, array,  fmap, text_end_by, optional)

The PPM header format allows for comments in the ASCII header.

[2]:
comment = sequence(text_literal("#"), text_end_by("\n"))

We define a class that should contain all the data in the header.

[3]:
@dataclass
class Header:
    width: int
    height: int
    maxint: int

Then we can construct a parser for this header, using named_sequence and construct.

[4]:
header = named_sequence(
    _1 = tokenize(text_literal("P6")),
    _2 = optional(comment),
    width = tokenize(integer),
    height = tokenize(integer),
    maxint = tokenize(integer)) >> construct(Header)

We’ll have to pass on the header information to the parser for the binary blob somehow, so we define a function.

[5]:
def image_bytes(header: Header):
    shape = (header.height, header.width, 3)
    size = header.height * header.width * 3
    return array(np.uint8, size) >> fmap(lambda a: a.reshape(shape))

ppm_image = header >> image_bytes

Let’s test this on a sample image, and ignore the fact that PIL has a perfectly good parser for PPM files itself.

[6]:
raw_data = open("python-logo.ppm", "rb").read()
image = parse_bytes(ppm_image, raw_data)

from PIL import Image
Image.frombytes(mode="RGB", size=(image.shape[1], image.shape[0]), data=image)
[6]:
_images/advanced_12_0.png

Parsing PLY files

PLY is a file format storing 3D polygonal data that has support for both ASCII and binary formats. Some references: Wikipedia entry and Paul Bourke’s pages. The specification of the PLY format is a bit vague and allows for liberal interpretation and custom user defined fields. We take a look at the example on Paul Bourke’s page:

ply
format ascii 1.0           { ascii/binary, format version number }
comment made by Greg Turk  { comments keyword specified, like all lines }
comment this file is a cube
element vertex 8           { define "vertex" element, 8 of them in file }
property float x           { vertex contains float "x" coordinate }
property float y           { y coordinate is also a vertex property }
property float z           { z coordinate, too }
element face 6             { there are 6 "face" elements in the file }
property list uchar int vertex_index { "vertex_indices" is a list of ints }
end_header                 { delimits the end of the header }
0 0 0                      { start of vertex list }
0 0 1
0 1 1
0 1 0
1 0 0
1 0 1
1 1 1
1 1 0
4 0 1 2 3                  { start of face list }
4 7 6 5 4
4 0 4 5 1
4 1 5 6 2
4 2 6 7 3
4 3 7 4 0

We see a header that is always ASCII encoded. The header specifies a list of elements, each containing a list of properties. The listed elements determine how to parse the data section of the PLY file. The data section maybe encoded in ASCII or binary.

[7]:
from __future__ import annotations
from typing import Optional
from dataclasses import dataclass
from functools import partial
import numpy as np
import pprint

from byteparsing import (parse_bytes, Parser, parser)
from byteparsing.parsers import (
    sequence, named_sequence, choice, optional, value, repeat_n,
    text_literal, char, text_one_of, text_end_by,
    byte_none_of, byte_one_of,
    flush, flush_decode,
    push, pop, char_pred,
    many, some, many_char, some_char, many_char_0, some_char_0,
    integer, scientific_number, array, binary_value,
    fmap, construct)

pp = pprint.PrettyPrinter(indent=2, width=80)

The Stanford Bunny

Now that we have the capability to parse binary PLY files, we can load the Stanford Bunny. For visualisation purposes it helps to know that all faces in this file are triangles.

[28]:
from pathlib import Path

bunny = parse_bytes(ply_file, Path("_static/stanford_bunny.ply").open(mode="rb").read())
[29]:
vertices = bunny["data"]["vertex"].view((np.float32, 3))
triangles = np.array([row["vertex_indices"] for row in bunny["data"]["face"]])
[30]:
from mpl_toolkits import mplot3d
from matplotlib import pyplot as plt
# For interactive use: install ipympl and run:
# %matplotlib widget
[31]:
fig = plt.figure(figsize=(12,12))
ax = plt.axes(projection="3d")
ax.azim = -18
ax.elev = 15
ax.plot_trisurf(*vertices.T[[2,0,1]], triangles=triangles);
_images/advanced_47_0.png

Parsing Memory mapped OpenFOAM files

The final example is to read an OpenFOAM file as a memory mapped array. There are some details that need attention.

import mmap
import numpy as np
from byteparsing import parse_bytes
from byteparsing.openfoam import foam_file

f = Path("pipeFlow/1.0/U").open(mode="r+b")
with mmap.mmap(f.fileno(), 0) as mm:
  content = parse_bytes(foam_file, mm)
  result = content["data"]["internalField"]

  <<do work ...>>

  del result
  del content

The content is returned in the form of a nested dictionary. The "internalField" item is a name that one often finds in OpenFOAM files. The result object is a Numpy ndarray created using a np.frombuffer call. Any mutations to the Numpy array are directly reflected on the disk. This means that accessing large amounts of data can be extremely efficient in terms of memory footprint.

The final two del statements are necessary to ensure that no reference to the memory-mapped data outlives the memory map itself, which is closed as soon as we leave the with mmap ... context.