Advanced example: parsing PLY files
PLY is a file format storing 3D polygonal data that has support for both ASCII and binary formats. Some references: Wikipedia entry and Paul Bourke’s pages. The specification of the PLY format is a bit vague and allows for liberal interpretation and custom user defined fields. We take a look at the example on Paul Bourke’s page:
ply
format ascii 1.0 { ascii/binary, format version number }
comment made by Greg Turk { comments keyword specified, like all lines }
comment this file is a cube
element vertex 8 { define "vertex" element, 8 of them in file }
property float x { vertex contains float "x" coordinate }
property float y { y coordinate is also a vertex property }
property float z { z coordinate, too }
element face 6 { there are 6 "face" elements in the file }
property list uchar int vertex_index { "vertex_indices" is a list of ints }
end_header { delimits the end of the header }
0 0 0 { start of vertex list }
0 0 1
0 1 1
0 1 0
1 0 0
1 0 1
1 1 1
1 1 0
4 0 1 2 3 { start of face list }
4 7 6 5 4
4 0 4 5 1
4 1 5 6 2
4 2 6 7 3
4 3 7 4 0
We see a header that is always ASCII encoded. The header specifies a list of elements, each containing a list of properties. The listed elements determine how to parse the data section of the PLY file. The data section maybe encoded in ASCII or binary.
[6]:
from __future__ import annotations
from typing import Optional
from dataclasses import dataclass
from functools import partial
import numpy as np
import pprint
from byteparsing import (parse_bytes, Parser, parser)
from byteparsing.parsers import (
sequence, named_sequence, choice, optional, value, repeat_n,
text_literal, char, text_one_of, text_end_by,
byte_none_of, byte_one_of,
flush, flush_decode,
push, pop, char_pred,
many, some, many_char, some_char, many_char_0, some_char_0,
integer, scientific_number, array, binary_value,
fmap, construct)
pp = pprint.PrettyPrinter(indent=2, width=80)
Header
The header starts with a “magic number”, a line containing ply
.
[9]:
eol = choice(text_literal("\n"), text_literal("\n\r"))
ply_magic_number = sequence(text_literal("ply"), eol)
The second line indicates which variation of the PLY format this is. We will store header information in data classes and enums.
[14]:
from enum import Enum
class PlyFormat(Enum):
ASCII = 1
BINARY_LE = 2
BINARY_BE = 3
@staticmethod
def from_string(format_str: str) -> PlyFormat:
if format_str == "ascii":
return PlyFormat.ASCII
if format_str == "binary_little_endian":
return PlyFormat.BINARY_LE
if format_str == "binary_big_endian":
return PlyFormat.BINARY_BE
else:
raise ValueError(f"Unrecognized format string: {s}")
Since each item in the header is seperated by a newline, we need a tokenize
function that doesn’t skip newlines. From that we can define what we may consider to be a word
in a PLY header. We flush
the cursor, expect a ascii_alpha
character (words don’t start with numbers) and then many characters that are letters, numbers or underscores. At the end we flush
the cursor again, decoding to a Python string. We see here both many_char
and many_char_0
being used:
many_char
automatically flushes the cursor before and after which is not what we want when parsing a word. In fact, the only thing many_char_0
does, is to move the cursor end until the given parser no longer matches the input.
[15]:
ascii_alpha = char_pred(lambda c: 64 < c < 91 or 96 < c < 123)
ascii_num = char_pred(lambda c: 48 <= c < 58)
ascii_alpha_num = choice(ascii_alpha, ascii_num)
ascii_underscore = char(95)
def tokenize(p: Parser) -> Parser:
return sequence(p >> push, many_char(text_one_of(" ")), pop())
word = sequence(
flush(), ascii_alpha, many_char_0(choice(ascii_alpha_num, ascii_underscore)),
flush_decode())
The format line specifies the used data encoding: one of ascii
, binary_little_endian
or binary_big_endian
, and a version number. The version number is always “1.0”.
[24]:
ply_format = named_sequence(
_1 = tokenize(text_literal("format")),
format_str = tokenize(word),
_2 = sequence(tokenize(text_literal("1.0")), eol)) \
>> construct(PlyFormat.from_string)
[25]:
parse_bytes(ply_format, b"format binary_little_endian 1.0\n")
[25]:
<PlyFormat.BINARY_LE: 2>
Comments may be placed in the header by using the word comment at the start of the line. Everything from there until the end of the line should then be ignored.
[26]:
ply_comment = sequence(
tokenize(text_literal("comment")), flush(),
text_end_by("\n") >> push, optional(char("\r")), pop())
Now we need a parser for a data type, and given a data type we need to be able to parse that data either in ASCII or binary. There are two classes of types: primitive types and list types.
[27]:
ply_type_table = {
"char": "int8",
"uchar": "uint8",
"short": "int16",
"ushort": "uint16",
"int": "int32",
"uint": "uint32",
"float": "float32",
"double": "float64"
}
class PlyType:
pass
@dataclass
class PlyPrimitiveType(PlyType):
dtype: np.dtype
@staticmethod
def from_string(s: str) -> PlyPrimitiveType:
sanitized_name = ply_type_table.get(s, s)
return PlyPrimitiveType(np.dtype(sanitized_name))
@property
def byte_size(self) -> int:
return self.dtype.itemsize
def ascii(self) -> Parser:
return sequence(
flush(), some_char_0(byte_none_of(b"\n ")),
many_char_0(byte_one_of(b"\n ")), flush(self.dtype.type))
def binary(self) -> Parser:
return binary_value(self.dtype)
@dataclass
class PlyListType(PlyType):
size_type: PlyPrimitiveType
value_type: PlyPrimitiveType
def ascii(self) -> Parser:
return self.size_type.ascii() >> partial(repeat_n, self.value_type.ascii())
def binary(self) -> Parser:
return binary_value(self.size_type.dtype) >> partial(array, self.value_type.dtype)
[28]:
primitive_type = tokenize(word) >> fmap(PlyPrimitiveType.from_string)
list_type = named_sequence(
_1=tokenize(text_literal("list")),
size_type=primitive_type,
value_type=primitive_type) >> construct(PlyListType)
ply_type = choice(list_type, primitive_type)
[29]:
parse_bytes(ply_type, b"float float")
[29]:
PlyPrimitiveType(dtype=dtype('float32'))
[30]:
pp.pprint(parse_bytes(ply_type, b"list uint8 float"))
PlyListType(size_type=PlyPrimitiveType(dtype=dtype('uint8')),
value_type=PlyPrimitiveType(dtype=dtype('float32')))
[31]:
@dataclass
class PlyProperty:
dtype: PlyType
name: str
ply_property = named_sequence(
_1=tokenize(text_literal("property")),
dtype=ply_type,
name=tokenize(word),
_2=eol) >> construct(PlyProperty)
[32]:
parse_bytes(
ply_property,
b"property float x\n")
[32]:
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')), name='x')
[33]:
end_header = sequence(text_literal("end_header"), eol)
[34]:
@dataclass
class PlyElement:
name: str
size: int
properties: List[PlyProperty]
def ascii(self) -> Parser:
single_item = named_sequence(
**{p.name: p.dtype.ascii() for p in self.properties})
return repeat_n(single_item, self.size)
@property
def afine(self) -> bool:
return all(isinstance(p.dtype, PlyPrimitiveType)
for p in self.properties)
def binary(self) -> Parser:
if self.afine:
compound_type = [(p.name, p.dtype.dtype) for p in self.properties]
return array(compound_type, self.size)
else:
single_item = named_sequence(
**{p.name: p.dtype.binary() for p in self.properties})
return repeat_n(single_item, self.size)
ply_element = named_sequence(
_1=tokenize(text_literal("element")),
name=tokenize(word),
size=tokenize(integer),
_2=eol,
properties=some(ply_property)) >> construct(PlyElement)
[37]:
pp.pprint(
parse_bytes(
some(ply_element),
b"element vertex 8\nproperty float x\nproperty float y\nproperty float z\n" +
b"element face 6\nproperty list uchar int vertex_index\n"))
[ PlyElement(name='vertex',
size=8,
properties=[ PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='x'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='y'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='z')]),
PlyElement(name='face',
size=6,
properties=[ PlyProperty(dtype=PlyListType(size_type=PlyPrimitiveType(dtype=dtype('uint8')),
value_type=PlyPrimitiveType(dtype=dtype('int32'))),
name='vertex_index')])]
[38]:
pp.pprint(parse_bytes(ply_element, b"element face 6\nproperty list uchar int vertex_index\n"))
PlyElement(name='face',
size=6,
properties=[ PlyProperty(dtype=PlyListType(size_type=PlyPrimitiveType(dtype=dtype('uint8')),
value_type=PlyPrimitiveType(dtype=dtype('int32'))),
name='vertex_index')])
[39]:
@dataclass
class PlyHeader:
format: PlyFormat
comment: List[str]
elements: List[PlyElement]
def parser(self) -> Parser:
if self.format == PlyFormat.ASCII:
return named_sequence(
**{e.name: e.ascii() for e in self.elements})
if self.format == PlyFormat.BINARY_LE:
return named_sequence(
**{e.name: e.binary() for e in self.elements})
else:
raise NotImplementedError()
ply_header = named_sequence(
_1=ply_magic_number,
format=ply_format,
comment=many(ply_comment),
elements=some(ply_element),
_2=sequence(text_literal("end_header"), eol)) >> construct(PlyHeader)
def ply_data(header):
return named_sequence(header=value(header), data=header.parser())
ply_file = ply_header >> ply_data
The following ASCII example is given on Paul Bourke’s page.
[40]:
ascii_example = b"""ply
format ascii 1.0
comment made by Greg Turk
comment this file is a cube
element vertex 8
property float x
property float y
property float z
element face 6
property list uchar int vertex_index
end_header
0 0 0
0 0 1
0 1 1
0 1 0
1 0 0
1 0 1
1 1 1
1 1 0
4 0 1 2 3
4 7 6 5 4
4 0 4 5 1
4 1 5 6 2
4 2 6 7 3
4 3 7 4 0
"""
[41]:
pp.pprint(parse_bytes(ply_header, ascii_example))
PlyHeader(format=<PlyFormat.ASCII: 1>,
comment=['made by Greg Turk', 'this file is a cube'],
elements=[ PlyElement(name='vertex',
size=8,
properties=[ PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='x'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='y'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='z')]),
PlyElement(name='face',
size=6,
properties=[ PlyProperty(dtype=PlyListType(size_type=PlyPrimitiveType(dtype=dtype('uint8')),
value_type=PlyPrimitiveType(dtype=dtype('int32'))),
name='vertex_index')])])
No for the fun part! The header that we read actually encodes the parser for the rest of the file!
[42]:
pp.pprint(parse_bytes(ply_file, ascii_example))
{ 'data': { 'face': [ {'vertex_index': [0, 1, 2, 3]},
{'vertex_index': [7, 6, 5, 4]},
{'vertex_index': [0, 4, 5, 1]},
{'vertex_index': [1, 5, 6, 2]},
{'vertex_index': [2, 6, 7, 3]},
{'vertex_index': [3, 7, 4, 0]}],
'vertex': [ {'x': 0.0, 'y': 0.0, 'z': 0.0},
{'x': 0.0, 'y': 0.0, 'z': 1.0},
{'x': 0.0, 'y': 1.0, 'z': 1.0},
{'x': 0.0, 'y': 1.0, 'z': 0.0},
{'x': 1.0, 'y': 0.0, 'z': 0.0},
{'x': 1.0, 'y': 0.0, 'z': 1.0},
{'x': 1.0, 'y': 1.0, 'z': 1.0},
{'x': 1.0, 'y': 1.0, 'z': 0.0}]},
'header': PlyHeader(format=<PlyFormat.ASCII: 1>,
comment=['made by Greg Turk', 'this file is a cube'],
elements=[ PlyElement(name='vertex',
size=8,
properties=[ PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='x'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='y'),
PlyProperty(dtype=PlyPrimitiveType(dtype=dtype('float32')),
name='z')]),
PlyElement(name='face',
size=6,
properties=[ PlyProperty(dtype=PlyListType(size_type=PlyPrimitiveType(dtype=dtype('uint8')),
value_type=PlyPrimitiveType(dtype=dtype('int32'))),
name='vertex_index')])])}
The Stanford Bunny
Now that we have the capability to parse binary PLY files, we can load the Stanford Bunny. For visualisation purposes it helps to know that all faces in this file are triangles.
[22]:
from pathlib import Path
bunny = parse_bytes(ply_file, Path("_static/stanford_bunny.ply").open(mode="rb").read())
[23]:
vertices = bunny["data"]["vertex"].view((np.float32, 3))
triangles = np.array([row["vertex_indices"] for row in bunny["data"]["face"]])
[24]:
from mpl_toolkits import mplot3d
from matplotlib import pyplot as plt
# For interactive use: install ipympl and run:
# %matplotlib widget
[25]:
fig = plt.figure(figsize=(12,12))
ax = plt.axes(projection="3d")
ax.azim = -18
ax.elev = 15
ax.plot_trisurf(*vertices.T[[2,0,1]], triangles=triangles);