{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parsers\n", "\n", "## Parsing functions\n", "\n", "The `dataclass` `Parser` contains a single field, `func`.\n", "`func` is expected to be a function with the following signature:\n", "\n", "$$\n", "f: (Cursor, Aux) \\longrightarrow (data, Cursor, Aux)\n", "$$\n", "\n", "Wait a second! Why is the function returning something else than the `data` itself?\n", "\n", "The reason is that, apart from the obvious output of `data`, it is very convenient to return an **updated** `Cursor` and an **updated** `Aux`.\n", "Typically, the updated `Cursor` will contain only the remaining, non-parsed content of the data.\n", "This is very convenient if we want to concatenate different parsers, passing the output of the first to the next one... and so on.\n", "\n", "Let's see an example:\n", "\n", "In this example we will create a very simple parser, that just reads the first letter of a text string." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from byteparsing import Cursor" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Create some data to be parsed\n", "data = b\"Hello world!\"\n", "\n", "# Initialize the Cursor\n", "c = Cursor(data, begin=0, end=0)\n", "\n", "# Use an empty auxiliary variable\n", "a = []\n", "\n", "# Create a parsing function\n", "def read_one(c, a):\n", " c = c.increment() # Increase end index by one\n", " x = c.content_str # Read content\n", " c = c.flush() # Flush (i.e.: move begin to end)\n", " return x, c, a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the snippet below we see why it is convenient to use the updated `Cursor`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "H\n", "e\n", "l\n", "l\n", "o\n", " \n", "w\n", "o\n", "r\n", "l\n", "d\n", "!\n" ] } ], "source": [ "while c:\n", " x, c, a = read_one(c, a)\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even write a new parsing function, based on the previous one, that parses to the end of the string:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def read_all(c, a):\n", " x = [] # Initialize as empty list\n", " while c:\n", " temp, c, a = read_one(c, a)\n", " x.append(temp)\n", " return x, c, a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try it:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']\n", "Cursor(data=b'Hello world!', begin=12, end=12, encoding='utf-8')\n" ] } ], "source": [ "# Restart the Cursor\n", "c = Cursor(data, begin=0, end=0)\n", "\n", "x, c, a = read_all(c, a)\n", "print(x)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the composition of two parsing functions (say, $f$ and $g$) is slightly more complicated than $f \\circ g$, because the input and the output spaces of parsing functions are slightly different.\n", "\n", "For these and other reasons, it is advisable to manage parsing functions with a more flexible data structure.\n", "\n", "We introduce the Parser (data) class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Parser class\n", "\n", "This section is work in progress.\n", "\n", "We'll manage parsing functions using a `Parser` class.\n", "`Parser` is a `dataclass` that contains a single field, `func`, representing a parser function.\n", "\n", "Let's build a `Parser` class from the `read_one` parsing function defined in the previous section:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from byteparsing.trampoline import Parser, parser" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Initialize the Cursor\n", "c = Cursor(data, begin=0, end=0)\n", "\n", "# Use an empty auxiliary variable\n", "a = []\n", "\n", "# Create a function\n", "def read_one(c, a):\n", " c = c.increment() # Increase end index by one\n", " x = c.content_str # Read content\n", " c = c.flush() # Flush (i.e.: move begin to end)\n", " return x, c, a\n", "\n", "read_one_p = Parser(read_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: the lines above are entirely equivalent to:\n", "\n", "```python\n", "@parser\n", "def read_one_p(c, a):\n", " c = c.increment() # Increase end index by one\n", " x = c.content_str # Read content\n", " c = c.flush() # Flush (i.e.: move begin to end)\n", " return x, c, a\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsers are callable, but they don't return anything informative until they are invoked:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Call(p=, cursor=Cursor(data=b'Hello world!', begin=0, end=0, encoding='utf-8'), aux=[])\n" ] } ], "source": [ "print(read_one_p(c, a)) # Whithout invoking" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "H\n", "Cursor(data=b'Hello world!', begin=1, end=1, encoding='utf-8')\n" ] } ], "source": [ "x, c, a = read_one_p(c, a).invoke()\n", "print(x)\n", "print(c)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }