funcparserlib-0.3.6/ 0000755 0000765 0000024 00000000000 12140502162 014433 5 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/CHANGES 0000644 0000765 0000024 00000004447 12140476510 015446 0 ustar vlan staff 0000000 0000000 The Changelog
=============
This is a changelog of [funcparserlib][1].
0.3.6, 2013-05-02
-----------------
A maintenance release.
* Python 3 compatibility
* #31 Fixed `many()` that consumed too many tokens in some cases
* #14 More info available in exception objects
0.3.5, 2011-01-13
-----------------
A maintenance release.
* Python 2.4 compatibility
* More readable terminal names for error reporting
* Fixed wrong token positions in lexer error messages
0.3.4, 2009-10-06
-----------------
A maintenance release.
* Switched from `setuptools` to `distutils`
* Fixed importing all symbols from `funcparserlib.lexer`
* Improved `run-tests` utility
0.3.3, 2009-08-03
-----------------
A bugfix release, added more docs.
* Fixed bug in results of skip + skip parsers
* Added FAQ question about infinite loops in parsers
* Debug rule tracing can be enabled again
0.3.2, 2009-07-26
-----------------
A bugfix release, added more docs.
* Fixed some string and number encoding issuses in examples
* Added the Parsing Stages Illustrated page
0.3.1, 2009-07-26
-----------------
Major optimizations (10x faster than the version 0.3), added `forward_decl`,
`pretty_tree`, more docs.
* Added the Nested Brackets Mini-HOWTO
* Added the `pretty_tree` function for creating pseudographic trees
* Added the `forward_decl` function, that performs better than
`with_forward_decls`
* Wrapped parser is called directly without `__call__`, using `run`
* A single immutable input sequence is used in parsers
* The slow `logging` is enabled only when the `debug` flag is set
* Added the project Makefile and this file
0.3, 2009-07-23
---------------
Translated docs into English, added more docs and examples, internal
improvements.
* Added The `funcparserlib` Tutorial
* Translated docs from Russian into English
* Added `pure` and `bind` functions on `Parser`s making them monads
* Added a JSON parser as an example
0.2, 2009-07-07
---------------
Added `with_forward_decls`, internal improvements.
* Added the `with_forward_decls` combinator for dealing with forward
declarations
* Switched to iterative implementation of `many`
* Uncurried parser function type in order to simplify things
* Improvements of the DOT parser
0.1, 2009-06-26
---------------
Initial release.
[1]: http://code.google.com/p/funcparserlib/
funcparserlib-0.3.6/doc/ 0000755 0000765 0000024 00000000000 12140502162 015200 5 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/doc/Brackets.md 0000644 0000765 0000024 00000012035 12140501310 017253 0 ustar vlan staff 0000000 0000000 Nested Brackets Mini-HOWTO
==========================
- Author:
-
Andrey Vlasovskikh
- License:
-
Creative Commons Attribution-Noncommercial-Share Alike 3.0
- Library Homepage:
-
http://code.google.com/p/funcparserlib/
- Library Version:
- 0.3.6
Intro
-----
Let's try out `funcparserlib` using a tiny example: parsing strings of nested
curly brackets. It is well known that it can't be done with regular expressions
so we need a parser. For more complex examples see [The funcparserlib
Tutorial][tutorial] or
other examples at [the funcparserlib homepage][funcparserlib].
[funcparserlib]: http://code.google.com/p/funcparserlib/
[tutorial]: http://archlinux.folding-maps.org/2009/funcparserlib/Tutorial
Here is the EBNF grammar of our curly brackets language:
nested = "{" , { nested } , "}" ;
i. e. `nested` is a sequence of the symbol `{`, followed by zero or more
occurences of the `nested` production itself, followed by the symbol `}`. Let's
develop a parser for this grammar.
We will parse plain strings, but in real life you may wish to use
`funcparserlib.lexer` or any other lexer to tokenize the input and parse tokens,
not just symbols.
We will use the following `funcparserlib` functions: `a`, `forward_decl`,
`maybe`, `finished`, `many`, `skip`, `pretty_tree`. The library actually exports
only 11 functions, so the API is quite compact.
Nested Brackets Checker
-----------------------
Basic usage:
>>> from funcparserlib.parser import a
>>> brackets = a('{') + a('}')
>>> brackets.parse('{}')
('{', '}')
Let's write a nested brackets checker:
>>> from funcparserlib.parser import forward_decl, maybe
>>> nested = forward_decl()
>>> nested.define(a('{') + maybe(nested) + a('}'))
Test it:
>>> nested.parse('{}')
('{', None, '}')
>>> nested.parse('{{}}')
('{', ('{', None, '}'), '}')
>>> nested.parse('{{}')
Traceback (most recent call last):
...
NoParseError: no tokens left in the stream:
>>> nested.parse('{foo}')
Traceback (most recent call last):
...
NoParseError: got unexpected token: f
>>> nested.parse('{}foo')
('{', None, '}')
In the last test we have parsed only a valid subsequence. Let's ensure that all
the input symbols have been parsed:
>>> from funcparserlib.parser import finished
>>> input = nested + finished
Test it:
>>> input.parse('{}foo')
Traceback (most recent call last):
...
NoParseError: should have reached : f
Allow zero or more nested brackets:
>>> from funcparserlib.parser import many
>>> nested = forward_decl()
>>> nested.define(a('{') + many(nested) + a('}'))
>>> input = nested + finished
Test it:
>>> input.parse('{{}{}}')
('{', [('{', [], '}'), ('{', [], '}')], '}', None)
Skip `None`, the result of `finished`:
>>> from funcparserlib.parser import skip
>>> end_ = skip(finished)
>>> input = nested + end_
Test it:
>>> input.parse('{{}{}}')
('{', [('{', [], '}'), ('{', [], '}')], '}')
Textual Parse Tree
------------------
Objectify a parse tree:
>>> class Bracket(object):
... def __init__(self, kids):
... self.kids = kids
... def __repr__(self):
... return 'Bracket(%r)' % self.kids
>>> a_ = lambda x: skip(a(x))
>>> nested = forward_decl()
>>> nested.define(a_('{') + many(nested) + a_('}') >> Bracket)
>>> input = nested + end_
Test it:
>>> nested.parse('{{{}{}}{}}')
Bracket([Bracket([Bracket([]), Bracket([])]), Bracket([])])
Draw a textual parse tree:
>>> from funcparserlib.util import pretty_tree
>>> def ptree(t):
... def kids(x):
... if isinstance(x, Bracket):
... return x.kids
... else:
... return []
... def show(x):
... if isinstance(x, Bracket):
... return '{}'
... else:
... return repr(x)
... return pretty_tree(t, kids, show)
Test it:
>>> print ptree(nested.parse('{{{}{}}{}}'))
{}
|-- {}
| |-- {}
| `-- {}
`-- {}
>>> print ptree(nested.parse('{{{{}}{}}{{}}{}{{}{}}}'))
{}
|-- {}
| |-- {}
| | `-- {}
| `-- {}
|-- {}
| `-- {}
|-- {}
`-- {}
|-- {}
`-- {}
Nesting Level
-------------
Let's count the nesting level:
>>> def count(x):
... return 1 if len(x) == 0 else max(x) + 1
>>> nested = forward_decl()
>>> nested.define(a_('{') + many(nested) + a_('}') >> count)
>>> input = nested + end_
Test it:
>>> input.parse('{}')
1
>>> input.parse('{{{}}}')
3
>>> input.parse('{{}{{{}}}}')
4
>>> input.parse('{{{}}{}}')
3
funcparserlib-0.3.6/doc/Changes.md 0000644 0000765 0000024 00000000000 12140476510 023362 1funcparserlib-0.3.6/CHANGES ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/doc/FAQ.md 0000644 0000765 0000024 00000004155 12104307555 016147 0 ustar vlan staff 0000000 0000000 `funcparserlib` FAQ
===================
Frequently asked questions related to [funcparserlib][1].
1. Why did my parser enter an infinite loop?
--------------------------------------------
Because the grammar you've defined allows infinite empty sequences. It's a
general pitfall of grammar rules and it must be avoided. Let's explain why this
may happen.
_A universally successful parser_ is a parser that _may consume no token_ from
the input sequence _returning a value_ without raising `NoParseError`. It still
may consume some tokens and return values, but when it cannot, it just returns
a value, not an error.
The basic parser combinators that return parsers having this property are
`pure`, `maybe`, and `many`:
* A result of `pure` always returns its argument without even accessing the
input sequence
* A result of `maybe` always returns either a parsing result or `None`
* A result of `many` always returns a list (maybe the empty one)
By using these combinators for composing parsers you can (and you have done this
actually!) create your own universally successful parsers.
One more fact. Given some parser `p`, the `many` combinator returns a parser `q`
that applies `p` to the input sequence unless `p` fails with `NoParseError`.
So we can deduce that, given a universally successful parser, `many` returns a
parser that may apply it to the input _forever._ This is the cause of an infinite
loop.
You **must not** pass a universally successful parser to the `many` combinator.
Consider the following parsers:
from funcparserlib.parser import a, many, maybe, pure
const = lambda x: lambda _: x
x = a('x')
p1 = maybe(x)
p2 = many(p1)
p3 = maybe(x) + x
p4 = many(p3)
p5 = x | many(x)
p6 = many(p5)
p7 = x + many(p4)
p8 = x >> const(True)
p9 = pure(True)
Here `p1`, `p2`, `p4`, `p5`, `p6`, and `p9` are universally successful parsers
while `p3`, `p7`, and `p8` are is not. Parsers `p2`, `p6`, and `p7` may enter an
infinite loop, while others cannot. Just apply the statements we have made
above to these parsers to figure out why.
[1]: http://code.google.com/p/funcparserlib/
funcparserlib-0.3.6/doc/Illustrated.md 0000644 0000765 0000024 00000010340 12140501300 020005 0 ustar vlan staff 0000000 0000000 Parsing Stages Illustrated
==========================
- Author:
-
Andrey Vlasovskikh
- License:
-
Creative Commons Attribution-Noncommercial-Share Alike 3.0
- Library Homepage:
-
http://code.google.com/p/funcparserlib/
- Library Version:
- 0.3.6
Given some language, for example, the [GraphViz DOT][dot] graph language (see
[its grammar][dot-grammar]), you can *easily write your own parser* for it in
Python using `funcpaserlib`.
Then you can:
1. Take a piece of source code in this DOT language:
>>> s = '''\
... digraph g1 {
... n1 -> n2 ->
... subgraph n3 {
... nn1 -> nn2 -> nn3;
... nn3 -> nn1;
... };
... subgraph n3 {} -> n1;
... }
... '''
that stands for the graph:

2. Import your small parser (we use one shipped as an example with
`funcparserlib` here):
>>> import sys, os
>>> sys.path.append(os.path.join(os.getcwd(), '../examples/dot'))
>>> import dot as dotparser
3. Transform the source code into a sequence of tokens:
>>> toks = dotparser.tokenize(s)
>>> print '\n'.join(unicode(tok) for tok in toks)
1,0-1,7: Name 'digraph'
1,8-1,10: Name 'g1'
1,11-1,12: Op '{'
2,4-2,6: Name 'n1'
2,7-2,9: Op '->'
2,10-2,12: Name 'n2'
2,13-2,15: Op '->'
3,4-3,12: Name 'subgraph'
3,13-3,15: Name 'n3'
3,16-3,17: Op '{'
4,8-4,11: Name 'nn1'
4,12-4,14: Op '->'
4,15-4,18: Name 'nn2'
4,19-4,21: Op '->'
4,22-4,25: Name 'nn3'
4,25-4,26: Op ';'
5,8-5,11: Name 'nn3'
5,12-5,14: Op '->'
5,15-5,18: Name 'nn1'
5,18-5,19: Op ';'
6,4-6,5: Op '}'
6,5-6,6: Op ';'
7,4-7,12: Name 'subgraph'
7,13-7,15: Name 'n3'
7,16-7,17: Op '{'
7,17-7,18: Op '}'
7,19-7,21: Op '->'
7,22-7,24: Name 'n1'
7,24-7,25: Op ';'
8,0-8,1: Op '}'
4. Parse the sequence of tokens into a parse tree:
>>> tree = dotparser.parse(toks)
>>> from textwrap import fill
>>> print fill(repr(tree), 70)
Graph(strict=None, type='digraph', id='g1', stmts=[Edge(nodes=['n1',
'n2', SubGraph(id='n3', stmts=[Edge(nodes=['nn1', 'nn2', 'nn3'],
attrs=[]), Edge(nodes=['nn3', 'nn1'], attrs=[])])], attrs=[]),
Edge(nodes=[SubGraph(id='n3', stmts=[]), 'n1'], attrs=[])])
5. Pretty-print the parse tree:
>>> print dotparser.pretty_parse_tree(tree)
Graph [id=g1, strict=False, type=digraph]
`-- stmts
|-- Edge
| |-- nodes
| | |-- n1
| | |-- n2
| | `-- SubGraph [id=n3]
| | `-- stmts
| | |-- Edge
| | | |-- nodes
| | | | |-- nn1
| | | | |-- nn2
| | | | `-- nn3
| | | `-- attrs
| | `-- Edge
| | |-- nodes
| | | |-- nn3
| | | `-- nn1
| | `-- attrs
| `-- attrs
`-- Edge
|-- nodes
| |-- SubGraph [id=n3]
| | `-- stmts
| `-- n1
`-- attrs
6. And so on. Basically, you got full access to the tree-like structure of the
DOT file
See [the source code][dot-py] of the DOT parser and the docs at [the funcparserlib
homepage][funcparserlib] for details.
[dot]: http://www.graphviz.org/
[dot-grammar]: http://www.graphviz.org/doc/info/lang.html
[funcparserlib]: http://code.google.com/p/funcparserlib/
[dot-py]: http://code.google.com/p/funcparserlib/source/browse/examples/dot/dot.py
funcparserlib-0.3.6/doc/index.md 0000644 0000765 0000024 00000002201 12140501202 016616 0 ustar vlan staff 0000000 0000000 funcparserlib
=============
A recurisve descent parsing library based on functional combinators.
Installation
------------
The `funcparserlib` library is installed via the standard Python installation
tools `pip` and `distribute`:
$ pip install funcparserlib
You can also download the source code manually and install it using:
$ python setup.py install
It is also possible to run tests via `unittest`, `nosetests` or `tox`:
$ python -m unittest discover funcparserlib.tests
$ nosetests funcparserlib.tests
$ tox
Documentation
-------------
The comprehensive [funcparserlib Tutorial][1] is available as `./doc/Tutorial.md`.
A short intro to `funcparserlib` can be found in the [Nested Brackets
Mini-HOWTO][2], see `./doc/Brackets.md`.
See also comments inside the modules `funcparserlib.parser` and
`funcparserlib.lexer` or generate the API docs from the modules using `pydoc`.
There a couple of examples available in `./examples` directory:
* GraphViz DOT parser
* JSON paser
See also [the changelog][3] and [FAQ][4].
[1]: Tutorial
[2]: Brackets
[3]: Changes
[4]: FAQ
funcparserlib-0.3.6/doc/Makefile 0000644 0000765 0000024 00000000275 12104307555 016655 0 ustar vlan staff 0000000 0000000 test: test-tutorial test-brackets
test-tutorial:
python -c 'import doctest; doctest.testfile("Tutorial.md")'
test-brackets:
python -c 'import doctest; doctest.testfile("Brackets.md")'
funcparserlib-0.3.6/doc/Tutorial.md 0000644 0000765 0000024 00000122767 12140501317 017345 0 ustar vlan staff 0000000 0000000 The `funcparserlib` Tutorial
============================
- Author:
-
Andrey Vlasovskikh
- License:
-
Creative Commons Attribution-Noncommercial-Share Alike 3.0
- Library Homepage:
-
http://code.google.com/p/funcparserlib/
- Library Version:
- 0.3.6
Foreword
--------
This is an epic tutorial that explains how to write parsers using
`funcparserlib`. As the tutorial contains lots of code listings, it is written
using the exciting [doctest][] module. This module is a part of the Python
standard library. Using it, you can _execute the tutorial file_ in order to make
sure that all the code listings work as described here.
Although writing functional parsers and functional programming in general is
fun, the large size of the tutorial makes it a bit monotonous. To prevent the
reader from getting bored, some bits of humor and interesting facts were added.
Some knowlegde of general parsing concepts is assumed as well as some
familiarity with functional programming. Experience with Haskell or Scheme would
be nice, but it is not required.
Any comments and suggestions are welcome! Especially corrections related to the
English language, as the author is not a native English speaker. Please post
your comments to [the issues list][funcparserlib-issues] on Google Code.
Contents
--------
1. Intro
2. Diving In
3. Lexing with `tokenize`
4. The Library Basics
1. Parser Combinators
2. The `some` Combinator
3. The `>>` Combinator
4. The `+` Combinator
5. Getting First Numbers
1. The `a` Combinator
2. Pythonic Uncurrying
6. Making a Choice
1. The `|` Combinator
2. Conflicting Alternatives
3. The Fear of Left-Recursion
4. The `many` Combinator
7. Ordering Calculations
1. Operator Precedence
2. The `with_forward_decls` Combinator
8. Polishing the Code
1. The `skip` Combinator
2. The `finished` Combinator
3. The `maybe` Combinator
9. Advanced Topics
1. Parser Type Classes
2. Papers on Funcional Parsers
Intro
-----
In this tutorial, we will write _an expression calculator_ that uses syntax
similar to Python or Haskell expressions. Writing a calculator is a common
example in articles related to parsers and parsing techniques, so it is a good
starting point in learning `funcparserlib`.
If you are interested in more real-world examples, see the sources of [a
GraphViz DOT parser][dot-parser] or [a JSON parser][json-parser] available in
`./examples` directory of `funcparserlib`. If you need just a short intro
instead of the full tutorial, see the [Nested Brackets Mini-HOWTO][nested].
We will show how to write a parser and an evaluator of expressions using
`funcparserlib`. The library comes with its own lexer module, but in this
example we will use the standard Python module [tokenize][] as a lexer.
`funcparserlib` parser combinators are completely agnostic of what the tokens
are and how they have been produced, so you can use any lexer you like.
Here are some expressions we want to be able to parse and calculate:
1
2 + 3
2 ** 32 - 1
3.1415926 * (2 + 7.18281828e-1)
Diving In
---------
Here is a complete expression calculator program.
You are not assumed to understand it now. Just look at its shape and try to get
some feeling of its structure.
In the end of this tutorial you will fully understand this code and will be able
to write parsers for your own needs.
>>> from StringIO import StringIO
>>> from tokenize import generate_tokens
>>> import operator, token
>>> from funcparserlib.parser import (some, a, many, skip, finished, maybe,
... with_forward_decls)
>>> class Token(object):
... def __init__(self, code, value, start=(0, 0), stop=(0, 0), line=''):
... self.code = code
... self.value = value
... self.start = start
... self.stop = stop
... self.line = line
...
... @property
... def type(self):
... return token.tok_name[self.code]
...
... def __unicode__(self):
... pos = '-'.join('%d,%d' % x for x in [self.start, self.stop])
... return "%s %s '%s'" % (pos, self.type, self.value)
...
... def __repr__(self):
... return 'Token(%r, %r, %r, %r, %r)' % (
... self.code, self.value, self.start, self.stop, self.line)
...
... def __eq__(self, other):
... return (self.code, self.value) == (other.code, other.value)
>>> def tokenize(s):
... 'str -> [Token]'
... return list(Token(*t)
... for t in generate_tokens(StringIO(s).readline)
... if t[0] not in [token.NEWLINE])
>>> def parse(tokens):
... 'Sequence(Token) -> int or float or None'
... # Well known functions
... const = lambda x: lambda _: x
... unarg = lambda f: lambda x: f(*x)
...
... # Semantic actions and auxiliary functions
... tokval = lambda tok: tok.value
... makeop = lambda s, f: op(s) >> const(f)
... def make_number(s):
... try:
... return int(s)
... except ValueError:
... return float(s)
... def eval_expr(z, list):
... 'float, [((float, float -> float), float)] -> float'
... return reduce(lambda s, (f, x): f(s, x), list, z)
... eval = unarg(eval_expr)
...
... # Primitives
... number = (
... some(lambda tok: tok.code == token.NUMBER)
... >> tokval
... >> make_number)
... op = lambda s: a(Token(token.OP, s)) >> tokval
... op_ = lambda s: skip(op(s))
...
... add = makeop('+', operator.add)
... sub = makeop('-', operator.sub)
... mul = makeop('*', operator.mul)
... div = makeop('/', operator.div)
... pow = makeop('**', operator.pow)
...
... mul_op = mul | div
... add_op = add | sub
...
... # Means of composition
... @with_forward_decls
... def primary():
... return number | (op_('(') + expr + op_(')'))
... factor = primary + many(pow + primary) >> eval
... term = factor + many(mul_op + factor) >> eval
... expr = term + many(add_op + term) >> eval
...
... # Toplevel parsers
... endmark = a(Token(token.ENDMARKER, ''))
... end = skip(endmark + finished)
... toplevel = maybe(expr) + end
...
... return toplevel.parse(tokens)
A couple of tests:
>>> assert parse(tokenize('')) is None
>>> assert parse(tokenize('1')) == 1
>>> assert parse(tokenize('2 + 3')) == 5
>>> assert parse(tokenize('2 * (3 + 4)')) == 14
OK, now let's forget about all this stuff:
>>> del StringIO, generate_tokens, operator, token
>>> del Token, tokenize, parse
and start from scratch!
Lexing with `tokenize`
----------------------
We start with lexing in order to be able to define parsers in terms of tokens,
not just characters. This section is auxiliary and it is completely unrelated to
`funcparserlib`. But we just need tokens to start writing parsers. You may skip
this section and start with “The Library Basics”.
We will need to `generate_tokens` using the standard `tokenize` module:
>>> from tokenize import generate_tokens
Import some standard library stuff:
>>> from StringIO import StringIO
>>> from pprint import pformat
This is an output from the tokenizer:
>>> ts = list(generate_tokens(StringIO('3 * (4 + 5)').readline))
>>> print pformat(ts)
[(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'),
(51, '*', (1, 2), (1, 3), '3 * (4 + 5)'),
(51, '(', (1, 4), (1, 5), '3 * (4 + 5)'),
(2, '4', (1, 5), (1, 6), '3 * (4 + 5)'),
(51, '+', (1, 7), (1, 8), '3 * (4 + 5)'),
(2, '5', (1, 9), (1, 10), '3 * (4 + 5)'),
(51, ')', (1, 10), (1, 11), '3 * (4 + 5)'),
(0, '', (2, 0), (2, 0), '')]
As we can see, the lexer has already thrown away the spaces. Each token is a
5-tuple of the token code, the token string, the beginning and ending of the
token, the line on which it was found.
Let's make the output more pretty by wrapping a token in a class. We could
definitely go on without such a wrapper, but it will make messages more readable
and allow access to the fields of the token by name.
Import a standard module containing the code-to-name map for tokens:
>>> import token
Define the wrapper class:
>>> class Token(object):
... def __init__(self, code, value, start=(0, 0), stop=(0, 0), line=''):
... self.code = code
... self.value = value
... self.start = start
... self.stop = stop
... self.line = line
...
... @property
... def type(self):
... return token.tok_name[self.code]
...
... def __unicode__(self):
... pos = '-'.join('%d,%d' % x for x in [self.start, self.stop])
... return "%s %s '%s'" % (pos, self.type, self.value)
...
... def __repr__(self):
... return 'Token(%r, %r, %r, %r, %r)' % (
... self.code, self.value, self.start, self.stop, self.line)
...
... def __eq__(self, other):
... return (self.code, self.value) == (other.code, other.value)
Functions `__repr__` and `__eq__` will be used later. Let's see what it will
look like:
>>> print '\n'.join(unicode(Token(*t)) for t in ts)
1,0-1,1 NUMBER '3'
1,2-1,3 OP '*'
1,4-1,5 OP '('
1,5-1,6 NUMBER '4'
1,7-1,8 OP '+'
1,9-1,10 NUMBER '5'
1,10-1,11 OP ')'
2,0-2,0 ENDMARKER ''
So we are basically done with lexing. The last thing left is to write _the_
lexer function:
>>> def tokenize(s):
... 'str -> [Token]'
... return list(Token(*t)
... for t in generate_tokens(StringIO(s).readline)
... if t[0] not in [token.NEWLINE])
Here we just have added filtering newline symbols.
The Library Basics
------------------
`funcparserlib` is a library for recursive descent parsing using parser
combinators. The parsers made with its help are LL(*) parsers. It means that
it's very easy to write them without thinking about look-aheads and all that
hardcore parsing stuff. But the recursive descent parsing is a rather slow
method compared to LL(k) or LR(k) algorithms. So the primary domain for
`funcparserlib` is parsing small languages or external DSLs (domain specific
languages).
### Parser Combinators
_A parser_ is basically a function `f` of type (we will use a Haskell-ish
notation for types):
f :: [a] -> (b, [a])
that takes a list of tokens of arbitrary type `a` and returns a pair of the
parsed value of arbitrary type `b` and the list of tokens left. We can define
an alias for this type:
type Parser(a, b) = [a] -> (b, [a])
_Parser combinators_ are just higher-order functions that take parsers as their
arguments and return them as result values. Parser combinators are:
* First-class values
* Extremely composable
* Tend to make the code quite compact
* Resemble the readable notation of xBNF grammars
`funcparserlib` uses a more advanced parser type in order to generalize away
from lists to [sequences][] and provide more readable error reports by tracking
a parsing state (in a functional way of course):
f :: Sequence(a), State -> (b, State)
But this parser type is no fun any more. In order to get rid of it as well as to
use overloaded operators `funcparserlib` wraps parser functions into a class (we
have already seen this approach earlier in the lexer). This class is named
`Parser` and all the combinators we will be using deal with objects of this
class. So the typedef `Parser(a, b)` above is just a parameterized class, not a
function. The parser itself is ivoked via `Parser.run` function.
In fact, all the plain parser functions are hidden from you by `funcparserlib`
so you don't need to know these internals. So, every parser `p` you have ever
met `isinstance` of the `Parser` class.
So let's leave parser functions behind the barrier of abstraction. But if you
are interested in how all this stuff really works, just look into [the
sources][parser-py] of `funcparserlib`! There are only approximately 300 lines
of documented code there. And you are already familiar with the basic idea.
### The `some` Combinator
Initial imports:
>>> from funcparserlib.parser import some, a, many, skip, finished, maybe
Let's recall the expressions we would like to parse:
1
2 + 3
2 ** 32 - 1
3.1415926 * (2 + 7.18281828e-1)
So our grammar consists of expressions, that consist of numbers or nested
expressions. All the expressions we have seen so far are binary.
Let's start with just numbers. Number is some token of type `'NUMBER'`:
>>> number = some(lambda tok: tok.type == 'NUMBER')
We have just introduced the first parser combinator — `some`. Dealing with
parser combinators, we should always keep in mind their types in order to know
precisely what they do. `some` has the following type:
some :: (a -> bool) -> Parser(a, a)
`some` takes as its input a predicate function from token of arbitrary type `a`
and returns a parser of `Sequence(a)` that returns a result of type `a`. The
first `a` in `Parser` is the type of tokens in the input sequence, and the
second one is the type of a parsed token. The type doesn't change during
parsing, so we get exaclty the token that satisfies the predicate (there is only
one function from `a` to `a`: `id = lambda x: x`).
The resulting parser acts like a filter by parsing only those tokens that
satisfy the predicate. Hence the name: _some_ token satisfying the predicate
will be returned by the parser.
And this is how it works:
>>> number.parse(tokenize('5'))
Token(2, '5', (1, 0), (1, 1), '5')
and how it reports errors:
>>> number.parse(tokenize('a'))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 1,0-1,1 NAME 'a'
Notice that the lexer and the `Token` wrapper class help us identify the
position in which the error occured.
### The `>>` Combinator
Using `some`, we have got a parsed `Token`. But we need numbers, not `Token`s, to
calculate an expression! So the result of the `number` parser is not
appropriate. It should have the type `int` or `float`. We need some tool to
transform a `Parser(Token, Token)` into a `Parser(Token, int or float)` (note:
we use dynamic typing here).
And this tool is called the `>>` combinator. It has the type:
(>>) :: Parser(a, b), (b -> c) -> Parser(a, c)
Again, its type suggests what it can possibly do. It returns a parser, that
applies the `Parser(a, b)` to the input sequence and then maps the result of
type `b` to type `c` using a function `b -> c` (for functionally inclined: a
parser is a functor where `>>` is its `fmap`).
Let's write a function that maps a `Token` to an `int` or a `float`:
>>> def make_number(tok):
... 'Token -> int or float'
... try:
... return int(tok.value)
... except ValueError:
... return float(tok.value)
OK, but we can spilt this one into two more primitive useful functions:
>>> def tokval(tok):
... 'Token -> str'
... return tok.value
>>> def make_number(s):
... try:
... return int(s)
... except ValueError:
... return float(s)
Let's use these functions in our `number` parser:
>>> number = (
... some(lambda tok: tok.type == 'NUMBER')
... >> tokval
... >> make_number)
Now we got exactly what we needed:
>>> number.parse(tokenize('5'))
5
>>> '%g' % number.parse(tokenize('1.6e-19'))
'1.6e-19'
See how composition works. We compose a parser `some(...)` of type
`Parser(Token, Token)` with the function `tokval` and we get a value of type
`Parser` again, but this time it is `Parser(Token, str)`. Let's put it this way:
the set of parsers is closed under the application of `>>` to a parser and a
function of type `a -> b`.
### The `+` Combinator
Having just numbers is boring. We need some operations on them. Let's start with
the only one operator `**` (because `+` could be confusing in this context) and
apply it to numbers only, not to expressions.
In the expression `2 ** 32`, we need some way of saying “a number `2` is
followed by an operator `**`, followed by a number `32`.” In
`funcparserlib`, we do this by using the `+` combinator.
The `+` combinator is a sequential composition of two parsers. It has the
following type (warning: dynamic typing tricks ahead):
(+) :: Parser(a, b), Parser(a, c) -> Parser(a, _Tuple(b, c))
It basically does the following. Given two parsers of `Sequence(a)` to `b` and
`c`, respectively, it returns a parser, that applies the first one to the
sequence, then applies the second one to the sequence left, and combines the
results into a `_Tuple`.
The `_Tuple` is some sort of magic that simplifies access to the parsing
results. It accumulates all the parsed values preventing the nesting of tuples.
We can “turn off” the `_Tuple` to see what will happen by
explicitely casing every value parsed by a composed parser to `tuple`:
>>> p = (number + number >> tuple) + number >> tuple
>>> p.parse(tokenize('1 2 3'))
((1, 2), 3)
We have got nested tuples. To get the first number from the result `t` we need
to use `t[0][0]`. The second and the third ones are `t[0][1]` and `t[1]`. Well,
it is pretty inconsistent (but it is OK for you, Lisp hackers).
So the magic does the following:
>>> p = number + number + number
>>> p.parse(tokenize('1 2 3'))
(1, 2, 3)
Now it's OK for everyone (except for very statically typed persons).
OK, let's write a parser for the power operator expression. We have already got
a number parser. Now we need an operator parser. How about this one:
>>> pow = some(lambda tok: tok.type == 'OP' and tok.value == '**') >> tokval
It will work, but let's abstract away from the operator name:
>>> def op(s):
... 'str -> Parser(Token, str)'
... return (
... some(lambda tok: tok.type == 'OP' and tok.value == s)
... >> tokval)
Getting First Numbers
---------------------
### The `a` Combinator
Continuing with the `op`, we can define it using `lambda`:
>>> op = (lambda s:
... some(lambda tok: tok.type == 'OP' and tok.value == s)
... >> tokval)
We need to parse here an exact token, a token `s`. So maybe we can come up with
some combinator, that takes as its input a value and returns a parser, that
parses a token only if it is equal to that value. Let's call this combinator `a`
(because it parses _a_ token given to it). Here is its type:
a :: Eq(a) => a -> Parser(a, a)
`a` requires an equality constraint (we have already defined `__eq__` for
`Token`) on its input type `a`.
The definition of the combinator is straightforward:
a = lambda x: some(lambda y: x == y)
It's quite useful in practice, so `funcparserlib` already contains such a
combinator. You can just import it from there (as we have already done earlier).
Let's rewrite `op` using `a`:
>>> op = lambda s: a(Token(token.OP, s)) >> tokval
>>> pow = op('**')
and test it:
>>> pow.parse(tokenize('**'))
'**'
Oops, we got just a string `'**'`, but we wanted a function `**` (for Lisp
hackers: it would be nice to just `(eval (quote **))`). We have already seen
this problem before. Let's just transform the parser using `>>`:
>>> import operator
>>> pow = op('**') >> (lambda x: operator.pow)
OK, but the `x` isn't used here, so the classic function `const` comes to our
minds (for combinatorically inclined: it is just `K`):
>>> const = lambda x: lambda _: x
The revisited version of `pow` is:
>>> pow = op('**') >> const(operator.pow)
Let's test it again:
>>> f = pow.parse(tokenize('**'))
>>> f(2, 12)
4096
>>> del f
### Pythonic Uncurrying
OK, it's time to put it all together. Let's define the `eval_expr` function,
that will map the result of parsing an expression to the resulting value:
>>> def eval_expr(x):
... return x[1](x[0], x[2])
Then define a simple expression parser (we don't recur on the subparts of the
expression yet):
>>> expr = number + pow + number >> eval_expr
Test it:
>>> expr.parse(tokenize('2 ** 12'))
4096
Cool! Our first real calculation!
But the `eval_expr` function isn't very clean. Why doesn't it just take
positional arguments instead of a tuple? Because `+` returns a tuple (the magic
`_Tuple`). Hey, don't allow some code to force you to make your functions less
clean than they should be!
Let's make the arguments positional and provide a wrapper for calling
`eval_expr` with a single tuple. In fact, this task is quite general. We can
turn any function of `n` arguments into a function of a single `n`-tuple (for
functionally inclined: we can uncurry it):
>>> unarg = lambda f: lambda x: f(*x)
So the new `eval_expr` is:
>>> eval_expr = unarg(lambda a, f, b: f(a, b))
Yes, it is cleaner now than it was before.
Redefine `expr` and test it:
>>> expr = number + pow + number >> eval_expr
>>> expr.parse(tokenize('2 ** 12'))
4096
Making a Choice
---------------
### The `|` Combinator
So far so good. Now we need to support more than one operation. We already know
how define a new operation. But how do we choose between, say, `**` and `-`
while parsing? The combinators we learned so far are pretty determinate. Well,
except for `some` that returns something that satisfies the predicate. In this
particular case we could continue with only `some`, but this approach is _ad
hoc_ so we need a general one.
And the general approach is the choice combinator `|`. It allows choice
composition of parsers. Given two parsers of `Sequence(a)` returning `b` and
`c`, respectively, it returns a parser of `Sequence(a)` that applies the first
parser, and in case it has failed applies the second one. Here is it's type (for
Haskell hackers: dynamic typing again, there should be `Either b c` here):
(|) :: Parser(a, b), Parser(a, c) -> Parser(a, b or c)
Let's see how it works by defining one more operator:
>>> sub = op('-') >> const(operator.sub)
and then using the choice combinator in `expr`:
>>> expr = number + (pow | sub) + number >> eval_expr
Test it:
>>> expr.parse(tokenize('2 ** 8'))
256
>>> expr.parse(tokenize('256 - 1'))
255
and what if none of the alternatives matches:
>>> expr.parse(tokenize('2 + 2'))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 1,2-1,3 OP '+'
Let's cover all the basic arithmetic binary operators using one more bit of
abstraction:
>>> makeop = lambda s, f: op(s) >> const(f)
>>> add = makeop('+', operator.add)
>>> sub = makeop('-', operator.sub)
>>> mul = makeop('*', operator.mul)
>>> div = makeop('/', operator.div)
>>> pow = makeop('**', operator.pow)
>>> operator = add | sub | mul | div | pow
>>> expr = number + operator + number >> eval_expr
Test it:
>>> expr.parse(tokenize('2 + 2'))
4
>>> expr.parse(tokenize('2 * 2'))
4
Yay! We can do elementary school arithmetics!
### Conflicting Alternatives
OK, we have got a parser for expressions containing a binary operation, so we
can write a toplevel parser of single numbers _and_ expressions of numbers:
>>> toplevel = number | expr
Test it:
>>> toplevel.parse(tokenize('5'))
5
>>> toplevel.parse(tokenize('2 + 3')) == 5
False
>>> toplevel.parse(tokenize('2 + 3'))
2
Oops, it does wrong arithmetics! We have encountered a common problem in
parsing. The first alternative of `toplevel` parses a subtree of some next
alternative (because `number` is a subpart of `expr`). We should be careful and
compose parsers using `|` so that they don't conflict with each other:
>>> toplevel = expr | number
Remember that the longest token sequence should be parsed first!
Let's test it:
>>> toplevel.parse(tokenize('5'))
5
>>> toplevel.parse(tokenize('2 + 3'))
5
### The Fear of Left-Recursion
We have defined the `toplevel` parser, that can parse expressions of numbers or
just numbers. But what about expressions of expressions of numbers, etc.? We
want to be able to parse the following expression:
2 ** 32 - 1
In order to build (or evaluate) its parse tree we could write a recurive parser:
expr = (expr + operator + expr) | number
but we cannot, because in top-down parsing algorithms (like the one used in
`funcparserlib`) left-recursion leads to non-termination of parsing!
How to avoid left-recursion on `expr` here? Let's start thinking in terms of
EBNF (Extended Backus-Naur Form) that is used widely in grammar definitions. Our
parser corresponds to these EBNF productions:
::= | ;
::= , , ;
Left-recursion is still there of course. But we can rewrite them this way using
EBNF repetition syntax:
::= , { , }
Here `{` and `}` mean “zero or more times”. As we can see,
left-recursion has been thrown away here. It is always possible to get rid of it
using a formal method, but usually you can just look at your grammar and
modify it a little to make it non-left-recursive.
Remember that the left-recursion must be avoided!
### The `many` Combinator
The new definition of `` doesn't have left-recurison any more, but it
assumes a new parser combinator for doing things many times as supposed by the
`{` `}` notation.
This combinator is called `many`. It returns a parser that applies a parser
passed as its argument to a sequence of tokens as many times as the parser
succeeds. The resulting parser returns a list of results containing zero or more
parsed tokens. Here is its type:
many :: Parser(a, b) -> Parser(a, [b])
It works like this:
>>> many(number).parse(tokenize('1'))
[1]
>>> many(number).parse(tokenize('1 2 3'))
[1, 2, 3]
>>> many(number).parse(tokenize('1 foo'))
[1]
>>> many(number).parse(tokenize('foo'))
[]
With `many`, we can avoid left-recursion and translate the `` production
of EBNF directly into the parser of `funcparserlib`:
>>> expr = number + many(operator + number)
Let's test it:
>>> expr.parse(tokenize('2 + 3'))
(2, [(, 3)])
It seems that we forgot to map parsing results to numbers again. Let's fix
this:
>>> def eval_expr(z, list):
... return reduce(lambda s, (f, x): f(s, x), list, z)
Here we fold the `list` of an operator an its right operand starting with the
initial value `z` using a function that applies the operator `f` to the
accumulated value `s` and the right operand `x` (for functionally inclined: we
just `foldl` the list of functions and their right arguments using function
application).
Well, for _not_ functionally inclined: just write your own `eval_expr` for
evaluating results of the new `expr` and then look how your recursion pattern is
abstracted in the code above.
Now let's refine `expr` with `eval_expr`:
>>> expr = number + many(operator + number) >> unarg(eval_expr)
and test it:
>>> expr.parse(tokenize('2 * 3 + 4'))
10
>>> expr.parse(tokenize('1 * 2 * 3 * 4'))
24
Cool, we just have calculated the factorial of 4!
>>> expr.parse(tokenize('2 ** 32 - 1')) == 4294967295
True
and this is the largest `unsigned int` possible on 32-bit computers.
Ordering Calculations
---------------------
### Operator Precedence
And how about this one:
>>> expr.parse(tokenize('2 + 3 * 4'))
20
Wait, it should be `14`, not `20`, because `2 + 3 * 4` is really `2 + (3 * 4)`.
Our parser is unaware of operators precedence.
There are two basic approaches for dealing with precedence in parsers. The first
one is to provide special constructs for specifying precedence and the second
one is to modify the grammar to reflect the precedence rules. We will use the
second one.
According to this quite popular approach, our modified grammar will look like
this:
>>> f = unarg(eval_expr)
>>> mul_op = mul | div
>>> add_op = add | sub
>>> factor = number + many(pow + number) >> f
>>> term = factor + many(mul_op + factor) >> f
>>> expr = term + many(add_op + term) >> f
The nesting levels in the parse tree mirror the precedence levels of the
operators. So `1` as a tree is something like `Expr(Term(Factor(Number(1))))`
but its OK since it's only a parse tree, not an AST (abstract syntax tree). In a
typical AST, such wrapper nodes are thrown away. We don't transform our parse
tree into an AST because we write an interpreter that evaluates parse tree nodes
(does semantic actions) while parsing.
Let's test our new `expr`:
>>> expr.parse(tokenize('1'))
1
>>> expr.parse(tokenize('2 + 3 * 4'))
14
>>> expr.parse(tokenize('3 + 2 * 2 ** 3 - 4 * 4'))
3
### The `with_forward_decls` Combinator
Initial deletions:
>>> del expr
The last thing we want to see in our expressions is parentheses. That's an easy
one. Let's just add one more nesting level of operators. Parentheses have the
highest precedence, so they should be nested in `factor`. We can write the new
nested parser `primary`:
>>> primary = number | ((op('(') + expr + op(')')) >> (lambda x: x[1]))
Traceback (most recent call last):
...
NameError: name 'expr' is not defined
Oops, if fact, we cannot yet! The definition is recursive. `primary` uses
`expr`, but `expr` uses `term` that uses `factor` that uses `primary`.
Variable binding rules in Python don't allow using a variable before it got
assigned a value in the current scope. But it's OK to use it within a nested
scope, think of mutually recursive functions definitions. So we have to wrap the
parser that is assigned to `primary` into a function of no arguments (sometimes
called a suspension or a thunk) in order to evaluate the parser lazily (for
Haskell hackers: you got it for free, lazy guys).
Such a combinator is provided by `funcparserlib`. It is called
`with_forward_decls` and its type is:
with_forward_decls :: (None -> Parser(a, b)) -> Parser(a, b)
Import it:
>>> from funcparserlib.parser import with_forward_decls
Another way to define mutually recursive parsers is via the `forward_decl`
combinator. It uses some bits of mutable state, but it is more efficient and
probably will be the recommended way to deal with recursive definitions. See the
sources for details. But let's use `with_forward_decls` here.
Finally, we can write a definition of `primary` that has a forward declaration
of `expr`:
>>> primary = with_forward_decls(lambda:
... number | ((op('(') + expr + op(')')) >> (lambda x: x[1])))
or equivalently using Python decorators syntax:
>>> @with_forward_decls
... def primary():
... return number | ((op('(') + expr + op(')')) >> (lambda x: x[1]))
and redefine the dependent parsers:
>>> factor = primary + many(pow + primary) >> f
>>> term = factor + many(mul_op + factor) >> f
>>> expr = term + many(add_op + term) >> f
Let's test it:
>>> expr.parse(tokenize('2 + 3 * 4'))
14
>>> expr.parse(tokenize('(2 + 3) * 4'))
20
>>> expr.parse(tokenize('((1 + 1) ** (((8))))'))
256
So, we are basically done with our expression parser. But there are still some
minor issues we want to cover.
One not so minor thing we still don't have in our expressions is the unary `-`
for negative numbers. Its implementation is left as an exercise for the reader
(for Haskell hackers: you may wish to add functions support to our calculator
and implement `-` as a function `negate`).
Polishing the Code
------------------
Let's cover some minor issues we mentioned in the previous section.
### The `skip` Combinator
First of all, the parentheses parser we have defined is quite ugly:
primary = with_forward_decls(lambda:
number | ((op('(') + expr + op(')')) >> (lambda x: x[1])))
What we really want to say here is: “`primary` is a parser
`with_forward_decls`, that parses a `number` or (an `op('(')` followed by an
`expr` followed be an `op(')')`) where `op`s are of no use and should be
skipped, so the return value is just the `number` or the `expr`.”
The `skip` combinator will help us to write exactly that. It has the following
type (warning: dynamic typing magic is back again):
skip :: Parser(a, b) -> Parser(a, _Ignored(b))
A magic `_Ignored(b)` value is a trivial container for values of `b` that is
completely ignored by the `+` combinator during concatenation of its magic
`_Tuple` of results.
Look at the examples:
>>> (number + number).parse(tokenize('2 3'))
(2, 3)
>>> (skip(number) + number).parse(tokenize('2 3'))
3
>>> (skip(number) + number).parse(tokenize('+ 2 3'))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 1,0-1,1 OP '+'
Note, that `skip` still requires its argument parser to succeed.
So let's rewrite the `primary` parser using `op_` (for Haskell hackers: notice
a naming analogy with functions like `sequence_`):
>>> op_ = lambda s: skip(op(s))
>>> primary = with_forward_decls(lambda:
... number | (op_('(') + expr + op_(')')))
and redefine the dependent parsers:
>>> factor = primary + many(pow + primary) >> f
>>> term = factor + many(mul_op + factor) >> f
>>> expr = term + many(add_op + term) >> f
Finally, test it:
>>> expr.parse(tokenize('(2 + 3) * 4'))
20
>>> expr.parse(tokenize('3.1415926 * (2 + 7.18281828e-1)'))
8.539734075559272
### The `finished` Combinator
It seems that we have almost finished with our calculator. Let's fix some more
subtle problems. Suppose the user typed the following string:
'2 + 3 * 4 foo'
It seems like a syntax error: `'foo'` is clearly not a part of our expression
grammar. Let's test it:
>>> expr.parse(tokenize('2 + 3 foo'))
5
No, it _is_ a part of our grammar somehow. Let's look at the sequence of tokens
in this example:
>>> print '\n'.join(map(unicode, tokenize('2 + 3 foo')))
1,0-1,1 NUMBER '2'
1,2-1,3 OP '+'
1,4-1,5 NUMBER '3'
1,6-1,9 NAME 'foo'
2,0-2,0 ENDMARKER ''
Our `expr` parses the first three tokens and then stops calculating the result.
Why does it behave this way? Let's recall the type of a parser function (that is
hidden inside `Parser`):
p :: Sequence(a), State -> (b, State)
A parser function takes tokens from the input sequence and transforms them into
a tuple of a resulting value of type `b` _and_ the rest of the input sequence.
The `Parser.parse` function that we are using drops the rest of the sequence and
returns only the resulting value. Hence, only the first three tokens were parsed
in our example.
So we need some means to make sure that the input sequence is parsed to its very
end. There are two things we have to do. The first one is to consume the
`ENDMARKER` token returned by `tokenize.generate_tokens`. And the second one is
to check that nothing is left in the stream.
Checking the `ENDMARKER` is easy:
>>> endmark = a(Token(token.ENDMARKER, ''))
>>> toplevel = expr + skip(endmark)
Test it:
>>> toplevel.parse(tokenize('2 + 3 foo'))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 1,6-1,9 NAME 'foo'
>>> toplevel.parse(tokenize('2 + 3'))
5
Now we need to check that nothing is left in the sequence after the `ENDMARKER`.
In the context of a parser _function_ it is easy again. We have to check the
lengh of the input sequence. Let's call it `finished`:
@Parser
def finished(tokens, s):
if len(tokens) == 0:
return (None, s)
else:
raise NoParseError('sequence must be empty', s)
Notice, that the function is wrapped into a `Parser` object.
But functions like this one expose too many internal details. In fact, we have
managed so far without dealing with all these `Parser` and `NoParseError`
classes, manipulations with a parsing state, etc. So it is a rare case when we
really need the details.
As this particular parser is useful in practice, it is provided by
`funcparserlib` so we can just import it and forget about the internals of
parsers again.
Let's rewrite `toplevel` again:
>>> toplevel = expr + skip(endmark + finished)
>>> toplevel.parse(tokenize('2 + 3'))
5
Test is using a hand crafted illegal sequence of tokens:
>>> toplevel.parse([
... Token(token.NUMBER, '5'),
... Token(token.ENDMARKER, ''),
... Token(token.ENDMARKER, '')])
Traceback (most recent call last):
...
NoParseError: should have reached : 0,0-0,0 ENDMARKER ''
### The `maybe` Combinator
And what about the empty input:
>>> toplevel.parse(tokenize(''))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 1,0-1,0 ENDMARKER ''
In a calculator (as in any shell) the empty string should be considered as a
no-op command. The result should be nothing, not an error message.
Let's allow the empty input in `toplevel`:
>>> end = skip(endmark + finished)
>>> toplevel = (end >> const(None)) | (expr + end)
Why `>> const(None)`, not just `end`? Because `skip` returns a value of type
`_Ignored(a)` and we need just `None`.
Test it:
>>> toplevel.parse(tokenize('2 + 3'))
5
>>> toplevel.parse(tokenize('')) is None
True
`toplevel` is now correct, but its definition uses too many words. Basically we
want to say just this: “`toplevel` consists of an optional `expr`, plus
the `end` of the input.” This reminds us of optional production brackets
`[` `]` in EBNF. In an EBNF grammar, we can write:
::= [ ] ,
Why not just add the equivalent `maybe` combinator to our tools? `funcparserlib`
already includes `maybe`, and it is quite useful in practice.
But let's try to come up with its definition ourselves!
We could write the following `_maybe` combinator, that returns a parser
returning either the result of the given parser or `None` if the parser fails:
>>> _maybe = lambda x: x | (some(const(True)) >> const(None))
The first alternative is the parser that is to be made optional and the second
one is the parser that always succeeds (it isn't so, see below) and returns
`None`.
Test it:
>>> _maybe(op('(')).parse(tokenize('()'))
'('
>>> (_maybe(op('(')) + number).parse(tokenize('5'))
Traceback (most recent call last):
...
NoParseError: got unexpected token: 2,0-2,0 ENDMARKER ''
Oops, it doesn't work! The reason is that `some(const(True))` always consumes
one token despite the fact that the predicate `const(True)` doesn't require a
token. We need some parser that does nothing and keeps its input untouched
returning its argument as a result. It is called the `pure` combinator (for
functionally inclined: a parser is a pointed functor). Here is its type:
pure :: b -> Parser(a, b)
`pure` itself is not so useful in practice. But the real `maybe` combinator from
`funcparserlib` is defined in terms if `pure`:
maybe = lambda x: x | pure(None)
We will just import `maybe` from `funcparserlib` (we have already done this in
the beginning). Here is its type (for Haskell hackers: yes, it should return
`Maybe b`):
maybe :: Parser(a, b) -> Parser(a, b or None)
Given `maybe`, let's rewrite `toplevel` once again. But this time we are about
to define an interface function for parsing as we did for lexing:
>>> def parse(tokens):
... 'Sequence(Token) -> int or float or None'
...
... # All our parsers should be defined here
...
... toplevel = maybe(expr) + end
... return toplevel.parse(tokens)
`toplevel` is very nice now!
Let's test it:
>>> parse(tokenize('2 + 3'))
5
>>> parse(tokenize('')) is None
True
Now we have completed our calculator!
Go make yourself a cup of tea and revisit the full source code in the
“Dive In” section! Or maybe read some advanced materials below.
And don't forget to write some comments [here][funcparserlib-issues]!
Advanced Topics
---------------
### Parser Type Classes
Parsers can be thought as instances of type classes. Parsers are monads
(therefore, applicative pointed functors). The monadic nature of parsers is used
in the implementation of some combinators, see [the source code][parser-py].
Also parsers form two monoids under sequential composition and choice
composition.
Haskell hackers may have extra fun by considering the following pseudo-Haskell
instances for parsers:
instance Functor (Parser a) where
fmap f x = x >> f
instance Pointed (Parser a) where
pure x = pure x
instance Monad (Parser a b) where
x >>= f = x.bind(f)
instance Monoid (Parser a b) where
mempty = skip(pure(const(None)))
mappend x y = x + y
instance Monoid (Parser a b) where
mempty = some(const(False))
mappend x y = x | y
[doctest]: http://docs.python.org/library/doctest.html
[tokenize]: http://docs.python.org/library/tokenize.html
[funcparserlib]: http://code.google.com/p/funcparserlib/
[funcparserlib-issues]: http://code.google.com/p/funcparserlib/issues/list
[dot-parser]: http://code.google.com/p/funcparserlib/source/browse/examples/dot/dot.py
[json-parser]: http://code.google.com/p/funcparserlib/source/browse/examples/json/json.py
[nested]: http://archlinux.folding-maps.org/2009/funcparserlib/Brackets
[sequences]: http://www.python.org/dev/peps/pep-3119/#sequences
[parser-py]: http://code.google.com/p/funcparserlib/source/browse/src/funcparserlib/parser.py
### Papers on Functional Parsers
TODO: There are lots of them. Write a review.
funcparserlib-0.3.6/funcparserlib/ 0000755 0000765 0000024 00000000000 12140502162 017272 5 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/funcparserlib/__init__.py 0000644 0000765 0000024 00000000000 12135637150 021403 0 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/funcparserlib/lexer.py 0000644 0000765 0000024 00000010504 12135637150 020775 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
# Copyright (c) 2008/2013 Andrey Vlasovskikh
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
__all__ = ['make_tokenizer', 'Token', 'LexerError']
import re
class LexerError(Exception):
def __init__(self, place, msg):
self.place = place
self.msg = msg
def __str__(self):
s = u'cannot tokenize data'
line, pos = self.place
return u'%s: %d,%d: "%s"' % (s, line, pos, self.msg)
class Token(object):
def __init__(self, type, value, start=None, end=None):
self.type = type
self.value = value
self.start = start
self.end = end
def __repr__(self):
return u'Token(%r, %r)' % (self.type, self.value)
def __eq__(self, other):
# FIXME: Case sensitivity is assumed here
return self.type == other.type and self.value == other.value
def _pos_str(self):
if self.start is None or self.end is None:
return ''
else:
sl, sp = self.start
el, ep = self.end
return u'%d,%d-%d,%d:' % (sl, sp, el, ep)
def __str__(self):
s = u"%s %s '%s'" % (self._pos_str(), self.type, self.value)
return s.strip()
@property
def name(self):
return self.value
def pformat(self):
return u"%s %s '%s'" % (self._pos_str().ljust(20),
self.type.ljust(14),
self.value)
def make_tokenizer(specs):
"""[(str, (str, int?))] -> (str -> Iterable(Token))"""
def compile_spec(spec):
name, args = spec
return name, re.compile(*args)
compiled = [compile_spec(s) for s in specs]
def match_specs(specs, str, i, position):
line, pos = position
for type, regexp in specs:
m = regexp.match(str, i)
if m is not None:
value = m.group()
nls = value.count(u'\n')
n_line = line + nls
if nls == 0:
n_pos = pos + len(value)
else:
n_pos = len(value) - value.rfind(u'\n') - 1
return Token(type, value, (line, pos + 1), (n_line, n_pos))
else:
errline = str.splitlines()[line - 1]
raise LexerError((line, pos + 1), errline)
def f(str):
length = len(str)
line, pos = 1, 0
i = 0
while i < length:
t = match_specs(compiled, str, i, (line, pos))
yield t
line, pos = t.end
i += len(t.value)
return f
# This is an example of a token spec. See also [this article][1] for a
# discussion of searching for multiline comments using regexps (including `*?`).
#
# [1]: http://ostermiller.org/findcomment.html
_example_token_specs = [
('COMMENT', (r'\(\*(.|[\r\n])*?\*\)', re.MULTILINE)),
('COMMENT', (r'\{(.|[\r\n])*?\}', re.MULTILINE)),
('COMMENT', (r'//.*',)),
('NL', (r'[\r\n]+',)),
('SPACE', (r'[ \t\r\n]+',)),
('NAME', (r'[A-Za-z_][A-Za-z_0-9]*',)),
('REAL', (r'[0-9]+\.[0-9]*([Ee][+\-]?[0-9]+)*',)),
('INT', (r'[0-9]+',)),
('INT', (r'\$[0-9A-Fa-f]+',)),
('OP', (r'(\.\.)|(<>)|(<=)|(>=)|(:=)|[;,=\(\):\[\]\.+\-<>\*/@\^]',)),
('STRING', (r"'([^']|(''))*'",)),
('CHAR', (r'#[0-9]+',)),
('CHAR', (r'#\$[0-9A-Fa-f]+',)),
]
#tokenize = make_tokenizer(_example_token_specs)
funcparserlib-0.3.6/funcparserlib/parser.py 0000644 0000765 0000024 00000027774 12135637150 021173 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
# Copyright (c) 2008/2013 Andrey Vlasovskikh
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""A recurisve descent parser library based on functional combinators.
Basic combinators are taken from Harrison's book ["Introduction to Functional
Programming"][1] and translated from ML into Python. See also [a Russian
translation of the book][2].
[1]: http://www.cl.cam.ac.uk/teaching/Lectures/funprog-jrh-1996/
[2]: http://code.google.com/p/funprog-ru/
A parser `p` is represented by a function of type:
p :: Sequence(a), State -> (b, State)
that takes as its input a sequence of tokens of arbitrary type `a` and a
current parsing state and return a pair of a parsed token of arbitrary type
`b` and the new parsing state.
The parsing state includes the current position in the sequence being parsed and
the position of the rightmost token that has been consumed while parsing.
Parser functions are wrapped into an object of the class `Parser`. This class
implements custom operators `+` for sequential composition of parsers, `|` for
choice composition, `>>` for transforming the result of parsing. The method
`Parser.parse` provides an easier way for invoking a parser hiding details
related to a parser state:
Parser.parse :: Parser(a, b), Sequence(a) -> b
Altough this module is able to deal with a sequences of any kind of objects, the
recommended way of using it is applying a parser to a `Sequence(Token)`.
`Token` objects are produced by a regexp-based tokenizer defined in
`funcparserlib.lexer`. By using it this way you get more readable parsing error
messages (as `Token` objects contain their position in the source file) and good
separation of lexical and syntactic levels of the grammar. See examples for more
info.
Debug messages are emitted via a `logging.Logger` object named
`"funcparserlib"`.
"""
__all__ = [
'some', 'a', 'many', 'pure', 'finished', 'maybe', 'skip', 'oneplus',
'forward_decl', 'NoParseError',
]
import logging
log = logging.getLogger('funcparserlib')
debug = False
class Parser(object):
"""A wrapper around a parser function that defines some operators for parser
composition.
"""
def __init__(self, p):
"""Wraps a parser function p into an object."""
self.define(p)
def named(self, name):
"""Specifies the name of the parser for more readable parsing log."""
self.name = name
return self
def define(self, p):
"""Defines a parser wrapped into this object."""
f = getattr(p, 'run', p)
if debug:
setattr(self, '_run', f)
else:
setattr(self, 'run', f)
self.named(getattr(p, 'name', p.__doc__))
def run(self, tokens, s):
"""Sequence(a), State -> (b, State)
Runs a parser wrapped into this object.
"""
if debug:
log.debug(u'trying %s' % self.name)
return self._run(tokens, s)
def _run(self, tokens, s):
raise NotImplementedError(u'you must define() a parser')
def parse(self, tokens):
"""Sequence(a) -> b
Applies the parser to a sequence of tokens producing a parsing result.
It provides a way to invoke a parser hiding details related to the
parser state. Also it makes error messages more readable by specifying
the position of the rightmost token that has been reached.
"""
try:
(tree, _) = self.run(tokens, State())
return tree
except NoParseError, e:
max = e.state.max
if len(tokens) > max:
tok = tokens[max]
else:
tok = u''
raise NoParseError(u'%s: %s' % (e.msg, tok), e.state)
def __add__(self, other):
"""Parser(a, b), Parser(a, c) -> Parser(a, _Tuple(b, c))
A sequential composition of parsers.
NOTE: The real type of the parsed value isn't always such as specified.
Here we use dynamic typing for ignoring the tokens that are of no
interest to the user. Also we merge parsing results into a single _Tuple
unless the user explicitely prevents it. See also skip and >>
combinators.
"""
def magic(v1, v2):
vs = [v for v in [v1, v2] if not isinstance(v, _Ignored)]
if len(vs) == 1:
return vs[0]
elif len(vs) == 2:
if isinstance(vs[0], _Tuple):
return _Tuple(v1 + (v2,))
else:
return _Tuple(vs)
else:
return _Ignored(())
@Parser
def _add(tokens, s):
(v1, s2) = self.run(tokens, s)
(v2, s3) = other.run(tokens, s2)
return magic(v1, v2), s3
# or in terms of bind and pure:
# _add = self.bind(lambda x: other.bind(lambda y: pure(magic(x, y))))
_add.name = u'(%s , %s)' % (self.name, other.name)
return _add
def __or__(self, other):
"""Parser(a, b), Parser(a, c) -> Parser(a, b or c)
A choice composition of two parsers.
NOTE: Here we are not providing the exact type of the result. In a
statically typed langage something like Either b c could be used. See
also + combinator.
"""
@Parser
def _or(tokens, s):
try:
return self.run(tokens, s)
except NoParseError, e:
return other.run(tokens, State(s.pos, e.state.max))
_or.name = u'(%s | %s)' % (self.name, other.name)
return _or
def __rshift__(self, f):
"""Parser(a, b), (b -> c) -> Parser(a, c)
Given a function from b to c, transforms a parser of b into a parser of
c. It is useful for transorming a parser value into another value for
making it a part of a parse tree or an AST.
This combinator may be thought of as a functor from b -> c to Parser(a,
b) -> Parser(a, c).
"""
@Parser
def _shift(tokens, s):
(v, s2) = self.run(tokens, s)
return f(v), s2
# or in terms of bind and pure:
# _shift = self.bind(lambda x: pure(f(x)))
_shift.name = u'(%s)' % (self.name,)
return _shift
def bind(self, f):
"""Parser(a, b), (b -> Parser(a, c)) -> Parser(a, c)
NOTE: A monadic bind function. It is used internally to implement other
combinators. Functions bind and pure make the Parser a Monad.
"""
@Parser
def _bind(tokens, s):
(v, s2) = self.run(tokens, s)
return f(v).run(tokens, s2)
_bind.name = u'(%s >>=)' % (self.name,)
return _bind
class State(object):
"""A parsing state that is maintained basically for error reporting.
It consists of the current position pos in the sequence being parsed and
the position max of the rightmost token that has been consumed while
parsing.
"""
def __init__(self, pos=0, max=0):
self.pos = pos
self.max = max
def __str__(self):
return unicode((self.pos, self.max))
def __repr__(self):
return u'State(%r, %r)' % (self.pos, self.max)
class NoParseError(Exception):
def __init__(self, msg=u'', state=None):
self.msg = msg
self.state = state
def __str__(self):
return self.msg
class _Tuple(tuple):
pass
class _Ignored(object):
def __init__(self, value):
self.value = value
def __repr__(self):
return u'_Ignored(%s)' % repr(self.value)
@Parser
def finished(tokens, s):
"""Parser(a, None)
Throws an exception if any tokens are left in the input unparsed.
"""
if s.pos >= len(tokens):
return None, s
else:
raise NoParseError(u'should have reached ', s)
finished.name = u'finished'
def many(p):
"""Parser(a, b) -> Parser(a, [b])
Returns a parser that infinitely applies the parser p to the input sequence
of tokens while it successfully parsers them. The resulting parser returns a
list of parsed values.
"""
@Parser
def _many(tokens, s):
"""Iterative implementation preventing the stack overflow."""
res = []
try:
while True:
(v, s) = p.run(tokens, s)
res.append(v)
except NoParseError, e:
return res, State(s.pos, e.state.max)
_many.name = u'{ %s }' % p.name
return _many
def some(pred):
"""(a -> bool) -> Parser(a, a)
Returns a parser that parses a token if it satisfies a predicate pred.
"""
@Parser
def _some(tokens, s):
if s.pos >= len(tokens):
raise NoParseError(u'no tokens left in the stream', s)
else:
t = tokens[s.pos]
if pred(t):
pos = s.pos + 1
s2 = State(pos, max(pos, s.max))
if debug:
log.debug(u'*matched* "%s", new state = %s' % (t, s2))
return t, s2
else:
if debug:
log.debug(u'failed "%s", state = %s' % (t, s))
raise NoParseError(u'got unexpected token', s)
_some.name = u'(some)'
return _some
def a(value):
"""Eq(a) -> Parser(a, a)
Returns a parser that parses a token that is equal to the value value.
"""
name = getattr(value, 'name', value)
return some(lambda t: t == value).named(u'(a "%s")' % (name,))
def pure(x):
@Parser
def _pure(_, s):
return x, s
_pure.name = u'(pure %r)' % (x,)
return _pure
def maybe(p):
"""Parser(a, b) -> Parser(a, b or None)
Returns a parser that retuns None if parsing fails.
NOTE: In a statically typed language, the type Maybe b could be more
approprieate.
"""
return (p | pure(None)).named(u'[ %s ]' % (p.name,))
def skip(p):
"""Parser(a, b) -> Parser(a, _Ignored(b))
Returns a parser which results are ignored by the combinator +. It is useful
for throwing away elements of concrete syntax (e. g. ",", ";").
"""
return p >> _Ignored
def oneplus(p):
"""Parser(a, b) -> Parser(a, [b])
Returns a parser that applies the parser p one or more times.
"""
q = p + many(p) >> (lambda x: [x[0]] + x[1])
return q.named(u'(%s , { %s })' % (p.name, p.name))
def with_forward_decls(suspension):
"""(None -> Parser(a, b)) -> Parser(a, b)
Returns a parser that computes itself lazily as a result of the suspension
provided. It is needed when some parsers contain forward references to
parsers defined later and such references are cyclic. See examples for more
details.
"""
@Parser
def f(tokens, s):
return suspension().run(tokens, s)
return f
def forward_decl():
"""None -> Parser(?, ?)
Returns an undefined parser that can be used as a forward declaration. You
will be able to define() it when all the parsers it depends on are
available.
"""
@Parser
def f(tokens, s):
raise NotImplementedError(u'you must define() a forward_decl somewhere')
return f
if __name__ == '__main__':
import doctest
doctest.testmod()
funcparserlib-0.3.6/funcparserlib/tests/ 0000755 0000765 0000024 00000000000 12140502162 020434 5 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/funcparserlib/tests/__init__.py 0000644 0000765 0000024 00000000000 12135637150 022545 0 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/funcparserlib/tests/dot.py 0000644 0000765 0000024 00000014416 12135637150 021614 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
r"""A DOT language parser using funcparserlib.
The parser is based on [the DOT grammar][1]. It is pretty complete with a few
not supported things:
* String ecapes `\"`
* Ports and compass points
* XML identifiers
At the moment, the parser builds only a parse tree, not an abstract syntax tree
(AST) or an API for dealing with DOT.
[1]: http://www.graphviz.org/doc/info/lang.html
"""
import sys
import os
from re import MULTILINE
from funcparserlib.util import pretty_tree
from funcparserlib.lexer import make_tokenizer, Token, LexerError
from funcparserlib.parser import (some, a, maybe, many, finished, skip,
oneplus, forward_decl, NoParseError)
try:
from collections import namedtuple
except ImportError:
# Basic implementation of namedtuple for 2.1 < Python < 2.6
def namedtuple(name, fields):
"""Only space-delimited fields are supported."""
def prop(i, name):
return name, property(lambda self: self[i])
def new(cls, *args, **kwargs):
args = list(args)
n = len(args)
for i in range(n, len(names)):
name = names[i - n]
args.append(kwargs[name])
return tuple.__new__(cls, args)
names = dict((i, f) for i, f in enumerate(fields.split(u' ')))
methods = dict(prop(i, f) for i, f in enumerate(fields.split(u' ')))
methods.update({
'__new__': new,
'__repr__': lambda self: u'%s(%s)' % (
name,
u', '.join(u'%s=%r' % (
f, getattr(self, f)) for f in fields.split(u' ')))})
return type(name, (tuple,), methods)
ENCODING = u'UTF-8'
Graph = namedtuple('Graph', 'strict type id stmts')
SubGraph = namedtuple('SubGraph', 'id stmts')
Node = namedtuple('Node', 'id attrs')
Attr = namedtuple('Attr', 'name value')
Edge = namedtuple('Edge', 'nodes attrs')
DefAttrs = namedtuple('DefAttrs', 'object attrs')
def tokenize(str):
"""str -> Sequence(Token)"""
specs = [
(u'Comment', (ur'/\*(.|[\r\n])*?\*/', MULTILINE)),
(u'Comment', (ur'//.*',)),
(u'NL', (ur'[\r\n]+',)),
(u'Space', (ur'[ \t\r\n]+',)),
(u'Name', (ur'[A-Za-z\200-\377_][A-Za-z\200-\377_0-9]*',)),
(u'Op', (ur'[{};,=\[\]]|(->)|(--)',)),
(u'Number', (ur'-?(\.[0-9]+)|([0-9]+(\.[0-9]*)?)',)),
(u'String', (ur'"[^"]*"',)), # '\"' escapes are ignored
]
useless = [u'Comment', u'NL', u'Space']
t = make_tokenizer(specs)
return [x for x in t(str) if x.type not in useless]
def parse(seq):
"""Sequence(Token) -> object"""
unarg = lambda f: lambda args: f(*args)
tokval = lambda x: x.value
flatten = lambda list: sum(list, [])
n = lambda s: a(Token(u'Name', s)) >> tokval
op = lambda s: a(Token(u'Op', s)) >> tokval
op_ = lambda s: skip(op(s))
id_types = [u'Name', u'Number', u'String']
id = some(lambda t: t.type in id_types).named(u'id') >> tokval
make_graph_attr = lambda args: DefAttrs(u'graph', [Attr(*args)])
make_edge = lambda x, xs, attrs: Edge([x] + xs, attrs)
node_id = id # + maybe(port)
a_list = (
id +
maybe(op_(u'=') + id) +
skip(maybe(op(u',')))
>> unarg(Attr))
attr_list = (
many(op_(u'[') + many(a_list) + op_(u']'))
>> flatten)
attr_stmt = (
(n(u'graph') | n(u'node') | n(u'edge')) +
attr_list
>> unarg(DefAttrs))
graph_attr = id + op_(u'=') + id >> make_graph_attr
node_stmt = node_id + attr_list >> unarg(Node)
# We use a forward_decl becaue of circular definitions like (stmt_list ->
# stmt -> subgraph -> stmt_list)
subgraph = forward_decl()
edge_rhs = skip(op(u'->') | op(u'--')) + (subgraph | node_id)
edge_stmt = (
(subgraph | node_id) +
oneplus(edge_rhs) +
attr_list
>> unarg(make_edge))
stmt = (
attr_stmt
| edge_stmt
| subgraph
| graph_attr
| node_stmt
)
stmt_list = many(stmt + skip(maybe(op(u';'))))
subgraph.define(
skip(n(u'subgraph')) +
maybe(id) +
op_(u'{') +
stmt_list +
op_(u'}')
>> unarg(SubGraph))
graph = (
maybe(n(u'strict')) +
maybe(n(u'graph') | n(u'digraph')) +
maybe(id) +
op_(u'{') +
stmt_list +
op_(u'}')
>> unarg(Graph))
dotfile = graph + skip(finished)
return dotfile.parse(seq)
def pretty_parse_tree(x):
"""object -> str"""
Pair = namedtuple(u'Pair', u'first second')
p = lambda x, y: Pair(x, y)
def kids(x):
"""object -> list(object)"""
if isinstance(x, (Graph, SubGraph)):
return [p(u'stmts', x.stmts)]
elif isinstance(x, (Node, DefAttrs)):
return [p(u'attrs', x.attrs)]
elif isinstance(x, Edge):
return [p(u'nodes', x.nodes), p(u'attrs', x.attrs)]
elif isinstance(x, Pair):
return x.second
else:
return []
def show(x):
"""object -> str"""
if isinstance(x, Pair):
return x.first
elif isinstance(x, Graph):
return u'Graph [id=%s, strict=%r, type=%s]' % (
x.id, x.strict is not None, x.type)
elif isinstance(x, SubGraph):
return u'SubGraph [id=%s]' % (x.id,)
elif isinstance(x, Edge):
return u'Edge'
elif isinstance(x, Attr):
return u'Attr [name=%s, value=%s]' % (x.name, x.value)
elif isinstance(x, DefAttrs):
return u'DefAttrs [object=%s]' % (x.object,)
elif isinstance(x, Node):
return u'Node [id=%s]' % (x.id,)
else:
return unicode(x)
return pretty_tree(x, kids, show)
def main():
#import logging
#logging.basicConfig(level=logging.DEBUG)
#import funcparserlib
#funcparserlib.parser.debug = True
try:
stdin = os.fdopen(sys.stdin.fileno(), u'rb')
input = stdin.read().decode(ENCODING)
tree = parse(tokenize(input))
#print pformat(tree)
print pretty_parse_tree(tree).encode(ENCODING)
except (NoParseError, LexerError), e:
msg = (u'syntax error: %s' % e).encode(ENCODING)
print >> sys.stderr, msg
sys.exit(1)
if __name__ == '__main__':
main()
funcparserlib-0.3.6/funcparserlib/tests/json.py 0000644 0000765 0000024 00000007302 12135637150 021773 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
r"""A JSON parser using funcparserlib.
The parser is based on [the JSON grammar][1].
[1]: http://tools.ietf.org/html/rfc4627
"""
import sys
import os
import re
import logging
from re import VERBOSE
from pprint import pformat
from funcparserlib.lexer import make_tokenizer, Token, LexerError
from funcparserlib.parser import (some, a, maybe, many, finished, skip,
forward_decl, NoParseError)
ENCODING = u'UTF-8'
regexps = {
u'escaped': ur'''
\\ # Escape
((?P["\\/bfnrt]) # Standard escapes
| (u(?P[0-9A-Fa-f]{4}))) # uXXXX
''',
u'unescaped': ur'''
[^"\\] # Unescaped: avoid ["\\]
''',
}
re_esc = re.compile(regexps[u'escaped'], VERBOSE)
def tokenize(str):
"""str -> Sequence(Token)"""
specs = [
(u'Space', (ur'[ \t\r\n]+',)),
(u'String', (ur'"(%(unescaped)s | %(escaped)s)*"' % regexps, VERBOSE)),
(u'Number', (ur'''
-? # Minus
(0|([1-9][0-9]*)) # Int
(\.[0-9]+)? # Frac
([Ee][+-][0-9]+)? # Exp
''', VERBOSE)),
(u'Op', (ur'[{}\[\]\-,:]',)),
(u'Name', (ur'[A-Za-z_][A-Za-z_0-9]*',)),
]
useless = [u'Space']
t = make_tokenizer(specs)
return [x for x in t(str) if x.type not in useless]
def parse(seq):
"""Sequence(Token) -> object"""
const = lambda x: lambda _: x
tokval = lambda x: x.value
toktype = lambda t: some(lambda x: x.type == t) >> tokval
op = lambda s: a(Token(u'Op', s)) >> tokval
op_ = lambda s: skip(op(s))
n = lambda s: a(Token(u'Name', s)) >> tokval
def make_array(n):
if n is None:
return []
else:
return [n[0]] + n[1]
def make_object(n):
return dict(make_array(n))
def make_number(n):
try:
return int(n)
except ValueError:
return float(n)
def unescape(s):
std = {
u'"': u'"', u'\\': u'\\', u'/': u'/', u'b': u'\b', u'f': u'\f',
u'n': u'\n', u'r': u'\r', u't': u'\t',
}
def sub(m):
if m.group(u'standard') is not None:
return std[m.group(u'standard')]
else:
return unichr(int(m.group(u'unicode'), 16))
return re_esc.sub(sub, s)
def make_string(n):
return unescape(n[1:-1])
null = n(u'null') >> const(None)
true = n(u'true') >> const(True)
false = n(u'false') >> const(False)
number = toktype(u'Number') >> make_number
string = toktype(u'String') >> make_string
value = forward_decl()
member = string + op_(u':') + value >> tuple
object = (
op_(u'{') +
maybe(member + many(op_(u',') + member)) +
op_(u'}')
>> make_object)
array = (
op_(u'[') +
maybe(value + many(op_(u',') + value)) +
op_(u']')
>> make_array)
value.define(
null
| true
| false
| object
| array
| number
| string)
json_text = object | array
json_file = json_text + skip(finished)
return json_file.parse(seq)
def loads(s):
"""str -> object"""
return parse(tokenize(s))
def main():
logging.basicConfig(level=logging.DEBUG)
try:
stdin = os.fdopen(sys.stdin.fileno(), 'rb')
input = stdin.read().decode(ENCODING)
tree = loads(input)
print pformat(tree)
except (NoParseError, LexerError), e:
msg = (u'syntax error: %s' % e).encode(ENCODING)
print >> sys.stderr, msg
sys.exit(1)
if __name__ == '__main__':
main()
funcparserlib-0.3.6/funcparserlib/tests/test_dot.py 0000644 0000765 0000024 00000013116 12135637150 022647 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
import unittest
from funcparserlib.parser import NoParseError
from funcparserlib.lexer import LexerError
from dot import parse, tokenize, Graph, Edge, SubGraph, DefAttrs, Attr, Node
class DotTest(unittest.TestCase):
def t(self, data, expected=None):
self.assertEqual(parse(tokenize(data)), expected)
def test_comments(self):
self.t(u'''
/* комм 1 */
graph /* комм 4 */ g1 {
// комм 2 /* комм 3 */
}
// комм 5
''',
Graph(strict=None, type=u'graph', id=u'g1', stmts=[]))
def test_connected_subgraph(self):
self.t(u'''
digraph g1 {
n1 -> n2 ->
subgraph n3 {
nn1 -> nn2 -> nn3;
nn3 -> nn1;
};
subgraph n3 {} -> n1;
}
''',
Graph(strict=None, type=u'digraph', id=u'g1', stmts=[
Edge(
nodes=[
u'n1',
u'n2',
SubGraph(id=u'n3', stmts=[
Edge(
nodes=[u'nn1', u'nn2', u'nn3'],
attrs=[]),
Edge(
nodes=[u'nn3', u'nn1'],
attrs=[])])],
attrs=[]),
Edge(
nodes=[
SubGraph(id=u'n3', stmts=[]),
u'n1'],
attrs=[])]))
def test_default_attrs(self):
self.t(u'''
digraph g1 {
page="3,3";
graph [rotate=90];
node [shape=box, color="#0000ff"];
edge [style=dashed];
n1 -> n2 -> n3;
n3 -> n1;
}
''',
Graph(strict=None, type=u'digraph', id=u'g1', stmts=[
DefAttrs(object=u'graph', attrs=[
Attr(name=u'page', value=u'"3,3"')]),
DefAttrs(object=u'graph', attrs=[
Attr(name=u'rotate', value=u'90')]),
DefAttrs(object=u'node', attrs=[
Attr(name=u'shape', value=u'box'),
Attr(name=u'color', value=u'"#0000ff"')]),
DefAttrs(object=u'edge', attrs=[
Attr(name=u'style', value=u'dashed')]),
Edge(nodes=[u'n1', u'n2', u'n3'], attrs=[]),
Edge(nodes=[u'n3', u'n1'], attrs=[])]))
def test_empty_graph(self):
self.t(u'''
graph g1 {}
''',
Graph(strict=None, type=u'graph', id=u'g1', stmts=[]))
def test_few_attrs(self):
self.t(u'''
digraph g1 {
n1 [attr1, attr2 = value2];
}
''',
Graph(strict=None, type=u'digraph', id=u'g1', stmts=[
Node(id=u'n1', attrs=[
Attr(name=u'attr1', value=None),
Attr(name=u'attr2', value=u'value2')])]))
def test_few_nodes(self):
self.t(u'''
graph g1 {
n1;
n2;
n3
}
''',
Graph(strict=None, type=u'graph', id=u'g1', stmts=[
Node(id=u'n1', attrs=[]),
Node(id=u'n2', attrs=[]),
Node(id=u'n3', attrs=[])]))
def test_illegal_comma(self):
try:
self.t(u'''
graph g1 {
n1;
n2;
n3,
}
''')
except NoParseError:
pass
else:
self.fail('must raise NoParseError')
def test_null(self):
try:
self.t(u'')
except NoParseError:
pass
else:
self.fail('must raise NoParseError')
def test_simple_cycle(self):
self.t(u'''
digraph g1 {
n1 -> n2 [w=5];
n2 -> n3 [w=10];
n3 -> n1 [w=7];
}
''',
Graph(strict=None, type=u'digraph', id=u'g1', stmts=[
Edge(nodes=[u'n1', u'n2'], attrs=[
Attr(name=u'w', value=u'5')]),
Edge(nodes=[u'n2', u'n3'], attrs=[
Attr(name=u'w', value=u'10')]),
Edge(nodes=[u'n3', u'n1'], attrs=[
Attr(name=u'w', value=u'7')])]))
def test_single_unicode_char(self):
try:
self.t(u'ф')
except LexerError:
pass
else:
self.fail('must raise LexerError')
def test_unicode_names(self):
self.t(u'''
digraph g1 {
n1 -> "Медведь" [label="Поехали!"];
"Медведь" -> n3 [label="Добро пожаловать!"];
n3 -> n1 ["Водка"="Селёдка"];
}
''',
Graph(strict=None, type=u'digraph', id=u'g1', stmts=[
Edge(nodes=[u'n1', u'"Медведь"'], attrs=[
Attr(name=u'label', value=u'"Поехали!"')]),
Edge(nodes=[u'"Медведь"', u'n3'], attrs=[
Attr(name=u'label', value=u'"Добро пожаловать!"')]),
Edge(nodes=[u'n3', u'n1'], attrs=[
Attr(name=u'"Водка"', value=u'"Селёдка"')])]))
funcparserlib-0.3.6/funcparserlib/tests/test_json.py 0000644 0000765 0000024 00000005254 12135637150 023036 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
import unittest
from funcparserlib.parser import NoParseError
from funcparserlib.lexer import LexerError
import json
class JsonTest(unittest.TestCase):
def t(self, data, expected=None):
self.assertEqual(json.loads(data), expected)
def test_1_array(self):
self.t(u'[1]', [1])
def test_1_object(self):
self.t(u'{"foo": "bar"}', {u'foo': u'bar'})
def test_bool_and_null(self):
self.t(u'[null, true, false]', [None, True, False])
def test_empty_array(self):
self.t(u'[]', [])
def test_empty_object(self):
self.t(u'{}', {})
def test_many_array(self):
self.t(u'[1, 2, [3, 4, 5], 6]', [1, 2, [3, 4, 5], 6])
def test_many_object(self):
self.t(u'''
{
"foo": 1,
"bar":
{
"baz": 2,
"quux": [true, false],
"{}": {}
},
"spam": "eggs"
}
''', {
u'foo': 1,
u'bar': {
u'baz': 2,
u'quux': [True, False],
u'{}': {},
},
u'spam': u'eggs',
})
def test_null(self):
try:
self.t(u'')
except NoParseError:
pass
else:
self.fail('must raise NoParseError')
def test_numbers(self):
self.t(u'''\
[
0, 1, -1, 14, -14, 65536,
0.0, 3.14, -3.14, -123.456,
6.67428e-11, -1.602176e-19, 6.67428E-11
]
''', [
0, 1, -1, 14, -14, 65536,
0.0, 3.14, -3.14, -123.456,
6.67428e-11, -1.602176e-19, 6.67428E-11,
])
def test_strings(self):
self.t(ur'''
[
["", "hello", "hello world!"],
["привет, мир!", "λx.x"],
["\"", "\\", "\/", "\b", "\f", "\n", "\r", "\t"],
["\u0000", "\u03bb", "\uffff", "\uFFFF"],
["вот функция идентичности:\nλx.x\nили так:\n\u03bbx.x"]
]
''', [
[u'', u'hello', u'hello world!'],
[u'привет, мир!', u'λx.x'],
[u'"', u'\\', u'/', u'\x08', u'\x0c', u'\n', u'\r', u'\t'],
[u'\u0000', u'\u03bb', u'\uffff', u'\uffff'],
[u'вот функция идентичности:\nλx.x\nили так:\n\u03bbx.x'],
])
def test_toplevel_string(self):
try:
self.t(u'неправильно')
except LexerError:
pass
else:
self.fail('must raise LexerError')
funcparserlib-0.3.6/funcparserlib/tests/test_parsing.py 0000644 0000765 0000024 00000003652 12135641116 023525 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
import unittest
from funcparserlib.lexer import make_tokenizer, LexerError, Token
from funcparserlib.parser import a, many, some, skip, NoParseError
class ParsingTest(unittest.TestCase):
# Issue 31
def test_many_backtracking(self):
x = a(u'x')
y = a(u'y')
expr = many(x + y) + x + x
self.assertEqual(expr.parse(u'xyxyxx'),
([(u'x', u'y'), (u'x', u'y')], u'x', u'x'))
# Issue 14
def test_error_info(self):
tokenize = make_tokenizer([
(u'keyword', (ur'(is|end)',)),
(u'id', (ur'[a-z]+',)),
(u'space', (ur'[ \t]+',)),
(u'nl', (ur'[\n\r]+',)),
])
try:
list(tokenize(u'f is ф'))
except LexerError, e:
self.assertEqual(unicode(e),
u'cannot tokenize data: 1,6: "f is \u0444"')
else:
self.fail(u'must raise LexerError')
sometok = lambda type: some(lambda t: t.type == type)
keyword = lambda s: a(Token(u'keyword', s))
id = sometok(u'id')
is_ = keyword(u'is')
end = keyword(u'end')
nl = sometok(u'nl')
equality = id + skip(is_) + id >> tuple
expr = equality + skip(nl)
file = many(expr) + end
msg = """\
spam is eggs
eggs isnt spam
end"""
toks = [x for x in tokenize(msg) if x.type != u'space']
try:
file.parse(toks)
except NoParseError, e:
self.assertEqual(e.msg,
u"got unexpected token: 2,11-2,14: id 'spam'")
self.assertEqual(e.state.pos, 4)
self.assertEqual(e.state.max, 7)
# May raise KeyError
t = toks[e.state.max]
self.assertEqual(t, Token(u'id', u'spam'))
self.assertEqual((t.start, t.end), ((2, 11), (2, 14)))
else:
self.fail(u'must raise NoParseError')
funcparserlib-0.3.6/funcparserlib/util.py 0000644 0000765 0000024 00000003634 12135637150 020641 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
# Copyright (c) 2008/2013 Andrey Vlasovskikh
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
def pretty_tree(x, kids, show):
"""(a, (a -> list(a)), (a -> str)) -> str
Returns a pseudographic tree representation of x similar to the tree command
in Unix.
"""
(MID, END, CONT, LAST, ROOT) = (u'|-- ', u'`-- ', u'| ', u' ', u'')
def rec(x, indent, sym):
line = indent + sym + show(x)
xs = kids(x)
if len(xs) == 0:
return line
else:
if sym == MID:
next_indent = indent + CONT
elif sym == ROOT:
next_indent = indent + ROOT
else:
next_indent = indent + LAST
syms = [MID] * (len(xs) - 1) + [END]
lines = [rec(x, next_indent, sym) for x, sym in zip(xs, syms)]
return u'\n'.join([line] + lines)
return rec(x, u'', ROOT)
funcparserlib-0.3.6/funcparserlib.egg-info/ 0000755 0000765 0000024 00000000000 12140502162 020764 5 ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/funcparserlib.egg-info/dependency_links.txt 0000644 0000765 0000024 00000000001 12140502162 025032 0 ustar vlan staff 0000000 0000000
funcparserlib-0.3.6/funcparserlib.egg-info/PKG-INFO 0000644 0000765 0000024 00000000461 12140502162 022062 0 ustar vlan staff 0000000 0000000 Metadata-Version: 1.0
Name: funcparserlib
Version: 0.3.6
Summary: Recursive descent parsing library based on functional combinators
Home-page: http://code.google.com/p/funcparserlib/
Author: Andrey Vlasovskikh
Author-email: andrey.vlasovskikh@gmail.com
License: MIT
Description: UNKNOWN
Platform: UNKNOWN
funcparserlib-0.3.6/funcparserlib.egg-info/SOURCES.txt 0000644 0000765 0000024 00000001101 12140502162 022641 0 ustar vlan staff 0000000 0000000 CHANGES
LICENSE
MANIFEST.in
README
setup.py
doc/Brackets.md
doc/Changes.md
doc/FAQ.md
doc/Illustrated.md
doc/Makefile
doc/Tutorial.md
doc/index.md
funcparserlib/__init__.py
funcparserlib/lexer.py
funcparserlib/parser.py
funcparserlib/util.py
funcparserlib.egg-info/PKG-INFO
funcparserlib.egg-info/SOURCES.txt
funcparserlib.egg-info/dependency_links.txt
funcparserlib.egg-info/top_level.txt
funcparserlib/tests/__init__.py
funcparserlib/tests/dot.py
funcparserlib/tests/json.py
funcparserlib/tests/test_dot.py
funcparserlib/tests/test_json.py
funcparserlib/tests/test_parsing.py funcparserlib-0.3.6/funcparserlib.egg-info/top_level.txt 0000644 0000765 0000024 00000000016 12140502162 023513 0 ustar vlan staff 0000000 0000000 funcparserlib
funcparserlib-0.3.6/LICENSE 0000644 0000765 0000024 00000002054 12135635351 015454 0 ustar vlan staff 0000000 0000000 Copyright © 2009/2013 Andrey Vlasovskikh
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
funcparserlib-0.3.6/MANIFEST.in 0000644 0000765 0000024 00000000111 12135631377 016201 0 ustar vlan staff 0000000 0000000 recursive-include doc *
include LICENSE CHANGES MANIFEST.in requires.txt
funcparserlib-0.3.6/PKG-INFO 0000644 0000765 0000024 00000000461 12140502162 015531 0 ustar vlan staff 0000000 0000000 Metadata-Version: 1.0
Name: funcparserlib
Version: 0.3.6
Summary: Recursive descent parsing library based on functional combinators
Home-page: http://code.google.com/p/funcparserlib/
Author: Andrey Vlasovskikh
Author-email: andrey.vlasovskikh@gmail.com
License: MIT
Description: UNKNOWN
Platform: UNKNOWN
funcparserlib-0.3.6/README 0000644 0000765 0000024 00000000000 12140501202 022771 1funcparserlib-0.3.6/doc/index.md ustar vlan staff 0000000 0000000 funcparserlib-0.3.6/setup.cfg 0000644 0000765 0000024 00000000073 12140502162 016254 0 ustar vlan staff 0000000 0000000 [egg_info]
tag_build =
tag_date = 0
tag_svn_revision = 0
funcparserlib-0.3.6/setup.py 0000644 0000765 0000024 00000000773 12140500634 016156 0 ustar vlan staff 0000000 0000000 # -*- coding: utf-8 -*-
from setuptools import setup
import sys
extra = {}
if sys.version_info >= (3,):
extra['use_2to3'] = True
setup(
name='funcparserlib',
version='0.3.6',
packages=['funcparserlib', 'funcparserlib.tests'],
author='Andrey Vlasovskikh',
author_email='andrey.vlasovskikh@gmail.com',
description='Recursive descent parsing library based on functional '
'combinators',
license='MIT',
url='http://code.google.com/p/funcparserlib/',
**extra)