pax_global_header00006660000000000000000000000064143047042270014515gustar00rootroot0000000000000052 comment=a33206834534df5bc1da341315c819f4312b8131 parsimonious-0.10.0/000077500000000000000000000000001430470422700143235ustar00rootroot00000000000000parsimonious-0.10.0/.github/000077500000000000000000000000001430470422700156635ustar00rootroot00000000000000parsimonious-0.10.0/.github/workflows/000077500000000000000000000000001430470422700177205ustar00rootroot00000000000000parsimonious-0.10.0/.github/workflows/main.yml000066400000000000000000000012101430470422700213610ustar00rootroot00000000000000--- name: CI on: push: branches: [ master ] pull_request: branches: [ master ] jobs: build: runs-on: ubuntu-latest strategy: matrix: python-version: ['3.7', '3.8', '3.9', '3.10'] name: Python ${{ matrix.python-version}} steps: - uses: actions/checkout@v2.3.5 - name: Set up Python uses: actions/setup-python@v2.2.2 with: python-version: ${{ matrix.python-version }} - name: Update pip and install dev requirements run: | python -m pip install --upgrade pip pip install tox tox-gh-actions - name: Test run: tox parsimonious-0.10.0/.gitignore000066400000000000000000000000471430470422700163140ustar00rootroot00000000000000.tox *.egg-info *.egg *.pyc build dist parsimonious-0.10.0/LICENSE000066400000000000000000000020351430470422700153300ustar00rootroot00000000000000Copyright (c) 2012 Erik Rose Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. parsimonious-0.10.0/MANIFEST.in000066400000000000000000000000431430470422700160560ustar00rootroot00000000000000include README.rst include LICENSE parsimonious-0.10.0/README.rst000066400000000000000000000601621430470422700160170ustar00rootroot00000000000000============ Parsimonious ============ Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure Python—and the most usable. It's based on parsing expression grammars (PEGs), which means you feed it a simplified sort of EBNF notation. Parsimonious was designed to undergird a MediaWiki parser that wouldn't take 5 seconds or a GB of RAM to do one page, but it's applicable to all sorts of languages. :Code: https://github.com/erikrose/parsimonious/ :Issues: https://github.com/erikrose/parsimonious/issues :License: MIT License (MIT) :Package: https://pypi.org/project/parsimonious/ Goals ===== * Speed * Frugal RAM use * Minimalistic, understandable, idiomatic Python code * Readable grammars * Extensible grammars * Complete test coverage * Separation of concerns. Some Python parsing kits mix recognition with instructions about how to turn the resulting tree into some kind of other representation. This is limiting when you want to do several different things with a tree: for example, render wiki markup to HTML *or* to text. * Good error reporting. I want the parser to work *with* me as I develop a grammar. Install ======= To install Parsimonious, run:: $ pip install parsimonious Example Usage ============= Here's how to build a simple grammar: .. code:: python >>> from parsimonious.grammar import Grammar >>> grammar = Grammar( ... """ ... bold_text = bold_open text bold_close ... text = ~"[A-Z 0-9]*"i ... bold_open = "((" ... bold_close = "))" ... """) You can have forward references and even right recursion; it's all taken care of by the grammar compiler. The first rule is taken to be the default start symbol, but you can override that. Next, let's parse something and get an abstract syntax tree: .. code:: python >>> print(grammar.parse('((bold stuff))')) You'd typically then use a ``nodes.NodeVisitor`` subclass (see below) to walk the tree and do something useful with it. Another example would be to implement a parser for ``.ini``-files. Consider the following: .. code:: python grammar = Grammar( r""" expr = (entry / emptyline)* entry = section pair* section = lpar word rpar ws pair = key equal value ws? key = word+ value = (word / quoted)+ word = ~r"[-\w]+" quoted = ~'"[^\"]+"' equal = ws? "=" ws? lpar = "[" rpar = "]" ws = ~"\s*" emptyline = ws+ """ ) We could now implement a subclass of ``NodeVisitor`` like so: .. code:: python class IniVisitor(NodeVisitor): def visit_expr(self, node, visited_children): """ Returns the overall output. """ output = {} for child in visited_children: output.update(child[0]) return output def visit_entry(self, node, visited_children): """ Makes a dict of the section (as key) and the key/value pairs. """ key, values = visited_children return {key: dict(values)} def visit_section(self, node, visited_children): """ Gets the section name. """ _, section, *_ = visited_children return section.text def visit_pair(self, node, visited_children): """ Gets each key/value pair, returns a tuple. """ key, _, value, *_ = node.children return key.text, value.text def generic_visit(self, node, visited_children): """ The generic visit method. """ return visited_children or node And call it like that: .. code:: python from parsimonious.grammar import Grammar from parsimonious.nodes import NodeVisitor data = """[section] somekey = somevalue someotherkey=someothervalue [anothersection] key123 = "what the heck?" key456="yet another one here" """ tree = grammar.parse(data) iv = IniVisitor() output = iv.visit(tree) print(output) This would yield .. code:: python {'section': {'somekey': 'somevalue', 'someotherkey': 'someothervalue'}, 'anothersection': {'key123': '"what the heck?"', 'key456': '"yet another one here"'}} Status ====== * Everything that exists works. Test coverage is good. * I don't plan on making any backward-incompatible changes to the rule syntax in the future, so you can write grammars with confidence. * It may be slow and use a lot of RAM; I haven't measured either yet. However, I have yet to begin optimizing in earnest. * Error reporting is now in place. ``repr`` methods of expressions, grammars, and nodes are clear and helpful as well. The ``Grammar`` ones are even round-trippable! * The grammar extensibility story is underdeveloped at the moment. You should be able to extend a grammar by simply concatenating more rules onto the existing ones; later rules of the same name should override previous ones. However, this is untested and may not be the final story. * Sphinx docs are coming, but the docstrings are quite useful now. * Note that there may be API changes until we get to 1.0, so be sure to pin to the version you're using. Coming Soon ----------- * Optimizations to make Parsimonious worthy of its name * Tighter RAM use * Better-thought-out grammar extensibility story * Amazing grammar debugging A Little About PEG Parsers ========================== PEG parsers don't draw a distinction between lexing and parsing; everything is done at once. As a result, there is no lookahead limit, as there is with, for instance, Yacc. And, due to both of these properties, PEG grammars are easier to write: they're basically just a more practical dialect of EBNF. With caching, they take O(grammar size * text length) memory (though I plan to do better), but they run in O(text length) time. More Technically ---------------- PEGs can describe a superset of *LL(k)* languages, any deterministic *LR(k)* language, and many others—including some that aren't context-free (http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what would be ambiguous languages if described in canonical EBNF. They do this by trading the ``|`` alternation operator for the ``/`` operator, which works the same except that it makes priority explicit: ``a / b / c`` first tries matching ``a``. If that fails, it tries ``b``, and, failing that, moves on to ``c``. Thus, ambiguity is resolved by always yielding the first successful recognition. Writing Grammars ================ Grammars are defined by a series of rules. The syntax should be familiar to anyone who uses regexes or reads programming language manuals. An example will serve best: .. code:: python my_grammar = Grammar(r""" styled_text = bold_text / italic_text bold_text = "((" text "))" italic_text = "''" text "''" text = ~"[A-Z 0-9]*"i """) You can wrap a rule across multiple lines if you like; the syntax is very forgiving. Syntax Reference ---------------- ==================== ======================================================== ``"some literal"`` Used to quote literals. Backslash escaping and Python conventions for "raw" and Unicode strings help support fiddly characters. ``b"some literal"`` A bytes literal. Using bytes literals and regular expressions allows your grammar to parse binary files. Note that all literals and regular expressions must be of the same type within a grammar. In grammars that process bytestrings, you should make the grammar string an ``r"""string"""`` so that byte literals like ``\xff`` work correctly. [space] Sequences are made out of space- or tab-delimited things. ``a b c`` matches spots where those 3 terms appear in that order. ``a / b / c`` Alternatives. The first to succeed of ``a / b / c`` wins. ``thing?`` An optional expression. This is greedy, always consuming ``thing`` if it exists. ``&thing`` A lookahead assertion. Ensures ``thing`` matches at the current position but does not consume it. ``!thing`` A negative lookahead assertion. Matches if ``thing`` isn't found here. Doesn't consume any text. ``things*`` Zero or more things. This is greedy, always consuming as many repetitions as it can. ``things+`` One or more things. This is greedy, always consuming as many repetitions as it can. ``~r"regex"ilmsuxa`` Regexes have ``~`` in front and are quoted like literals. Any flags_ (``asilmx``) follow the end quotes as single chars. Regexes are good for representing character classes (``[a-z0-9]``) and optimizing for speed. The downside is that they won't be able to take advantage of our fancy debugging, once we get that working. Ultimately, I'd like to deprecate explicit regexes and instead have Parsimonious dynamically build them out of simpler primitives. Parsimonious uses the regex_ library instead of the built-in re module. ``~br"regex"`` A bytes regex; required if your grammar parses bytestrings. ``(things)`` Parentheses are used for grouping, like in every other language. ``thing{n}`` Exactly ``n`` repetitions of ``thing``. ``thing{n,m}`` Between ``n`` and ``m`` repititions (inclusive.) ``thing{,m}`` At most ``m`` repetitions of ``thing``. ``thing{n,}`` At least ``n`` repetitions of ``thing``. ==================== ======================================================== .. _flags: https://docs.python.org/3/howto/regex.html#compilation .. _regex: https://github.com/mrabarnett/mrab-regex Optimizing Grammars =================== Don't Repeat Expressions ------------------------ If you need a ``~"[a-z0-9]"i`` at two points in your grammar, don't type it twice. Make it a rule of its own, and reference it from wherever you need it. You'll get the most out of the caching this way, since cache lookups are by expression object identity (for speed). Even if you have an expression that's very simple, not repeating it will save RAM, as there can, at worst, be a cached int for every char in the text you're parsing. In the future, we may identify repeated subexpressions automatically and factor them up while building the grammar. How much should you shove into one regex, versus how much should you break them up to not repeat yourself? That's a fine balance and worthy of benchmarking. More stuff jammed into a regex will execute faster, because it doesn't have to run any Python between pieces, but a broken-up one will give better cache performance if the individual pieces are re-used elsewhere. If the pieces of a regex aren't used anywhere else, by all means keep the whole thing together. Quantifiers ----------- Bring your ``?`` and ``*`` quantifiers up to the highest level you can. Otherwise, lower-level patterns could succeed but be empty and put a bunch of useless nodes in your tree that didn't really match anything. Processing Parse Trees ====================== A parse tree has a node for each expression matched, even if it matched a zero-length string, like ``"thing"?`` might. The ``NodeVisitor`` class provides an inversion-of-control framework for walking a tree and returning a new construct (tree, string, or whatever) based on it. For now, have a look at its docstrings for more detail. There's also a good example in ``grammar.RuleVisitor``. Notice how we take advantage of nodes' iterability by using tuple unpacks in the formal parameter lists: .. code:: python def visit_or_term(self, or_term, (slash, _, term)): ... For reference, here is the production the above unpacks:: or_term = "/" _ term When something goes wrong in your visitor, you get a nice error like this:: [normal traceback here...] VisitationException: 'Node' object has no attribute 'foo' Parse tree: <-- *** We were here. *** The parse tree is tacked onto the exception, and the node whose visitor method raised the error is pointed out. Why No Streaming Tree Processing? --------------------------------- Some have asked why we don't process the tree as we go, SAX-style. There are two main reasons: 1. It wouldn't work. With a PEG parser, no parsing decision is final until the whole text is parsed. If we had to change a decision, we'd have to backtrack and redo the SAX-style interpretation as well, which would involve reconstituting part of the AST and quite possibly scuttling whatever you were doing with the streaming output. (Note that some bursty SAX-style processing may be possible in the future if we use cuts.) 2. It interferes with the ability to derive multiple representations from the AST: for example, turning wiki markup into first HTML and then text. Future Directions ================= Rule Syntax Changes ------------------- * Maybe support left-recursive rules like PyMeta, if anybody cares. * Ultimately, I'd like to get rid of explicit regexes and break them into more atomic things like character classes. Then we can dynamically compile bits of the grammar into regexes as necessary to boost speed. Optimizations ------------- * Make RAM use almost constant by automatically inserting "cuts", as described in http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf. This would also improve error reporting, as we wouldn't backtrack out of everything informative before finally failing. * Find all the distinct subexpressions, and unify duplicates for a better cache hit ratio. * Think about having the user (optionally) provide some representative input along with a grammar. We can then profile against it, see which expressions are worth caching, and annotate the grammar. Perhaps there will even be positions at which a given expression is more worth caching. Or we could keep a count of how many times each cache entry has been used and evict the most useless ones as RAM use grows. * We could possibly compile the grammar into VM instructions, like in "A parsing machine for PEGs" by Medeiros. * If the recursion gets too deep in practice, use trampolining to dodge it. Niceties -------- * Pijnu has a raft of tree manipulators. I don't think I want all of them, but a judicious subset might be nice. Don't get into mixing formatting with tree manipulation. https://github.com/erikrose/pijnu/blob/master/library/node.py#L333. PyPy's parsing lib exposes a sane subset: http://doc.pypy.org/en/latest/rlib.html#tree-transformations. Version History =============== (Next release) * ... 0.10.0 * Fix infinite recursion in __eq__ in some cases. (FelisNivalis) * Improve error message in left-recursive rules. (lucaswiman) * Add support for range ``{min,max}`` repetition expressions (righthandabacus) * Fix bug in ``*`` and ``+`` for token grammars (lucaswiman) * Add support for grammars on bytestrings (lucaswiman) * Fix LazyReference resolution bug #134 (righthandabacus) * ~15% speedup on benchmarks with a faster node cache (ethframe) .. warning:: This release makes backward-incompatible changes: * Fix precedence of string literal modifiers ``u/r/b``. This will break grammars with no spaces between a reference and a string literal. (lucaswiman) 0.9.0 * Add support for Python 3.7, 3.8, 3.9, 3.10 (righthandabacus, Lonnen) * Drop support for Python 2.x, 3.3, 3.4 (righthandabacus, Lonnen) * Remove six and go all in on Python 3 idioms (Lonnen) * Replace re with regex for improved handling of unicode characters in regexes (Oderjunkie) * Dropped nose for unittest (swayson) * `Grammar.__repr__()` now correctly escapes backslashes (ingolemo) * Custom rules can now be class methods in addition to functions (James Addison) * Make the ascii flag available in the regex syntax (Roman Inflianskas) 0.8.1 * Switch to a function-style ``print`` in the benchmark tests so we work cleanly as a dependency on Python 3. (Edward Betts) 0.8.0 * Make Grammar iteration ordered, making the ``__repr__`` more like the original input. (Lucas Wiman) * Improve text representation and error messages for anonymous subexpressions. (Lucas Wiman) * Expose BadGrammar and VisitationError as top-level imports. * No longer crash when you try to compare a Node to an instance of a different class. (Esben Sonne) * Pin ``six`` at 1.9.0 to ensure we have ``python_2_unicode_compatible``. (Sam Raker) * Drop Python 2.6 support. 0.7.0 * Add experimental token-based parsing, via TokenGrammar class, for those operating on pre-lexed streams of tokens. This can, for example, help parse indentation-sensitive languages that use the "off-side rule", like Python. (Erik Rose) * Common codebase for Python 2 and 3: no more 2to3 translation step (Mattias Urlichs, Lucas Wiman) * Drop Python 3.1 and 3.2 support. * Fix a bug in ``Grammar.__repr__`` which fails to work on Python 3 since the string_escape codec is gone in Python 3. (Lucas Wiman) * Don't lose parentheses when printing representations of expressions. (Michael Kelly) * Make Grammar an immutable mapping (until we add automatic recompilation). (Michael Kelly) 0.6.2 * Make grammar compilation 100x faster. Thanks to dmoisset for the initial patch. 0.6.1 * Fix bug which made the default rule of a grammar invalid when it contained a forward reference. 0.6 .. warning:: This release makes backward-incompatible changes: * The ``default_rule`` arg to Grammar's constructor has been replaced with a method, ``some_grammar.default('rule_name')``, which returns a new grammar just like the old except with its default rule changed. This is to free up the constructor kwargs for custom rules. * ``UndefinedLabel`` is no longer a subclass of ``VisitationError``. This matters only in the unlikely case that you were catching ``VisitationError`` exceptions and expecting to thus also catch ``UndefinedLabel``. * Add support for "custom rules" in Grammars. These provide a hook for simple custom parsing hooks spelled as Python lambdas. For heavy-duty needs, you can put in Compound Expressions with LazyReferences as subexpressions, and the Grammar will hook them up for optimal efficiency--no calling ``__getitem__`` on Grammar at parse time. * Allow grammars without a default rule (in cases where there are no string rules), which leads to also allowing empty grammars. Perhaps someone building up grammars dynamically will find that useful. * Add ``@rule`` decorator, allowing grammars to be constructed out of notations on ``NodeVisitor`` methods. This saves looking back and forth between the visitor and the grammar when there is only one visitor per grammar. * Add ``parse()`` and ``match()`` convenience methods to ``NodeVisitor``. This makes the common case of parsing a string and applying exactly one visitor to the AST shorter and simpler. * Improve exception message when you forget to declare a visitor method. * Add ``unwrapped_exceptions`` attribute to ``NodeVisitor``, letting you name certain exceptions which propagate out of visitors without being wrapped by ``VisitationError`` exceptions. * Expose much more of the library in ``__init__``, making your imports shorter. * Drastically simplify reference resolution machinery. (Vladimir Keleshev) 0.5 .. warning:: This release makes some backward-incompatible changes. See below. * Add alpha-quality error reporting. Now, rather than returning ``None``, ``parse()`` and ``match()`` raise ``ParseError`` if they don't succeed. This makes more sense, since you'd rarely attempt to parse something and not care if it succeeds. It was too easy before to forget to check for a ``None`` result. ``ParseError`` gives you a human-readable unicode representation as well as some attributes that let you construct your own custom presentation. * Grammar construction now raises ``ParseError`` rather than ``BadGrammar`` if it can't parse your rules. * ``parse()`` now takes an optional ``pos`` argument, like ``match()``. * Make the ``_str__()`` method of ``UndefinedLabel`` return the right type. * Support splitting rules across multiple lines, interleaving comments, putting multiple rules on one line (but don't do that) and all sorts of other horrific behavior. * Tolerate whitespace after opening parens. * Add support for single-quoted literals. 0.4 * Support Python 3. * Fix ``import *`` for ``parsimonious.expressions``. * Rewrite grammar compiler so right-recursive rules can be compiled and parsing no longer fails in some cases with forward rule references. 0.3 * Support comments, the ``!`` ("not") operator, and parentheses in grammar definition syntax. * Change the ``&`` operator to a prefix operator to conform to the original PEG syntax. The version in Parsing Techniques was infix, and that's what I used as a reference. However, the unary version is more convenient, as it lets you spell ``AB & A`` as simply ``A &B``. * Take the ``print`` statements out of the benchmark tests. * Give Node an evaluate-able ``__repr__``. 0.2 * Support matching of prefixes and other not-to-the-end slices of strings by making ``match()`` public and able to initialize a new cache. Add ``match()`` callthrough method to ``Grammar``. * Report a ``BadGrammar`` exception (rather than crashing) when there are mistakes in a grammar definition. * Simplify grammar compilation internals: get rid of superfluous visitor methods and factor up repetitive ones. Simplify rule grammar as well. * Add ``NodeVisitor.lift_child`` convenience method. * Rename ``VisitationException`` to ``VisitationError`` for consistency with the standard Python exception hierarchy. * Rework ``repr`` and ``str`` values for grammars and expressions. Now they both look like rule syntax. Grammars are even round-trippable! This fixes a unicode encoding error when printing nodes that had parsed unicode text. * Add tox for testing. Stop advertising Python 2.5 support, which never worked (and won't unless somebody cares a lot, since it makes Python 3 support harder). * Settle (hopefully) on the term "rule" to mean "the string representation of a production". Get rid of the vague, mysterious "DSL". 0.1 * A rough but useable preview release Thanks to Wiki Loves Monuments Panama for showing their support with a generous gift. parsimonious-0.10.0/parsimonious/000077500000000000000000000000001430470422700170535ustar00rootroot00000000000000parsimonious-0.10.0/parsimonious/__init__.py000066400000000000000000000006451430470422700211710ustar00rootroot00000000000000"""Parsimonious's public API. Import from here. Things may move around in modules deeper than this one. """ from parsimonious.exceptions import (ParseError, IncompleteParseError, VisitationError, UndefinedLabel, BadGrammar) from parsimonious.grammar import Grammar, TokenGrammar from parsimonious.nodes import NodeVisitor, VisitationError, rule parsimonious-0.10.0/parsimonious/exceptions.py000066400000000000000000000104101430470422700216020ustar00rootroot00000000000000from textwrap import dedent from parsimonious.utils import StrAndRepr class ParseError(StrAndRepr, Exception): """A call to ``Expression.parse()`` or ``match()`` didn't match.""" def __init__(self, text, pos=-1, expr=None): # It would be nice to use self.args, but I don't want to pay a penalty # to call descriptors or have the confusion of numerical indices in # Expression.match_core(). self.text = text self.pos = pos self.expr = expr def __str__(self): rule_name = (("'%s'" % self.expr.name) if self.expr.name else str(self.expr)) return "Rule %s didn't match at '%s' (line %s, column %s)." % ( rule_name, self.text[self.pos:self.pos + 20], self.line(), self.column()) # TODO: Add line, col, and separated-out error message so callers can build # their own presentation. def line(self): """Return the 1-based line number where the expression ceased to match.""" # This is a method rather than a property in case we ever wanted to # pass in which line endings we want to use. if isinstance(self.text, list): # TokenGrammar return None else: return self.text.count('\n', 0, self.pos) + 1 def column(self): """Return the 1-based column where the expression ceased to match.""" # We choose 1-based because that's what Python does with SyntaxErrors. try: return self.pos - self.text.rindex('\n', 0, self.pos) except (ValueError, AttributeError): return self.pos + 1 class LeftRecursionError(ParseError): def __str__(self): rule_name = self.expr.name if self.expr.name else str(self.expr) window = self.text[self.pos:self.pos + 20] return dedent(f""" Left recursion in rule {rule_name!r} at {window!r} (line {self.line()}, column {self.column()}). Parsimonious is a packrat parser, so it can't handle left recursion. See https://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion for how to rewrite your grammar into a rule that does not use left-recursion. """ ).strip() class IncompleteParseError(ParseError): """A call to ``parse()`` matched a whole Expression but did not consume the entire text.""" def __str__(self): return "Rule '%s' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with '%s' (line %s, column %s)." % ( self.expr.name, self.text[self.pos:self.pos + 20], self.line(), self.column()) class VisitationError(Exception): """Something went wrong while traversing a parse tree. This exception exists to augment an underlying exception with information about where in the parse tree the error occurred. Otherwise, it could be tiresome to figure out what went wrong; you'd have to play back the whole tree traversal in your head. """ # TODO: Make sure this is pickleable. Probably use @property pattern. Make # the original exc and node available on it if they don't cause a whole # raft of stack frames to be retained. def __init__(self, exc, exc_class, node): """Construct. :arg exc: What went wrong. We wrap this and add more info. :arg node: The node at which the error occurred """ self.original_class = exc_class super().__init__( '%s: %s\n\n' 'Parse tree:\n' '%s' % (exc_class.__name__, exc, node.prettily(error=node))) class BadGrammar(StrAndRepr, Exception): """Something was wrong with the definition of a grammar. Note that a ParseError might be raised instead if the error is in the grammar definition syntax. """ class UndefinedLabel(BadGrammar): """A rule referenced in a grammar was never defined. Circular references and forward references are okay, but you have to define stuff at some point. """ def __init__(self, label): self.label = label def __str__(self): return 'The label "%s" was never defined.' % self.label parsimonious-0.10.0/parsimonious/expressions.py000066400000000000000000000406261430470422700220170ustar00rootroot00000000000000"""Subexpressions that make up a parsed grammar These do the parsing. """ # TODO: Make sure all symbol refs are local--not class lookups or # anything--for speed. And kill all the dots. from collections import defaultdict from inspect import getfullargspec, isfunction, ismethod, ismethoddescriptor import regex as re from parsimonious.exceptions import ParseError, IncompleteParseError, LeftRecursionError from parsimonious.nodes import Node, RegexNode from parsimonious.utils import StrAndRepr def is_callable(value): criteria = [isfunction, ismethod, ismethoddescriptor] return any([criterion(value) for criterion in criteria]) def expression(callable, rule_name, grammar): """Turn a plain callable into an Expression. The callable can be of this simple form:: def foo(text, pos): '''If this custom expression matches starting at text[pos], return the index where it stops matching. Otherwise, return None.''' if the expression matched: return end_pos If there child nodes to return, return a tuple:: return end_pos, children If the expression doesn't match at the given ``pos`` at all... :: return None If your callable needs to make sub-calls to other rules in the grammar or do error reporting, it can take this form, gaining additional arguments:: def foo(text, pos, cache, error, grammar): # Call out to other rules: node = grammar['another_rule'].match_core(text, pos, cache, error) ... # Return values as above. The return value of the callable, if an int or a tuple, will be automatically transmuted into a :class:`~parsimonious.Node`. If it returns a Node-like class directly, it will be passed through unchanged. :arg rule_name: The rule name to attach to the resulting :class:`~parsimonious.Expression` :arg grammar: The :class:`~parsimonious.Grammar` this expression will be a part of, to make delegating to other rules possible """ # Resolve unbound methods; allows grammars to use @staticmethod custom rules # https://stackoverflow.com/questions/41921255/staticmethod-object-is-not-callable if ismethoddescriptor(callable) and hasattr(callable, '__func__'): callable = callable.__func__ num_args = len(getfullargspec(callable).args) if ismethod(callable): # do not count the first argument (typically 'self') for methods num_args -= 1 if num_args == 2: is_simple = True elif num_args == 5: is_simple = False else: raise RuntimeError("Custom rule functions must take either 2 or 5 " "arguments, not %s." % num_args) class AdHocExpression(Expression): def _uncached_match(self, text, pos, cache, error): result = (callable(text, pos) if is_simple else callable(text, pos, cache, error, grammar)) if isinstance(result, int): end, children = result, None elif isinstance(result, tuple): end, children = result else: # Node or None return result return Node(self, text, pos, end, children=children) def _as_rhs(self): return '{custom function "%s"}' % callable.__name__ return AdHocExpression(name=rule_name) IN_PROGRESS = object() class Expression(StrAndRepr): """A thing that can be matched against a piece of text""" # Slots are about twice as fast as __dict__-based attributes: # http://stackoverflow.com/questions/1336791/dictionary-vs-object-which-is-more-efficient-and-why # Top-level expressions--rules--have names. Subexpressions are named ''. __slots__ = ['name', 'identity_tuple'] def __init__(self, name=''): self.name = name self.identity_tuple = (self.name, ) def __hash__(self): return hash(self.identity_tuple) def __eq__(self, other): return self._eq_check_cycles(other, set()) def __ne__(self, other): return not (self == other) def _eq_check_cycles(self, other, checked): # keep a set of all pairs that are already checked, so we won't fall into infinite recursions. checked.add((id(self), id(other))) return other.__class__ is self.__class__ and self.identity_tuple == other.identity_tuple def resolve_refs(self, rule_map): # Nothing to do on the base expression. return self def parse(self, text, pos=0): """Return a parse tree of ``text``. Raise ``ParseError`` if the expression wasn't satisfied. Raise ``IncompleteParseError`` if the expression was satisfied but didn't consume the full string. """ node = self.match(text, pos=pos) if node.end < len(text): raise IncompleteParseError(text, node.end, self) return node def match(self, text, pos=0): """Return the parse tree matching this expression at the given position, not necessarily extending all the way to the end of ``text``. Raise ``ParseError`` if there is no match there. :arg pos: The index at which to start matching """ error = ParseError(text) node = self.match_core(text, pos, defaultdict(dict), error) if node is None: raise error return node def match_core(self, text, pos, cache, error): """Internal guts of ``match()`` This is appropriate to call only from custom rules or Expression subclasses. :arg cache: The packrat cache:: {(oid, pos): Node tree matched by object `oid` at index `pos` ...} :arg error: A ParseError instance with ``text`` already filled in but otherwise blank. We update the error reporting info on this object as we go. (Sticking references on an existing instance is faster than allocating a new one for each expression that fails.) We return None rather than raising and catching ParseErrors because catching is slow. """ # TODO: Optimize. Probably a hot spot. # # Is there a faster way of looking up cached stuff? # # If this is slow, think about the array module. It might (or might # not!) use more RAM, but it'll likely be faster than hashing things # all the time. Also, can we move all the allocs up front? # # To save space, we have lots of choices: (0) Quit caching whole Node # objects. Cache just what you need to reconstitute them. (1) Cache # only the results of entire rules, not subexpressions (probably a # horrible idea for rules that need to backtrack internally a lot). (2) # Age stuff out of the cache somehow. LRU? (3) Cuts. expr_cache = cache[id(self)] if pos in expr_cache: node = expr_cache[pos] else: # TODO: Set default value to prevent infinite recursion in left-recursive rules. expr_cache[pos] = IN_PROGRESS # Mark as in progress node = expr_cache[pos] = self._uncached_match(text, pos, cache, error) if node is IN_PROGRESS: raise LeftRecursionError(text, pos=-1, expr=self) # Record progress for error reporting: if node is None and pos >= error.pos and ( self.name or getattr(error.expr, 'name', None) is None): # Don't bother reporting on unnamed expressions (unless that's all # we've seen so far), as they're hard to track down for a human. # Perhaps we could include the unnamed subexpressions later as # auxiliary info. error.expr = self error.pos = pos return node def __str__(self): return '<%s %s>' % ( self.__class__.__name__, self.as_rule()) def as_rule(self): """Return the left- and right-hand sides of a rule that represents me. Return unicode. If I have no ``name``, omit the left-hand side. """ rhs = self._as_rhs().strip() if rhs.startswith('(') and rhs.endswith(')'): rhs = rhs[1:-1] return ('%s = %s' % (self.name, rhs)) if self.name else rhs def _unicode_members(self): """Return an iterable of my unicode-represented children, stopping descent when we hit a named node so the returned value resembles the input rule.""" return [(m.name or m._as_rhs()) for m in self.members] def _as_rhs(self): """Return the right-hand side of a rule that represents me. Implemented by subclasses. """ raise NotImplementedError class Literal(Expression): """A string literal Use these if you can; they're the fastest. """ __slots__ = ['literal'] def __init__(self, literal, name=''): super().__init__(name) self.literal = literal self.identity_tuple = (name, literal) def _uncached_match(self, text, pos, cache, error): if text.startswith(self.literal, pos): return Node(self, text, pos, pos + len(self.literal)) def _as_rhs(self): return repr(self.literal) class TokenMatcher(Literal): """An expression matching a single token of a given type This is for use only with TokenGrammars. """ def _uncached_match(self, token_list, pos, cache, error): if token_list[pos].type == self.literal: return Node(self, token_list, pos, pos + 1) class Regex(Expression): """An expression that matches what a regex does. Use these as much as you can and jam as much into each one as you can; they're fast. """ __slots__ = ['re'] def __init__(self, pattern, name='', ignore_case=False, locale=False, multiline=False, dot_all=False, unicode=False, verbose=False, ascii=False): super().__init__(name) self.re = re.compile(pattern, (ignore_case and re.I) | (locale and re.L) | (multiline and re.M) | (dot_all and re.S) | (unicode and re.U) | (verbose and re.X) | (ascii and re.A)) self.identity_tuple = (self.name, self.re) def _uncached_match(self, text, pos, cache, error): """Return length of match, ``None`` if no match.""" m = self.re.match(text, pos) if m is not None: span = m.span() node = RegexNode(self, text, pos, pos + span[1] - span[0]) node.match = m # TODO: A terrible idea for cache size? return node def _regex_flags_from_bits(self, bits): """Return the textual equivalent of numerically encoded regex flags.""" flags = 'ilmsuxa' return ''.join(flags[i - 1] if (1 << i) & bits else '' for i in range(1, len(flags) + 1)) def _as_rhs(self): return '~{!r}{}'.format(self.re.pattern, self._regex_flags_from_bits(self.re.flags)) class Compound(Expression): """An abstract expression which contains other expressions""" __slots__ = ['members'] def __init__(self, *members, **kwargs): """``members`` is a sequence of expressions.""" super().__init__(kwargs.get('name', '')) self.members = members def resolve_refs(self, rule_map): self.members = tuple(m.resolve_refs(rule_map) for m in self.members) return self def _eq_check_cycles(self, other, checked): return ( super()._eq_check_cycles(other, checked) and len(self.members) == len(other.members) and all(m._eq_check_cycles(mo, checked) for m, mo in zip(self.members, other.members) if (id(m), id(mo)) not in checked) ) def __hash__(self): # Note we leave members out of the hash computation, since compounds can get added to # sets, then have their members mutated. See RuleVisitor._resolve_refs. # Equality should still work, but we want the rules to go into the correct hash bucket. return hash((self.__class__, self.name)) class Sequence(Compound): """A series of expressions that must match contiguous, ordered pieces of the text In other words, it's a concatenation operator: each piece has to match, one after another. """ def _uncached_match(self, text, pos, cache, error): new_pos = pos children = [] for m in self.members: node = m.match_core(text, new_pos, cache, error) if node is None: return None children.append(node) length = node.end - node.start new_pos += length # Hooray! We got through all the members! return Node(self, text, pos, new_pos, children) def _as_rhs(self): return '({0})'.format(' '.join(self._unicode_members())) class OneOf(Compound): """A series of expressions, one of which must match Expressions are tested in order from first to last. The first to succeed wins. """ def _uncached_match(self, text, pos, cache, error): for m in self.members: node = m.match_core(text, pos, cache, error) if node is not None: # Wrap the succeeding child in a node representing the OneOf: return Node(self, text, pos, node.end, children=[node]) def _as_rhs(self): return '({0})'.format(' / '.join(self._unicode_members())) class Lookahead(Compound): """An expression which consumes nothing, even if its contained expression succeeds""" __slots__ = ['negativity'] def __init__(self, member, *, negative=False, **kwargs): super().__init__(member, **kwargs) self.negativity = bool(negative) def _uncached_match(self, text, pos, cache, error): node = self.members[0].match_core(text, pos, cache, error) if (node is None) == self.negativity: # negative lookahead == match only if not found return Node(self, text, pos, pos) def _as_rhs(self): return '%s%s' % ('!' if self.negativity else '&', self._unicode_members()[0]) def _eq_check_cycles(self, other, checked): return ( super()._eq_check_cycles(other, checked) and self.negativity == other.negativity ) def Not(term): return Lookahead(term, negative=True) # Quantifiers. None of these is strictly necessary, but they're darn handy. class Quantifier(Compound): """An expression wrapper like the */+/?/{n,m} quantifier in regexes.""" __slots__ = ['min', 'max'] def __init__(self, member, *, min=0, max=float('inf'), name='', **kwargs): super().__init__(member, name=name, **kwargs) self.min = min self.max = max def _uncached_match(self, text, pos, cache, error): new_pos = pos children = [] size = len(text) while new_pos < size and len(children) < self.max: node = self.members[0].match_core(text, new_pos, cache, error) if node is None: break # no more matches children.append(node) length = node.end - node.start if len(children) >= self.min and length == 0: # Don't loop infinitely break new_pos += length if len(children) >= self.min: return Node(self, text, pos, new_pos, children) def _as_rhs(self): if self.min == 0 and self.max == 1: qualifier = '?' elif self.min == 0 and self.max == float('inf'): qualifier = '*' elif self.min == 1 and self.max == float('inf'): qualifier = '+' elif self.max == float('inf'): qualifier = '{%d,}' % self.min elif self.min == 0: qualifier = '{,%d}' % self.max else: qualifier = '{%d,%d}' % (self.min, self.max) return '%s%s' % (self._unicode_members()[0], qualifier) def _eq_check_cycles(self, other, checked): return ( super()._eq_check_cycles(other, checked) and self.min == other.min and self.max == other.max ) def ZeroOrMore(member, name=''): return Quantifier(member, name=name, min=0, max=float('inf')) def OneOrMore(member, name='', min=1): return Quantifier(member, name=name, min=min, max=float('inf')) def Optional(member, name=''): return Quantifier(member, name=name, min=0, max=1) parsimonious-0.10.0/parsimonious/grammar.py000066400000000000000000000475661430470422700210750ustar00rootroot00000000000000"""A convenience which constructs expression trees from an easy-to-read syntax Use this unless you have a compelling reason not to; it performs some optimizations that would be tedious to do when constructing an expression tree by hand. """ from collections import OrderedDict from textwrap import dedent from parsimonious.exceptions import BadGrammar, UndefinedLabel from parsimonious.expressions import (Literal, Regex, Sequence, OneOf, Lookahead, Quantifier, Optional, ZeroOrMore, OneOrMore, Not, TokenMatcher, expression, is_callable) from parsimonious.nodes import NodeVisitor from parsimonious.utils import evaluate_string class Grammar(OrderedDict): """A collection of rules that describe a language You can start parsing from the default rule by calling ``parse()`` directly on the ``Grammar`` object:: g = Grammar(''' polite_greeting = greeting ", my good " title greeting = "Hi" / "Hello" title = "madam" / "sir" ''') g.parse('Hello, my good sir') Or start parsing from any of the other rules; you can pull them out of the grammar as if it were a dictionary:: g['title'].parse('sir') You could also just construct a bunch of ``Expression`` objects yourself and stitch them together into a language, but using a ``Grammar`` has some important advantages: * Languages are much easier to define in the nice syntax it provides. * Circular references aren't a pain. * It does all kinds of whizzy space- and time-saving optimizations, like factoring up repeated subexpressions into a single object, which should increase cache hit ratio. [Is this implemented yet?] """ def __init__(self, rules='', **more_rules): """Construct a grammar. :arg rules: A string of production rules, one per line. :arg default_rule: The name of the rule invoked when you call :meth:`parse()` or :meth:`match()` on the grammar. Defaults to the first rule. Falls back to None if there are no string-based rules in this grammar. :arg more_rules: Additional kwargs whose names are rule names and values are Expressions or custom-coded callables which accomplish things the built-in rule syntax cannot. These take precedence over ``rules`` in case of naming conflicts. """ decorated_custom_rules = { k: (expression(v, k, self) if is_callable(v) else v) for k, v in more_rules.items()} exprs, first = self._expressions_from_rules(rules, decorated_custom_rules) super().__init__(exprs.items()) self.default_rule = first # may be None def default(self, rule_name): """Return a new Grammar whose :term:`default rule` is ``rule_name``.""" new = self._copy() new.default_rule = new[rule_name] return new def _copy(self): """Return a shallow copy of myself. Deep is unnecessary, since Expression trees are immutable. Subgrammars recreate all the Expressions from scratch, and AbstractGrammars have no Expressions. """ new = Grammar.__new__(Grammar) super(Grammar, new).__init__(self.items()) new.default_rule = self.default_rule return new def _expressions_from_rules(self, rules, custom_rules): """Return a 2-tuple: a dict of rule names pointing to their expressions, and then the first rule. It's a web of expressions, all referencing each other. Typically, there's a single root to the web of references, and that root is the starting symbol for parsing, but there's nothing saying you can't have multiple roots. :arg custom_rules: A map of rule names to custom-coded rules: Expressions """ tree = rule_grammar.parse(rules) return RuleVisitor(custom_rules).visit(tree) def parse(self, text, pos=0): """Parse some text with the :term:`default rule`. :arg pos: The index at which to start parsing """ self._check_default_rule() return self.default_rule.parse(text, pos=pos) def match(self, text, pos=0): """Parse some text with the :term:`default rule` but not necessarily all the way to the end. :arg pos: The index at which to start parsing """ self._check_default_rule() return self.default_rule.match(text, pos=pos) def _check_default_rule(self): """Raise RuntimeError if there is no default rule defined.""" if not self.default_rule: raise RuntimeError("Can't call parse() on a Grammar that has no " "default rule. Choose a specific rule instead, " "like some_grammar['some_rule'].parse(...).") def __str__(self): """Return a rule string that, when passed to the constructor, would reconstitute the grammar.""" exprs = [self.default_rule] if self.default_rule else [] exprs.extend(expr for expr in self.values() if expr is not self.default_rule) return '\n'.join(expr.as_rule() for expr in exprs) def __repr__(self): """Return an expression that will reconstitute the grammar.""" return "Grammar({!r})".format(str(self)) class TokenGrammar(Grammar): """A Grammar which takes a list of pre-lexed tokens instead of text This is useful if you want to do the lexing yourself, as a separate pass: for example, to implement indentation-based languages. """ def _expressions_from_rules(self, rules, custom_rules): tree = rule_grammar.parse(rules) return TokenRuleVisitor(custom_rules).visit(tree) class BootstrappingGrammar(Grammar): """The grammar used to recognize the textual rules that describe other grammars This grammar gets its start from some hard-coded Expressions and claws its way from there to an expression tree that describes how to parse the grammar description syntax. """ def _expressions_from_rules(self, rule_syntax, custom_rules): """Return the rules for parsing the grammar definition syntax. Return a 2-tuple: a dict of rule names pointing to their expressions, and then the top-level expression for the first rule. """ # Hard-code enough of the rules to parse the grammar that describes the # grammar description language, to bootstrap: comment = Regex(r'#[^\r\n]*', name='comment') meaninglessness = OneOf(Regex(r'\s+'), comment, name='meaninglessness') _ = ZeroOrMore(meaninglessness, name='_') equals = Sequence(Literal('='), _, name='equals') label = Sequence(Regex(r'[a-zA-Z_][a-zA-Z_0-9]*'), _, name='label') reference = Sequence(label, Not(equals), name='reference') quantifier = Sequence(Regex(r'[*+?]'), _, name='quantifier') # This pattern supports empty literals. TODO: A problem? spaceless_literal = Regex(r'u?r?"[^"\\]*(?:\\.[^"\\]*)*"', ignore_case=True, dot_all=True, name='spaceless_literal') literal = Sequence(spaceless_literal, _, name='literal') regex = Sequence(Literal('~'), literal, Regex('[ilmsuxa]*', ignore_case=True), _, name='regex') atom = OneOf(reference, literal, regex, name='atom') quantified = Sequence(atom, quantifier, name='quantified') term = OneOf(quantified, atom, name='term') not_term = Sequence(Literal('!'), term, _, name='not_term') term.members = (not_term,) + term.members sequence = Sequence(term, OneOrMore(term), name='sequence') or_term = Sequence(Literal('/'), _, term, name='or_term') ored = Sequence(term, OneOrMore(or_term), name='ored') expression = OneOf(ored, sequence, term, name='expression') rule = Sequence(label, equals, expression, name='rule') rules = Sequence(_, OneOrMore(rule), name='rules') # Use those hard-coded rules to parse the (more extensive) rule syntax. # (For example, unless I start using parentheses in the rule language # definition itself, I should never have to hard-code expressions for # those above.) rule_tree = rules.parse(rule_syntax) # Turn the parse tree into a map of expressions: return RuleVisitor().visit(rule_tree) # The grammar for parsing PEG grammar definitions: # This is a nice, simple grammar. We may someday add to it, but it's a safe bet # that the future will always be a superset of this. rule_syntax = (r''' # Ignored things (represented by _) are typically hung off the end of the # leafmost kinds of nodes. Literals like "/" count as leaves. rules = _ rule* rule = label equals expression equals = "=" _ literal = spaceless_literal _ # So you can't spell a regex like `~"..." ilm`: spaceless_literal = ~"u?r?b?\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""is / ~"u?r?b?'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'"is expression = ored / sequence / term or_term = "/" _ term ored = term or_term+ sequence = term term+ not_term = "!" term _ lookahead_term = "&" term _ term = not_term / lookahead_term / quantified / atom quantified = atom quantifier atom = reference / literal / regex / parenthesized regex = "~" spaceless_literal ~"[ilmsuxa]*"i _ parenthesized = "(" _ expression ")" _ quantifier = ~r"[*+?]|\{\d*,\d+\}|\{\d+,\d*\}|\{\d+\}" _ reference = label !equals # A subsequent equal sign is the only thing that distinguishes a label # (which begins a new rule) from a reference (which is just a pointer to a # rule defined somewhere else): label = ~"[a-zA-Z_][a-zA-Z_0-9]*(?![\"'])" _ # _ = ~r"\s*(?:#[^\r\n]*)?\s*" _ = meaninglessness* meaninglessness = ~r"\s+" / comment comment = ~r"#[^\r\n]*" ''') class LazyReference(str): """A lazy reference to a rule, which we resolve after grokking all the rules""" name = '' def resolve_refs(self, rule_map): """ Traverse the rule map following top-level lazy references, until we reach a cycle (raise an error) or a concrete expression. For example, the following is a circular reference: foo = bar baz = foo2 foo2 = foo Note that every RHS of a grammar rule _must_ be either a LazyReference or a concrete expression, so the reference chain will eventually either terminate or find a cycle. """ seen = set() cur = self while True: if cur in seen: raise BadGrammar(f"Circular Reference resolving {self.name}={self}.") else: seen.add(cur) try: cur = rule_map[str(cur)] except KeyError: raise UndefinedLabel(cur) if not isinstance(cur, LazyReference): return cur # Just for debugging: def _as_rhs(self): return '' % self class RuleVisitor(NodeVisitor): """Turns a parse tree of a grammar definition into a map of ``Expression`` objects This is the magic piece that breathes life into a parsed bunch of parse rules, allowing them to go forth and parse other things. """ quantifier_classes = {'?': Optional, '*': ZeroOrMore, '+': OneOrMore} visit_expression = visit_term = visit_atom = NodeVisitor.lift_child def __init__(self, custom_rules=None): """Construct. :arg custom_rules: A dict of {rule name: expression} holding custom rules which will take precedence over the others """ self.custom_rules = custom_rules or {} self._last_literal_node_and_type = None def visit_parenthesized(self, node, parenthesized): """Treat a parenthesized subexpression as just its contents. Its position in the tree suffices to maintain its grouping semantics. """ left_paren, _, expression, right_paren, _ = parenthesized return expression def visit_quantifier(self, node, quantifier): """Turn a quantifier into just its symbol-matching node.""" symbol, _ = quantifier return symbol def visit_quantified(self, node, quantified): atom, quantifier = quantified try: return self.quantifier_classes[quantifier.text](atom) except KeyError: # This should pass: assert re.full_match("\{(\d*)(,(\d*))?\}", quantifier) quantifier = quantifier.text[1:-1].split(",") if len(quantifier) == 1: min_match = max_match = int(quantifier[0]) else: min_match = int(quantifier[0]) if quantifier[0] else 0 max_match = int(quantifier[1]) if quantifier[1] else float('inf') return Quantifier(atom, min=min_match, max=max_match) def visit_lookahead_term(self, node, lookahead_term): ampersand, term, _ = lookahead_term return Lookahead(term) def visit_not_term(self, node, not_term): exclamation, term, _ = not_term return Not(term) def visit_rule(self, node, rule): """Assign a name to the Expression and return it.""" label, equals, expression = rule expression.name = label # Assign a name to the expr. return expression def visit_sequence(self, node, sequence): """A parsed Sequence looks like [term node, OneOrMore node of ``another_term``s]. Flatten it out.""" term, other_terms = sequence return Sequence(term, *other_terms) def visit_ored(self, node, ored): first_term, other_terms = ored return OneOf(first_term, *other_terms) def visit_or_term(self, node, or_term): """Return just the term from an ``or_term``. We already know it's going to be ored, from the containing ``ored``. """ slash, _, term = or_term return term def visit_label(self, node, label): """Turn a label into a unicode string.""" name, _ = label return name.text def visit_reference(self, node, reference): """Stick a :class:`LazyReference` in the tree as a placeholder. We resolve them all later. """ label, not_equals = reference return LazyReference(label) def visit_regex(self, node, regex): """Return a ``Regex`` expression.""" tilde, literal, flags, _ = regex flags = flags.text.upper() pattern = literal.literal # Pull the string back out of the Literal # object. return Regex(pattern, ignore_case='I' in flags, locale='L' in flags, multiline='M' in flags, dot_all='S' in flags, unicode='U' in flags, verbose='X' in flags, ascii='A' in flags) def visit_spaceless_literal(self, spaceless_literal, visited_children): """Turn a string literal into a ``Literal`` that recognizes it.""" literal_value = evaluate_string(spaceless_literal.text) if self._last_literal_node_and_type: last_node, last_type = self._last_literal_node_and_type if last_type != type(literal_value): raise BadGrammar(dedent(f"""\ Found {last_node.text} ({last_type}) and {spaceless_literal.text} ({type(literal_value)}) string literals. All strings in a single grammar must be of the same type. """) ) self._last_literal_node_and_type = spaceless_literal, type(literal_value) return Literal(literal_value) def visit_literal(self, node, literal): """Pick just the literal out of a literal-and-junk combo.""" spaceless_literal, _ = literal return spaceless_literal def generic_visit(self, node, visited_children): """Replace childbearing nodes with a list of their children; keep others untouched. For our case, if a node has children, only the children are important. Otherwise, keep the node around for (for example) the flags of the regex rule. Most of these kept-around nodes are subsequently thrown away by the other visitor methods. We can't simply hang the visited children off the original node; that would be disastrous if the node occurred in more than one place in the tree. """ return visited_children or node # should semantically be a tuple def visit_rules(self, node, rules_list): """Collate all the rules into a map. Return (map, default rule). The default rule is the first one. Or, if you have more than one rule of that name, it's the last-occurring rule of that name. (This lets you override the default rule when you extend a grammar.) If there are no string-based rules, the default rule is None, because the custom rules, due to being kwarg-based, are unordered. """ _, rules = rules_list # Map each rule's name to its Expression. Later rules of the same name # override earlier ones. This lets us define rules multiple times and # have the last declaration win, so you can extend grammars by # concatenation. rule_map = OrderedDict((expr.name, expr) for expr in rules) # And custom rules override string-based rules. This is the least # surprising choice when you compare the dict constructor: # dict({'x': 5}, x=6). rule_map.update(self.custom_rules) # Resolve references. This tolerates forward references. for name, rule in list(rule_map.items()): if hasattr(rule, 'resolve_refs'): # Some custom rules may not define a resolve_refs method, # though anything that inherits from Expression will have it. rule_map[name] = rule.resolve_refs(rule_map) # isinstance() is a temporary hack around the fact that * rules don't # always get transformed into lists by NodeVisitor. We should fix that; # it's surprising and requires writing lame branches like this. return rule_map, (rule_map[rules[0].name] if isinstance(rules, list) and rules else None) class TokenRuleVisitor(RuleVisitor): """A visitor which builds expression trees meant to work on sequences of pre-lexed tokens rather than strings""" def visit_spaceless_literal(self, spaceless_literal, visited_children): """Turn a string literal into a ``TokenMatcher`` that matches ``Token`` objects by their ``type`` attributes.""" return TokenMatcher(evaluate_string(spaceless_literal.text)) def visit_regex(self, node, regex): tilde, literal, flags, _ = regex raise BadGrammar('Regexes do not make sense in TokenGrammars, since ' 'TokenGrammars operate on pre-lexed tokens rather ' 'than characters.') # Bootstrap to level 1... rule_grammar = BootstrappingGrammar(rule_syntax) # ...and then to level 2. This establishes that the node tree of our rule # syntax is built by the same machinery that will build trees of our users' # grammars. And the correctness of that tree is tested, indirectly, in # test_grammar. rule_grammar = Grammar(rule_syntax) # TODO: Teach Expression trees how to spit out Python representations of # themselves. Then we can just paste that in above, and we won't have to # bootstrap on import. Though it'll be a little less DRY. [Ah, but this is not # so clean, because it would have to output multiple statements to get multiple # refs to a single expression hooked up.] parsimonious-0.10.0/parsimonious/nodes.py000066400000000000000000000315251430470422700205430ustar00rootroot00000000000000"""Nodes that make up parse trees Parsing spits out a tree of these, which you can then tell to walk itself and spit out a useful value. Or you can walk it yourself; the structural attributes are public. """ # TODO: If this is slow, think about using cElementTree or something. from inspect import isfunction from sys import version_info, exc_info from parsimonious.exceptions import VisitationError, UndefinedLabel class Node(object): """A parse tree node Consider these immutable once constructed. As a side effect of a memory-saving strategy in the cache, multiple references to a single ``Node`` might be returned in a single parse tree. So, if you start messing with one, you'll see surprising parallel changes pop up elsewhere. My philosophy is that parse trees (and their nodes) should be representation-agnostic. That is, they shouldn't get all mixed up with what the final rendered form of a wiki page (or the intermediate representation of a programming language, or whatever) is going to be: you should be able to parse once and render several representations from the tree, one after another. """ # I tried making this subclass list, but it got ugly. I had to construct # invalid ones and patch them up later, and there were other problems. __slots__ = ['expr', # The expression that generated me 'full_text', # The full text fed to the parser 'start', # The position in the text where that expr started matching 'end', # The position after start where the expr first didn't # match. [start:end] follow Python slice conventions. 'children'] # List of child parse tree nodes def __init__(self, expr, full_text, start, end, children=None): self.expr = expr self.full_text = full_text self.start = start self.end = end self.children = children or [] @property def expr_name(self): # backwards compatibility return self.expr.name def __iter__(self): """Support looping over my children and doing tuple unpacks on me. It can be very handy to unpack nodes in arg lists; see :class:`PegVisitor` for an example. """ return iter(self.children) @property def text(self): """Return the text this node matched.""" return self.full_text[self.start:self.end] # From here down is just stuff for testing and debugging. def prettily(self, error=None): """Return a unicode, pretty-printed representation of me. :arg error: The node to highlight because an error occurred there """ # TODO: If a Node appears multiple times in the tree, we'll point to # them all. Whoops. def indent(text): return '\n'.join((' ' + line) for line in text.splitlines()) ret = [u'<%s%s matching "%s">%s' % ( self.__class__.__name__, (' called "%s"' % self.expr_name) if self.expr_name else '', self.text, ' <-- *** We were here. ***' if error is self else '')] for n in self: ret.append(indent(n.prettily(error=error))) return '\n'.join(ret) def __str__(self): """Return a compact, human-readable representation of me.""" return self.prettily() def __eq__(self, other): """Support by-value deep comparison with other nodes for testing.""" if not isinstance(other, Node): return NotImplemented return (self.expr == other.expr and self.full_text == other.full_text and self.start == other.start and self.end == other.end and self.children == other.children) def __ne__(self, other): return not self == other def __repr__(self, top_level=True): """Return a bit of code (though not an expression) that will recreate me.""" # repr() of unicode flattens everything out to ASCII, so we don't need # to explicitly encode things afterward. ret = ["s = %r" % self.full_text] if top_level else [] ret.append("%s(%r, s, %s, %s%s)" % ( self.__class__.__name__, self.expr, self.start, self.end, (', children=[%s]' % ', '.join([c.__repr__(top_level=False) for c in self.children])) if self.children else '')) return '\n'.join(ret) class RegexNode(Node): """Node returned from a ``Regex`` expression Grants access to the ``re.Match`` object, in case you want to access capturing groups, etc. """ __slots__ = ['match'] class RuleDecoratorMeta(type): def __new__(metaclass, name, bases, namespace): def unvisit(name): """Remove any leading "visit_" from a method name.""" return name[6:] if name.startswith('visit_') else name methods = [v for k, v in namespace.items() if hasattr(v, '_rule') and isfunction(v)] if methods: from parsimonious.grammar import Grammar # circular import dodge methods.sort(key=(lambda x: x.func_code.co_firstlineno) if version_info[0] < 3 else (lambda x: x.__code__.co_firstlineno)) # Possible enhancement: once we get the Grammar extensibility story # solidified, we can have @rules *add* to the default grammar # rather than pave over it. namespace['grammar'] = Grammar( '\n'.join('{name} = {expr}'.format(name=unvisit(m.__name__), expr=m._rule) for m in methods)) return super(RuleDecoratorMeta, metaclass).__new__(metaclass, name, bases, namespace) class NodeVisitor(object, metaclass=RuleDecoratorMeta): """A shell for writing things that turn parse trees into something useful Performs a depth-first traversal of an AST. Subclass this, add methods for each expr you care about, instantiate, and call ``visit(top_node_of_parse_tree)``. It'll return the useful stuff. This API is very similar to that of ``ast.NodeVisitor``. These could easily all be static methods, but that would add at least as much weirdness at the call site as the ``()`` for instantiation. And this way, we support subclasses that require state: options, for example, or a symbol table constructed from a programming language's AST. We never transform the parse tree in place, because... * There are likely multiple references to the same ``Node`` object in a parse tree, and changes to one reference would surprise you elsewhere. * It makes it impossible to report errors: you'd end up with the "error" arrow pointing someplace in a half-transformed mishmash of nodes--and that's assuming you're even transforming the tree into another tree. Heaven forbid you're making it into a string or something else. """ #: The :term:`default grammar`: the one recommended for use with this #: visitor. If you populate this, you will be able to call #: :meth:`NodeVisitor.parse()` as a shortcut. grammar = None #: Classes of exceptions you actually intend to raise during visitation #: and which should propagate out of the visitor. These will not be #: wrapped in a VisitationError when they arise. unwrapped_exceptions = () # TODO: If we need to optimize this, we can go back to putting subclasses # in charge of visiting children; they know when not to bother. Or we can # mark nodes as not descent-worthy in the grammar. def visit(self, node): """Walk a parse tree, transforming it into another representation. Recursively descend a parse tree, dispatching to the method named after the rule in the :class:`~parsimonious.grammar.Grammar` that produced each node. If, for example, a rule was... :: bold = '' ...the ``visit_bold()`` method would be called. It is your responsibility to subclass :class:`NodeVisitor` and implement those methods. """ method = getattr(self, 'visit_' + node.expr_name, self.generic_visit) # Call that method, and show where in the tree it failed if it blows # up. try: return method(node, [self.visit(n) for n in node]) except (VisitationError, UndefinedLabel): # Don't catch and re-wrap already-wrapped exceptions. raise except Exception as exc: # implentors may define exception classes that should not be # wrapped. if isinstance(exc, self.unwrapped_exceptions): raise # Catch any exception, and tack on a parse tree so it's easier to # see where it went wrong. exc_class = type(exc) raise VisitationError(exc, exc_class, node) from exc def generic_visit(self, node, visited_children): """Default visitor method :arg node: The node we're visiting :arg visited_children: The results of visiting the children of that node, in a list I'm not sure there's an implementation of this that makes sense across all (or even most) use cases, so we leave it to subclasses to implement for now. """ raise NotImplementedError('No visitor method was defined for this expression: %s' % node.expr.as_rule()) # Convenience methods: def parse(self, text, pos=0): """Parse some text with this Visitor's default grammar and return the result of visiting it. ``SomeVisitor().parse('some_string')`` is a shortcut for ``SomeVisitor().visit(some_grammar.parse('some_string'))``. """ return self._parse_or_match(text, pos, 'parse') def match(self, text, pos=0): """Parse and visit some text with this Visitor's default grammar, but don't insist on parsing all the way to the end. ``SomeVisitor().match('some_string')`` is a shortcut for ``SomeVisitor().visit(some_grammar.match('some_string'))``. """ return self._parse_or_match(text, pos, 'match') # Internal convenience methods to help you write your own visitors: def lift_child(self, node, children): """Lift the sole child of ``node`` up to replace the node.""" first_child, = children return first_child # Private methods: def _parse_or_match(self, text, pos, method_name): """Execute a parse or match on the default grammar, followed by a visitation. Raise RuntimeError if there is no default grammar specified. """ if not self.grammar: raise RuntimeError( "The {cls}.{method}() shortcut won't work because {cls} was " "never associated with a specific " "grammar. Fill out its " "`grammar` attribute, and try again.".format( cls=self.__class__.__name__, method=method_name)) return self.visit(getattr(self.grammar, method_name)(text, pos=pos)) def rule(rule_string): """Decorate a NodeVisitor ``visit_*`` method to tie a grammar rule to it. The following will arrange for the ``visit_digit`` method to receive the results of the ``~"[0-9]"`` parse rule:: @rule('~"[0-9]"') def visit_digit(self, node, visited_children): ... Notice that there is no "digit = " as part of the rule; that gets inferred from the method name. In cases where there is only one kind of visitor interested in a grammar, using ``@rule`` saves you having to look back and forth between the visitor and the grammar definition. On an implementation level, all ``@rule`` rules get stitched together into a :class:`~parsimonious.Grammar` that becomes the NodeVisitor's :term:`default grammar`. Typically, the choice of a default rule for this grammar is simple: whatever ``@rule`` comes first in the class is the default. But the choice may become surprising if you divide the ``@rule`` calls among subclasses. At the moment, which method "comes first" is decided simply by comparing line numbers, so whatever method is on the smallest-numbered line will be the default. In a future release, this will change to pick the first ``@rule`` call on the basemost class that has one. That way, a subclass which does not override the default rule's ``visit_*`` method won't unintentionally change which rule is the default. """ def decorator(method): method._rule = rule_string # XXX: Maybe register them on a class var instead so we can just override a @rule'd visitor method on a subclass without blowing away the rule string that comes with it. return method return decorator parsimonious-0.10.0/parsimonious/tests/000077500000000000000000000000001430470422700202155ustar00rootroot00000000000000parsimonious-0.10.0/parsimonious/tests/__init__.py000066400000000000000000000000001430470422700223140ustar00rootroot00000000000000parsimonious-0.10.0/parsimonious/tests/benchmarks.py000066400000000000000000000061641430470422700227130ustar00rootroot00000000000000"""Benchmarks for Parsimonious Run these with ``python parsimonious/tests/benchmarks.py``. They don't run during normal test runs because they're not tests--they don't assert anything. Also, they're a bit slow. These differ from the ones in test_benchmarks in that these are meant to be compared from revision to revision of Parsimonious to make sure we're not getting slower. test_benchmarks simply makes sure our choices among implementation alternatives remain valid. """ from __future__ import print_function import gc from timeit import repeat from parsimonious.grammar import Grammar def test_not_really_json_parsing(): """As a baseline for speed, parse some JSON. I have no reason to believe that JSON is a particularly representative or revealing grammar to test with. Also, this is a naive, unoptimized, incorrect grammar, so don't use it as a basis for comparison with other parsers. It's just meant to compare across versions of Parsimonious. """ father = """{ "id" : 1, "married" : true, "name" : "Larry Lopez", "sons" : null, "daughters" : [ { "age" : 26, "name" : "Sandra" }, { "age" : 25, "name" : "Margaret" }, { "age" : 6, "name" : "Mary" } ] }""" more_fathers = ','.join([father] * 60) json = '{"fathers" : [' + more_fathers + ']}' grammar = Grammar(r""" value = space (string / number / object / array / true_false_null) space object = "{" members "}" members = (pair ("," pair)*)? pair = string ":" value array = "[" elements "]" elements = (value ("," value)*)? true_false_null = "true" / "false" / "null" string = space "\"" chars "\"" space chars = ~"[^\"]*" # TODO implement the real thing number = (int frac exp) / (int exp) / (int frac) / int int = "-"? ((digit1to9 digits) / digit) frac = "." digits exp = e digits digits = digit+ e = "e+" / "e-" / "e" / "E+" / "E-" / "E" digit1to9 = ~"[1-9]" digit = ~"[0-9]" space = ~"\s*" """) # These number and repetition values seem to keep results within 5% of the # difference between min and max. We get more consistent results running a # bunch of single-parse tests and taking the min rather than upping the # NUMBER and trying to stomp out the outliers with averaging. NUMBER = 1 REPEAT = 5 total_seconds = min(repeat(lambda: grammar.parse(json), lambda: gc.enable(), # so we take into account how we treat the GC repeat=REPEAT, number=NUMBER)) seconds_each = total_seconds / NUMBER kb = len(json) / 1024.0 print('Took %.3fs to parse %.1fKB: %.0fKB/s.' % (seconds_each, kb, kb / seconds_each)) if __name__ == "__main__": test_not_really_json_parsing()parsimonious-0.10.0/parsimonious/tests/test_benchmarks.py000066400000000000000000000033601430470422700237450ustar00rootroot00000000000000"""Tests to show that the benchmarks we based our speed optimizations on are still valid""" import unittest from functools import partial from timeit import timeit timeit = partial(timeit, number=500000) class TestBenchmarks(unittest.TestCase): def test_lists_vs_dicts(self): """See what's faster at int key lookup: dicts or lists.""" list_time = timeit('item = l[9000]', 'l = [0] * 10000') dict_time = timeit('item = d[9000]', 'd = {x: 0 for x in range(10000)}') # Dicts take about 1.6x as long as lists in Python 2.6 and 2.7. self.assertTrue(list_time < dict_time, '%s < %s' % (list_time, dict_time)) def test_call_vs_inline(self): """How bad is the calling penalty?""" no_call = timeit('l[0] += 1', 'l = [0]') call = timeit('add(); l[0] += 1', 'l = [0]\n' 'def add():\n' ' pass') # Calling a function is pretty fast; it takes just 1.2x as long as the # global var access and addition in l[0] += 1. self.assertTrue(no_call < call, '%s (no call) < %s (call)' % (no_call, call)) def test_startswith_vs_regex(self): """Can I beat the speed of regexes by special-casing literals?""" re_time = timeit( 'r.match(t, 19)', 'import re\n' "r = re.compile('hello')\n" "t = 'this is the finest hello ever'") startswith_time = timeit("t.startswith('hello', 19)", "t = 'this is the finest hello ever'") # Regexes take 2.24x as long as simple string matching. self.assertTrue(startswith_time < re_time, '%s (startswith) < %s (re)' % (startswith_time, re_time))parsimonious-0.10.0/parsimonious/tests/test_expressions.py000066400000000000000000000332441430470422700242160ustar00rootroot00000000000000# coding=utf-8 from unittest import TestCase from parsimonious.exceptions import ParseError, IncompleteParseError from parsimonious.expressions import (Literal, Regex, Sequence, OneOf, Not, Quantifier, Optional, ZeroOrMore, OneOrMore, Expression) from parsimonious.grammar import Grammar, rule_grammar from parsimonious.nodes import Node class LengthTests(TestCase): """Tests for returning the right lengths I wrote these before parse tree generation was implemented. They're partially redundant with TreeTests. """ def len_eq(self, node, length): """Return whether the match lengths of 2 nodes are equal. Makes tests shorter and lets them omit positional stuff they don't care about. """ node_length = None if node is None else node.end - node.start assert node_length == length def test_regex(self): self.len_eq(Literal('hello').match('ehello', 1), 5) # simple self.len_eq(Regex('hello*').match('hellooo'), 7) # * self.assertRaises(ParseError, Regex('hello*').match, 'goodbye') # no match self.len_eq(Regex('hello', ignore_case=True).match('HELLO'), 5) def test_sequence(self): self.len_eq(Sequence(Regex('hi*'), Literal('lo'), Regex('.ingo')).match('hiiiilobingo1234'), 12) # succeed self.assertRaises(ParseError, Sequence(Regex('hi*'), Literal('lo'), Regex('.ingo')).match, 'hiiiilobing') # don't self.len_eq(Sequence(Regex('hi*')).match('>hiiii', 1), 5) # non-0 pos def test_one_of(self): self.len_eq(OneOf(Literal('aaa'), Literal('bb')).match('aaa'), 3) # first alternative self.len_eq(OneOf(Literal('aaa'), Literal('bb')).match('bbaaa'), 2) # second self.assertRaises(ParseError, OneOf(Literal('aaa'), Literal('bb')).match, 'aa') # no match def test_not(self): self.len_eq(Not(Regex('.')).match(''), 0) # match self.assertRaises(ParseError, Not(Regex('.')).match, 'Hi') # don't def test_optional(self): self.len_eq(Sequence(Optional(Literal('a')), Literal('b')).match('b'), 1) # contained expr fails self.len_eq(Sequence(Optional(Literal('a')), Literal('b')).match('ab'), 2) # contained expr succeeds self.len_eq(Optional(Literal('a')).match('aa'), 1) self.len_eq(Optional(Literal('a')).match('bb'), 0) def test_zero_or_more(self): self.len_eq(ZeroOrMore(Literal('b')).match(''), 0) # zero self.len_eq(ZeroOrMore(Literal('b')).match('bbb'), 3) # more self.len_eq(Regex('^').match(''), 0) # Validate the next test. # Try to make it loop infinitely using a zero-length contained expression: self.len_eq(ZeroOrMore(Regex('^')).match(''), 0) def test_one_or_more(self): self.len_eq(OneOrMore(Literal('b')).match('b'), 1) # one self.len_eq(OneOrMore(Literal('b')).match('bbb'), 3) # more self.len_eq(OneOrMore(Literal('b'), min=3).match('bbb'), 3) # with custom min; success self.len_eq(Quantifier(Literal('b'), min=3, max=5).match('bbbb'), 4) # with custom min and max; success self.len_eq(Quantifier(Literal('b'), min=3, max=5).match('bbbbbb'), 5) # with custom min and max; success self.assertRaises(ParseError, OneOrMore(Literal('b'), min=3).match, 'bb') # with custom min; failure self.assertRaises(ParseError, Quantifier(Literal('b'), min=3, max=5).match, 'bb') # with custom min and max; failure self.len_eq(OneOrMore(Regex('^')).match('bb'), 0) # attempt infinite loop class TreeTests(TestCase): """Tests for building the right trees We have only to test successes here; failures (None-returning cases) are covered above. """ def test_simple_node(self): """Test that leaf expressions like ``Literal`` make the right nodes.""" h = Literal('hello', name='greeting') self.assertEqual(h.match('hello'), Node(h, 'hello', 0, 5)) def test_sequence_nodes(self): """Assert that ``Sequence`` produces nodes with the right children.""" s = Sequence(Literal('heigh', name='greeting1'), Literal('ho', name='greeting2'), name='dwarf') text = 'heighho' self.assertEqual(s.match(text), Node(s, text, 0, 7, children=[Node(s.members[0], text, 0, 5), Node(s.members[1], text, 5, 7)])) def test_one_of(self): """``OneOf`` should return its own node, wrapping the child that succeeds.""" o = OneOf(Literal('a', name='lit'), name='one_of') text = 'aa' self.assertEqual(o.match(text), Node(o, text, 0, 1, children=[ Node(o.members[0], text, 0, 1)])) def test_optional(self): """``Optional`` should return its own node wrapping the succeeded child.""" expr = Optional(Literal('a', name='lit'), name='opt') text = 'a' self.assertEqual(expr.match(text), Node(expr, text, 0, 1, children=[ Node(expr.members[0], text, 0, 1)])) # Test failure of the Literal inside the Optional; the # LengthTests.test_optional is ambiguous for that. text = '' self.assertEqual(expr.match(text), Node(expr, text, 0, 0)) def test_zero_or_more_zero(self): """Test the 0 case of ``ZeroOrMore``; it should still return a node.""" expr = ZeroOrMore(Literal('a'), name='zero') text = '' self.assertEqual(expr.match(text), Node(expr, text, 0, 0)) def test_one_or_more_one(self): """Test the 1 case of ``OneOrMore``; it should return a node with a child.""" expr = OneOrMore(Literal('a', name='lit'), name='one') text = 'a' self.assertEqual(expr.match(text), Node(expr, text, 0, 1, children=[ Node(expr.members[0], text, 0, 1)])) # Things added since Grammar got implemented are covered in integration # tests in test_grammar. class ParseTests(TestCase): """Tests for the ``parse()`` method""" def test_parse_success(self): """Make sure ``parse()`` returns the tree on success. There's not much more than that to test that we haven't already vetted above. """ expr = OneOrMore(Literal('a', name='lit'), name='more') text = 'aa' self.assertEqual(expr.parse(text), Node(expr, text, 0, 2, children=[ Node(expr.members[0], text, 0, 1), Node(expr.members[0], text, 1, 2)])) class ErrorReportingTests(TestCase): """Tests for reporting parse errors""" def test_inner_rule_succeeding(self): """Make sure ``parse()`` fails and blames the rightward-progressing-most named Expression when an Expression isn't satisfied. Make sure ParseErrors have nice Unicode representations. """ grammar = Grammar(""" bold_text = open_parens text close_parens open_parens = "((" text = ~"[a-zA-Z]+" close_parens = "))" """) text = '((fred!!' try: grammar.parse(text) except ParseError as error: self.assertEqual(error.pos, 6) self.assertEqual(error.expr, grammar['close_parens']) self.assertEqual(error.text, text) self.assertEqual(str(error), "Rule 'close_parens' didn't match at '!!' (line 1, column 7).") def test_rewinding(self): """Make sure rewinding the stack and trying an alternative (which progresses farther) from a higher-level rule can blame an expression within the alternative on failure. There's no particular reason I suspect this wouldn't work, but it's a more real-world example than the no-alternative cases already tested. """ grammar = Grammar(""" formatted_text = bold_text / weird_text bold_text = open_parens text close_parens weird_text = open_parens text "!!" bork bork = "bork" open_parens = "((" text = ~"[a-zA-Z]+" close_parens = "))" """) text = '((fred!!' try: grammar.parse(text) except ParseError as error: self.assertEqual(error.pos, 8) self.assertEqual(error.expr, grammar['bork']) self.assertEqual(error.text, text) def test_no_named_rule_succeeding(self): """Make sure ParseErrors have sane printable representations even if we never succeeded in matching any named expressions.""" grammar = Grammar('''bork = "bork"''') try: grammar.parse('snork') except ParseError as error: self.assertEqual(error.pos, 0) self.assertEqual(error.expr, grammar['bork']) self.assertEqual(error.text, 'snork') def test_parse_with_leftovers(self): """Make sure ``parse()`` reports where we started failing to match, even if a partial match was successful.""" grammar = Grammar(r'''sequence = "chitty" (" " "bang")+''') try: grammar.parse('chitty bangbang') except IncompleteParseError as error: self.assertEqual(str( error), "Rule 'sequence' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with 'bang' (line 1, column 12).") def test_favoring_named_rules(self): """Named rules should be used in error messages in favor of anonymous ones, even if those are rightward-progressing-more, and even if the failure starts at position 0.""" grammar = Grammar(r'''starts_with_a = &"a" ~"[a-z]+"''') try: grammar.parse('burp') except ParseError as error: self.assertEqual(str(error), "Rule 'starts_with_a' didn't match at 'burp' (line 1, column 1).") def test_line_and_column(self): """Make sure we got the line and column computation right.""" grammar = Grammar(r""" whee_lah = whee "\n" lah "\n" whee = "whee" lah = "lah" """) try: grammar.parse('whee\nlahGOO') except ParseError as error: # TODO: Right now, this says "Rule # didn't match". That's not the greatest. Fix that, then fix this. self.assertTrue(str(error).endswith(r"""didn't match at 'GOO' (line 2, column 4).""")) class RepresentationTests(TestCase): """Tests for str(), unicode(), and repr() of expressions""" def test_unicode_crash(self): """Make sure matched unicode strings don't crash ``__str__``.""" grammar = Grammar(r'string = ~r"\S+"u') str(grammar.parse('中文')) def test_unicode(self): """Smoke-test the conversion of expressions to bits of rules. A slightly more comprehensive test of the actual values is in ``GrammarTests.test_unicode``. """ str(rule_grammar) def test_unicode_keep_parens(self): """Make sure converting an expression to unicode doesn't strip parenthesis. """ # ZeroOrMore self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs")* "spam"')), "foo = 'bar' ('baz' 'eggs')* 'spam'") # Quantifiers self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){2,4} "spam"')), "foo = 'bar' ('baz' 'eggs'){2,4} 'spam'") self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){2,} "spam"')), "foo = 'bar' ('baz' 'eggs'){2,} 'spam'") self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){1,} "spam"')), "foo = 'bar' ('baz' 'eggs')+ 'spam'") self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){,4} "spam"')), "foo = 'bar' ('baz' 'eggs'){,4} 'spam'") self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){0,1} "spam"')), "foo = 'bar' ('baz' 'eggs')? 'spam'") self.assertEqual(str(Grammar('foo = "bar" ("baz" "eggs"){0,} "spam"')), "foo = 'bar' ('baz' 'eggs')* 'spam'") # OneOf self.assertEqual(str(Grammar('foo = "bar" ("baz" / "eggs") "spam"')), "foo = 'bar' ('baz' / 'eggs') 'spam'") # Lookahead self.assertEqual(str(Grammar('foo = "bar" &("baz" "eggs") "spam"')), "foo = 'bar' &('baz' 'eggs') 'spam'") # Multiple sequences self.assertEqual(str(Grammar('foo = ("bar" "baz") / ("baff" "bam")')), "foo = ('bar' 'baz') / ('baff' 'bam')") def test_unicode_surrounding_parens(self): """ Make sure there are no surrounding parens around the entire right-hand side of an expression (as they're unnecessary). """ self.assertEqual(str(Grammar('foo = ("foo" ("bar" "baz"))')), "foo = 'foo' ('bar' 'baz')") class SlotsTests(TestCase): """Tests to do with __slots__""" def test_subclassing(self): """Make sure a subclass of a __slots__-less class can introduce new slots itself. This isn't supposed to work, according to the language docs: When inheriting from a class without __slots__, the __dict__ attribute of that class will always be accessible, so a __slots__ definition in the subclass is meaningless. But it does. """ class Smoo(Quantifier): __slots__ = ['smoo'] def __init__(self): self.smoo = 'smoo' smoo = Smoo() self.assertEqual(smoo.__dict__, {}) # has a __dict__ but with no smoo in it self.assertEqual(smoo.smoo, 'smoo') # The smoo attr ended up in a slot. parsimonious-0.10.0/parsimonious/tests/test_grammar.py000066400000000000000000000626121430470422700232630ustar00rootroot00000000000000# coding=utf-8 from sys import version_info from unittest import TestCase import pytest from parsimonious.exceptions import BadGrammar, LeftRecursionError, ParseError, UndefinedLabel, VisitationError from parsimonious.expressions import Literal, Lookahead, Regex, Sequence, TokenMatcher, is_callable from parsimonious.grammar import rule_grammar, rule_syntax, RuleVisitor, Grammar, TokenGrammar, LazyReference from parsimonious.nodes import Node from parsimonious.utils import Token class BootstrappingGrammarTests(TestCase): """Tests for the expressions in the grammar that parses the grammar definition syntax""" def test_quantifier(self): text = '*' quantifier = rule_grammar['quantifier'] self.assertEqual(quantifier.parse(text), Node(quantifier, text, 0, 1, children=[ Node(quantifier.members[0], text, 0, 1), Node(rule_grammar['_'], text, 1, 1)])) text = '?' self.assertEqual(quantifier.parse(text), Node(quantifier, text, 0, 1, children=[ Node(quantifier.members[0], text, 0, 1), Node(rule_grammar['_'], text, 1, 1)])) text = '+' self.assertEqual(quantifier.parse(text), Node(quantifier, text, 0, 1, children=[ Node(quantifier.members[0], text, 0, 1), Node(rule_grammar['_'], text, 1, 1)])) def test_spaceless_literal(self): text = '"anything but quotes#$*&^"' spaceless_literal = rule_grammar['spaceless_literal'] self.assertEqual(spaceless_literal.parse(text), Node(spaceless_literal, text, 0, len(text), children=[ Node(spaceless_literal.members[0], text, 0, len(text))])) text = r'''r"\""''' self.assertEqual(spaceless_literal.parse(text), Node(spaceless_literal, text, 0, 5, children=[ Node(spaceless_literal.members[0], text, 0, 5)])) def test_regex(self): text = '~"[a-zA-Z_][a-zA-Z_0-9]*"LI' regex = rule_grammar['regex'] self.assertEqual(rule_grammar['regex'].parse(text), Node(regex, text, 0, len(text), children=[ Node(Literal('~'), text, 0, 1), Node(rule_grammar['spaceless_literal'], text, 1, 25, children=[ Node(rule_grammar['spaceless_literal'].members[0], text, 1, 25)]), Node(regex.members[2], text, 25, 27), Node(rule_grammar['_'], text, 27, 27)])) def test_successes(self): """Make sure the PEG recognition grammar succeeds on various inputs.""" self.assertTrue(rule_grammar['label'].parse('_')) self.assertTrue(rule_grammar['label'].parse('jeff')) self.assertTrue(rule_grammar['label'].parse('_THIS_THING')) self.assertTrue(rule_grammar['atom'].parse('some_label')) self.assertTrue(rule_grammar['atom'].parse('"some literal"')) self.assertTrue(rule_grammar['atom'].parse('~"some regex"i')) self.assertTrue(rule_grammar['quantified'].parse('~"some regex"i*')) self.assertTrue(rule_grammar['quantified'].parse('thing+')) self.assertTrue(rule_grammar['quantified'].parse('"hi"?')) self.assertTrue(rule_grammar['term'].parse('this')) self.assertTrue(rule_grammar['term'].parse('that+')) self.assertTrue(rule_grammar['sequence'].parse('this that? other')) self.assertTrue(rule_grammar['ored'].parse('this / that+ / "other"')) # + is higher precedence than &, so 'anded' should match the whole # thing: self.assertTrue(rule_grammar['lookahead_term'].parse('&this+')) self.assertTrue(rule_grammar['expression'].parse('this')) self.assertTrue(rule_grammar['expression'].parse('this? that other*')) self.assertTrue(rule_grammar['expression'].parse('&this / that+ / "other"')) self.assertTrue(rule_grammar['expression'].parse('this / that? / "other"+')) self.assertTrue(rule_grammar['expression'].parse('this? that other*')) self.assertTrue(rule_grammar['rule'].parse('this = that\r')) self.assertTrue(rule_grammar['rule'].parse('this = the? that other* \t\r')) self.assertTrue(rule_grammar['rule'].parse('the=~"hi*"\n')) self.assertTrue(rule_grammar.parse(''' this = the? that other* that = "thing" the=~"hi*" other = "ahoy hoy" ''')) class RuleVisitorTests(TestCase): """Tests for ``RuleVisitor`` As I write these, Grammar is not yet fully implemented. Normally, there'd be no reason to use ``RuleVisitor`` directly. """ def test_round_trip(self): """Test a simple round trip. Parse a simple grammar, turn the parse tree into a map of expressions, and use that to parse another piece of text. Not everything was implemented yet, but it was a big milestone and a proof of concept. """ tree = rule_grammar.parse('''number = ~"[0-9]+"\n''') rules, default_rule = RuleVisitor().visit(tree) text = '98' self.assertEqual(default_rule.parse(text), Node(default_rule, text, 0, 2)) def test_undefined_rule(self): """Make sure we throw the right exception on undefined rules.""" tree = rule_grammar.parse('boy = howdy\n') self.assertRaises(UndefinedLabel, RuleVisitor().visit, tree) def test_optional(self): tree = rule_grammar.parse('boy = "howdy"?\n') rules, default_rule = RuleVisitor().visit(tree) howdy = 'howdy' # It should turn into a Node from the Optional and another from the # Literal within. self.assertEqual(default_rule.parse(howdy), Node(default_rule, howdy, 0, 5, children=[ Node(Literal("howdy"), howdy, 0, 5)])) def function_rule(text, pos): """This is an example of a grammar rule implemented as a function, and is provided as a test fixture.""" token = 'function' return pos + len(token) if text[pos:].startswith(token) else None class GrammarTests(TestCase): """Integration-test ``Grammar``: feed it a PEG and see if it works.""" def method_rule(self, text, pos): """This is an example of a grammar rule implemented as a method, and is provided as a test fixture.""" token = 'method' return pos + len(token) if text[pos:].startswith(token) else None @staticmethod def descriptor_rule(text, pos): """This is an example of a grammar rule implemented as a descriptor, and is provided as a test fixture.""" token = 'descriptor' return pos + len(token) if text[pos:].startswith(token) else None rules = {"descriptor_rule": descriptor_rule} def test_expressions_from_rules(self): """Test the ``Grammar`` base class's ability to compile an expression tree from rules. That the correct ``Expression`` tree is built is already tested in ``RuleGrammarTests``. This tests only that the ``Grammar`` base class's ``_expressions_from_rules`` works. """ greeting_grammar = Grammar('greeting = "hi" / "howdy"') tree = greeting_grammar.parse('hi') self.assertEqual(tree, Node(greeting_grammar['greeting'], 'hi', 0, 2, children=[ Node(Literal('hi'), 'hi', 0, 2)])) def test_unicode(self): """Assert that a ``Grammar`` can convert into a string-formatted series of rules.""" grammar = Grammar(r""" bold_text = bold_open text bold_close text = ~"[A-Z 0-9]*"i bold_open = "((" bold_close = "))" """) lines = str(grammar).splitlines() self.assertEqual(lines[0], 'bold_text = bold_open text bold_close') self.assertTrue("text = ~'[A-Z 0-9]*'i%s" % ('u' if version_info >= (3,) else '') in lines) self.assertTrue("bold_open = '(('" in lines) self.assertTrue("bold_close = '))'" in lines) self.assertEqual(len(lines), 4) def test_match(self): """Make sure partial-matching (with pos) works.""" grammar = Grammar(r""" bold_text = bold_open text bold_close text = ~"[A-Z 0-9]*"i bold_open = "((" bold_close = "))" """) s = ' ((boo))yah' self.assertEqual(grammar.match(s, pos=1), Node(grammar['bold_text'], s, 1, 8, children=[ Node(grammar['bold_open'], s, 1, 3), Node(grammar['text'], s, 3, 6), Node(grammar['bold_close'], s, 6, 8)])) def test_bad_grammar(self): """Constructing a Grammar with bad rules should raise ParseError.""" self.assertRaises(ParseError, Grammar, 'just a bunch of junk') def test_comments(self): """Test tolerance of comments and blank lines in and around rules.""" grammar = Grammar(r"""# This is a grammar. # It sure is. bold_text = stars text stars # nice text = ~"[A-Z 0-9]*"i #dude stars = "**" # Pretty good #Oh yeah.#""") # Make sure a comment doesn't need a # \n or \r to end. self.assertEqual(list(sorted(str(grammar).splitlines())), ['''bold_text = stars text stars''', # TODO: Unicode flag is on by default in Python 3. I wonder if we # should turn it on all the time in Parsimonious. """stars = '**'""", '''text = ~'[A-Z 0-9]*'i%s''' % ('u' if version_info >= (3,) else '')]) def test_multi_line(self): """Make sure we tolerate all sorts of crazy line breaks and comments in the middle of rules.""" grammar = Grammar(""" bold_text = bold_open # commenty comment text # more comment bold_close text = ~"[A-Z 0-9]*"i bold_open = "((" bold_close = "))" """) self.assertTrue(grammar.parse('((booyah))') is not None) def test_not(self): """Make sure "not" predicates get parsed and work properly.""" grammar = Grammar(r'''not_arp = !"arp" ~"[a-z]+"''') self.assertRaises(ParseError, grammar.parse, 'arp') self.assertTrue(grammar.parse('argle') is not None) def test_lookahead(self): grammar = Grammar(r'''starts_with_a = &"a" ~"[a-z]+"''') self.assertRaises(ParseError, grammar.parse, 'burp') s = 'arp' self.assertEqual(grammar.parse('arp'), Node(grammar['starts_with_a'], s, 0, 3, children=[ Node(Lookahead(Literal('a')), s, 0, 0), Node(Regex(r'[a-z]+'), s, 0, 3)])) def test_parens(self): grammar = Grammar(r'''sequence = "chitty" (" " "bang")+''') # Make sure it's not as if the parens aren't there: self.assertRaises(ParseError, grammar.parse, 'chitty bangbang') s = 'chitty bang bang' self.assertEqual(str(grammar.parse(s)), """ """) def test_resolve_refs_order(self): """Smoke-test a circumstance where lazy references don't get resolved.""" grammar = Grammar(""" expression = "(" terms ")" terms = term+ term = number number = ~r"[0-9]+" """) grammar.parse('(34)') def test_resolve_refs_completeness(self): """Smoke-test another circumstance where lazy references don't get resolved.""" grammar = Grammar(r""" block = "{" _ item* "}" _ # An item is an element of a block. item = number / word / block / paren # Parens are for delimiting subexpressions. paren = "(" _ item* ")" _ # Words are barewords, unquoted things, other than literals, that can live # in lists. We may renege on some of these chars later, especially ".". We # may add Unicode. word = spaceless_word _ spaceless_word = ~r"[-a-z`~!@#$%^&*_+=|\\;<>,.?][-a-z0-9`~!@#$%^&*_+=|\\;<>,.?]*"i number = ~r"[0-9]+" _ # There are decimals and strings and other stuff back on the "parsing" branch, once you get this working. _ = meaninglessness* meaninglessness = whitespace whitespace = ~r"\s+" """) grammar.parse('{log (add 3 to 5)}') def test_infinite_loop(self): """Smoke-test a grammar that was causing infinite loops while building. This was going awry because the "int" rule was never getting marked as resolved, so it would just keep trying to resolve it over and over. """ Grammar(""" digits = digit+ int = digits digit = ~"[0-9]" number = int main = number """) def test_circular_toplevel_reference(self): with pytest.raises(VisitationError): Grammar(""" foo = bar bar = foo """) with pytest.raises(VisitationError): Grammar(""" foo = foo bar = foo """) with pytest.raises(VisitationError): Grammar(""" foo = bar bar = baz baz = foo """) def test_right_recursive(self): """Right-recursive refs should resolve.""" grammar = Grammar(""" digits = digit digits? digit = ~r"[0-9]" """) self.assertTrue(grammar.parse('12') is not None) def test_badly_circular(self): """Uselessly circular references should be detected by the grammar compiler.""" self.skipTest('We have yet to make the grammar compiler detect these.') Grammar(""" foo = bar bar = foo """) def test_parens_with_leading_whitespace(self): """Make sure a parenthesized expression is allowed to have leading whitespace when nested directly inside another.""" Grammar("""foo = ( ("c") )""").parse('c') def test_single_quoted_literals(self): Grammar("""foo = 'a' '"'""").parse('a"') def test_simple_custom_rules(self): """Run 2-arg custom-coded rules through their paces.""" grammar = Grammar(""" bracketed_digit = start digit end start = '[' end = ']'""", digit=lambda text, pos: (pos + 1) if text[pos].isdigit() else None) s = '[6]' self.assertEqual(grammar.parse(s), Node(grammar['bracketed_digit'], s, 0, 3, children=[ Node(grammar['start'], s, 0, 1), Node(grammar['digit'], s, 1, 2), Node(grammar['end'], s, 2, 3)])) def test_complex_custom_rules(self): """Run 5-arg custom rules through their paces. Incidentally tests returning an actual Node from the custom rule. """ grammar = Grammar(""" bracketed_digit = start digit end start = '[' end = ']' real_digit = '6'""", # In this particular implementation of the digit rule, no node is # generated for `digit`; it falls right through to `real_digit`. # I'm not sure if this could lead to problems; I can't think of # any, but it's probably not a great idea. digit=lambda text, pos, cache, error, grammar: grammar['real_digit'].match_core(text, pos, cache, error)) s = '[6]' self.assertEqual(grammar.parse(s), Node(grammar['bracketed_digit'], s, 0, 3, children=[ Node(grammar['start'], s, 0, 1), Node(grammar['real_digit'], s, 1, 2), Node(grammar['end'], s, 2, 3)])) def test_lazy_custom_rules(self): """Make sure LazyReferences manually shoved into custom rules are resolved. Incidentally test passing full-on Expressions as custom rules and having a custom rule as the default one. """ grammar = Grammar(""" four = '4' five = '5'""", forty_five=Sequence(LazyReference('four'), LazyReference('five'), name='forty_five')).default('forty_five') s = '45' self.assertEqual(grammar.parse(s), Node(grammar['forty_five'], s, 0, 2, children=[ Node(grammar['four'], s, 0, 1), Node(grammar['five'], s, 1, 2)])) def test_unconnected_custom_rules(self): """Make sure custom rules that aren't hooked to any other rules still get included in the grammar and that lone ones get set as the default. Incidentally test Grammar's `rules` default arg. """ grammar = Grammar(one_char=lambda text, pos: pos + 1).default('one_char') s = '4' self.assertEqual(grammar.parse(s), Node(grammar['one_char'], s, 0, 1)) def test_callability_of_routines(self): self.assertTrue(is_callable(function_rule)) self.assertTrue(is_callable(self.method_rule)) self.assertTrue(is_callable(self.rules['descriptor_rule'])) def test_callability_custom_rules(self): """Confirms that functions, methods and method descriptors can all be used to supply custom grammar rules. """ grammar = Grammar(""" default = function method descriptor """, function=function_rule, method=self.method_rule, descriptor=self.rules['descriptor_rule'], ) result = grammar.parse('functionmethoddescriptor') rule_names = [node.expr.name for node in result.children] self.assertEqual(rule_names, ['function', 'method', 'descriptor']) def test_lazy_default_rule(self): """Make sure we get an actual rule set as our default rule, even when the first rule has forward references and is thus a LazyReference at some point during grammar compilation. """ grammar = Grammar(r""" styled_text = text text = "hi" """) self.assertEqual(grammar.parse('hi'), Node(grammar['text'], 'hi', 0, 2)) def test_immutable_grammar(self): """Make sure that a Grammar is immutable after being created.""" grammar = Grammar(r""" foo = 'bar' """) def mod_grammar(grammar): grammar['foo'] = 1 self.assertRaises(TypeError, mod_grammar, [grammar]) def mod_grammar(grammar): new_grammar = Grammar(r""" baz = 'biff' """) grammar.update(new_grammar) self.assertRaises(AttributeError, mod_grammar, [grammar]) def test_repr(self): self.assertTrue(repr(Grammar(r'foo = "a"'))) def test_rule_ordering_is_preserved(self): grammar = Grammar('\n'.join('r%s = "something"' % i for i in range(100))) self.assertEqual( list(grammar.keys()), ['r%s' % i for i in range(100)]) def test_rule_ordering_is_preserved_on_shallow_copies(self): grammar = Grammar('\n'.join('r%s = "something"' % i for i in range(100)))._copy() self.assertEqual( list(grammar.keys()), ['r%s' % i for i in range(100)]) def test_repetitions(self): grammar = Grammar(r''' left_missing = "a"{,5} right_missing = "a"{5,} exact = "a"{5} range = "a"{2,5} optional = "a"? plus = "a"+ star = "a"* ''') should_parse = [ ("left_missing", ["a" * i for i in range(6)]), ("right_missing", ["a" * i for i in range(5, 8)]), ("exact", ["a" * 5]), ("range", ["a" * i for i in range(2, 6)]), ("optional", ["", "a"]), ("plus", ["a", "aa"]), ("star", ["", "a", "aa"]), ] for rule, examples in should_parse: for example in examples: assert grammar[rule].parse(example) should_not_parse = [ ("left_missing", ["a" * 6]), ("right_missing", ["a" * i for i in range(5)]), ("exact", ["a" * i for i in list(range(5)) + list(range(6, 10))]), ("range", ["a" * i for i in list(range(2)) + list(range(6, 10))]), ("optional", ["aa"]), ("plus", [""]), ("star", ["b"]), ] for rule, examples in should_not_parse: for example in examples: with pytest.raises(ParseError): grammar[rule].parse(example) def test_equal(self): grammar_def = (r""" x = y / z / "" y = "y" x z = "z" x """) assert Grammar(grammar_def) == Grammar(grammar_def) self.assertEqual(Grammar(rule_syntax), Grammar(rule_syntax)) self.assertNotEqual(Grammar('expr = ~"[a-z]{1,3}"'), Grammar('expr = ~"[a-z]{2,3}"')) self.assertNotEqual(Grammar('expr = ~"[a-z]{1,3}"'), Grammar('expr = ~"[a-z]{1,4}"')) self.assertNotEqual(Grammar('expr = &"a"'), Grammar('expr = !"a"')) class TokenGrammarTests(TestCase): """Tests for the TokenGrammar class and associated machinery""" def test_parse_success(self): """Token literals should work.""" s = [Token('token1'), Token('token2')] grammar = TokenGrammar(""" foo = token1 "token2" token1 = "token1" """) self.assertEqual(grammar.parse(s), Node(grammar['foo'], s, 0, 2, children=[ Node(grammar['token1'], s, 0, 1), Node(TokenMatcher('token2'), s, 1, 2)])) def test_parse_failure(self): """Parse failures should work normally with token literals.""" grammar = TokenGrammar(""" foo = "token1" "token2" """) with pytest.raises(ParseError) as e: grammar.parse([Token('tokenBOO'), Token('token2')]) assert "Rule 'foo' didn't match at" in str(e.value) def test_token_repr(self): t = Token('💣') self.assertTrue(isinstance(t.__repr__(), str)) self.assertEqual('', t.__repr__()) def test_token_star_plus_expressions(self): a = Token("a") b = Token("b") grammar = TokenGrammar(""" foo = "a"* bar = "a"+ """) assert grammar["foo"].parse([]) is not None assert grammar["foo"].parse([a]) is not None assert grammar["foo"].parse([a, a]) is not None with pytest.raises(ParseError): grammar["foo"].parse([a, b]) with pytest.raises(ParseError): grammar["foo"].parse([b]) assert grammar["bar"].parse([a]) is not None with pytest.raises(ParseError): grammar["bar"].parse([a, b]) with pytest.raises(ParseError): grammar["bar"].parse([b]) def test_precedence_of_string_modifiers(): # r"strings", etc. should be parsed as a single literal, not r followed # by a string literal. g = Grammar(r""" escaped_bell = r"\b" r = "irrelevant" """) assert isinstance(g["escaped_bell"], Literal) assert g["escaped_bell"].literal == "\\b" with pytest.raises(ParseError): g.parse("irrelevant\b") g2 = Grammar(r""" escaped_bell = r"\b" """) assert g2.parse("\\b") def test_binary_grammar(): g = Grammar(r""" file = header body terminator header = b"\xFF" length b"~" length = ~rb"\d+" body = ~b"[^\xFF]*" terminator = b"\xFF" """) length = 22 assert g.parse(b"\xff22~" + (b"a" * 22) + b"\xff") is not None def test_inconsistent_string_types_in_grammar(): with pytest.raises(VisitationError) as e: Grammar(r""" foo = b"foo" bar = "bar" """) assert e.value.original_class is BadGrammar with pytest.raises(VisitationError) as e: Grammar(r""" foo = ~b"foo" bar = "bar" """) assert e.value.original_class is BadGrammar # The following should parse without errors because they use the same # string types: Grammar(r""" foo = b"foo" bar = b"bar" """) Grammar(r""" foo = "foo" bar = "bar" """) def test_left_associative(): # Regression test for https://github.com/erikrose/parsimonious/issues/209 language_grammar = r""" expression = operator_expression / non_operator_expression non_operator_expression = number_expression operator_expression = expression "+" non_operator_expression number_expression = ~"[0-9]+" """ grammar = Grammar(language_grammar) with pytest.raises(LeftRecursionError) as e: grammar["operator_expression"].parse("1+2") assert "Parsimonious is a packrat parser, so it can't handle left recursion." in str(e.value) parsimonious-0.10.0/parsimonious/tests/test_nodes.py000066400000000000000000000145611430470422700227450ustar00rootroot00000000000000# -*- coding: utf-8 -*- from unittest import SkipTest, TestCase from parsimonious import Grammar, NodeVisitor, VisitationError, rule from parsimonious.expressions import Literal from parsimonious.nodes import Node class HtmlFormatter(NodeVisitor): """Visitor that turns a parse tree into HTML fragments""" grammar = Grammar("""bold_open = '(('""") # just partial def visit_bold_open(self, node, visited_children): return '' def visit_bold_close(self, node, visited_children): return '' def visit_text(self, node, visited_children): """Return the text verbatim.""" return node.text def visit_bold_text(self, node, visited_children): return ''.join(visited_children) class ExplosiveFormatter(NodeVisitor): """Visitor which raises exceptions""" def visit_boom(self, node, visited_children): raise ValueError class SimpleTests(TestCase): def test_visitor(self): """Assert a tree gets visited correctly.""" grammar = Grammar(r''' bold_text = bold_open text bold_close text = ~'[a-zA-Z 0-9]*' bold_open = '((' bold_close = '))' ''') text = '((o hai))' tree = Node(grammar['bold_text'], text, 0, 9, [Node(grammar['bold_open'], text, 0, 2), Node(grammar['text'], text, 2, 7), Node(grammar['bold_close'], text, 7, 9)]) self.assertEqual(grammar.parse(text), tree) result = HtmlFormatter().visit(tree) self.assertEqual(result, 'o hai') def test_visitation_exception(self): self.assertRaises(VisitationError, ExplosiveFormatter().visit, Node(Literal(''), '', 0, 0)) def test_str(self): """Test str and unicode of ``Node``.""" n = Node(Literal('something', name='text'), 'o hai', 0, 5) good = '' self.assertEqual(str(n), good) def test_repr(self): """Test repr of ``Node``.""" s = 'hai ö' boogie = 'böogie' n = Node(Literal(boogie), s, 0, 3, children=[ Node(Literal(' '), s, 3, 4), Node(Literal('ö'), s, 4, 5)]) self.assertEqual(repr(n), str("""s = {hai_o}\nNode({boogie}, s, 0, 3, children=[Node({space}, s, 3, 4), Node({o}, s, 4, 5)])""").format( hai_o=repr(s), boogie=repr(Literal(boogie)), space=repr(Literal(" ")), o=repr(Literal("ö")), ) ) def test_parse_shortcut(self): """Exercise the simple case in which the visitor takes care of parsing.""" self.assertEqual(HtmlFormatter().parse('(('), '') def test_match_shortcut(self): """Exercise the simple case in which the visitor takes care of matching.""" self.assertEqual(HtmlFormatter().match('((other things'), '') class CoupledFormatter(NodeVisitor): @rule('bold_open text bold_close') def visit_bold_text(self, node, visited_children): return ''.join(visited_children) @rule('"(("') def visit_bold_open(self, node, visited_children): return '' @rule('"))"') def visit_bold_close(self, node, visited_children): return '' @rule('~"[a-zA-Z 0-9]*"') def visit_text(self, node, visited_children): """Return the text verbatim.""" return node.text class DecoratorTests(TestCase): def test_rule_decorator(self): """Make sure the @rule decorator works.""" self.assertEqual(CoupledFormatter().parse('((hi))'), 'hi') def test_rule_decorator_subclassing(self): """Make sure we can subclass and override visitor methods without blowing away the rules attached to them.""" class OverridingFormatter(CoupledFormatter): def visit_text(self, node, visited_children): """Return the text capitalized.""" return node.text.upper() @rule('"not used"') def visit_useless(self, node, visited_children): """Get in the way. Tempt the metaclass to pave over the superclass's grammar with a new one.""" raise SkipTest("I haven't got around to making this work yet.") self.assertEqual(OverridingFormatter().parse('((hi))'), 'HI') class PrimalScream(Exception): pass class SpecialCasesTests(TestCase): def test_unwrapped_exceptions(self): class Screamer(NodeVisitor): grammar = Grammar("""greeting = 'howdy'""") unwrapped_exceptions = (PrimalScream,) def visit_greeting(self, thing, visited_children): raise PrimalScream('This should percolate up!') self.assertRaises(PrimalScream, Screamer().parse, 'howdy') def test_node_inequality(self): node = Node(Literal('12345'), 'o hai', 0, 5) self.assertTrue(node != 5) self.assertTrue(node != None) self.assertTrue(node != Node(Literal('23456'), 'o hai', 0, 5)) self.assertTrue(not (node != Node(Literal('12345'), 'o hai', 0, 5))) def test_generic_visit_NotImplementedError_unnamed_node(self): """ Test that generic_visit provides informative error messages when visitors are not defined. Regression test for https://github.com/erikrose/parsimonious/issues/110 """ class MyVisitor(NodeVisitor): grammar = Grammar(r''' bar = "b" "a" "r" ''') unwrapped_exceptions = (NotImplementedError, ) with self.assertRaises(NotImplementedError) as e: MyVisitor().parse('bar') self.assertIn("No visitor method was defined for this expression: 'b'", str(e.exception)) def test_generic_visit_NotImplementedError_named_node(self): """ Test that generic_visit provides informative error messages when visitors are not defined. """ class MyVisitor(NodeVisitor): grammar = Grammar(r''' bar = myrule myrule myrule myrule = ~"[bar]" ''') unwrapped_exceptions = (NotImplementedError, ) with self.assertRaises(NotImplementedError) as e: MyVisitor().parse('bar') self.assertIn("No visitor method was defined for this expression: myrule = ~'[bar]'", str(e.exception)) parsimonious-0.10.0/parsimonious/utils.py000066400000000000000000000021761430470422700205730ustar00rootroot00000000000000"""General tools which don't depend on other parts of Parsimonious""" import ast class StrAndRepr(object): """Mix-in which gives the class the same __repr__ and __str__.""" def __repr__(self): return self.__str__() def evaluate_string(string): """Piggyback on Python's string support so we can have backslash escaping and niceties like \n, \t, etc. This also supports: 1. b"strings", allowing grammars to parse bytestrings, in addition to str. 2. r"strings" to simplify regexes. """ return ast.literal_eval(string) class Token(StrAndRepr): """A class to represent tokens, for use with TokenGrammars You will likely want to subclass this to hold additional information, like the characters that you lexed to create this token. Alternately, feel free to create your own class from scratch. The only contract is that tokens must have a ``type`` attr. """ __slots__ = ['type'] def __init__(self, type): self.type = type def __str__(self): return '' % (self.type,) def __eq__(self, other): return self.type == other.type parsimonious-0.10.0/setup.py000066400000000000000000000026751430470422700160470ustar00rootroot00000000000000from sys import version_info from io import open from setuptools import setup, find_packages long_description=open('README.rst', 'r', encoding='utf8').read() setup( name='parsimonious', version='0.10.0', description='(Soon to be) the fastest pure-Python PEG parser I could muster', long_description=long_description, long_description_content_type='text/x-rst', author='Erik Rose', author_email='erikrose@grinchcentral.com', license='MIT', packages=find_packages(exclude=['ez_setup']), test_suite='tests', url='https://github.com/erikrose/parsimonious', include_package_data=True, install_requires=['regex>=2022.3.15'], classifiers=[ 'Intended Audience :: Developers', 'Natural Language :: English', 'Development Status :: 3 - Alpha', 'License :: OSI Approved :: MIT License', 'Operating System :: OS Independent', 'Programming Language :: Python :: 3 :: Only', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9', 'Programming Language :: Python :: 3.10', 'Topic :: Scientific/Engineering :: Information Analysis', 'Topic :: Software Development :: Libraries', 'Topic :: Text Processing :: General'], keywords=['parse', 'parser', 'parsing', 'peg', 'packrat', 'grammar', 'language'], ) parsimonious-0.10.0/tox.ini000066400000000000000000000003241430470422700156350ustar00rootroot00000000000000[tox] envlist = py37, py38, py39, py310 [gh-actions] python = 3.7: py37 3.8: py38 3.9: py39 3.10: py310 [testenv] usedevelop = True commands = py.test --tb=native {posargs:parsimonious} deps = pytest