pax_global_header 0000666 0000000 0000000 00000000064 13112204712 0014503 g ustar 00root root 0000000 0000000 52 comment=55aef483d81095258d999ead0125c80ee35407d9
instaparse-1.4.7/ 0000775 0000000 0000000 00000000000 13112204712 0013665 5 ustar 00root root 0000000 0000000 instaparse-1.4.7/.gitattributes 0000664 0000000 0000000 00000000055 13112204712 0016560 0 ustar 00root root 0000000 0000000 * text auto
*.clj text
*.md text
*.png binary instaparse-1.4.7/.gitignore 0000664 0000000 0000000 00000000402 13112204712 0015651 0 ustar 00root root 0000000 0000000 /target
/lib
/classes
/checkouts
/bin
.project
.classpath
pom.xml
*.jar
*.class
.lein-deps-sum
.lein-failures
.lein-plugins
ideas.txt
benchmarks.txt
todo.txt
/.settings
.nrepl-port
.lein-repl-history
*~
*#*#
.cljs_node_repl/
.idea/
*.iml
*.asc
.nrepl-history instaparse-1.4.7/CHANGES.md 0000664 0000000 0000000 00000017165 13112204712 0015271 0 ustar 00root root 0000000 0000000 # Instaparse Change Log
## 1.4.7
### Enhancements
* `visualize` now supports `:output-file :buffered-image`, which returns a java.awt.image.BufferedImage object.
### Bugfixes
* Fixed problem where `visualize` with `:output-file` didn't work on rootless trees.
## 1.4.6
### Performance improvements
* Better performance for ABNF grammars in Clojurescript.
## 1.4.5
### Bugfixes
* Fixed regression in 1.4.4 involving parsers based off of URIs.
* defparser now supports the full range of relevant parser options.
## 1.4.4
### Enhancements
* Instaparse is now cross-platform compatible between Clojure and Clojurescript.
### Features
* defparser - builds parser at compile time
## 1.4.3
### Bugfixes
* Fixed bug with insta/transform on tree with hidden root tag and strings at the top level of the tree.
## 1.4.2
### Bugfixes
* Fixed problem with counted repetitions in ABNF.
## 1.4.1
### Features
* New function `add-line-and-column-info-to-metadata` in the instaparse.core namespace.
### Enhancements
* Added new combinators for unicode character ranges, for better portability to Clojurescript.
### Bugfixes
* Improved compatibility with boot, which allows having multiple versions of Clojure on the classpath, by making change to string-reader which needs to
be aware of what version of Clojure it is running due to a breaking change in Clojure 1.7.
* Fixed bug with the way failure messages were printed in certain cases.
## 1.4.0
### Bugfixes
* In 1.3.6, parsing of any CharSequence was introduced, however, the error messages
for failed parses weren't printing properly. This has been fixed.
* 1.4.0 uses a more robust algorithm for handling nested negative lookaheads, in
response to a bug report where the existing mechanism produced incorrect parses
(in addition to the correct parse) for a very unusual case.
### Enhancements
* New support for tracing the steps the parser goes through. Call your parser with
the optional flag `:trace true`. The first time you use this flag, it triggers a
recompilation of the code with additional tracing and profiling steps.
To restore the code to its non-instrumented form, call `(insta/disable-tracing!)`.
## 1.3.6
### Enhancements
* Modified for compatibility with Clojure 1.7.0-alpha6
* Instaparse now can parse anything supporting the CharSequence interface, not just strings.
Specifically, this allows instaparse to operate on StringBuilder objects.
## 1.3.5
### Bugfixes
* Fixed bug with `transform` on hiccup data structures with numbers or other atomic data as leaves.
* Fixed bug with character concatenation support in ABNF grammar
### Enhancements
* Added support for Unicode characters to ABNF.
## 1.3.4
### Enhancements
* Modified for compatibility with Clojure 1.7.0-alpha2.
## 1.3.3
### Enhancements
Made two changes to make it possible to use instaparse on Google App Engine.
* Removed dependency on javax.swing.text.Segment class.
* Added `:no-slurp true` keyword option to `insta/parser` to disable URI slurping behavior, since GAE does not support slurp.
## 1.3.2
### Bugfixes
* Regular expressions on empty strings weren't properly returning a failure.
## 1.3.1
### Enhancements
* Updated tests to use Clojure 1.6.0's final release.
* Added `:ci-string true` flag to `insta/parser`.
## 1.3.0
### Compatibility with Clojure 1.6
## 1.2.16
### Bugfixes
* Calling `empty` on a FlattenOnDemandVector now returns [].
## 1.2.15
### Enhancements
* :auto-whitespace can now take the keyword :standard or :comma to access one of the predefined whitespace parsers.
### Bugfixes
* Fixed newline problem visualizing parse trees on Linux.
* Fixed problem with visualizing rootless trees.
## 1.2.11
### Minor enhancements
* Further refinements to the way ordered choice interacts with epsilon parsers.
## 1.2.10
### Bugfixes
* Fixed bug introduced by 1.2.9 affecting ordered choice.
## 1.2.9
### Bugfixes
* Fixed bug where ordered choice was ignoring epsilon parser.
## 1.2.8
### Bugfixes
* Fixed bug introduced by 1.2.7, affecting printing of grammars with regexes.
### Enhancements
* Parser printing format now includes <> hidden information and tags.
## 1.2.7
### Bugfixes
* Fixed bug when regular expression contains | character.
## 1.2.6
### Bugfixes
* Changed pre-condition assertion for auto-whitespace option which was causing a problem with "lein jar".
## 1.2.5
### Bugfixes
* Improved handling of unusual characters in ABNF grammars.
## 1.2.4
### Bugfixes
* When parsing in :total mode with :enlive as the output format, changed the content of failure node from vector to list to match the rest of the enlive output.
## 1.2.3
### Bugfixes
* Fixed problem when epsilon was the only thing in a nonterminal, e.g., "S = epsilon"
### Features
* Added experimental `:auto-whitespace` feature. See the [Experimental Features Document](docs/ExperimentalFeatures.md) for more details.
## 1.2.2
### Bugfixes
* Fixed reflection warning.
## 1.2.1
### Bugfixes
* I had accidentally left a dependency on tools.trace in the repeat.clj file, used while I was debugging that namespace. Removed it.
## 1.2.0
### New Features
* `span` function returns substring indexes into the parsed text for a portion of the parse tree.
* `visualize` function draws the parse tree, using rhizome and graphviz if installed.
* `:optimize :memory` flag that, for suitable parsers, will perform the parsing in discrete chunks, using less memory.
* New parsing flag to undo the effect of the <> hide notation.
+ `(my-parser text :unhide :tags)` - reveals tags, i.e., `<>` applied on the left-hand sides of rules.
+ `(my-parser text :unhide :content)` - reveals content hidden on the right-hand side of rules with `<>`
+ `(my-parser text :unhide :all)` - reveals both tags and content.
### Notable Performance Improvements
* Dramatic performance improvement (quadratic time reduced to linear) when repetition parsers (+ or *) operate on text whose parse tree contains a large number of repetitions.
* Performance improvement for regular expressions.
### Minor Enhancements
* Added more support to IncrementalVector for a wider variety of vector operations, including subvec, nth, and vec.
## 1.1.0
### Breaking Changes
* When you run a parser in "total" mode, the failure node is no longer tagged with `:failure`, but instead is tagged with `:instaparse/failure`.
### New Features
* Comments now supported in CFGs. Use (* and *) notation.
* Added `ebnf` combinator to the `instaparse/combinators` namespace. This new combinator converts string specifications to the combinator-built equivalent. See combinator section of the updated tutorial for details.
* ABNF: can now create a parser from a specification using `:input-format :abnf` for ABNF parser syntax.
* New combinators related to ABNF:
1. `abnf` -- converts ABNF string fragments to combinators.
2. `string-ci` -- case-insensitive strings.
3. `rep` -- between m and n repetitions.
* New core function related to ABNF:
`set-default-input-format!` -- initially defaults to :ebnf
### Minor Enhancements
* Added comments to regexes used by the parser that processes the context-free grammar syntax, improving the readability of error messages if you have a faulty grammar specification.
### Bug Fixes
* Backslashes in front of quotation mark were escaping the quotation mark, even if the backslash itself was escaped.
* Unescaped double-quote marks weren't properly handled, e.g., (parser "A = '\"'").
* Nullable Plus: ((parser "S = ('a'?)+") "") previously returned a failure, now returns [:S]
* Fixed problem with failure reporting that would occur if parse failed on an input that ended with a newline character. instaparse-1.4.7/README.md 0000664 0000000 0000000 00000201563 13112204712 0015153 0 ustar 00root root 0000000 0000000 # Instaparse 1.4.7
*What if context-free grammars were as easy to use as regular expressions?*
## Features
Instaparse aims to be the simplest way to build parsers in Clojure.
+ Turns *standard EBNF or ABNF notation* for context-free grammars into an executable parser that takes a string as an input and produces a parse tree for that string.
+ *No Grammar Left Behind*: Works for *any* context-free grammar, including *left-recursive*, *right-recursive*, and *ambiguous* grammars.
+ Extends the power of context-free grammars with PEG-like syntax for lookahead and negative lookahead.
+ Supports both of Clojure's most popular tree formats (hiccup and enlive) as output targets.
+ Detailed reporting of parse errors.
+ Optionally produces lazy sequence of all parses (especially useful for diagnosing and debugging ambiguous grammars).
+ "Total parsing" mode where leftover string is embedded in the parse tree.
+ Optional combinator library for building grammars programmatically.
+ Performant.
## Quickstart
Instaparse requires Clojure v1.5.1 or later, or ClojureScript v1.7.28 or later.
Add the following line to your leiningen dependencies:
[instaparse "1.4.7"]
Require instaparse in your namespace header:
(ns example.core
(:require [instaparse.core :as insta]))
### Creating your first parser
Here's a typical example of a context-free grammar one might see in a textbook on automata and/or parsing. It is a common convention in many textbooks to use the capital letter `S` to indicate the starting rule, so for this example, we'll follow that convention:
S = AB*
AB = A B
A = 'a'+
B = 'b'+
This looks for alternating runs of 'a' followed by runs of 'b'. So for example "aaaaabbaaabbb" satisfies this grammar. On the other hand,
"aaabbbbaa" does not (because the grammar specifies that each run of 'a' must be followed by a run of 'b').
With instaparse, turning this grammar into an executable parser is as simple as typing the grammar in:
(def as-and-bs
(insta/parser
"S = AB*
AB = A B
A = 'a'+
B = 'b'+"))
=> (as-and-bs "aaaaabbbaaaabb")
[:S
[:AB [:A "a" "a" "a" "a" "a"] [:B "b" "b" "b"]]
[:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]
At this point, if you know EBNF notation for context-free grammars, you probably know enough to dive in and start playing around. However, instaparse is rich with features, so if you want to know the full scope of what it can do, read on...
## Tutorial
### Notation
Instaparse supports most of the common notations for context-free grammars. For example, a popular alternative to `*` is to surround the term with curly braces `{}`, and a popular alternative to `?` is to surround the term with square brackets `[]`. Rules can be specified with `=`, `:`, `:=`, or `::=`. Rules can optionally end with `;`. Instaparse is very flexible in terms of how you use whitespace (as in Clojure, `,` is treated as whitespace) and you can liberally use parentheses for grouping. Terminal strings can be enclosed in either single quotes or double quotes (however, since you are writing the grammar specification inside of a Clojure double-quoted string, any use of double-quotes would have to be escaped, therefore single-quotes are easier to read). Newlines are optional; you can put the entire grammar on one line if you desire. In fact, all these notations can be mixed up in the same specification if you want.
So here is an equally valid (but messier) way to write out the exact same grammar, just to illustrate the flexibility that you have:
(def as-and-bs-alternative
(insta/parser
"S:={AB} ;
AB ::= (A, B)
A : \"a\" + ;
B ='b' + ;"))
Note that regardless of the notation you use in your specification, when you evaluate the parser at the REPL, the rules will be pretty-printed:
=> as-and-bs-alternative
S = AB*
AB = A B
A = "a"+
B = "b"+
Here's a quick guide to the syntax for defining context-free grammars:
Category | Notations | Example |
Rule | : := ::= = | S = A |
End of rule | ; . (optional) | S = A; |
Alternation | | | A | B |
Concatenation | whitespace or , | A B |
Grouping | () | (A | B) C |
Optional | ? [] | A? [A] |
One or more | + | A+ |
Zero or more | * {} | A* {A} |
String terminal | "" '' | 'a' "a" |
Regex terminal | #"" #'' | #'a' #"a" |
Epsilon | Epsilon epsilon EPSILON eps ε "" '' | S = 'a' S | Epsilon |
Comment | (* *) | (* This is a comment *) |
As is the norm in EBNF notation, concatenation has a higher precedence than alternation, so in the absence of parentheses, something like `A B | C D` means `(A B) | (C D)`.
### Input from resource file
Parsers can also be built from a specification contained in a file, either locally or on the web. For example, I stored on github a file with a simple grammar to parse text containing a single 'a' surrounded optionally by whitespace. The specification in the file looks like this:
S = #"\s*" "a" #"\s*"
Building the parser from the URI is easy:
(insta/parser "https://gist.github.com/Engelberg/5283346/raw/77e0b1d0cd7388a7ddf43e307804861f49082eb6/SingleA")
This provides a convenienent way to share parser specifications over the Internet.
You can also use a specification contained in a local resource in your classpath:
(insta/parser (clojure.java.io/resource "myparser.bnf"))
### `defparser`
On ClojureScript, the `(def my-parser (insta/parser "..."))` use case
has the following disadvantages:
- ClojureScript does not support `slurp`, so `parser` cannot automatically read from file paths / URLs.
- Having to parse a grammar string at runtime can impact the startup performance of an application or webpage.
To solve those problems, a macro `instaparse.core/defparser` is
provided that, if given a string for a grammar specification, will
parse that as a grammar up front and emit more performant code.
```clojure
;; Clojure
(:require [instaparse.core :as insta :refer [defparser]])
;; ClojureScript
(:require [instaparse.core :as insta :refer-macros [defparser]])
=> (time (def p (insta/parser "S = A B; A = 'a'+; B = 'b'+")))
"Elapsed time: 4.368179 msecs"
#'user/p
=> (time (defparser p "S = A B; A = 'a'+; B = 'b'+")) ; the meat of the work happens at macro-time
"Elapsed time: 0.091689 msecs"
#'user/p
=> (defparser p "https://gist.github.com/Engelberg/5283346/raw/77e0b1d0cd7388a7ddf43e307804861f49082eb6/SingleA") ; works even in cljs!
#'user/p
=> (defparser p [:S (c/plus (c/string "a"))]) ; still works, but won't do any extra magic behind the scenes
#'user/p
=> (defparser p "S = 1*'a'" :input-format :abnf :output-format :enlive) ; takes additional keyword arguments
#'user/p
```
`defparser` is primarily useful in Clojurescript, but works in both Clojure and Clojurescript for cross-platform compatibility.
### Escape characters
Putting your grammar in a separate resource file has an additional advantage -- it provides a very straightforward "what you see is what you get" view of the grammar. The only escape characters needed are the ordinary escape characters for strings and regular expressions (additionally, instaparse also supports `\'` inside single-quoted strings).
When you specify a grammar directly in your Clojure code as a double-quoted string, extra escape characters may be needed in the strings and regexes of your grammar:
1. All `"` string and regex delimiters must be turned into `\"` or replaced with a single-quote `'`.
2. All backslash characters in your strings and regexes `\` should be escaped and turned into `\\`. (In some cases you can get away with not escaping the backslash, but it is best practice to be consistent and always do it.)
For example, the above grammar could be written in Clojure as:
(insta/parser "S = #'\\s*' 'a' #'\\s*'")
It is unfortunate that this extra level of escaping is necessary. Many programming languages provide some sort of facility for creating "raw strings" which are taken verbatim (e.g., Python's triple-quoted strings). I don't understand why Clojure does not support raw strings, but it doesn't.
Fortunately, for many grammars this is a non-issue, and if the escaping does get bad enough to affect readability, there is always the option of storing the grammar in a separate file.
### Output format
When building parsers, you can specify an output format of either :hiccup or :enlive. :hiccup is the default, but here is an example of the above parser with :enlive set as the output format:
(def as-and-bs-enlive
(insta/parser
"S = AB*
AB = A B
A = 'a'+
B = 'b'+"
:output-format :enlive))
=> (as-and-bs-enlive "aaaaabbbaaaabb")
{:tag :S,
:content
({:tag :AB,
:content
({:tag :A, :content ("a" "a" "a" "a" "a")}
{:tag :B, :content ("b" "b" "b")})}
{:tag :AB,
:content
({:tag :A, :content ("a" "a" "a" "a")}
{:tag :B, :content ("b" "b")})})}
I find the hiccup format to be pleasant and compact, especially when working with the parsed output in the REPL. The main advantage of the enlive format is that it allows you to use the very powerful enlive library to select and transform nodes in your tree.
If you want to alter instaparse's default output format:
(insta/set-default-output-format! :enlive)
### Controlling the tree structure
The principles of instaparse's output trees:
- Every rule equals one level of nesting in the tree.
- Each level is automatically tagged with the name of the rule.
To better understand this, take a look at these two variations of the same parser we've been discussing:
(def as-and-bs-variation1
(insta/parser
"S = AB*
AB = 'a'+ 'b'+"))
=> (as-and-bs-variation1 "aaaaabbbaaaabb")
[:S
[:AB "a" "a" "a" "a" "a" "b" "b" "b"]
[:AB "a" "a" "a" "a" "b" "b"]]
(def as-and-bs-variation2
(insta/parser
"S = ('a'+ 'b'+)*"))
=> (as-and-bs-variation2 "aaaaabbbaaaabb")
[:S "a" "a" "a" "a" "a" "b" "b" "b" "a" "a" "a" "a" "b" "b"]
#### Hiding content
For this next example, let's consider a parser that looks for a sequence of a's or b's surrounded by parens.
(def paren-ab
(insta/parser
"paren-wrapped = '(' seq-of-A-or-B ')'
seq-of-A-or-B = ('a' | 'b')*"))
=> (paren-ab "(aba)")
[:paren-wrapped "(" [:seq-of-A-or-B "a" "b" "a"] ")"]
It's very common in parsers to have elements that need to be present in the input and parsed, but we'd rather not have them appear in the output. In the above example, the parens are essential to the grammar yet the tree would be much easier to read and manipulate if we could hide those parens; once the string has been parsed, the parens themselves carry no additional semantic value.
In instaparse, you can use angle brackets `<>` to hide parsed elements, suppressing them from the tree output.
(def paren-ab-hide-parens
(insta/parser
"paren-wrapped = <'('> seq-of-A-or-B <')'>
seq-of-A-or-B = ('a' | 'b')*"))
=> (paren-ab-hide-parens "(aba)")
[:paren-wrapped [:seq-of-A-or-B "a" "b" "a"]]
Voila! The parens "(" and ")" tokens have been hidden. Angle brackets are a powerful tool for hiding whitespace and other delimiters from the output.
#### Hiding tags
Continuing with the same example parser, let's say we decide that the :seq-of-A-or-B tag is also superfluous -- we'd rather not have that extra nesting level appear in the output tree.
We've already seen that one option is to simply lift the right-hand side of the seq-of-A-or-B rule into the paren-wrapped rule, as follows:
(def paren-ab-manually-flattened
(insta/parser
"paren-wrapped = <'('> ('a'|'b')* <')'>"))
=> (paren-ab-manually-flattened "(aba)")
[:paren-wrapped "a" "b" "a"]
But sometimes, it is ugly or impractical to do this. It would be nice to have a way to express the concept of "repeated sequence of a's and b's" as a separate rule, without necessarily introducing an additional level of nesting.
Again, the angle brackets come to the rescue. We simply use the angle brackets to hide the *name* of the rule. Since each name corresponds to a level of nesting, hiding the name means the parsed contents of that rule will appear in the output tree without the tag and its associated new level of nesting.
(def paren-ab-hide-tag
(insta/parser
"paren-wrapped = <'('> seq-of-A-or-B <')'>
= ('a' | 'b')*"))
=> (paren-ab-hide-tag "(aba)")
[:paren-wrapped "a" "b" "a"]
You might wonder what would happen if we hid the root tag as well. Let's take a look:
(def paren-ab-hide-both-tags
(insta/parser
" = <'('> seq-of-A-or-B <')'>
= ('a' | 'b')*"))
=> (paren-ab-hide-both-tags "(aba)")
("a" "b" "a")
With no root tag, the parser just returns a sequence of children. So in the above example where *all* the tags are hidden, you just get a sequence of parsed elements. Sometimes that's what you want, but in general, I recommend that you don't hide the root tag, ensuring the output is a well-formed tree.
#### Revealing hidden information
Sometimes, after setting up the parser to hide content and tags, you temporarily want to reveal the hidden information, perhaps for debugging purposes.
The optional keyword argument `:unhide :content` reveals the hidden content in the tree output.
=> (paren-ab-hide-both-tags "(aba)" :unhide :content)
("(" "a" "b" "a" ")")
The optional keyword argument `:unhide :tags` reveals the hidden tags in the tree output.
=> (paren-ab-hide-both-tags "(aba)" :unhide :tags)
[:paren-wrapped [:seq-of-A-or-B "a" "b" "a"]]
The optional keyword argument `:unhide :all` reveals all hidden information.
=> (paren-ab-hide-both-tags "(aba)" :unhide :all)
[:paren-wrapped "(" [:seq-of-A-or-B "a" "b" "a"] ")"]
### No Grammar Left Behind
One of the things that really sets instaparse apart from other Clojure parser generators is that it can handle any context-free grammar. For example, some parsers only accept LL(1) grammars, others accept LALR grammars. Many of the libraries use a recursive-descent strategy that fails for left-recursive grammars. If you are willing to learn the esoteric restrictions posed by the library, it is usually possible to rework your grammar to fit that mold. But instaparse lets you write your grammar in whatever way is most natural.
#### Right recursion
No problem:
=> ((insta/parser "S = 'a' S | Epsilon") "aaaa")
[:S "a" [:S "a" [:S "a" [:S "a" [:S]]]]]
Note the use of Epsilon, a common name for the "empty" parser that always succeeds without consuming any characters. You can also just use an empty string if you prefer.
#### Left recursion
No problem:
=> ((insta/parser "S = S 'a' | Epsilon") "aaaa")
[:S [:S [:S [:S [:S] "a"] "a"] "a"] "a"]
As you can see, either of these recursive parsers will generate a parse tree that is deeply nested. Unfortunately, Clojure does not handle deeply-nested data structures very well. If you were to run the above parser on, say, a string of 20,000 a's, instaparse will happily try to generate the corresponding parse tree but then Clojure will stack overflow when it tries to hash the tree.
So, as is often advisable in Clojure, use recursion judiciously in a way that will keep your trees a manageable depth. For the above parser, it is almost certainly better to just do:
=> ((insta/parser "S = 'a'*") "aaaa")
[:S "a" "a" "a" "a"]
#### Infinite loops
If you specify an unterminated recursive grammar, instaparse will handle that gracefully as well and terminate with an error, rather than getting caught in an infinite loop:
=> ((insta/parser "S = S") "a")
Parse error at line 1, column 1:
a
^
### Ambiguous grammars
(def ambiguous
(insta/parser
"S = A A
A = 'a'*"))
This grammar is interesting because even though it specifies a repeated run of a's, there are many possible ways the grammar can chop it up. Our parser will faithfully return one of the possible parses:
=> (ambiguous "aaaaaa")
[:S [:A "a"] [:A "a" "a" "a" "a" "a"]]
However, we can do better. First, I should point out that `(ambiguous "aaaaaa")` is really just shorthand for `(insta/parse ambiguous "aaaaaa")`. Parsers are not actually functions, but are records that implement the function interface as a shorthand for calling the insta/parse function.
`insta/parse` is the way you ask a parser to produce a single parse tree. But there is another library function `insta/parses` that asks the parser to produce a lazy sequence of all parse trees. Compare:
=> (insta/parse ambiguous "aaaaaa")
[:S [:A "a"] [:A "a" "a" "a" "a" "a"]]
=> (insta/parses ambiguous "aaaaaa")
([:S [:A "a"] [:A "a" "a" "a" "a" "a"]]
[:S [:A "a" "a" "a" "a" "a" "a"] [:A]]
[:S [:A "a" "a"] [:A "a" "a" "a" "a"]]
[:S [:A "a" "a" "a"] [:A "a" "a" "a"]]
[:S [:A "a" "a" "a" "a"] [:A "a" "a"]]
[:S [:A "a" "a" "a" "a" "a"] [:A "a"]]
[:S [:A] [:A "a" "a" "a" "a" "a" "a"]])
You may wonder, why is this useful? Two reasons:
1. Sometimes it is difficult to remove ambiguity from a grammar, but the ambiguity doesn't really matter -- any parse tree will do. In these situations, instaparse's ability to work with ambiguous grammars can be quite handy.
2. Instaparse's ability to generate a sequence of all parses provides a powerful tool for debugging and thus *removing* ambiguity from an unintentionally ambiguous grammar. It turns out that when designing a context-free grammar, it's all too easy to accidentally introduce some unintentional ambiguity. Other parser tools often report ambiguities as cryptic "shift-reduce" messages, if at all. It's rather empowering to see the precise parse that instaparse finds when multiple parses are possible.
I generally test my parsers using the `insta/parses` function so I can immediately spot any ambiguities I've inadvertently introduced. When I'm confident the parser is not ambiguous, I switch to `insta/parse` or, equivalently, just call the parser as if it were a function.
### Regular expressions: A word of warning
As you can see from the above example, instaparse flexibly interprets * and +, trying all possible numbers of repetitions in order to create a parse tree. It is easy to become spoiled by this, and then forget that regular expressions have different semantics. Instaparse's regular expressions are just Clojure/Java regular expressions, which behave in a greedy manner.
To better understand this point, contrast the above parser with this one:
(def not-ambiguous
(insta/parser
"S = A A
A = #'a*'"))
=> (insta/parses not-ambiguous "aaaaaa")
([:S [:A "aaaaaa"] [:A ""]])
In this parser, the * is *inside* the regular expression, which means that it follows greedy regular expression semantics. Therefore, the first A eats all the a's it can, leaving no a's for the second A.
For this reason, it is wise to use regular expressions judiciously, mainly to express the patterns of your tokens, and leave the overall task of parsing to instaparse. Regular expressions can often be tortured and abused into serving as a crude parser, but don't do it! There's no need; with instaparse, you now have an equally convenient but more expressive tool to bring to bear on parsing problems.
Here is an example that I think is a tasteful use of regular expressions to split a sentence on whitespaces, categorizing the tokens as words or numbers:
(def words-and-numbers
(insta/parser
"sentence = token ( token)*
= word | number
whitespace = #'\\s+'
word = #'[a-zA-Z]+'
number = #'[0-9]+'"))
=> (words-and-numbers "abc 123 def")
[:sentence [:word "abc"] [:number "123"] [:word "def"]]
### Partial parses
By default, instaparse assumes you are looking for a parse tree that covers the entire input string. However, sometimes it may be useful to look at all the partial parses that satisfy the grammar while consuming some initial portion of the input string.
For this purpose, both `insta/parse` and `insta/parses` take a keyword argument, `:partial` that you simply set to true.
(def repeated-a
(insta/parser
"S = 'a'+"))
=> (insta/parses repeated-a "aaaaaa")
([:S "a" "a" "a" "a" "a" "a"])
=> (insta/parses repeated-a "aaaaaa" :partial true)
([:S "a"]
[:S "a" "a"]
[:S "a" "a" "a"]
[:S "a" "a" "a" "a"]
[:S "a" "a" "a" "a" "a"]
[:S "a" "a" "a" "a" "a" "a"])
Of course, using `:partial true` with `insta/parse` means that you'll only get the first parse result found.
=> (insta/parse repeated-a "aaaaaa" :partial true)
[:S "a"]
### PEG extensions
PEGs are a popular alternative to context-free grammars. On the surface, PEGs look very similar to CFGs, but the various choice operators are meant to be interpreted in a strictly greedy, ordered way that removes any ambiguity from the grammar. Some view this lack of ambiguity as an advantage, but it does limit the expressiveness of PEGs relative to context-free grammars. Furthermore, PEGs are usually tightly coupled to a specific parsing strategy that forbids left-recursion, further limiting their utility.
To combat that lost expressiveness, PEGs adopted a few operators that actually allow PEGs to do some things that CFGs cannot express. Even though the underlying paradigm is different, I've swiped these juicy bits from PEGs and included them in instaparse, giving instaparse more expressive power than either traditional PEGs or traditional CFGs.
Here is a table of the PEG operators that have been adapted for use in instaparse; I'll explain them in more detail shortly.
Category | Notations | Example |
Lookahead | & | &A |
Negative lookahead | ! | !A |
Ordered Choice | / | A / B |
#### Lookahead
The symbol for lookahead is `&`, and is generally used as part of a chain of concatenated parsers. Lookahead tests whether there are some number of characters that lie ahead in the text stream that satisfy the parser. It performs this test without actually "consuming" characters. Only if that lookahead test succeeds do the remaining parsers in the chain execute.
That's a mouthful, and hard to understand in the abstract, so let's look at a concrete example:
(def lookahead-example
(insta/parser
"S = &'ab' ('a' | 'b')+"))
The `('a' | 'b')+` part should be familiar at this point, and you hopefully recognize this as a parser that ensures the text is a string entirely of a's and b's. The other part, `&'ab'` is the lookahead. Notice how the `&` precedes the expression it is operating on. Before processing the `('a' | 'b')+`, it looks ahead to verify that the `'ab'` parser could hypothetically be satisfied by the upcoming characters. In other words, it will only accept strings that start off with the characters `ab`.
=> (lookahead-example "abaaaab")
[:S "a" "b" "a" "a" "a" "a" "b"]
=> (lookahead-example "bbaaaab")
Parse error at line 1, column 1:
bbaaaab
^
Expected:
"ab"
If you write something like `&'a'+` with no parens, this will be interpreted as `&('a'+)`.
Here is my favorite example of lookahead, a parser that only succeeds on strings with a run of a's followed by a run of b's followed by a run of c's, where each of those runs must be the same length. If you've ever taken an automata course, you may remember that there is a very elegant proof that it is impossible to express this set of constraints with a pure context-free grammar. Well, with lookahead, it *is* possible:
(def abc
(insta/parser
"S = &(A 'c') 'a'+ B
A = 'a' A? 'b'
= 'b' B? 'c'"))
=> (abc "aaabbbccc")
[:S "a" "a" "a" "b" "b" "b" "c" "c" "c"]
This example succeeds because there are three a's followed by three b's followed by three c's. Verifying that this parser fails for unequal runs and other mixes of letters is left as an exercise for the reader.
#### Negative lookahead
Negative lookahead uses the symbol `!`, and like `&`, it precedes the expression. It does exactly what you'd expect -- it performs a lookahead and confirms that the parser is *not* satisfied by the upcoming characters in the screen.
(def negative-lookahead-example
(insta/parser
"S = !'ab' ('a' | 'b')+"))
So this parser turns around the meaning of the previous example, accepting all strings of a's and b's that *don't* start off with `ab`.
=> (negative-lookahead-example "abaaaab")
Parse error at line 1, column 1:
abaaaab
^
Expected:
NOT "ab"
=> (negative-lookahead-example "bbaaaab")
[:S "b" "b" "a" "a" "a" "a" "b"]
One issue with negative lookahead is that it introduces the possibility of paradoxes. Consider:
S = !S 'a'
How should this parser behave on an input of "a"? If S succeeds, it should fail, and if it fails it should succeed.
PEGs simply don't allow this sort of grammar, but the whole spirit of instaparse is to flexibly allow recursive grammars, so I needed to find some way to handle it. Basically, I've taken steps to make sure that a paradoxical grammar won't cause instaparse to go into an infinite loop. It will terminate, but I make no promises about what the results will be. If you specify a paradoxical grammar, it's a garbage-in-garbage-out kind of situation (although to be clear, instaparse won't return complete garbage; it will make some sort of reasonable judgment about how to interpret it). If you're curious about how instaparse behaves with the above paradoxical example, here it is:
=> ((insta/parser "S = !S 'a'") "a")
[:S "a"]
Negative lookahead, when used properly, is an extremely powerful tool for removing ambiguity from your parser. To illustrate this, let's take a look at a very common parsing task, which involves tokenizing a string of characters into a combination of identifiers and reserved keywords. Our first attempt at this ends up ambiguous:
(def ambiguous-tokenizer
(insta/parser
"sentence = token ( token)*
= keyword | identifier
whitespace = #'\\s+'
identifier = #'[a-zA-Z]+'
keyword = 'cond' | 'defn'"))
=> (insta/parses ambiguous-tokenizer "defn my cond")
([:sentence [:identifier "defn"] [:identifier "my"] [:identifier "cond"]]
[:sentence [:keyword "defn"] [:identifier "my"] [:identifier "cond"]]
[:sentence [:identifier "defn"] [:identifier "my"] [:keyword "cond"]]
[:sentence [:keyword "defn"] [:identifier "my"] [:keyword "cond"]])
Each of our keywords not only fits the description of keyword, but also of identifier, so our parser doesn't know which way to parse those words. Instaparse makes no guarantee about what order it processes alternatives, and in this situation, we see that in fact, the combination we wanted was listed last among the possible parses. Negative lookahead provides an easy way to remove this ambiguity:
(def unambiguous-tokenizer
(insta/parser
"sentence = token ( token)*
= keyword | !keyword identifier
whitespace = #'\\s+'
identifier = #'[a-zA-Z]+'
keyword = 'cond' | 'defn'"))
=> (insta/parses unambiguous-tokenizer "defn my cond")
([:sentence [:keyword "defn"] [:identifier "my"] [:keyword "cond"]])
#### Ordered choice
As I mentioned earlier, a PEG's interpretation of `+`, `*`, and `|` are subtly different from the way those symbols are interpreted in CFGs. `+` and `*` are interpreted greedily, just as they are in regular expressions. `|` proceeds in a rather strict order, trying the first alternative first, and only proceeding if that one fails. To remind users that these multiple choices are strictly ordered, PEGs commonly use the forward slash `/` rather than `|`.
Although the PEG paradigm of forced order is antithetical to instaparse's flexible parsing strategy, I decided to co-opt the `/` notation to express a preference of one alternative over another.
With that in mind, let's look back at the `ambiguous-tokenizer` example from the previous section. In that example, we found that our desired parse, in which the keywords were classified, ended up at the bottom of the heap:
=> (insta/parses ambiguous-tokenizer "defn my cond")
([:sentence [:identifier "defn"] [:identifier "my"] [:identifier "cond"]]
[:sentence [:keyword "defn"] [:identifier "my"] [:identifier "cond"]]
[:sentence [:identifier "defn"] [:identifier "my"] [:keyword "cond"]]
[:sentence [:keyword "defn"] [:identifier "my"] [:keyword "cond"]])
We've already seen one way to remove the ambiguity by using negative lookahead. But now we have another tool in our toolbox, `/`, which will allow the ambiguity to remain, while bringing the desired parse result to the top of the list.
(def preferential-tokenizer
(insta/parser
"sentence = token ( token)*
= keyword / identifier
whitespace = #'\\s+'
identifier = #'[a-zA-Z]+'
keyword = 'cond' | 'defn'"))
=> (insta/parses preferential-tokenizer "defn my cond")
([:sentence [:keyword "defn"] [:identifier "my"] [:keyword "cond"]]
[:sentence [:identifier "defn"] [:identifier "my"] [:keyword "cond"]]
[:sentence [:keyword "defn"] [:identifier "my"] [:identifier "cond"]]
[:sentence [:identifier "defn"] [:identifier "my"] [:identifier "cond"]])
The ordered choice operator has its uses, but don't go overboard. There are two main reasons why it is generally better to use the regular unordered alternation operator.
1. When ordered choice interacts with a complex mix of recursion, other ordered choice operators, and indeterminate operators like `+` and `*`, it can quickly become difficult to reason about how the parsing will actually play out.
2. The next version of instaparse will support multithreading. In that version, every use of `|` will be an opportunity to exploit parallelism. On the contrary, uses of `/` will create a bottleneck where options have to be pursued in a specific order.
### Parse errors
`(insta/parse my-parser "parse this text")` will either return a parse tree or a failure object. The failure object will pretty-print at the REPL, showing you the furthest point it reached while parsing your text, and listing all the possible tokens that would have allowed it to proceed.
`(insta/parses my-parser "parse this text")` will return a sequence of all the parse trees, so in the event that no parse can be found, it will simply return an empty list. However, the failure object is still there, attached to the empty list as metadata.
`(insta/failure? result)` will detect both these scenarios and return true if the result is either a failure object, or an empty list with a failure object attached as metdata.
`(insta/get-failure result)` provides a unified way to extract the failure object in both these cases. If the result is a failure object, then it is directly returned, and if the result is an empty list with the failure attached as metadata, then the failure object is retrieved from the metadata.
### Total parse mode
Sometimes knowing the point of failure is not enough and you need to know the entire context of the parse tree when it failed. To help with these sorts of situations, instaparse offers a "total parse" mode inspired by Christophe Grand's parsley parser. This total parse mode guarantees to parse the entire string; if the parser fails, it completes the parse anyway, embedding the failure point as a node in the parse tree.
To demonstrate, let's revisit the ultra-simple `repeated-a` parser.
=> repeated-a
S = "a"+
=> (repeated-a "aaaaaaaa")
[:S "a" "a" "a" "a" "a" "a" "a" "a"]
On a string with a valid parse, the total parse mode performs identically:
=> (repeated-a "aaaaaaaa" :total true)
[:S "a" "a" "a" "a" "a" "a" "a" "a"]
On a failure, note the difference:
=> (repeated-a "aaaabaaa")
Parse error at line 1, column 5:
aaaabaaa
^
Expected:
"a"
=> (repeated-a "aaaabaaa" :total true)
[:S "a" "a" "a" "a" [:instaparse/failure "baaa"]]
Note that this kind of total parse result is still considered a "failure", and we can test for that and retrieve the failure object using `insta/failure?` and `insta/get-failure`, respectively.
=> (insta/failure? (repeated-a "aaaabaaa" :total true))
true
=> (insta/get-failure (repeated-a "aaaabaaa" :total true))
Parse error at line 1, column 5:
aaaabaaa
^
Expected:
"a"
I find that the total parse mode is the most valuable diagnostic tool when the cause of the error is far away from the point where the parser actually fails. A typical example might be a grammar where you are looking for phrases delimited by quotes, and the text neglects to include a closing quote mark around some phrase in the middle of the text. The parser doesn't fail until it hits the end of the text without encountering a closing quote mark.
In such a case, a quick look at the total parse tree will show you the context of the failure, making it easy to spot the location where the run-on phrase began.
### Parsing from another start rule
Another valuable tool for interactive debugging is the ability to test out individual rules. To demonstrate this, let's look back at our very first parser:
=> as-and-bs
S = AB*
AB = A B
A = "a"+
B = "b"+
As we've seen throughout this tutorial, by default, instaparse assumes that the very first rule is your "starting production", the rule from which parsing initially proceeds. But we can easily set other rules to be the starting production with the `:start` keyword argument.
=> (as-and-bs "aaa" :start :A)
[:A "a" "a" "a"]
=> (as-and-bs "aab" :start :A)
Parse error at line 1, column 3:
aab
^
Expected:
"a"
=> (as-and-bs "aabb" :start :AB)
[:AB [:A "a" "a"] [:B "b" "b"]]
=> (as-and-bs "aabbaabb" :start :AB)
Parse error at line 1, column 5:
aabbaabb
^
Expected:
"b"
The `insta/parser` function, which builds the parser from the specification, also accepts the :start keyword to set the default start rule to something other than the first rule listed.
#### Review of keyword arguments
At this point, you've seen all the keyword arguments that an instaparse-generated parser accepts, `:start :rule-name`, `:partial true`, and `:total true`. All these keyword arguments can be freely mixed and work with both `insta/parse` and `insta/parses`.
You've also seen both keyword arguments that can be used when building the parser from the specification: `:output-format (:enlive or :hiccup)` and `:start :rule-name` to set a different default start rule than the first rule.
### Transforming the tree
A parser's job is to turn a string into some kind of tree structure. What you do with it from there is up to you. It is delightfully easy to manipulate trees in Clojure. There are wonderful tools available: enlive, zippers, match, and tree-seq. But even without those tools, most tree manipulations are straightforward to perform in Clojure with recursion.
Since tree transformations are already so easy to perform in Clojure, there's not much point in building a sophisticated transform library into instaparse. Nevertheless, I did include one function, `insta/transform`, that addresses the most common transformation needs.
`insta/transform` takes a map from tree tags to transform functions. A transform function is defined as a function which takes the children of the tree node as inputs and returns a replacement node. In other words, if you want to turn all nodes in your tree of the form `[:switch x y]` into `[:switch y x]`, then you'd call:
(insta/transform {:switch (fn [x y] [:switch y x])}
my-tree)
Let's make this concrete with an example. So far, throughout the tutorial, we were able to adequately express the tokens of our languages with strings or regular expressions. But sometimes, regular expressions are not sufficient, and we want to bring the full power of context-free grammars to bear on the problem of processing the individual tokens. When we do that, we end up with a bunch of individual characters where we really want a string or a number.
To illustrate this, let's revisit the `words-and-numbers` example, but this time, we'll imagine that regular expressions aren't rich enough to specify the constraints on those tokens and we need our grammar to process the string one character at a time:
(def words-and-numbers-one-character-at-a-time
(insta/parser
"sentence = token ( token)*
= word | number
whitespace = #'\\s+'
word = letter+
number = digit+
= #'[a-zA-Z]'
= #'[0-9]'"))
=> (words-and-numbers-one-character-at-a-time "abc 123 def")
[:sentence [:word "a" "b" "c"] [:number "1" "2" "3"] [:word "d" "e" "f"]]
We'd really like to simplify these `:word` and `:number` terminals. So for `:word` nodes, we want to concatenate the strings with clojure's built-in `str` function, and for `:number` nodes, we want to concatenate the strings and convert the string to a number. We can do this quite simply as follows:
=> (insta/transform
{:word str,
:number (comp clojure.edn/read-string str)}
(words-and-numbers-one-character-at-a-time "abc 123 def"))
[:sentence "abc" 123 "def"]
Or, if you're a fan of threading macros, try this version:
=> (->> (words-and-numbers-one-character-at-a-time "abc 123 def")
(insta/transform
{:word str,
:number (comp clojure.edn/read-string str)}))
The `insta/transform` function auto-detects whether you are using enlive or hiccup trees, and processes accordingly.
`insta/transform` performs its transformations in a bottom-up manner, which means that taken to an extreme, `insta/transform` can be used not only to rearrange a tree, but to evaluate it. Including a grammar for infix arithmetic math expressions has become nearly obligatory in parser tutorials, so I might as well use that in order to demonstrate evaluation. I've leveraged instaparse's principle of "one rule per node type" and the hide notation `<>` to get a nice clean unambiguous tree that includes only the relevant information for evaluation.
(def arithmetic
(insta/parser
"expr = add-sub
= mul-div | add | sub
add = add-sub <'+'> mul-div
sub = add-sub <'-'> mul-div
= term | mul | div
mul = mul-div <'*'> term
div = mul-div <'/'> term
= number | <'('> add-sub <')'>
number = #'[0-9]+'"))
=> (arithmetic "1-2/(3-4)+5*6")
[:expr
[:add
[:sub
[:number "1"]
[:div [:number "2"] [:sub [:number "3"] [:number "4"]]]]
[:mul [:number "5"] [:number "6"]]]]
With the tree in this shape, it's trivial to evaluate it:
=> (->> (arithmetic "1-2/(3-4)+5*6")
(insta/transform
{:add +, :sub -, :mul *, :div /,
:number clojure.edn/read-string :expr identity}))
33
`insta/transform` is designed to play nicely with all the possible outputs of `insta/parse` and `insta/parses`. So if the input is a sequence of parse trees, it will return a sequence of transformed parse trees. If the input is a Failure object, then the Failure object is passed through unchanged. This means you can safely chain a transform to your parser without taking special cases. To demonstrate this, let's look back at the `ambiguous` parser from earlier in the tutorial:
(def ambiguous
(insta/parser
"S = A A
A = 'a'*"))
=> (->> (insta/parses ambiguous "aaaaaa")
(insta/transform {:A str}))
([:S "a" "aaaaa"]
[:S "aaaaaa" ""]
[:S "aa" "aaaa"]
[:S "aaa" "aaa"]
[:S "aaaa" "aa"]
[:S "aaaaa" "a"]
[:S "" "aaaaaa"])
=> (->> (ambiguous "aabaaa")
(insta/transform {:A str}))
Parse error at line 1, column 3:
aabaaa
^
Expected:
"a"
### Understanding the tree
#### Character spans
The trees produced by instaparse are annotated with metadata so that for each subtree, you can easily recover the start and end index of the input text parsed by that subtree. The convenience function for extracting this metadata is `insta/span`. To demonstrate, let's revisit our first example.
=> (as-and-bs "aaaaabbbaaaabb")
[:S
[:AB [:A "a" "a" "a" "a" "a"] [:B "b" "b" "b"]]
[:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]
=> (meta (as-and-bs "aaaaabbbaaaabb"))
{:instaparse.gll/start-index 0, :instaparse.gll/end-index 14}
=> (insta/span (as-and-bs "aaaaabbbaaaabb"))
[0 14]
=> (count "aaaaabbbaaaabb")
14
As you can see, `insta/span` returns a pair containing the start index (inclusive) and end index (exclusive), the customary way to represent the start and end of a substring. So far, this isn't particularly interesting -- we already knew that the entire string was successfully parsed. But since `span` works on all the subtrees, this gives us a powerful tool for exploring the provenance of each portion of the tree. To demonstrate this, here's a quick helper function (not part of instaparse's API) that takes a hiccup tree and replaces all the tags with the character spans.
(defn spans [t]
(if (sequential? t)
(cons (insta/span t) (map spans (next t)))
t))
=> (spans (as-and-bs "aaaabbbaabbab"))
([0 13]
([0 7] ([0 4] "a" "a" "a" "a") ([4 7] "b" "b" "b"))
([7 11] ([7 9] "a" "a") ([9 11] "b" "b"))
([11 13] ([11 12] "a") ([12 13] "b")))
`insta/span` works on all the tree types produced by instaparse. Furthermore, when you use `insta/transform` to transform your parse tree, `insta/span` will work on the transformed tree as well -- the span metadata is preserved for every node in the transformed tree to which metadata can be attached. Keep in mind that although most types of Clojure data support metadata, primitives such as strings or numbers do not, so if you transform any of your nodes into such primitive data types, `insta/span` on those nodes will simply return `nil`.
##### Line and column information
Sometimes, when the input string contains newline characters, it is useful to have the span metadata in the form of line and column numbers. By default, instaparse doesn't do this, because generating line and column information requires a second pass over the input string and parse tree. However, the function `insta/add-line-and-column-info-to-metadata` performs this second pass, taking the input string and parse tree, returning a parse tree with the additional metadata. Make sure to pass in the same input string from which the parse tree was derived!
=> (def multiline-text "This is line 1\nThis is line 2")
=> (words-and-numbers multiline-text)
[:sentence [:word "This"] [:word "is"] [:word "line"] [:number "1"]
[:word "This"] [:word "is"] [:word "line"] [:number "2"]]
=> (def parsed-multiline-text-with-line-and-column-metadata
(insta/add-line-and-column-info-to-metadata
multiline-text
(words-and-numbers multiline-text)))
The additional information is in the metadata, so the tree itself is not visibly changed:
=> parsed-multiline-text-with-line-and-column-metadata
[:sentence [:word "This"] [:word "is"] [:word "line"] [:number "1"]
[:word "This"] [:word "is"] [:word "line"] [:number "2"]]
But now let's inspect the metadata for the overall parse tree.
=> (meta parsed-multiline-text-with-line-and-column-metadata)
{:instaparse.gll/end-column 15, :instaparse.gll/end-line 2,
:instaparse.gll/start-column 1, :instaparse.gll/start-line 1,
:instaparse.gll/start-index 0, :instaparse.gll/end-index 29}
And let's take a look at the metadata for the word "is" on the second line of the text.
=> (meta (nth parsed-multiline-text-with-line-and-column-metadata 6))
{:instaparse.gll/end-column 8, :instaparse.gll/end-line 2,
:instaparse.gll/start-column 6, :instaparse.gll/start-line 2,
:instaparse.gll/start-index 20, :instaparse.gll/end-index 22}]
start-line and start-column point to the same character as start-index, and end-line and end-column point to the same character as end-index. So just like the regular span metadata, the line/column start point is inclusive and the end point is exclusive. However, line and column numbers are 1-based counts, rather than 0-based. So, for example, index number 0 of the string corresponds to line 1, column 1.
#### Visualizing the tree
Instaparse contains a function, `insta/visualize` *(Clojure only)*, that will give you a visual overview of the parse tree, showing the tags, the character spans, and the leaves of the tree.
=> (insta/visualize (as-and-bs "aaabbab"))
The visualize function, by default, pops open the tree in a new window. To actually save the tree image as a file for this tutorial, I used both of the optional keyword arguments supported by `insta/visualize`. First, the `:output-file` keyword argument supplies the destination where the image should be saved. Second, the keyword `:options` is used to supply an option map of additional drawing parameters. I lowered it to 63dpi so it wouldn't take up so much screen real estate. So my function call looked like:
=> (insta/visualize (as-and-bs "aaabbab") :output-file "images/vizexample1.png" :options {:dpi 63})
`insta/visualize` draws the tree using the [rhizome](https://github.com/ztellman/rhizome) library, which in turn uses [graphviz](http://www.graphviz.org). Unfortunately, Java, and by extension Clojure, has a bit of a weakness when it comes to libraries depending on other libraries. If you want to use two libraries that rely on two different versions of a third library, you're in for a headache.
In this instance, rhizome is a particularly fast-moving target. As of the time of this writing, rhizome 0.1.8 is the most current version, released just a few weeks after version 0.1.6. If I were to make instaparse depend on rhizome 0.1.8, then in a few weeks when 0.1.9 is released, it will become more difficult to use instaparse in projects which rely on the most recent version of rhizome.
For this reason, I've done something a bit unusual: rather than include rhizome directly in instaparse's dependencies, I've set things up so that `insta/visualize` will use whatever version of rhizome *you've* put in your project.clj dependencies (must be version 0.1.8 or greater). On top of that, rhizome assumes that you have graphviz installed on your system. If rhizome is not in your dependencies, or graphviz is not installed, `insta/visualize` will throw an error with a message reminding you of the necessary dependencies. To find the most current version number for rhizome, and for links to graphviz installers, check out the [rhizome github site](https://github.com/ztellman/rhizome).
If you don't want to use `insta/visualize`, there is no need to add rhizome to your dependencies and no need to install graphviz. All the other instaparse functions will work just fine.
### Combinators
I truly believe that ordinary EBNF notation is the clearest, most concise way to express a context-free grammar. Nevertheless, there may be times when it is useful to build parsers with parser combinators. If you want to use instaparse in this way, you'll need to use the `instaparse.combinators` namespace. If you are not interested in the combinator interface, feel free to skip this section -- the combinators provide no additional power or expressiveness over the string representation.
Each construct you've seen from the string specification has a corresponding parser combinator. Most are straightforward, but the last few lines of the table will require some additional explanation.
String syntax | Combinator | Mnemonic |
Epsilon | Epsilon | Epsilon |
A | B | C | (alt A B C) | Alternation |
A B C | (cat A B C) | Concatenation |
A? | (opt A) | Optional |
A+ | (plus A) | Plus |
A* | (star A) | Star |
A / B / C | (ord A B C) | Ordered Choice |
&A | (look A) | Lookahead |
!A | (neg A) | Negative lookahead |
<A> | (hide A) | Hide |
"string" | (string "string") | String |
#"regexp" | (regexp "regexp") | Regular Expression |
A non-terminal | (nt :non-terminal) | Non-terminal |
<S> = ... | {:S (hide-tag ...)} | Hide tag |
When using combinators, instead of building a string, your goal is to build a *grammar map*. So a spec that looks like this:
S = ...
A = ...
B = ...
becomes
{:S ... combinators describing right-hand-side of S rule ...
:A ... combinators describing right-hand-side of A rule ...
:B ... combinators describing right-hand-side of B rule ...}
You can also build it as a vector:
[:S ... combinators describing right-hand-side of S rule ...
:A ... combinators describing right-hand-side of A rule ...
:B ... combinators describing right-hand-side of B rule ...]
The main difference is that if you use the map representation, you'll eventually need to specify the start rule, but if you use the vector, instaparse will assume the first rule is the start rule. Either way, I'm going to refer to the above structure as a *grammar map*.
Most of the combinators, if you consult the above table, are pretty obvious. Here are a few additional things to keep in mind, and then a concrete example will follow:
1. Literal strings must be wrapped in a call to the `string` combinator.
2. Regular expressions must be wrapped in a call to the `regexp` combinator.
3. Any reference on the right-hand side of a rule to a non-terminal (i.e., a name of another rule) must be wrapped in a call to the `nt` combinator.
4. Angle brackets on the right-hand side of a rule correspond to the `hide` combinator.
5. Even though the notation for hiding a rule name is to put angle brackets around the name (on the left-hand side), this is implemented by wrapping the `hide-tag` combinator around the entire *right-hand side* of the rule expressed as combinators.
Hopefully this will all be clarified with an example. Do you remember the parser that looks for equal numbers of a's followed by b's followed by c's?
S = &(A 'c') 'a'+ B
A = 'a' A? 'b'
= 'b' B? 'c'
Well, here's the corresponding grammar map:
(use 'instaparse.combinators)
(def abc-grammar-map
{:S (cat (look (cat (nt :A) (string "c")))
(plus (string "a"))
(nt :B))
:A (cat (string "a") (opt (nt :A)) (string "b"))
:B (hide-tag (cat (string "b") (opt (nt :B)) (string "c")))})
Once you've built your grammar map, you turn it into an executable parser by calling `insta/parser`. As I mentioned before, if you use map notation, you'll need to specify the start rule.
(insta/parser abc-grammar-map :start :S)
The result is a parser that is the same as the one built from the string specification.
To my eye, the string is dramatically more readable, but if you need or want to use the combinator approach, it's there for you to utilize.
#### String to combinator conversion
Shortly after I published the first version of instaparse, I received a question, "String specifications can be combined with `clojure.string/join` and combinator grammar maps can be combined with `merge` --- is there any way to mix and match string and combinator grammar representations?" At the time, there wasn't, but now there is. As of version 1.1, there is a new function `ebnf` in the `instaparse.combinators` namespace which *converts* EBNF strings into the same underlying structure that is built by the combinator library, thus allowing for further manipulation by combinators. (EBNF stands for Extended Backus-Naur Form, the technical name for the syntax used by instaparse and described in this tutorial.) For example,
(ebnf "'a'* | 'b'+")
produces the same structure as if you had typed the combinator version
(alt (star (string "a")) (plus (string "b")))
You can also pass entire rules to `ebnf` and you'll get back the corresponding grammar map:
(ebnf "A = 'a'*; B = 'b'+")
produces
{:A (star (string "a"))
:B (plus (string "b"))}
This opens up the possibility of building a grammar from a mixture of combinators, and strings that have been converted to combinators. Here's a contrived example:
(def combo-build-example
(insta/parser
(merge
{:S (alt (nt :A) (nt :B))}
(ebnf "A = 'a'*")
{:B (ebnf "'b'+")})
:start :S))
### ABNF
Instaparse's primary input format is based on EBNF syntax, but an alternative input format, ABNF, is available. Most users will not need the ABNF input format, but if you need to implement a parser whose specification was written in ABNF syntax, it is very easy to do. Please read [instaparse's ABNF documentation](https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md) for details.
### String case sensitivity
One interesting difference between EBNF and ABNF grammars is that in EBNF, string terminals are case-sensitive whereas in ABNF, all string terminals are case-*in*sensitive. If you like ABNF's case-insensitive approach, but want to use Instaparse's somewhat richer EBNF syntax, there are a couple options available to you.
If you want *all* of the string terminals in your Instaparse EBNF grammar to be case-insensitive, the simplest solution is to use the `:string-ci true` keyword argument when calling `insta/parser` to make the strings case-insensitive:
=> ((insta/parser "S = 'a'+") "AaaAaa")
Parse error at line 1, column 1:
AaaAaa
^
Expected:
"a"
=> ((insta/parser "S = 'a'+" :string-ci true) "AaaAaa")
[:S "a" "a" "a" "a" "a" "a"]
On the other hand, if you want to cherry-pick certain string tokens to be case-insensitive, simply convert your string tokens into case-insensitive regexes, for example, replacing the string `'select'` with `#'(?i)select'`.
### Serialization
You can serialize an instaparse parser with `print-dup`, and deserialize it with `read`. (You can't use `clojure.edn/read` because edn does not support regular expressions.)
Typically, it is more convenient to store and/or transmit the string specification used to generate the parser. The string specification allows the parser to be rebuilt with a different output format; `print-dup` captures the state of the parser after the output format has been "baked in". However, if you have built the parser with the combinators, rather than via a string spec, or if you are storing the parser inside of other Clojure data structures that need to be serialized, then `print-dup` may be your best option.
## Performance notes
Some of the parsing libraries out there were written as a learning exercise -- monadic parser combinators, for example, are a great way to develop an appreciation for monads. There's nothing wrong with taking the fruits of a learning exercise and making it available to the public, but there are enough Clojure parser libraries out there that it is getting to be hard to tell the difference between those that are "ready for primetime" and those that aren't. For example, some of the libraries rely heavily on nested continuations, a strategy that is almost certain to cause a stack overflow on moderately large inputs. Others rely heavily on memoization, but never bother to clear the cache between inputs, eventually exhausting all available memory if you use the parser repeatedly.
I'm not going to make any precise performance guarantees -- the flexible, general nature of instaparse means that it is possible to write grammars that behave poorly. Nevertheless, I want to convey that performance is something I have taken seriously. I spent countless hours profiling instaparse's behavior on strange grammars and large inputs, using that data to improve performance. Just as one example, I discovered that for a large class of grammars, the biggest bottleneck was Clojure's hashing strategy, so I implemented a wrapper around Clojure's vectors that uses an alternative hashing strategy, successfully reducing running time on many parsers from quadratic to linear. (A shout-out to Christophe Grand who provided me with valuable guidance on this particular improvement.)
I've also worked to remove "performance surprises". For example, both left-recursion and right-recursion have sufficiently similar performance that you really don't need to agonize over which one to use -- choose whichever style best fits the problem at hand. If you express your grammar in a natural way, odds are good that you'll find the performance of the generated parser to be satisfactory. An additional performance boost in the form of multithreading is slated for the next release.
One performance caveat: instaparse is fairly memory-hungry, relying on extensive caching of intermediate results to keep the computational costs reasonable. This is not unusual -- caching is commonplace in many modern parsers, trading off space for time -- but it's worth bearing in mind. Packrat/PEG parsers and many recursive descent parsers employ a similar memory-intensive strategy, but there are other alternatives out there if that kind of memory usage is unacceptable. As one would expect, instaparse parsers do not hold onto the memory cache once the parse is complete; that memory is made available for garbage collection.
The [performance notes document] (https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md) contains a deeper discussion of performance and a few helpful hints for getting the best performance out of your parser.
## Reference
All the functionality you've seen in this tutorial is packed into an API of just 9 functions. Here are the doc strings:
=> (doc insta/parser)
-------------------------
instaparse.core/parser
([grammar-specification & {:as options}])
Takes a string specification of a context-free grammar,
or a URI for a text file containing such a specification,
or a map of parser combinators and returns a parser for that grammar.
Optional keyword arguments:
:input-format :ebnf
or
:input-format :abnf
:output-format :enlive
or
:output-format :hiccup
:start :keyword (where :keyword is name of starting production rule)
:string-ci true (treat all string literals as case insensitive)
:auto-whitespace (:standard or :comma)
or
:auto-whitespace custom-whitespace-parser
Clj only:
:no-slurp true (disables use of slurp to auto-detect whether
input is a URI. When using this option, input
must be a grammar string or grammar map. Useful
for platforms where slurp is slow or not available.)
=> (doc insta/parse)
-------------------------
instaparse.core/parse
([parser text & {:as options}])
Use parser to parse the text. Returns first parse tree found
that completely parses the text. If no parse tree is possible, returns
a Failure object.
Optional keyword arguments:
:start :keyword (where :keyword is name of starting production rule)
:partial true (parses that don't consume the whole string are okay)
:total true (if parse fails, embed failure node in tree)
:unhide <:tags or :content or :all> (for this parse, disable hiding)
:optimize :memory (when possible, employ strategy to use less memory)
Clj only:
:trace true (print diagnostic trace while parsing)
=> (doc insta/parses)
-------------------------
instaparse.core/parses
([parser text & {:as options}])
Use parser to parse the text. Returns lazy seq of all parse trees
that completely parse the text. If no parse tree is possible, returns
() with a Failure object attached as metadata.
Optional keyword arguments:
:start :keyword (where :keyword is name of starting production rule)
:partial true (parses that don't consume the whole string are okay)
:total true (if parse fails, embed failure node in tree)
:unhide <:tags or :content or :all> (for this parse, disable hiding)
Clj only:
:trace true (print diagnostic trace while parsing)
=> (doc insta/set-default-output-format!)
-------------------------
instaparse.core/set-default-output-format!
([type])
Changes the default output format. Input should be :hiccup or :enlive
=> (doc insta/failure?)
-------------------------
instaparse.core/failure?
([result])
Tests whether a parse result is a failure.
=> (doc insta/get-failure)
-------------------------
instaparse.core/get-failure
([result])
Extracts failure object from failed parse result.
=> (doc insta/transform)
-------------------------
instaparse.core/transform
([transform-map parse-tree])
Takes a transform map and a parse tree (or seq of parse-trees).
A transform map is a mapping from tags to
functions that take a node's contents and return
a replacement for the node, i.e.,
{:node-tag (fn [child1 child2 ...] node-replacement),
:another-node-tag (fn [child1 child2 ...] node-replacement)}
=> (doc insta/span)
-------------------------
instaparse.core/span
([tree])
Takes a subtree of the parse tree and returns a [start-index end-index] pair
indicating the span of text parsed by this subtree.
start-index is inclusive and end-index is exclusive, as is customary
with substrings.
Returns nil if no span metadata is attached.
=> (doc insta/add-line-and-column-info-to-metadata)
-------------------------
instaparse.core/add-line-and-column-info-to-metadata
([text parse-tree])
Given a string `text` and a `parse-tree` for text, return parse tree
with its metadata annotated with line and column info. The info can
then be found in the metadata map under the keywords:
:instaparse.gll/start-line, :instaparse.gll/start-column,
:instaparse.gll/end-line, :instaparse.gll/end-column
The start is inclusive, the end is exclusive. Lines and columns are 1-based.
=> (doc insta/visualize)
-------------------------
instaparse.core/visualize
([tree & {output-file :output-file, options :options}])
Creates a graphviz visualization of the parse tree.
Optional keyword arguments:
:output-file output-file (will save the tree image to output-file)
:options options (options passed along to rhizome)
Important: This function will only work if you have added rhizome
to your dependencies, and installed graphviz on your system.
See https://github.com/ztellman/rhizome for more information.
## Experimental Features
See the [Experimental Features](docs/ExperimentalFeatures.md) page for a discussion of new features under active development, including memory optimization and automatic handling of whitespace.
## Communication
I try to be very responsive to issues posted to the github issues page. But if you have a general question, need some help troubleshooting a grammar, or have something interesting you've done in instaparse that you'd like to share, consider joining the [Instaparse Google Group](https://groups.google.com/d/forum/instaparse) and posting there.
## Special Thanks
My interest in this project began while watching a video of Matt Might's [*Parsing with Derivatives*](http://www.youtube.com/watch?v=ZzsK8Am6dKU) talk. That video convinced me that the world would be a better place if building parsers were as easy as working with regular expressions, and that the ability to handle arbitrary, possibly-ambiguous grammars was essential to that goal.
Matt Might has published a [paper](http://matt.might.net/papers/might2011derivatives.pdf) about a specific approach to achieving that goal, but I had difficulty getting his *Parsing with Derivatives* technique to work in a performant way.
I probably would have given up, but then Danny Yoo released the [Ragg parser generator](http://hashcollision.org/ragg/index.html) for the Racket language. The Ragg library was a huge inspiration -- a model for what I wanted instaparse to become. I asked Danny what technique he used, and he gave me more information about the algorithm he used. However, he told me that if he were to do it again from scratch, he'd probably choose to use a [GLL algorithm](http://ldta.info/2009/ldta2009proceedings.pdf) by Adrian Johnstone and Elizabeth Scott, and he pointed me to a fantastic article about it by Vegard Øye, [posted on Github with source code in Racket](https://github.com/epsil/gll).
That article had a link to a [paper](http://www.cs.uwm.edu/%7Edspiewak/papers/generalized-parser-combinators.pdf) and [Scala code](https://github.com/djspiewak/gll-combinators) by Daniel Spiewak, which was also extremely helpful.
Alex Engelberg coded the first version of instaparse, proving the capabilities of the GLL algorithm. He encouraged me to take his code and build and document a user-friendly API around it. He continues to be a main contributor on the project, most recently developing the ABNF front-end, bringing the Clojurescript port up to feature parity with the Clojure version, and working out the details of merging the two codebases.
I studied a number of other Clojure parser generators to help frame my ideas about what the API should look like. I communicated with Eric Normand ([squarepeg](https://github.com/ericnormand/squarepeg)) and Christophe Grand ([parsley](https://github.com/cgrand/parsley)), both of whom provided useful advice and encouraged me to pursue my vision.
YourKit is kindly supporting open source projects with its full-featured Java Profiler.
YourKit, LLC is the creator of innovative and intelligent tools for profiling
Java and .NET applications. Take a look at YourKit's leading software products:
[YourKit Java Profiler](http://www.yourkit.com/java/profiler/index.jsp) and
[YourKit .NET Profiler](http://www.yourkit.com/.net/profiler/index.jsp).
instaparse-1.4.7/docs/ 0000775 0000000 0000000 00000000000 13112204712 0014615 5 ustar 00root root 0000000 0000000 instaparse-1.4.7/docs/ABNF.md 0000664 0000000 0000000 00000035566 13112204712 0015664 0 ustar 00root root 0000000 0000000 # ABNF Input Format
ABNF is an alternative input format for instaparse grammar specifications. ABNF does not provide any additional expressive power over instaparse's default EBNF-based syntax, so if you are new to instaparse and parsing, you do not need to read this document -- stick with the syntax described in [the tutorial](https://github.com/Engelberg/instaparse/blob/master/README.md).
ABNF's main virtue is that it is precisely specified and commonly used in protocol specifications. If you use such protocols, instaparse's ABNF input format is a simple way to turn the ABNF specification into an executable parser. However, unless you are working with such specifications, you do not need the ABNF input format.
## EBNF vs ABNF
### EBNF
The most common notation for expressing context-free grammars is [Backus-Naur Form](http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form), or BNF for short. BNF, however, is a little too simplistic. People wanted more convenient notation for expressing repetitions, so [EBNF](http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form), or *Extended* Backus-Naur Form was developed.
There is a hodge-podge of various syntax extensions that all fall under the umbrella of EBNF. For example, one standard specifies that repetitions should be specified with `{}`, but regular expression operators such as `+`, `*`, and `?` are far more popular.
When creating the primary input format for instaparse, I based the syntax off of EBNF. I consulted various standards I found on the internet, and filtered it through my own experience of what I've seen in various textbooks and specs over the years. I included the official repetition operators as well as the ones derived from regular expressions. I also incorporated PEG-like syntax extensions.
What I ended up with was a slightly tweaked version of EBNF, making it relatively easy to turn any EBNF-specified grammar into an executable parser. However, with multiple competing standards and actively-used variations, there's no guarantee that an EBNF grammar that you find will perfectly align with instaparse's syntax. You may need to make a few tweaks to get it to work.
### ABNF
From what I can tell, the purpose of [ABNF](http://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_Form), or *Augmented* Backus-Naur Form, was to create a grammar syntax that would have a single, well-defined, formal standard, so that all ABNF grammars would look exactly the same.
For this reason, ABNF seems to be a more popular grammar syntax in the world of specifications and protocols. For example, if you want to know the formal definition of what constitutes a valid URI, there's an ABNF grammar for that.
After instaparse's initial release, I received a couple requests to support ABNF as an alternative input format. Since ABNF is so precisely defined, in theory, any ABNF grammar should work without modification. In practice, I've found that many ABNF specifications have one or two small typos; nevertheless, applying instaparse to ABNF is mostly a trivial copy-paste exercise.
I included whatever further extensions and extra instaparse goodies I could safely include, but omitted any extension that would conflict with the ABNF standard and jeopardize the ability to use ABNF grammar specifications without modification.
Aside from just wanting to adhere to the ABNF specifcation, I can think of a few niceties that ABNF provides over EBNF:
1. ABNF has a convenient syntax for specifying bounded repetitions, for example, something like "between 3 and 5 repetitions of the letter a".
2. Convenient syntax for expressing characters and ranges of characters.
3. ABNF comes with a "standard library" of a dozen or so common token rules.
## Usage
To get a feeling for what ABNF syntax looks like, first check out this [ABNF specification for phone URIs.](https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt) I copied and pasted it directly from the formal spec -- found one typo which I fixed.
(def phone-uri-parser
(insta/parser "https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt"
:input-format :abnf))
=> (phone-uri-parser "tel:+1-201-555-0123")
[:telephone-uri
"tel:"
[:telephone-subscriber
[:global-number
[:global-number-digits
"+"
[:DIGIT "1"]
[:phonedigit [:visual-separator "-"]]
[:phonedigit [:DIGIT "2"]]
[:phonedigit [:DIGIT "0"]]
[:phonedigit [:DIGIT "1"]]
[:phonedigit [:visual-separator "-"]]
[:phonedigit [:DIGIT "5"]]
[:phonedigit [:DIGIT "5"]]
[:phonedigit [:DIGIT "5"]]
[:phonedigit [:visual-separator "-"]]
[:phonedigit [:DIGIT "0"]]
[:phonedigit [:DIGIT "1"]]
[:phonedigit [:DIGIT "2"]]
[:phonedigit [:DIGIT "3"]]]]]]
The usage, as you can see, is almost identical to the way you build parsers using the `insta/parser` constructor. The only difference is the additional keyword argument `:input-format :abnf`.
If you find yourself working with a whole series of ABNF parser specifications, you may find it more convenient to call
(insta/set-default-input-format! :abnf)
to alter the default input format. Changing the default makes it unnecessary to specify `:input-format :abnf` with each call to the parser constructor.
Here is the doc string:
=> (doc insta/set-default-input-format!)
-------------------------
instaparse.core/set-default-input-format!
([type])
Changes the default input format. Input should be :abnf or :ebnf
## ABNF Syntax Guide
Category | Notations | Example | Notes |
Rule | = =/ | S = A | =/ is usually used to extend an already-defined rule |
Alternation | / | A / B | Despite the use of /, this is unordered choice |
Concatenation | whitespace | A B | |
Grouping | () | (A / B) C | |
Bounded Repetition | * | 3*5 A | In ABNF, repetition precedes the element |
Optional | *1 | *1 A | |
One or more | 1* | 1* A | |
Zero or more | * | *A |
String terminal | "" '' | 'a' "a" | Single-quoted strings are an instaparse extension |
Regex terminal | #"" #'' | #'a' #"a" | Regexes are an instaparse extension |
Character terminal | %d %b %x | %x30-37 |
Comment | ; | ; comment to the end of the line |
Lookahead | & | &A | Lookahead is an instaparse extension |
Negative lookahead | ! | !A | Negative lookahead is an instaparse extension |
Some important things to be aware of:
+ According to the ABNF standard, all strings are *case-insensitive*.
+ ABNF strings do not support any kind of escape characters. Use ABNF's character notation to specify unusual characters.
+ In ABNF, there is one repetition operator, `*`, and it *precedes* the thing that it is operating on. So, for example, `3*5` means "between 3 and 5 repetitions". The first number defaults to 0 and the second defaults to infinity, so you can omit one or both numbers to get effects comparable to EBNF's `+`, `*`, and `?`. `4*4` could just be written as `4`.
+ Use `;` for comments to the end of the line. The ABNF specification has rigid definitions about where comments can be, but in instaparse the rules for comment placement are a bit more flexible and intuitive.
+ ABNF uses `/` for the ordinary alternative operator with no order implied.
+ ABNF allows the restatement of a rule name to specify multiple alternatives. The custom is to use `=/` in definitions that are adding alternatives, for example `S = 'a' / 'b'` could be written as:
S = 'a'
S =/ 'b'
## Extensions
Instaparse extends ABNF by allowing single-quoted strings and both double-quoted and single-quoted regular expressions. The PEG extensions of lookahead `&` and negative lookahead `!` are permitted, but the PEG extension of ordered choice could not be included because of the syntactic conflict with ABNF's usage of `/` for unordered alternatives.
Instaparse is somewhat more flexible with whitespace than the ABNF specification dictates, but somewhat less flexible than you might expect from the EBNF input format. For example, in instaparse's EBNF mode, `(A B)C` would be just fine, but ABNF insists on at least one space to indicate concatenation, so you'd have to write `(A B) C`. I relaxed whitespace restrictions when I could do so without radically deviating from the specification.
### Angle brackets
The ABNF input format supports instaparse's angle bracket notation, where angle brackets can be used to hide certain parts of the grammar from the resulting tree structure. Including instaparse's angle bracket notation was a bit of a tough decision because technically angle brackets are reserved for special use in ABNF grammars.
However, in ABNF notation, angle brackets are meant to be used for prose descriptions of some concept that can't be mechanically specified in the grammar. For example:
P =
I realized that such constructs can't be mechanically handled anyway, so I might as well co-opt the angle bracket notation, as I did with the EBNF syntax, for the very handy purpose of hiding.
This means that when you paste in an ABNF specification, it is always wise to do a quick scan to make sure that no angle brackets were used. They are rarely used, but one [notably strange use of angle brackets](http://w3-org.9356.n7.nabble.com/ipath-empty-ABNF-rule-td192464.html) occurs in the URI specification, which uses `0` to designate the empty string. So be aware of these sorts of possibilities, but you're unlikely to run into them.
## The standard rules
The ABNF specification states that the following rules are always available for use in ABNF grammars:
Name | Explanation |
ALPHA | Alphabetic character |
BIT | 0 or 1 |
CHAR | ASCII character |
CR | \r |
CRLF | \r\n |
CTL | control character |
DIGIT | 0-9 |
DQUOTE | " |
HEXDIG | Hexadecimal digit: 0-9 or A-F |
HTAB | \t |
LF | \n |
LWSP | A specific mixture of whitespace and CRLF (see note below) |
OCTET | 8-bit character |
SP | the space character |
VCHAR | visible character |
WSP | space or tab |
LWSP is particularly quirky, defined to be either a space or tab character, or an alternating sequence of carriage-return-linefeed and a single space or tab character. It's very specific, presumably relevant to some particular protocol, but not generally useful and I don't recommend using it.
## Combinators
The `instaparse.combinators` contains a few combinators that are not documented in the main tutorial, but are listed here because they are only relevant to ABNF grammars.
String syntax | Combinator | Functionality |
"abc" (as used in ABNF) | (string-ci "abc") | string, case-insensitive |
3*5 (as used in ABNF) | (rep 3 5 parser) | repetition |
%d97 (as used in ABNF) | (unicode-char 97) | unicode code point |
%d97-122 (as used in ABNF) | (unicode-char 97 122) | unicode range |
Finally, just as there exists an `ebnf` function in the combinators namespace that turns EBNF fragments into combinator-built data structures, there exists an `abnf` function which does the same for ABNF fragments.
This means it is entirely possible to take fragments of EBNF syntax along with fragments of ABNF syntax, and convert all the pieces, merging them into a grammar map along with other pieces built from combinators. I don't expect that many people will need this ability to mix and match, but it's there if you need it.
## Case Sensitivity
I've already mentioned that in ABNF syntax, strings are *case-insensitive*, meaning that the string terminal "abc" in an ABNF grammar also matches "aBc", "AbC", etc. Many ABNF grammar specifications leverage this case insensitivity, for example, the spec for hexadecimal digits include the strings "A", "B", "C", "D", "E", and "F", and this is intended to match the lowercase letters as well.
A lesser-known quirk of ABNF syntax is that, in theory, non-terminal rule names are also case-insensitive. So for example, in the ABNF rule `S = 'a' s`, the lowercase `s` is actually referring back to the uppercase `S`. Although the specification of ABNF syntax allows for this possibility, as best as I can determine, this "feature" simply isn't used. It would be confusing and bad form to refer to a non-terminal in different places of your grammar with a different mixture of cases.
Therefore, by default in instaparse, ABNF non-terminals are in fact, case-sensitive. This makes it easier for ABNF grammars to play nicely with EBNF grammars, grammar maps, and instaparse's transform function, all of which are case-sensitive.
If you find yourself working with an ABNF grammar that uses an inconsistent mix of lowercase and uppercase letters to refer to the same non-terminal rules, you have two options available to you. The first possibility, of course, is to simply go through and fix the inconsistencies. The second option is to bind the dynamic variable `instaparse.abnf/*case-insensitive*` to true while building the parser from the ABNF grammar.
Under the hood, this works by *converting all non-terminals to uppercase*. This means that in the resulting parse tree, all the rule names will be uppercase, so plan your tree traversals and transformations accordingly.
As an example, let's revisit the usage example from above:
(def phone-uri-parser
(binding [instaparse.abnf/*case-insensitive* true]
(insta/parser "https://raw.github.com/Engelberg/instaparse/master/test/instaparse/phone_uri.txt"
:input-format :abnf)))
=> (phone-uri-parser "tel:+1-201-555-0123")
[:TELEPHONE-URI
"tel:"
[:TELEPHONE-SUBSCRIBER
[:GLOBAL-NUMBER
[:GLOBAL-NUMBER-DIGITS
"+"
[:DIGIT "1"]
[:PHONEDIGIT [:VISUAL-SEPARATOR "-"]]
[:PHONEDIGIT [:DIGIT "2"]]
[:PHONEDIGIT [:DIGIT "0"]]
[:PHONEDIGIT [:DIGIT "1"]]
[:PHONEDIGIT [:VISUAL-SEPARATOR "-"]]
[:PHONEDIGIT [:DIGIT "5"]]
[:PHONEDIGIT [:DIGIT "5"]]
[:PHONEDIGIT [:DIGIT "5"]]
[:PHONEDIGIT [:VISUAL-SEPARATOR "-"]]
[:PHONEDIGIT [:DIGIT "0"]]
[:PHONEDIGIT [:DIGIT "1"]]
[:PHONEDIGIT [:DIGIT "2"]]
[:PHONEDIGIT [:DIGIT "3"]]]]]]
The `*case-insensitive*` dynamic variable is also obeyed by the `abnf` combinator.
instaparse-1.4.7/docs/ExperimentalFeatures.md 0000664 0000000 0000000 00000033413 13112204712 0021277 0 ustar 00root root 0000000 0000000 # Instaparse Experimental Features
This document provides an explanation of some of the things I'm experimenting with in instaparse. Please try the new features and let me know what you think.
## Optimizing memory
I've added a new, experimental `:optimize :memory` flag that can conserve memory usage for certain classes of grammars. I discussed the motivation for this in the [Performance document](Performance.md). The idea is to make it more practical to use instaparse in situations where you need to parse files containing a large number of independent chunks.
Usage looks like this:
(def my-parser (insta/parser my-grammar))
(my-parser text :optimize :memory)
It works for grammars where the top-level production is of the form
start = chunk+
or
start = header chunk+
I don't mean that it literally needs to use the words `start` or `header` or `chunk`. What I mean is that the optimizer looks for top-level productions that finish off with some sort of repeating structure. To be properly optimized, you want to ensure that the `chunk` rule is written with no ambiguity about where a chunk begins and ends.
Behind the scenes, here's what the optimization algorithm is doing: After successfully parsing a `chunk`, the parser *forgets* all the backtracking information and continues parsing the remaining text totally fresh looking for the next chunk, with no sense of history about what has come before. As long as it keeps finding one chunk after another, it can get through a very large file with far less memory usage than the standard algorithm.
The downside of this approach is that if the parser hits a spot that doesn't match the repeating chunk rule, there's no way for it to know for sure that this is a fatal failure. It is entirely possible that there is some other interpretation of an eariler chunk that would make the whole input parseable. The standard instaparse approach is to backtrack and look for alternative interpretations before declaring a failure. However, without that backtracking history, there's no way to do that.
So when you use the `:optimize :memory` flag and your parser hits an error using the "parse one chunk at a time and forget the past" strategy, it *restarts the entire parse process* with the original strategy.
I'm not entirely sure this was the right design decision, and would welcome feedback on this point. Here are the tradeoffs:
Advantage of the current approach: With this *fall back to the original strategy if the optimizer doesn't work* approach, it should be totally safe to try the optimizer, even if you don't know for sure up front whether the optimizer will work. With the `:optimize :memory` flag, the output will always be exactly the same as if you hadn't used the flag. (A metadata annotation, however, will let you know whether the parse was successfully completed entirely with the optimization strategy.) I like the safety of this approach, and how it is amenable to the attitude of "Let's try this optimization flag out and see if it helps."
Disadvantage of the current approach: If you're operating on a block of input text so large that the memory optimization is a *necessity*, then if you have a flaw in your text, you're in trouble -- the parsing restarts with the original strategy and if the flaw is fairly late in your file, you could exhaust your memory.
An alternative design would be to say that if you've enabled the `:optimize :memory` flag, and it hits an apparent flaw in the input, then it's immediately reported as a failure, without any attempt to try the more sophisticated strategy and see whether backtracking might help the situation. This would be good for people willing to expend the effort to ensure the grammar conforms to the optimizer's constraints and has no ambiguity in the chunk definition. It would then correct to report a failure right away if encountered by the optimization strategy -- no need to fall back to the original strategy because there's no ambiguity and no alternative interpretation.
However, if the flag behaved in this way, then it is possible that if the grammar weren't well-suited for the optimizer, the `:optimize :memory` flag might return a failure in some instances where the regular strategy would return success. In some sense, this would give the programmer maximum control: the programmer can *choose* to rerun the input without the `:optimize :memory` flag or can accept the failure at face value if confident in the grammar's suitability for the optimization strategy.
So I'm torn: right now the optimizer falls back to the regular strategy because I like that it is dead simple to use, it's safe to try without a deep understanding of what is going on, and it will always give correct output. But I recognize that having the optimizer simply report the failure gives the programmer greatest control over whether to restart with the regular strategy or not.
What do you think is the better design choice?
## Auto Whitespace
I have received several requests for instaparse to support the parsing of streams of tokens, rather than just strings. There appear to be two main motivations for this request:
1. For some grammars, explicitly specifying all the places where whitespace can go is a pain.
2. For parsing indentation-sensitive languages, it is useful to have a pre-processing pass that identifies `indent` and `dedent` tokens.
I'm still thinking about developing a token-processing version of instaparse. But if I can find a way to address the underlying needs while maintaining the "token-free" simplicity of instaparse, that would be even better.
This new experimental "auto whitespace" feature addresses the first issue, simplifying the specification of grammars where you pretty much want to allow optional whitespace between all your tokens. Here's how to use the new feature:
First, you want to develop a parser that consumes whitespace. The simplest, most common way to do this would be:
(def whitespace
(insta/parser
"whitespace = #'\\s+'"))
Let's test it out:
=> (whitespace " ")
[:whitespace " "]
=> (whitespace " \t \n \t ")
[:whitespace " \t \n \t "]
Important: Your whitespace parser should *not* accept the empty string.
=> (whitespace "")
Parse error at line 1, column 1:
nil
^
Expected:
#"^\s+" (followed by end-of-string)
Good, this is what we want. Now, we can define a parser similar to the `words-and-numbers` parser from the tutorial, but this time we'll use the auto-whitespace feature.
(def words-and-numbers-auto-whitespace
(insta/parser
"sentence = token+
= word | number
word = #'[a-zA-Z]+'
number = #'[0-9]+'"
:auto-whitespace whitespace))
Notice the use of the `:auto-whitespace` keyword, and how we call it with the whitespace parser we developed earlier.
=> (words-and-numbers-auto-whitespace " abc 123 45 de ")
[:sentence [:word "abc"] [:number "123"] [:number "45"] [:word "de"]]
Behind the scenes, here's what's going on: the whitespace parsing rule(s) are merged into the new parser, and an optional version of the starting production for the whitespace rule is liberally inserted before all tokens and at the end. In this case, that means `` is inserted all over the place. You can see the insertion points by viewing the parser:
=> words-and-numbers-auto-whitespace
sentence = token+ whitespace?
whitespace = #"\s+"
token = word | number
word = whitespace? #"[a-zA-Z]+"
number = whitespace? #"[0-9]+"
You can also see that the whitespace is in fact getting parsed, and is just being hidden:
=> (words-and-numbers-auto-whitespace " abc 123 45 de " :unhide :content)
[:sentence " " [:word "abc"] " " [:number "123"] " " [:number "45"] " " [:word "de"] " "]
Because the whitespace parser rules are merged into the new parser, don't create any rules in your parser with the same names as those in the whitespace parser. If you do, one of the rules will get clobbered and you'll run into problems. (TODO: Report an error if a user tries to do this)
Note that it makes no difference whether the `:output-format` of the whitespace parser is :enlive or :hiccup. The rules and the starting production for the whitespace parser are all that matter.
Because the :auto-whitespace feature allows you to specify your notion of whitespace, you have the total flexibility to define this however you want. For example, let's say I want to allow not only whitespace, but `(* comments *)` between any tokens. Again, we start by developing a corresponding parser:
(def whitespace-or-comments-v1
(insta/parser
"ws-or-comment = #'\\s+' | comment
comment = '(*' inside-comment* '*)'
inside-comment = !( '*)' | '(*' ) #'.' | comment"))
Does it eat whitespace?
=> (whitespace-or-comments-v1 " ")
[:ws-or-comment " "]
Check. Does it handle a comment?
=> (whitespace-or-comments-v1 "(* comment *)")
Check. Can it handle nested comments?
=> (whitespace-or-comments-v1 "(* (* comment *) *)")
And we mustn't forget -- make sure it *doesn't* parse the empty string:
=> (whitespace-or-comments-v1 "")
However, there's a problem here. The auto-whitespace feature inserts optional `?` versions of the whitespace parser everywhere, *not* repeating versions. It's up to us to make sure that the whitespace parser consumes the *full extent* of any whitespace that could appear between tokens. In other words, if we want to allow multiple comments in a row, we need to spell that out:
(def whitespace-or-comments-v2
(insta/parser
"ws-or-comments = #'\\s+' | comments
comments = comment+
comment = '(*' inside-comment* '*)'
inside-comment = !( '*)' | '(*' ) #'.' | comment"))
=> (whitespace-or-comments-v2 "(* comment1 *)(* (* nested comment *) *)")
There's still one more issue, though. Right now, our parser specifies complete empty whitespace, or a series of comments. But if we want to intermingle whitespace and comments, it won't work:
=> (whitespace-or-comments-v2 " (* comment1 *) (* comment2 *) ")
Parse error at line 1, column 1:
(* comment1 *) (* comment2 *)
^
Expected one of:
#"^\s+" (followed by end-of-string)
"(*"
I could go through and manually insert optional whitespace, but wouldn't it be deliciously meta to use the auto-whitespace feature with our previous, simple whitespace parser to define our whitespace-or-comments parser?
(def whitespace-or-comments
(insta/parser
"ws-or-comments = #'\\s+' | comments
comments = comment+
comment = '(*' inside-comment* '*)'
inside-comment = !( '*)' | '(*' ) #'.' | comment"
:auto-whitespace whitespace))
Now it works:
=> (whitespace-or-comments " (* comment1 *) (* comment2 *) ")
Just out of curiosity, let's see where the `` got inserted:
=> whitespace-or-comments
ws-or-comments = (whitespace? #"\s+" | comments) whitespace?
whitespace = #"\s+"
comments = comment+
comment = whitespace? "(*" inside-comment* whitespace? "*)"
inside-comment = !(whitespace? "*)" | whitespace? "(*") whitespace? #"." | comment
Note that the auto-insertion process inserted `whitespace?` right before the `"*)"`, but this isn't particularly useful, because all whitespace before `*)` would already be eaten by the `inside-comment` rule. If you were inserting the optional whitespace by hand, you'd probably realize it was unnecessary there. However, when you let the system automatically insert it everywhere, some of the insertions might be gratuitous. But that's okay, having the extra optional whitespace inserted there doesn't really hurt us either.
Now that we have thoroughly tested our whitespace-or-comments parser, we can use it to enrich our words-and-numbers parser:
(def words-and-numbers-auto-whitespace-and-comments
(insta/parser
"sentence = token+
= word | number
word = #'[a-zA-Z]+'
number = #'[0-9]+'"
:auto-whitespace whitespace-or-comments))
=> (words-and-numbers-auto-whitespace-and-comments " abc 123 (* 456 *) (* (* 7*) 89 *) def ")
[:sentence [:word "abc"] [:number "123"] [:word "def"]]
=> words-and-numbers-auto-whitespace-and-comments
sentence = token+ ws-or-comments?
inside-comment = !(whitespace? "*)" | whitespace? "(*") whitespace? #"." | comment
comment = whitespace? "(*" inside-comment* whitespace? "*)"
comments = comment+
ws-or-comments = (whitespace? #"\s+" | comments) whitespace?
whitespace = #"\s+"
token = word | number
word = ws-or-comments? #"[a-zA-Z]+"
number = ws-or-comments? #"[0-9]+"
Note that this feature is only useful in grammars where all the strings and regexes are, conceptually, the "tokens" of your language. Occasionally, you'll see situations where grammars specify tokens through rules that build up the tokens character-by-character, for example:
month = ('M'|'m') 'arch'
If you try to use the auto-whitespace feature with a grammar like this, it will end up allowing space between the "m" and the "arch", which isn't what you want. The key is to try to express such tokens using a single regular expression:
month = #'[Mm]arch'
### Predefined whitespace parsers
There's no doubt that the following whitespace rule is by far the most common:
whitespace = #"\s+"
So for this common case, there's no need to create a separate whitespace parser. You can access this predefined whitespace parser with the option:
:auto-whitespace :standard
At this time, one other predefined whitespace parser is available, for Clojure-like parsing tasks where the comma is also treated as whitespace. The rule that will be added to your grammar is:
whitespace = #"[,\s]+"
and you can access it with the option:
:auto-whitespace :comma
Let me know what you think of the auto-whitespace feature. Is it sufficiently simple and useful to belong in the instaparse library? instaparse-1.4.7/docs/Performance.md 0000664 0000000 0000000 00000041570 13112204712 0017407 0 ustar 00root root 0000000 0000000 # Instaparse Performance Notes
In the instaparse tutorial, I make the claim that instaparse is performant without really defining what I mean. I explained that I've spent a lot of time on optimization, without really specifying what I'm tring to optimize. In this document, I'd like to [elaborate on these points](https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#specific-performance-goals), and talk a bit about how I view [instaparse's role](https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#the-role-of-instaparse) in the parser ecosystem. Finally, I'll provide [specific tips on how to get good performance from instaparse parsers](https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips).
## A bit of history
For decades, parsing has been considered a "solved problem" because there are well-known algorithms that can parse a stream of text blazingly fast, in a single linear pass, using minimal memory. The catch is that these algorithms only apply to certain types of context-free grammars -- these classes of easily-parsed grammars go by names like LL(1) and LALR(1), acronyms describing the parsing technique that applies. The good news is that most context-free grammars can, with some effort, be converted into the kind of format required by parsing algorithms. Furthermore, if you are knowledgable about parsing algorithms and are the one constructing the language / data format to be parsed, you can intentionally constrain the syntax to ensure that it can easily be parsed.
If you can do that, great! If there's already a parser written for the kind of data you're working with, even better! However, the programming world is awash with ad hoc config files and data files that don't use an existing standard like XML or JSON. Sometimes you find yourself needing to work with something that's a little too complex to tease apart just with regular expressions, yet hard to justify the time and energy it would take to study up on LL, LALR, etc. and learn how to parse the data within the constraints of tools using those parsing algorithms.
## The role of instaparse
That's where instaparse comes in. Instaparse can handle arbitrary context-free grammars written using standard notation, so it's easy to apply it, even for a quickie one-time parsing task.
Shortly after the release of instaparse, there were a couple great testimonial blog posts about instaparse. [This blog post by Brandon Harvey](http://walkwithoutrhythm.net/blog/2013/05/16/instaparse-parsing-with-clojure-is-the-new-black/) especially made my day, because it perfectly captured what I had hoped to achieve with instaparse.
In his blog post, Brandon describes some cave data that he wanted to parse. Ideally, he wanted to figure out how to get "from a big fat unwieldy string to a nice, regular tree-shaped data structure in 20 minutes or less." The cave data is clearly structured and looks kind of like JSON, but it isn't quite JSON.
First, he tried using another Clojure parsing library (a rather excellent library provided you're working with a grammar that fits its constraints), but couldn't figure out how to express his grammar in a way that worked. He got bogged down with a bunch of shift/reduce conflicts and other errors that he didn't know how to interpret without understanding the underlying machinery. Using instaparse, he expressed the grammar in the way that seemed most natural, and it worked.
This brings me to a point I'd like to make before discussing performance:
*Instaparse aims to be more flexible than traditional parser libraries --- more tolerant of grammars that are ambiguous, require backtracking, or use a mixture of left and right recursion.*
To accomplish this, instaparse uses a fundamentally different algorithm than those found in traditional parser libraries, which achieve their speeds and performance guarantees by restricting lookahead and limiting backtracking.
## Specific performance goals
With that disclaimer in mind, here are the specifics of what I strive for:
+ For typical, real-world grammars, I want the running time to be linear with respect to the size of the input. In other words, if you double the size of your text, it should take about twice as long to parse. (Of course, I'm using Clojure data structures, so in practice, the running time is more like O(n * log32 n), but that's pretty close to linear.)
+ If your grammar is unambiguous and LL(1), the parser should be competitive with parsers generated by tools that *only* accept unambiguous LL(1) grammars (i.e., within some reasonable constant factor).
+ If you have a reasonable grammar, even one that isn't expressed in "just the right way", it should still have solid performance.
+ Performance should degrade gracefully as you incorporate more ambiguity and heavy backtracking into the grammar.
Roughly speaking, the goal is for instaparse to be performant in the same sense that Clojure is performant. Clojure is not quite as fast as languages like Java or C++ and consumes considerably more memory, but we use it because it offers greater expressivity and flexibility with enough speed to be useful for a wide range of tasks.
## Specific optimizations
There were a lot of algorithmic coding decisions that I made by benchmarking multiple alternatives and data structures. I won't go into them all here. My aim in this section is to give you a sense for how I go about optimizing and what sorts of things I focus on.
Here is the gist of my optimization process: I take a grammar, try it on increasingly large inputs, and track the running-time growth. If the growth is quadratic (or worse), I profile and investigate to try to track down the offending code and rework it into linear behavior. My goal is to ensure that as many grammars as possible have linear growth.
As I mentioned in the tutorial, one of the first things I noticed in my profiling was how critical hashing was. This is a great example of how an algorithm that seems like it *should* be linear can go awry without careful attention to implementation details. We all know that inserting something into a hash map is essentially constant time, so we take that for granted in our analysis. As long as the algorithm only performs O(n) insertions/lookups in the hash table, it should have linear performance, right? Well, if the thing you are inserting into the hash table takes O(n) time to compute the hash, you're in big trouble!
So the first big accomplishment of my optimization efforts was to reduce the hashing time to constant for all the information cached by instaparse. Version 1.2 of instaparse sports two new equally significant performance improvements:
First, I discovered that on long texts with long repeating sequences, linear-time concatenation of the internal partial tree results was a huge bottleneck, leading to overall quadratic behavior. So in 1.2, I converted over to using a custom data structure with O(1) concatentation. RRB-trees would be another data structure that could potentially solve my concatenation problem, so this is something I intend to look at after the Clojure implementation of RRB-trees matures.
The other major performance improvement in 1.2 compensated for an unfortunate change that Oracle made in Java 1.7 to the String class, changing Strings so that the substring operation is O(n) rather than O(1), copying the substring into a freshly allocated string. Instaparse handles regular expressions by testing the regular expression against a substring of the input text that skips past the part of the text already parsed. This strategy, which creates rather large substrings frequently, needed to be modified in light of Java 1.7's poor substring behavior.
With these version 1.2 modifications in place, I'm now getting linear-time behavior for all the parsers in my test suite that aren't explicitly designed to demonstrate huge amounts of ambiguity. This is exactly where I want instaparse to be.
## Memory
When talking about performance, the other big discussion point is, of course, memory consumption. As I mentioned in the tutorial, instaparse does use a lot of memory. There's really no way around this; it all comes back to my earlier point that instaparse aims to gracefully handle arbitrary levels of ambiguity and backtracking, which means that the entire text needs to reside in memory and lots of intermediate results need to be cached.
Instaparse's own syntax for context-free grammars is parsed by an instaparse parser, and is a great example of the practical value of backtracking.
Consider the following grammar. The actual semantics of the grammar is not important here, just think about the syntax of the grammar specification and consider how instaparse's `parser` function needs to parse the grammar string as a series of rules:
(insta/parser
"A = B B
B = 'b'")
You might expect instaparse to impose a requirement that each line of the grammar be clearly terminated by an end-of-line character, such as `;` or a newline, but in fact, instaparse's CFG parser has no problem if you write out the grammar all mushed together on one line:
(insta/parser "A = B B B = 'b'")
Working from left-to-right, when it processes the third `B`, it is entirely possible that what it has seen so far should be interpreted as the rule:
A = B B B
But when it encounters the `=`, it realizes that the only sensible interpretation is for the third `B` to be the beginning of a new rule, and instaparse sorts it all out.
Taken to an extreme, consider the parser defined by the following grammar:
S = 'ab'+ | 'a' 'ba'+
If you use this parser to parse a long string of "abababab...aba", there's no way to determine when looking at the first 'a' which way to interpret it. The parser can try one path, perhaps assuming that it is part of the `'ab'+` rule, but it won't know until it gets to the very end of the string that it has chosen incorrectly, and has to back up and try another path. Looking at this example, it should be clear that there's no way to parse the input string in a single linear pass with bounded memory.
For this reason, I haven't put as much effort into optimizing memory usage -- a lot of data needs to be retained throughout the parsing process, and there simply is less scope for improvement, I think. Certainly Java 1.7's substring behavior was causing massive memory churn, so the changes I made in instaparse 1.2 will also benefit the memory side of the performance equation. But other than that, I haven't found any big wins for optimizing memory consumption.
In theory, I can imagine that there might be a way to intelligently figure out which cached data can be safely discarded, but in the context of left-recursion this is an extremely hard problem to solve. Chalk this up as a future research problem, but one that is not likely to bear fruit in the short-term. I have made one step in this direction which I will detail further in the section below about performance tips.
## Performance Tips
Occasionally, I receive a question about whether there's a *best* way to write instaparse grammars for maximum performance. I've tried very hard to make it so that instaparse's performance isn't ultra-sensitive to the exact way you word the grammar. My hope is that most people will find these performance tips to be completely unnecessary. However, for those that are interested, here are some recommendations:
1. Instaparse's algorithm is in the family of LL parsing algorithms. So if you know how to easily write your grammar as an LL grammar, that's probably going to yield the best possible performance. If not, don't worry about it.
2. If your token is a string, use a string literal, not a regular epxression. For example, prefer `'apple'` to `#'apple'`.
3. When the greedy behavior of regular expressions is what you want, prefer using `*` nd `+` *inside* the regular expression rather than outside. This comes up very commonly in processing whitespace. In most applications, once you hit whitespace, you want to eat up all the whitespace to get to the next token. So you'll get better performance with `#'\\s*'` than with `#'\\s'*`. In my parsers, I routinely have a rule for optional whitespace that looks like `ows = #'\\s*'` and then I sprinkle `` liberally in my other rules wherever I want to potentially allow whitespace.
4. Related to the previous point, prefer using regular expressions to define tokens in their entirety rather than using instaparse to build up the tokens by analyzing the string character by character. For example, if an identifer in your language is a letter followed by a series of letters or digits, you'll be better off with the rule
Identifier = #'[a-zA-Z][a-zA-Z0-9]*'
rather than
Identifer = Letter Digit*
Letter = #'[a-zA-Z]'
Digit = #'[a-zA-Z0-9]'
5. Remove as much ambiguity from your grammar as you can. Instaparse works with ambiguous grammars, but dealing with that ambiguity can take a toll on performance. Use the `insta/parses` function on a variety of sample inputs in order to troubleshoot your grammar and discover ways in which your inputs might have multiple interpretations.
6. Even if `insta/parses` returns a single answer, think about whether you've created a lot of *internal ambiguity*, i.e., situations where the parser won't be able to work out the interpretation of the text until it has gotten much further along. One way to analyze this is to test the various rules in your grammar using `insta/parses` with the `:partial true` flag to get a feel for how many scenarios it has to consider before it can be sure it has found the whole chunk of text defined by that rule.
7. Watch out for ambiguity in your hidden content. One time I was working with a grammar that I was convinced was unambiguous -- `insta/parses` always returned a single answer. However, it turned out that the definition of whitespace was highly ambiguous. I didn't realize it because the whitespace was hidden. To help diagnose these sorts of problems, try running `insta/parses` with the `:unhide :all` flag.
8. Prefer Java 1.7. I've received one report where instaparse, running on Java 1.6, was running out of memory on a large input, whereas the exact same grammar on the same input ran perfectly fine on Java 1.7.
9. Prefer using * and + over recursion to describe simple repetition. For example, the rule:
= 'a'+
can be internally optimized in ways that
= 'a' A | 'a'
cannot.
10. Feed instaparse smaller chunks of text. The reality is that most large parsing tasks involve a series of individual data records that could potentially be parsed independently of one another. As has been discussed earlier in this document, if you feed instaparse the entire block of text, instaparse has to assume the worst -- that it might encounter some sort of failure that causes it to go back and reintrepret all the text it has processed so far. Consider preprocessing the text, chopping it into strings representing the individual data records, and pass the smaller strings into instaparse in order to limit the scope of what possibilities it needs to consider and how much history it needs to track.
For example, I saw one grammar where each line of text represented a new record, and the grammar looked like:
document = line+
line = ...
Instead of applying this grammar to the entire document at once, why not build a parser where `line` is the top-level starting rule, and then map this parser over a `line-seq` of the text?
I've added a new, experimental `:optimize :memory` flag that attempts to automate this kind of preprocessing, chopping the text into smaller independent chunks in order to use less memory. This only works on grammars that describe these sorts of repeated data records (possibly with a header at the beginning of the file). If instaparse can't find the pattern or runs into any sort of failure, it will fall back to its usual parsing strategy in order to make sure it has considered all possibilities. Using this flag will likely slow down your parser, but if your data lends itself to this alternative strategy, you'll use much less memory.
I consider the `:optimize :memory` flag to be an *alpha* feature, subject to change. If you try it and find it useful, or try it on something where you'd expect it to help and it doesn't, please send me your feedback.
11. As of version 1.2, the enlive output format is slightly faster than hiccup. This may change in the future, so I don't recommend that you base your choice of output format on this slight differential. However, if you're trying to eke out the best possible performance, you might find it useful to experiment with both output formats to see whether one performs better for you than the other.
12. As of version 1.4, instaparse has a way to print a trace of the parser's execution process, as well as some profiling information which can be useful to detmerine whether your parser behaves linearly with respect to the size of the input. [Read about the new tracing feature here.](https://github.com/Engelberg/instaparse/blob/master/docs/Tracing.md)
instaparse-1.4.7/docs/Tracing.md 0000664 0000000 0000000 00000021434 13112204712 0016532 0 ustar 00root root 0000000 0000000 # Tracing
Instaparse 1.4.0 and up (in Clojure only) features the ability to look at a trace of what the parser is doing. As an example, let's take a look at the as-and-bs parser from the tutorial.
```
=> (as-and-bs "aaabb")
[:S [:AB [:A "a" "a" "a"] [:B "b" "b"]]]
```
Now let's look at a trace. We do this by calling the parser with the optional keyword argument `:trace true`. `insta/parse` and `insta/parses` both can take this optional argument.
```
=> (as-and-bs "aaabb" :trace true)
```
One of my design goals for the tracing feature was that if you don't use it, you shouldn't pay a performance penalty. So by default, the parsing code is not instrumented for tracing. The very first time you call a parser with `:trace true`, you may notice a slight pause as instaparse recompiles itself to support tracing. The trace the prints to standard out, and looks like this:
```
Initiating full parse: S at index 0 (aaabb)
Initiating full parse: AB* at index 0 (aaabb)
Initiating parse: AB at index 0 (aaabb)
Initiating parse: A B at index 0 (aaabb)
Initiating parse: A at index 0 (aaabb)
Initiating parse: "a"+ at index 0 (aaabb)
Initiating parse: "a" at index 0 (aaabb)
Result for "a" at index 0 (aaabb) => "a"
Result for "a"+ at index 0 (aaabb) => ("a")
Result for A at index 0 (aaabb) => [:A "a"]
Initiating parse: B at index 1 (aabb)
Initiating parse: "b"+ at index 1 (aabb)
Initiating parse: "b" at index 1 (aabb)
No result for "b" at index 1 (aabb)
Initiating parse: "a" at index 1 (aabb)
Result for "a" at index 1 (aabb) => "a"
Result for "a"+ at index 0 (aaabb) => ("a" "a")
Result for A at index 0 (aaabb) => [:A "a" "a"]
Initiating parse: B at index 2 (abb)
Initiating parse: "b"+ at index 2 (abb)
Initiating parse: "b" at index 2 (abb)
No result for "b" at index 2 (abb)
Initiating parse: "a" at index 2 (abb)
Result for "a" at index 2 (abb) => "a"
Result for "a"+ at index 0 (aaabb) => ("a" "a" "a")
Result for A at index 0 (aaabb) => [:A "a" "a" "a"]
Initiating parse: B at index 3 (bb)
Initiating parse: "b"+ at index 3 (bb)
Initiating parse: "b" at index 3 (bb)
Result for "b" at index 3 (bb) => "b"
Result for "b"+ at index 3 (bb) => ("b")
Result for B at index 3 (bb) => [:B "b"]
Result for A B at index 0 (aaabb) => ([:A "a" "a" "a"] [:B "b"])
Result for AB at index 0 (aaabb) => [:AB [:A "a" "a" "a"] [:B "b"]]
Initiating parse: AB at index 4 (b)
Initiating parse: A B at index 4 (b)
Initiating parse: A at index 4 (b)
Initiating parse: "a"+ at index 4 (b)
Initiating parse: "a" at index 4 (b)
No result for "a" at index 4 (b)
Initiating parse: "b" at index 4 (b)
Result for "b" at index 4 (b) => "b"
Result for "b"+ at index 3 (bb) => ("b" "b")
Result for B at index 3 (bb) => [:B "b" "b"]
Result for A B at index 0 (aaabb) => ([:A "a" "a" "a"] [:B "b" "b"])
Result for AB at index 0 (aaabb) => [:AB [:A "a" "a" "a"] [:B "b" "b"]]
Result for AB* at index 0 (aaabb) => ([:AB [:A "a" "a" "a"] [:B "b" "b"]])
Result for S at index 0 (aaabb) => [:S [:AB [:A "a" "a" "a"] [:B "b" "b"]]]
Successful parse.
Profile: {:push-message 21, :push-result 21, :push-listener 24, :push-stack 26, :push-full-listener 2, :create-node 26}
[:S [:AB [:A "a" "a" "a"] [:B "b" "b"]]]
```
Let me explain what some of these lines mean.
```
Initiating full parse: S at index 0 (aaabb)
```
A "full parse" means that it only succeeds if it consumes the entire string. Usually, we're looking to completely parse an entire string, and that's what "full parse" reflects.
It is important to understand that the word "initiating" does not necessarily mean that it is starting to work on that parse sub-problem right away. It just means that we're putting it on a stack of sub-problems to try to solve.
Notice the `(aaabb)` in parens. This is giving us the next several characters from this point in the string, which makes it a little easier to see at a glance where we are in the string (although, of course the index number can always be used to figure it out precisely).
```
Initiating full parse: AB* at index 0 (aaabb)
Initiating parse: AB at index 0 (aaabb)
```
Note that AB* needs to be a full parse to be satisfied, but that kicks off another subproblem, which is to look for a parse of AB (not necessarily a full parse) at index 0.
```
Initiating parse: A at index 0 (aaabb)
Initiating parse: "a"+ at index 0 (aaabb)
Initiating parse: "a" at index 0 (aaabb)
Result for "a" at index 0 (aaabb) => "a"
Result for "a"+ at index 0 (aaabb) => ("a")
Result for A at index 0 (aaabb) => [:A "a"]
```
Note that after initiating a bunch of parse subtasks, we start to see some results. Again, the content in the parentheses is a look ahead at the next several characters in the string, just to get our bearings. The information after the `=>` is the parse result that was found. Typically, the parse results are found in reverse order from the order in which the subtasks are initiated, because when initiated, the subtasks are put on a stack.
```
No result for "b" at index 1 (aabb)
```
The tracing mechanism reports when tokens (i.e., strings or regular expressions) are sought but not found. In general, the tracing mechanism does not report when subtasks involving non-terminals fail (because internally, instaparse does not transmit failure messages between subtasks).
```
Result for S at index 0 (aaabb) => [:S [:AB [:A "a" "a" "a"] [:B "b" "b"]]]
Successful parse.
```
At the end, we see the final parse, followed by some profiling data:
```
Profile: {:push-message 21, :push-result 21, :push-listener 24, :push-stack 26, :push-full-listener 2, :create-node 26}
```
The details of the profiling data don't matter that much, other than to know that it's a measure of how much work instaparse had to do to come up with the result. Repeating the trace with an input of `"aaaaaabbbb"` we get the profiling results:
```
Profile: {:push-message 40, :push-result 40, :push-listener 48, :push-stack 50, :push-full-listener 2, :create-node 50}
```
The key here is that we doubled the length of the input string, and this doubled-the amount of work that instaparse needed to do. That's good, it means that this parser behaves linearly with respect to its input size. Even though the code is instrumented with tracing functionality, you still need to explicitly request the trace each time. If you don't request the trace, it won't display:
```
=> (as-and-bs "aaabb")
[:S [:AB [:A "a" "a" "a"] [:B "b" "b"]]]
```
Now let's look at an example with negative lookahead. Here is the parser:
```
=> negative-lookahead-example
S = !"ab" ("a" | "b")+
=> (negative-lookahead-example "aabb")
[:S "a" "a" "b" "b"]
```
Let's run it with the trace:
```
=> (negative-lookahead-example "aabb" :trace true)
Initiating full parse: S at index 0 (aabb)
Initiating full parse: !"ab" ("a" | "b")+ at index 0 (aabb)
Initiating parse: !"ab" at index 0 (aabb)
Initiating parse: "ab" at index 0 (aabb)
No result for "ab" at index 0 (aabb)
Exhausted results for "ab" at index 0 (aabb)
Negation satisfied: !"ab" at index 0 (aabb)
Initiating full parse: ("a" | "b")+ at index 0 (aabb)
Initiating parse: "a" | "b" at index 0 (aabb)
Initiating parse: "b" at index 0 (aabb)
No result for "b" at index 0 (aabb)
Initiating parse: "a" at index 0 (aabb)
Result for "a" at index 0 (aabb) => "a"
Result for "a" | "b" at index 0 (aabb) => "a"
Initiating parse: "a" | "b" at index 1 (abb)
Initiating parse: "b" at index 1 (abb)
No result for "b" at index 1 (abb)
Initiating parse: "a" at index 1 (abb)
Result for "a" at index 1 (abb) => "a"
Result for "a" | "b" at index 1 (abb) => "a"
Initiating parse: "a" | "b" at index 2 (bb)
Initiating parse: "b" at index 2 (bb)
Result for "b" at index 2 (bb) => "b"
Result for "a" | "b" at index 2 (bb) => "b"
Initiating parse: "a" | "b" at index 3 (b)
Initiating parse: "b" at index 3 (b)
Result for "b" at index 3 (b) => "b"
Result for "a" | "b" at index 3 (b) => "b"
Result for ("a" | "b")+ at index 0 (aabb) => ("a" "a" "b" "b")
Result for !"ab" ("a" | "b")+ at index 0 (aabb) => ("a" "a" "b" "b")
Result for S at index 0 (aabb) => [:S "a" "a" "b" "b"]
Successful parse.
Profile: {:push-message 12, :push-result 12, :push-listener 14, :push-stack 17, :push-full-listener 3, :create-node 17}
[:S "a" "a" "b" "b"]
```
The interesting thing with negative lookahead (or ordered choice) is the following lines:
```
Initiating parse: !"ab" at index 0 (aabb)
Initiating parse: "ab" at index 0 (aabb)
No result for "ab" at index 0 (aabb)
Exhausted results for "ab" at index 0 (aabb)
Negation satisfied: !"ab" at index 0 (aabb)
```
To do negative lookahead, the parser sets up a subtask to try to parse the very thing we want to avoid. If the parser runs out of work to do, then the trace tells us that the negation was in fact satisfied.
When you are done tracing, you probably will want to recompile the code without all the tracing and profiling instrumentation. You can either restart the REPL or just type:
```
=> (insta/disable-tracing!)
nil
```
instaparse-1.4.7/images/ 0000775 0000000 0000000 00000000000 13112204712 0015132 5 ustar 00root root 0000000 0000000 instaparse-1.4.7/images/vizexample1.png 0000664 0000000 0000000 00000041242 13112204712 0020110 0 ustar 00root root 0000000 0000000 ‰PNG
IHDR É ]šý8 BiIDATxÚí”TUº¶qF½Î\èw93Þ;J’¤‚(’ƒdI*‰"QD”,ˆdÉ ‚€$QDÉYÁ ¨ (ADÇëþï³gÊ¿i:TuXõ>kÕêê†î:{ïsö·¿œÍ!„"M²i
„B I!„BBRá>?þø£Ù´i“yã7ÌèÑ£MïÞ½M»víLóæÍMƒ
LíÚµí‹÷üŒëÓ§3fŒý~÷ìÙ³šH!!)„7?ÿü³Y³f8p |yòä1ùóç7õêÕ3]»v5C†1Ó¦M3+V¬0«V²póæÍöÅ{~Æ¿ñlºtéb7_¾|&oÞ¼öo4Ȭ]»ÖüòË/šp!!)„6GŽ1ãÇ7UªT17Ýt“iܸ±ý~Û¶mVh:)€·nÝjÆg5jd?«jÕªö³Ž=ª…’Bˆ`€ùsêÔ©¦T©R¦hÑ¢æÙgŸ5;vìðü:øÌ~ýú™bÅŠ™Ò¥K›W^yE¦Y!!)„ð‡Ã‡›GyÄäÈ‘Ã~ݳgO`®m÷îÝ֤˵uëÖMÚ¥Bx§}ûöÖ/8vìXsîܹÀ^+×FÐþÐ:HX
I!„;à:t¨É;·™8qb¨‚e¸ö—^zÉ^ûóÏ?¯@!!)„pon»í6aúý÷߇v\{çÎMáÂ…ÍöíÛµ°BBR‘u~ûí73lØ0S @³~ýú„ׇ~hÇ4bÄ;F!$$…1‰²iÓ¦¦I“&æÔ©S 7¾“'Oš†
š-ZÈü*$$…±
*˜¾}û&´¦ÅØžzê)s÷ÝwÛj@BHH
!2„ˆÐŠ+šI“&%͘'L˜`¥“„„¤"5+ʽ½ð¾_ËìÙ³Í=÷Üc²eËfµ½}ûö¹úy”É»ï¾û䣒BˆôÉ÷~CšÆ¢E‹~ÿþ›o¾±)nC‚áÇëF’BˆóÙ¸q£)S¦L ‚XÐý skÉ’%Í–-[tC I!Ä¿ÀÄX¶lYóé§ŸB F4É~øÁóë9pà€)_¾¼Ì®BBRñ/fÍšezöì(1â“äÅ{/éÞ½»™;w®n!!)„´Èßl5 WÒA«D»ôŠ'N˜Ûo¿]7‡"ÙY½zµiÞ¼yð7ý”ôÂüàƒtƒ I!’™VZ™·ß~;PÂHV"Z#¼óÎ;6
ÄKÞzë-Ó¶m[Ý BBRˆd&þü¾V›IKH’‰Päßx¥š^ÀœPßU I!’”¯¿þÚÜqÇšˆt(V¬˜9~ü¸&BHH
‘ŒY¿~}MD:Ô[×lÚ´I!$$…HF–/_.¿[´iÓÆ¬X±B!$$…HF/^l:uêäü&ño_¢g›’KŸÇÜ0GBHH
‘„Ð|øpEh¥ùŽ–²6k40å‹Ñ~fø n!!)„¸Ó§O›¢E‹¦™¬Ÿè0æ"EŠøÚ6LHH
!Η_~i#;ÉLð,XÐŽ] I!D†|òÉ'&_¾|¶KH¢³téR;VÆ,„„¤"j²páÂ6"Q=z´5±>|X.$$…±A:D:ulcæ'N$̸¾ûî;[ޝ^½zŽ¥½!!)D’2a“'O3þüÐ…10–‰'ja…„¤Â8`k§V®\ÙìÞ½;t׿k×.S©R%S»vm;!$$…ŽC K¡B…l³æ;vþz¹Fºyà_]¶l™PHH
!Üå·ß~3,°åÛªU«f^ýuG¤Ç×òÚk¯™ªU«Úr{.´×,„„¤ÂSÖ¯_oZ·nm›7wéÒżû7qŽ>“Ïîܹ³É™3§iÓ¦Ù°aƒHHH
!üçÌ™3¶æi£F¬â+é#øÝÐâø›;wî4cÇŽ5÷ß¿É;·iܸ±íɵ!!)„$5jÔ0Ç7ƒ
²ïsäÈaJ”(aµ;~Ž©vݺu¶õUFÝ6èNÂÿáÿò;ü.㮻˜¿ÍgPouèСšx!!)„6Ó§O·Q°©9räˆyë·Ìˆ#L×®]MÆ
M™2eLÞ¼yÀ»á†ì×”ïù7þÿ—ßáwW®\iŽ=zÞß&Ÿ“´E
I!D`ùꫯ¬oè5‹/6*TP€ŽB*óL™2Å·ÏGãT‘ !!)„Ìâ'ÇŽ³<~h²BHH
!ÒäøñãÖÌJß̘1ÃÜ{ï½Z!!)„Tß?~|`®§zõê6E I!„¯P0¼bÅŠ
˜¡Íš-Ý>„Bøí³ð³wïÞÐ]{=L¯^½´ˆBBRá:t0ƒåµÓ’ë–[n1Û¶mÓB
I!„³¬^½Ú-Z4TfÖÔ|ðÁ¡ƒBŒ3gΘüùó›;v„~,?ü°2dˆUHH
!œ¡K—.æ™gžIˆ±œ:uÊ6hþä“O´°BBR˜(o¿ývóË/¿$̘–,YbÊ•+§ÍBBR‘uv!u‚ŠD£I“&ê\"$$…!ãñÇ7½{÷NȱѓªAtBBR6l0LèüÙ³g›{î¹G‹-$$…у`D@"(„$ÂR I!DTôìÙÓšZ“̘]1¿
!!)„8¢VgÍšeþ÷ÿ×~¿eË[™† d y"lß¾ÝìܹS7‡"ÙùðÃM¶lÙ¬`D0Ð1cݺuI5¤‚”-[Ö¼ñƦk×®v>
.¬›CHH
‘ìôë×Ïüñ4]t‘¹òÊ+M¥J•l{©d¿äUW]eþüç?[!yÍ5×èæ’B$;Ô2E(D^—_~¹íô‘,æF:›4kÖÌ
È”ópíµ×šè’B$+ø#ÑS
^_|±¹ì²Ë’bðǦ?¯?ýéOfÒ¤IºI„„¤ÉÊÚµkY1¥p@hV¯^ÝjXɾHæ“sʹ¨U«–n!!)D²òôÓO[dD(\}õÕfذaI9˜VsçÎm5èÈ|\wÝuºI„„¤É
‘¬ƒK.¹Äüýï77nLêù8{ö¬©_¿þï&h´K•’B”o¿ýÖ¼ÿþûfܸq¦{÷î¦iÓ¦¦B…
¦@&gΜæ†n°_yeÏžÝÜxãö}ñâÅMíÚµmïÄçž{μöÚkfß¾}çEâ$’óŠ+®0¥K—6ß}÷&üߌ=Úò\zé¥fæÌ™¿ÿœ\ÒO?ýÔ,\¸Ð4ÈtêÔÉÔ[×”(QÂj¡9rä8oM"ï „"ŤqãÆæ‘G1£F2ï¼óŽùúë¯5Ù’Bˆh W¦Æl ÷ß¿ÝtÑôŒlÈóæÍ³m«0¢ñdôw¾úê+³mÛ6Û꥗^2ݺu3•+W6¹rå2wÝu—¦3pà@«-Ñ'R-£.šù¹þúëMŸ>}¬CrHA8Ž;Ö,^¼ØlÚ´É9rÄüúë¯éþ-Êü:tÈú€_ýuóüóÏ›–-[š;ï¼Ó® šáÇ›>ú(ÿ#$$…H*ÐæV®\i:tè`òæÍkjÖ¬i†jÖ¯_ïZ®âÑ£GÍܹsM›6m¬–s÷Ýw[ÍIfÅAyº—_~ÙÖtå rï½÷šiÓ¦Y!çh§´"9r¤5õò™ÐE‹ex’B$,hƒ=zô°5CÉÑ›?¾ùñÇ}¹–={öX²P¡B¦jÕªVÓI¤ËÑ
ª+VX!•?~Û
Ñ~þùg³|ùróÐCYÙ¹sgka’B$º~´nÝÚ
ï^xÁ7Áí¥&?yòd+xàë'’éMrúôéÖgŒ™wÕªUzˆ$$…HÿüóÀ_3þÌÈ=”lµu%$…HPÐ}ôQ[<|Ù²e¡»~"^;vìh{K¾ûî» ±&t9AÐizøðáÐ]?é øiÑ¢…
Ð’B„R7ÐTlýLa†HÙ’%KÚ”’°¶ÐB&’ßë{ï½êõÀ‡JÉ<Ìâ,ÐÃ&!)DxÀ”Gà~G̬‰© tAÈì&ˆÜEÈ“º‘H}2¿øâS®\9k2ûALBRˆ$€
¸Aƒ¦mÛ¶67.Á7††ó+
” Ì!¥"A«äPF`ω'ôJH
Lˆ-S¦ŒMüOt¨:së·ÞÔ1y“‡˜èL:Õ6Š&èJHH
(0u‘”?qâĤ3I÷˜”I›"Æ›o¾9‘«NA7‚z‚–Z$!)D’Cé¸!C†øöù”9ãuÞƒšª?"åé¾ùæG?_‘¯AkÐÌuQßÖ/ß©_ë3fÌ0•*URy; I!‚Q†ø }}(ÿ½ñ¦þYJH Š‹Ó ˆÐ(ƒRx áP¾|y_#Xý\èÛ·¯7+$$…𕃚bÅŠù* èîÑ\xŸÞ¦œÞÏœ€Ä|",ƒÀ€lªG2¯Ìÿb!!)„oPëÓ‹Œm¦Dm²âLÁ™–æ’Úè$g÷»Æ(ÅÛIõðÂÔ˜Þše=ðÅH¦/’Bø‘“'÷ä¡Ë@H>õÔSÖ·Å‹÷)'õËÍ€ŽHé=?¡{†WDéIPÖ(¢Ÿ²¦Â3jÔ¨ˆ€Z:¥õÞkÍ(ÞîWº%Úh6õøÿR´hQ=¬’Bx5?K•*åûuP..µvÂÏÒÓtÜòE É3¥ëü€v_tVÑzœÅ-hæ,$$…ðŒ_|Ñ6Ì»‡.ÍtöìÙçg𞟥§¹¸Mê/tñÃvûí·Û‚ì~®IÐÖ^}õUÓ[7=´’Bx‡×§óô„$›lÊ\;ÞG6ÞÔ
¦=7òòRCãæ½{÷zº§N²€<ÝÓX“ ®Ç×_m I!<ƒî‰ÖcÑ)ÈÏ£Dš—ÐŒ˜ÖQ"mòäÉcL I!<áÆoÔ$¤ÃøñãÐK0iÉ)ÒMRHH
á:gΜ±5AEÚ¼öÚk¶Ñ´—Œ9ÒSqØ ‡5‘Ú¶IH
`h‡Eg Ǭ4J™¹ú »ôyhuO>ù¤§k2aÂW´W/×ÄÍϪ\¹²ùøãõðJH
á
Ù³gwe“Lª¶D6ÐXz"¦•¼ígƃב¿íÕÎôæ,³yv=RæQºµ@m]º¶ I!ùÄ“5Él^3ƒß‰p[HmLg!!)„g 0¹RaÆ+!Ñ ³²‘¢±Ðä7µOÓéMyâĉ¦ÿþ¾¬É´iÓLïÞ½]’ÑÌkV´H·„déҥϳF I!<¡]»vfùòåÕ$£ù§7eLGŽñe=Ξ=kµX'‹<Ä¢µGµ\Ó3™;½»ví2Õ«W×Ã*!)„÷|öÙg¦|ùòŽD8í“ôZH¾ù曦cÇŽ¾®ÉàÁƒy1ÈBmzšÓB’~§*n.!)„o8}út×6äx¢[1FjƒbÚÃ–ÚæÔ¦üóÏ?Ûª.´fòrXï¸ãsâÄ ×Ö$šyM}ûöexÈqRH®^½Ú4nÜX©„¤þqòäIsÛm·™C‡¹¦µd5O³ZK$Ð$ßujSFƒ3fL Ö„qÖ[×
?ù‰f^Óiå¦äÀ}ùÅ_è!•Â_ÐìŽ@›ŠwCcÅùJNÐ1bD`æÈËÏúõ×_MZµÌo¼¡‡SBRˆ`€ÅÆÄ•LÊP¸paÇÌ›NqîÜ9S²dI3wîܤ»Û·oïyY@!!)D¦à/lÒ¤IÒ´ÐZ¿~½-ôþå—_òúðRŽ€¢d Mþ±Ç3-[¶Tu I!‚ 'øjÕª9^².hÌŸ?ß(PÀñä}§ùꫯ¬ ¤ z"ƒ©¿yóæ¦Y³fIgÍ"dŒ;ÖnÌû÷ïO¸±Q<ê-äC"€ÂÀ©S§lY¶GyÄša
J –+WÎ6»–)!)D(ÀWG7ø—_~9aÆtøðaS±bEÓºukóã?†êÚ1÷ìÙÓ-ZÔìÝ»7aÖ„ÈZjÖz]P^HH
7ß~û©S§Ž©Zµj¨µJ̨Q£Lîܹm$k˜!w0_¾|æ™gž±z¬=6lØÐ”-[V%ç$$…7sæÌ19rä0Ý»w·‚3LPv|;"&¥!þbŠÓ#,‰~u²þ®×Nm\´ÇW^yEæU I!ÂÍš5kLÞ¼y͸qãÌСCM®\¹lbýy…Ú
:¤µlݺ5!׆ #¢‘)TOÕ$Úm•ï¿ÿÞ<ûì³ö°…ï‘"BBRˆÐB„aß¾}M¡B…ÌŽ;~ÿ9&¾‘#GZ-†&Á+V¬Œ&Car‚rˆZ}à2¬“H8pÀöE;ÃoÉ÷AaíÚµ¦E‹öp…6_¼xqóÝwßé“"¼à#*Uª”ÝÔÒó{a&{ûí·0"À‡
1|ïuŽ%eËð7S°`A«íFê’&äU¢ñ£A—(QÂK÷Ú—Ì ÁˆižÃJÍš5VIëÀÄŠæ{ðàA=h’B„yóæÙS,%Áˆ%ï7‚cȳÄïôÞ{ï9ZɆ–.õ˜Û¶mkÛJ¡™80¡">€/”µ#µŸÄI“&Y«€“fYÒS(kHM×ÚµkÛõÇÂ0sæLkbM‹•+WÚƒ•º{HH
Μ9cZµje*T¨`S%âaÏž=¶ÓB“’oÝJ•*YÍ”¨L/^¼Øn–T½Ù´i“õò•èM~Î&;|øpÓ[7Ið
§råʶ11f^6h‘9X˜×~ýúYÍÅaÖµkW«}sð`NWZe×aË–-ö+‚ŒõX²d‰™2eŠõ+b5à „y—¿Ãú`‚Ǽé}ûöíö:’¥š„¤!fóæÍv³ÃŸç†
³ßÒ¥KÏÛhñ£±Q³qãûä}Ó¦M–Ø£Góâ‹/Úü¹uëÖ¥«•ˆ¬A”)pÁ‚Ö\ÍÁƒyGãdŠ+f6¼çðÔ¡C{ÀAE°Ñ9^S9ëžèÕ„$$…)ø1“á;Ú°aƒo× °FZ Öï…–‡ë)-J‘"0ÂA©347¿Ãñ%$“WHu[IeÁ<o»6!!)DÜ`ö$ÀbÆŒ¸ Éä’€‰‰íRHH
á9ÇîÒ¥‹ýôÓOs]’’Ha!H+¨íË$$…HPˆ8%Ò”dó õ‹””L ŸK•§”E,„„¤®Aô áöï¾ûn ¯OBRB25DÞrÏR BHH
á
$ò×[×vòrar I É´À%@jUz„„¤ŽBµ*àã :’’éÁá®dÉ’æ¹çžÓ¢HH
?øIÇÿ¸{÷îP\³„¤„dFüôÓO¦^½z¶ØA¤¬"f:dOÝ;w¶KX”Ì*A™Í}B E!!)DLÌž=Ûæ>†±¦„¤„d´PÏ÷Ž;ît?S I!58é×GͰn’’±@§šüùóÛŽ0BBRˆtÙ¸q£¹ùæ›m‡0×½””Œ•>øÀZNÖ¬Y£Å’â|ðÏ2Ä
HÚ… I ɬ€&‰FùÚk¯iÁ$$…øG5wß}·iÙ²¥5µ&’’Y>J|•BBR$9‹-²nçÌ™“Pã’”Œ‹Õ«W·Ñ¯nôC•"àÎA“âR¥J™Ï>û,áÆ'!)!/ä“GI>e˜ÒŸ$$…ˆ:ÀÓáé§Ÿ\ar I É Ae“ß}÷QBR$:cÇŽµÝˆäKd$$%$„Z¯·Þzk ÚÁIH
á Ô«¬]»¶iÐ ùþûï~¼’’N³råJë¿§›ˆ Ä;ï¼c[Mš4)iÆ,!)!éÛ·o·–˜°CBRˆÿã—_~1=zô0E‹MŠJ"øWÉñäE´.~¤È÷êJï_ýõïkP¿~}3bĈ߿ÿñÇC9&î%|úôT’"¤8pÀÜyç¦[·næÜ¹sI1æ}ûö™lÙ²™«¯¾ú¼×Å_l75á=Íš5KsMøŽ°òÃ?˜
*˜ž={†º2•„¤HJ¦M›f}'Ë—/Oº±ßpÃ
vNùºòÊ+Í«¯¾ªÃЯ¹æšÖAùóÏ?‡zl\“&Mì+ìc‘IÁÉ“'MãÆMÕªU™+!å’K.9oCþË_þ¢<7¹þúë/’<ð@BŒ
-míRHHŠ€²~ýz[s’RZÉlþ!D?µærï½÷êñ‘'žxš¼Sj‘«W¯N¨1âŸÄ¤/ß·„¤”Ìêß¿¿¹å–[Ì–-[4!ÿч)7ä+VhR|„ ±”—k¯½6!K½ñʽGdµ€Sk¹råLëÖC)è/¼ð‚¹ì²Ëì†Ì木U…ÂDΜ9ízüá0íÚµKØq’CI<