Janet 1.38.0-73334f3 Documentation
(Other Versions:
1.37.1
1.36.0
1.35.0
1.34.0
1.31.0
1.29.1
1.28.0
1.27.0
1.26.0
1.25.1
1.24.0
1.23.0
1.22.0
1.21.0
1.20.0
1.19.0
1.18.1
1.17.1
1.16.1
1.15.0
1.13.1
1.12.2
1.11.1
1.10.1
1.9.1
1.8.1
1.7.0
1.6.0
1.5.1
1.5.0
1.4.0
1.3.1
)
Parsing Expression Grammars
A common programming task is recognizing patterns in text, be it filtering emails from a list or extracting data from a CSV file. Programming languages and libraries usually offer a number of tools for this, including prebuilt parsers, simple operations on strings (splitting a string on commas), and regular expressions. The pre-built or custom-built parser is usually the most robust solution, but can be very complex to maintain and may not exist for many languages. String functions are not powerful enough for a large class of languages, and regular expressions can be hard to read (which characters are escaped?) and under-powered (don't parse HTML with regex!).
PEGs, or Parsing Expression Grammars, are another formalism for recognizing
languages. PEGs are easier to write than a custom parser and more powerful than
regular expressions. They also can produce grammars that are easily
understandable and fast. PEGs can also be compiled to a bytecode format that can
be reused. Janet offers the peg
module for writing and evaluating PEGs.
Janet's peg
module borrows syntax and ideas from both LPeg and REBOL/Red
parse module. Janet has no built-in regex module because PEGs offer a superset
of the functionality of regular expressions.
Below is a simple example for checking if a string is a valid IP address. Notice how the grammar is descriptive enough that you can read it even if you don't know the PEG syntax (example is translated from a RED language blog post).
(def ip-address
'{:dig (range "09")
:0-4 (range "04")
:0-5 (range "05")
:byte (choice
(sequence "25" :0-5)
(sequence "2" :0-4 :dig)
(sequence "1" :dig :dig)
(between 1 2 :dig))
:main (sequence :byte "." :byte "." :byte "." :byte)})
(peg/match ip-address "0.0.0.0") # -> @[]
(peg/match ip-address "elephant") # -> nil
(peg/match ip-address "256.0.0.0") # -> nil
(peg/match ip-address "0.0.0.0more text") # -> @[]
The API
The peg
module has few functions because the complexity is exposed
through the pattern syntax. Note that there is only one match function,
peg/match
. Variations on matching, such as parsing or searching, can be
implemented inside patterns. PEGs can also be compiled ahead of time with
peg/compile
if a PEG will be reused many times.
(peg/match peg text [,start=0] & arguments)
Match a PEG against some text. Returns an array of captured data if the text
matches, or nil
if there is no match. The caller can provide an optional
start
index to begin matching, otherwise the PEG starts on the first
character of text. A PEG can either be a compiled PEG object or PEG source.
(peg/compile peg)
Compiles a PEG source data structure into a new PEG. Throws an error if there are problems with the PEG code.
Primitive patterns
Larger patterns are built up with primitive patterns, which recognize individual
characters, string literals, or a given number of characters. A character in
Janet is considered a byte, so PEGs will work on any string of bytes. No special
meaning is given to the 0
byte, or the string terminator as in many
languages.
Pattern Signature | What it matches |
---|---|
"cat" (a string) | The literal string. |
3 (an integer) | If positive, matches a number of characters, and advances that many characters. If 0, counts as an (empty string) match. If negative, matches if not that many characters and does not advance. For example, -1 will match the end of a string. |
(range "az" "AZ") |
Matches characters in a range and advances 1 character. Multiple ranges can be combined together. |
(set "abcd") |
Match any character in the argument string. Advances 1 character. |
true |
Always matches. Does not advance any characters. Epsilon in NFA. |
false |
Never matches. Does not advance any characters. Equivalent to
(! true) .
|
Primitive patterns are not that useful by themselves, but can be passed to
peg/match
and peg/compile
like any other pattern.
(peg/match "hello" "hello") # -> @[]
(peg/match "hello" "hi") # -> nil
(peg/match 1 "hi") # -> @[]
(peg/match 1 "") # -> nil
(peg/match '(range "AZ") "F") # -> @[]
(peg/match '(range "AZ") "-") # -> nil
(peg/match '(set "AZ") "F") # -> nil
(peg/match '(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ") "F") # -> @[]
Combining patterns
These primitive patterns can be combined with several combinators to match a wide number of languages. These combinators can be thought of as the looping and branching forms in a traditional language (that is how they are implemented when compiled to bytecode).
Pattern Signature | What it matches |
---|---|
(choice a b c ...) |
Tries to match a, then b, and so on. Will succeed on the first successful match, and fails if none of the arguments match the text. |
(+ a b c ...) |
Alias for (choice a b c ...) |
(sequence a b c) |
Tries to match a, b, c and so on in sequence. If any of these arguments fail to match the text, the whole pattern fails. |
(* a b c ...) |
Alias for (sequence a b c ...) |
(any x) |
Matches 0 or more repetitions of x. |
(some x) |
Matches 1 or more repetitions of x. |
(between min max x) |
Matches between min and max (inclusive) repetitions of x. |
(at-least n x) |
Matches at least n repetitions of x. |
(at-most n x) |
Matches at most n repetitions of x. |
(repeat n x) |
Matches exactly n repetitions of x. |
(if cond patt) |
Tries to match patt only if cond matches as well. cond will not produce any captures. |
(if-not cond patt) |
Tries to match only if cond does not match. cond will not produce any captures. |
(not patt) |
Matches only if patt does not match. Will not produce captures or advance any characters. |
(! patt) |
Alias for (not patt) |
(look ?offset patt) |
Matches only if patt matches at a fixed offset. offset can be any integer and defaults to 0. The peg will not advance any characters. |
(> offset ?patt) |
Alias for (look offset ?patt) |
(to patt) |
Match up to patt (but not including it). If the end of the input is reached and patt is not matched, the entire pattern does not match. |
(thru patt) |
Match up through patt (thus including it). If the end of the input is reached and patt is not matched, the entire pattern does not match. |
(backmatch ?tag) |
If tag is provided, matches against the tagged capture. If no tag is provided, matches against the last capture, but only if that capture is untagged. The peg advances if there was a match. |
(opt patt) |
Alias for (between 0 1 patt) |
(? patt) |
Alias for (between 0 1 patt) |
(n patt) |
Alias for (repeat n patt) |
(sub window-patt patt) |
Match window-patt , and if it succeeds match
patt against the bytes that window-patt matched.
patt cannot match more than window-patt ; it will
see end-of-input at the end of the substring matched by
window-patt . If patt also succeeds, sub will
advance to the end of what window-patt matched. |
(split separator-patt patt) |
Split the remaining input by separator-patt , and execute
patt on each substring. patt will execute with its
input constrained to the next instance of separator-patt ,
as if narrowed by (sub (to separator-patt) ...) .
split will continue to match separators and patterns until
it reaches the end of the input; if you don't want to match to the
end of the input you should first narrow it with
(sub ... (split ...)) . |
PEGs try to match an input text with a pattern in a greedy manner. This means
that if a rule fails to match, that rule will fail and not try again. The only
backtracking provided in a PEG is provided by the (choice x y z ...)
special, which will try rules in order until one succeeds, and the whole pattern
succeeds. If no sub-pattern succeeds, then the whole pattern fails. Note that
this means that the order of x y z
in choice does matter. If
y
matches everything that z
matches, z
will never succeed.
Captures
So far we have only been concerned with "does this text match this language?".
This is useful, but it is often more useful to extract data from text if it does
match a PEG. The peg
module uses the concept of a capture stack to
extract data from text. As the PEG is trying to match a piece of text, some
forms may push Janet values onto the capture stack as a side effect. If the text
matches the main PEG language, (peg/match)
will return the final capture
stack as an array.
Capture specials will only push captures to the capture stack if their child
pattern matches the text. Most captures specials will match the same text as
their first argument pattern. In addition, most specials that produce captures
can take an optional argument tag
that applies a keyword tag to the
capture. These tagged captures can then be recaptured via the (backref
tag)
special in subsequent matches. Tagged captures, when combined with the
(cmt)
special, provide a powerful form of look-behind that can make many
grammars simpler.
Pattern Signature | What it matches |
---|---|
(capture patt ?tag) |
Captures all of the text in patt if patt matches. If patt contains any captures, then those captures will be pushed on to the capture stack before the total text. |
(<- patt ?tag) |
Alias for (capture patt ?tag) |
(quote patt ?tag) |
Another alias for (capture patt ?tag) . This allows code
like 'patt to capture a pattern. |
(group patt ?tag) |
Captures an array of all of the captures in patt. |
(replace patt subst ?tag) |
Replaces the captures produced by patt by applying subst to them.
If subst is a table or struct, will push (get subst last-capture)
to the capture stack after removing the old captures. If subst is a
function, will call subst with the captures of patt as arguments and
push the result to the capture stack. Otherwise, will push subst
literally to the capture stack. |
(/ patt subst ?tag) |
Alias for (replace patt subst ?tag) |
(constant k ?tag) |
Captures a constant value and advances no characters. |
(argument n ?tag) |
Captures the nth extra argument to the match function and does not advance. |
(position ?tag) |
Captures the current index into the text and advances no input. |
($ ?tag) |
Alias for (position ?tag) . |
(column ?tag) |
Captures the column number of the current position in the matched text. |
(line ?tag) |
Captures the line number of the current position in the matched text. |
(accumulate patt ?tag) |
Capture a string that is the concatenation of all captures in patt. This will try to be efficient and not create intermediate strings if possible. |
(% patt ?tag) |
Alias for (accumulate patt ?tag) |
(cmt patt fun ?tag) |
Invokes fun with all of the captures of patt as arguments (if patt matches). If the result is truthy, then captures the result. The whole expression fails if fun returns false or nil. |
(backref tag ?tag) |
Duplicates the last capture with the tag tag . If no such
capture exists then the match fails. |
(-> tag ?tag) |
Alias for (backref tag) . |
(unref patt ?tag) |
Lets a user "scope" tagged captures. After the rule has matched, all captures with the given tag can no longer be referred to by their tag. However, such captures from outside rule are kept as is. If no tag is given, all tagged captures from rule are unreferenced. Note that this doesn't drop the captures, merely removes their association with the tag. This means subsequent calls to backref and backmatch will no longer "see" these tagged captures. |
(error ?patt) |
Throws a Janet error. If optional argument patt is
provided and it matches successfully, the error thrown will be the
last capture of patt , or a generic error if patt
produces no captures. If no argument is provided, a generic error is
thrown. If patt does not match, no error will be thrown.
|
(drop patt) |
Ignores (drops) all captures from patt. |
(only-tags patt) |
Ignores all captures from patt , while making tagged
captures within patt available for future back-referencing. |
(nth index patt ?tag) |
Capture one of the captures in patt at index .
If no such capture exists, then the match fails. |
(uint num-bytes ?tag) |
Capture a little endian, unsigned, two's complement integer from
num-bytes . Works for 1 to 8 byte integers. |
(uint-be num-bytes ?tag) |
Capture a big endian, unsigned, two's complement integer from
num-bytes . Works for 1 to 8 byte integers. |
(int num-bytes ?tag) |
Capture a little endian, signed, two's complement integer from
num-bytes . Works for 1 to 8 byte integers. |
(int-be num-bytes ?tag) |
Capture a big endian, signed, two's complement integer from
num-bytes . Works for 1 to 8 byte integers. |
(lenprefix n patt) |
Matches n repetitions of a pattern, where n is supplied from other parsed input and is not a constant. |
(number patt ?base ?tag) |
Capture a number parsed from the matched pattern. This uses the
same parser as Janet itself, so it supports all numeric notation
allowed in Janet files. To tag the capture without forcing a
particular base, use (number patt nil :tag) . |
Grammars and recursion
The feature that makes PEGs so much more powerful than pattern matching solutions like (vanilla) regex is mutual recursion. To do recursion in a PEG, you can wrap multiple patterns in a grammar, which is a Janet struct. The patterns must be named by keywords, which can then be used in all sub-patterns in the grammar.
Each grammar, defined by a struct, must also have a main rule, called
:main
, that is the pattern that the entire grammar is defined by.
An example grammar that uses mutual recursion:
(def my-grammar
'{:a (* "a" :b "a")
:b (* "b" (+ :a 0) "b")
:main (* "(" :b ")")})
(peg/match my-grammar "(bb)") # -> @[]
(peg/match my-grammar "(babbab)") # -> @[]
(peg/match my-grammar "(baab)") # -> nil
(peg/match my-grammar "(babaabab)") # -> nil
Keep in mind that recursion is implemented with a stack, meaning that very recursive grammars can overflow the stack. The compiler is able to turn some recursion into iteration via tail-call optimization, but some patterns may fail on large inputs. It is also possible to construct (very poorly written) patterns that will result in long loops and be very slow in general.
Built-in patterns
Besides the primitive patterns and pattern combinators given above, the
peg
module also provides a default grammar with a handful of commonly
used patterns. All of these shorthands can be defined with the combinators above
and primitive patterns, but you may see these aliases in other grammars and they
can make grammars simpler and easier to read.
Name | Expanded | Description |
---|---|---|
:d | (range "09") | Matches an ASCII digit. |
:a | (range "az" "AZ") | Matches an ASCII letter. |
:w | (range "az" "AZ" "09") | Matches an ASCII digit or letter. |
:s | (set " \t\r\n\0\f\v") | Matches an ASCII whitespace character. |
:h | (range "09" "af" "AF") | Matches a hex character. |
:D | (if-not :d 1) | Matches a character that is not an ASCII digit. |
:A | (if-not :a 1) | Matches a character that is not an ASCII letter. |
:W | (if-not :w 1) | Matches a character that is not an ASCII digit or letter. |
:S | (if-not :s 1) | Matches a character that is not ASCII whitespace. |
:H | (if-not :h 1) | Matches a character that is not a hex character. |
:d+ | (some :d) | Matches 1 or more ASCII digits. |
:a+ | (some :a) | Matches 1 or more ASCII letters. |
:w+ | (some :w) | Matches 1 or more ASCII digits and letters. |
:s+ | (some :s) | Matches 1 or more ASCII whitespace characters. |
:h+ | (some :h) | Matches 1 or more hex characters. |
:d* | (any :d) | Matches 0 or more ASCII digits. |
:a* | (any :a) | Matches 0 or more ASCII letters. |
:w* | (any :w) | Matches 0 or more ASCII digits and letters. |
:s* | (any :s) | Matches 0 or more ASCII whitespace characters. |
:h* | (any :h) | Matches 0 or more hex characters. |
All of these aliases are defined in default-peg-grammar
, which is a table
that maps from the alias name to the expanded form. You can even add your own
aliases here which are then available for all PEGs in the program. Modifiying
this table will not affect already compiled PEGs.
String searching and other idioms
Although all pattern matching is done in anchored mode, operations like global substitution and searching can be implemented with the PEG module. A simple Janet function that produces PEGs that search for strings shows how captures and looping specials can be composed, and how quasiquoting can be used to embed values in patterns.
(defn finder
"Creates a peg that finds all locations of str in the text."
[str]
(peg/compile ~(any (+ (* ($) ,str) 1))))
(def where-are-the-dogs? (finder "dog"))
(peg/match where-are-the-dogs? "dog dog cat dog") # -> @[0 4 12]
# Our finder function also works on any pattern, not just strings.
(def find-cats (finder '(* "c" (some "a") "t")))
(peg/match find-cats "cat ct caat caaaaat cat") # -> @[0 7 12 20]
We can also wrap a PEG to turn it into a global substitution grammar with the
accumulate special (%)
.
(defn replacer
"Creates a peg that replaces instances of patt with subst."
[patt subst]
(peg/compile ~(% (any (+ (/ (<- ,patt) ,subst) (<- 1))))))