Pegex::Syntax(3pm)

1Pegex::Syntax(3)      User Contributed Perl Documentation     Pegex::Syntax(3)
2
3
4

Pegex Syntax

6       The term "Pegex" can be used to mean both the Pegex Parser Framework
7       and also the Pegex Grammar Language Syntax that is used to write Pegex
8       grammar files.  This document details the Pegex Syntax.
9
10       Pegex is a self-hosting language. That means that the grammar for
11       defining the Pegex Language is written in the Pegex Language itself.
12       You can see it for yourself here:
13       <https://github.com/ingydotnet/pegex-pgx/blob/master/pegex.pgx>.
14
15       I encourage you to take a quick look at that link even now. A Pegex
16       grammar (like this one) is made up of 2 parts: a meta section and a
17       rule section.
18
19       The meta section just contains keyword/value meta attributes about the
20       grammar. Things like the grammar's name and version.
21
22       The real meat of a Pegex grammar is in its rules. The very first rule
23       of the grammar above is (basically):
24
25           grammar: meta_section rule_section
26
27       Which says, a grammar IS a meta_section followed by a rule_section. But
28       hey, we already knew that!
29

Meta Section

31       The meta statements ate the top of a grammar file look like this:
32
33           %pegexKeyword value
34
35       Let's look at the top the the pegex.pgx grammar:
36
37           # This is the Pegex grammar for Pegex grammars!
38           %grammar pegex
39           %version 0.1.0
40
41       This defines two meta values: "grammar" and "version", which specify
42       the name and the version of the grammar, respectively.
43
44       You'll also notice that the first line is a comment. Comments start
45       with a "#" and go until the end of the line. Comments are allowed
46       almost anywhere in the grammar, both on their own lines, after
47       statements, and even within regex definitions as we will see later.
48
49       The Pegex Meta Section ends when the Pegex Rule Section begins (with
50       the first rule definition).
51

Rule Section

53       The remainder of a Pegex grammar is a set of named rules. Each rule is
54       a rule name, followed by a ':', followed by the definition of the rule,
55       followed by a ';' or a newline.
56
57       Here are a couple rules from the pegex.pgx grammar. (These are the
58       rules that start to define a rule!).
59
60           rule_definition:
61               rule_start
62               rule_group
63               ending
64
65           rule_start: /
66               ( rule_name )     # Capture the rule_name
67               BLANK*
68               COLON -
69           /
70
71       Rule definitions are infix expressions. They consist of tokens
72       separated by operators, with parentheses to disambiguate binding
73       precedence. There are 3 distinct tokens and 3 operators.
74
75       The 3 token types are: rule-reference, regex and error-message. The 3
76       operators are AND (' '), OR ('|') and ALT ('%', '%%').
77
78       Here's an example from a Pegex grammar for parsing JSON:
79
80           json: hash | array
81           array: / LSQUARE / ( node* % / COMMA / ) (
82               / RSQUARE / | `missing ']'` )
83
84       This is saying: "json is either a hash or array. array is '[', zero or
85       more nodes separated by commas, and a ']'. error if no ']'".
86
87       "hash", "array" and "node" are rule references, meaning that they refer
88       to named rules within the grammar that must match at that point. Text
89       surrounded by a pair of '/' chars forms a regex. Text surrounding by
90       backticks is an error message.
91
92       "LSQUARE", "RSQUARE" and "COMMA" are also rule references. Rules may be
93       referred to inside of regexes, as long as they refer to regexes
94       themselves. In this way big regexes can be assembled from smaller ones,
95       thus leading to reuse and readability. Finally, the '*' after "node" is
96       called a "quantifier". More about those later.
97
98   Rule References
99       A rule reference is the name of a rule inside angle brackets. The
100       brackets are usually optional. Inside a regex, a rule reference without
101       "<>" must be preceded by a whitespace character.
102
103           <sub_rule_name>
104           sub_rule_name
105
106       When used outside a regex, a reference can have a number of prefix
107       modifiers.  Note the the angle brackets are not required here, but add
108       to readability.
109
110           =rule  # Zero-width positive assertion (look-ahead)
111           !rule  # Zero-width negative assertion (look-ahead)
112           .rule  # Skip (ie: parse but don't capture a subpattern)
113           -rule  # Flat (flatten the array captures)
114           +rule  # Always wrap
115
116       (Skipping and wrapping are explained in [Return Values].)
117
118       A reference can also have a number of suffixed quantifiers. Similar to
119       regular expression syntax, a quantifier indicates how many times a rule
120       (reference) should match.
121
122           rule?      # optional
123           rule*      # 0 or more times
124           rule+      # 1 or more times
125           <rule>8    # exactly 8 times
126           <rule>2+   # 2 or more times
127           <rule>2-3  # 2 or 3 times
128           <rule>0-6  # 0 to 6 times
129
130       Note that you must use angle brackets if you are using a numbered
131       modifier:
132
133           rule8    # WRONG!  This would match rule "rule8".
134           rule2+   # WRONG!  This would match rule "rule2", 1 or more times.
135           rule2-3  # WRONG!  Pegex syntax error
136
137       There is a special set of predefined "Atoms" that refer to regular
138       expression fragments. Atoms exist for every punctuation character and
139       for characters commonly found in regular expressions. Atoms enhance
140       readability in grammar texts, and allow special characters (like slash
141       or hash) to be used as Pegex syntax.
142
143       For example, a regex to match a comment might be '#' followed by
144       anything, followed by a newline. In Pegex, you would write:
145
146           comment: / HASH ANY* EOL /
147
148       instead of:
149
150           comment: /#.*\r?\n/
151
152       Pegex would compile the former into the latter.
153
154       Here are some atoms:
155
156           DASH    # -
157           PLUS    # +
158           TILDE   # ~
159           SLASH   # /
160           HASH    # # (literal)
161           QMARK   # ? (literal)
162           STAR    # * (literal)
163           LPAREN  # ( (literal)
164           RPAREN  # ) (literal)
165           WORD    # \w
166           WS      # \s
167
168       The full list can be found in the [Atoms source
169       code|<https://metacpan.org/source/Pegex::Grammar::Atoms].>
170
171   Regexes
172       In Pegex we call the syntax for a regular expression a "regex". ie When
173       the term "regex" is used, it is referring to Pegex syntax, and when the
174       term "regular expression" is used it refers to the actual regular
175       expression that the regex is compiled into.
176
177       A regex is a string inside forward slashes.
178
179           /regex/
180
181       The regex syntax mostly follows Perl, with the following exceptions:
182
183           # Any rules in angle brackets are referenced in the regex
184           / ( <rule1> | 'non_rule' ) /  # "non_rule" is interpreted literally
185
186           # The syntax implies a /x modifier, so whitespace and comments are
187           # ignored.
188           / (
189               rule1+   # Match rule1 one or more times
190               |
191               rule2
192           ) /
193
194           # Whitespace is declared with dash and plus.
195           / - rule3 + /  # - = \s*, + = \s+, etc.
196
197           # Any (?XX ) syntax can have the question mark removed
198           / (: a | b ) /  # same as / (?: a | b ) /
199
200   Error Message
201       An error message is a string inside backticks. If the parser gets to an
202       error message in the grammar, it throws a parse error with that
203       message.
204
205           `error message`
206
207   Operators
208       The Pegex operators in descending precedence order are: ALT, AND, and
209       OR.
210
211       AND and OR are the most common operators. AND is represented by the
212       absence of an operator. Like in these rules:
213
214           r1: <a><b>
215           r2: a b
216
217       Those are both the same. They mean rule "a" AND (followed immediately
218       by) rule "b".
219
220       OR means match one or the other.
221
222           r: a | b | c
223
224       means match rule "a" OR rule "b" OR rule "c". The rules are checked in
225       order and if one matches, the others are skipped.
226
227       ALT means alternation. It's a way to specify a separator in a list.
228
229           r: a+ % b
230
231       would match these:
232
233           a
234           aba
235           ababa
236
237       "%%" means that a trailing separator is optional.
238
239           r: a+ %% b
240
241       would match these:
242
243           a
244           ab
245           aba
246           abab
247
248       ANY operators take precedence over everything else, similar to other
249       parsers.  These rules have the same binding precedence:
250
251           r1: a b | c % d
252           r2: (a b) | (c % d)
253
254       Parens are not only used for indicating binding precedence; they also
255       can create quantifiable groups:
256
257           r1: (a b)+ c
258
259       would match:
260
261           abababac
262

Return Values

264       All return values are based on the capture groups ("$1/$2/$3/etc." type
265       variables) of parsed RE statements. The exact structure of the result
266       tree depends on the type of Receiver used. For example, Pegex::Tree
267       will return:
268
269           $1              # single capture group
270           [ @+[1..$#+] ]  # multiple capture groups
271
272       This would be a match directly from the RE rule. As rules go further
273       back, things are put into arrays, but only if there is more than one
274       result. For example:
275
276           r: (a b)+ % +
277           a: /( ALPHA+ )/
278           b: /( DIGIT+ )( PLUS )/
279
280           # input = foobar123+
281           # output (using Pegex::Tree) = [
282           #     'foobar', [ '123', '+' ]
283           # ]
284           #
285           # input = foobar123+ boofar789+
286           # output (using Pegex::Tree) = [
287           #     [ 'foobar', [ '123', '+' ] ],
288           #     [ 'boofar', [ '789', '+' ] ],
289           # ]
290
291   Skipping
292       Any rule can use the skip modifier (DOT) to completely skip the return
293       from that rule (and any children below it). The rule is still
294       processed, but nothing is put into the tree. (This is different from,
295       say, putting "undef" into the return.) This can also affect the number
296       of values returned, and thus, whether a value comes as an array:
297
298           r: (a .b)+ % +
299           a: /( ALPHA+ )/
300           b: /( DIGIT+ )( PLUS )/
301
302           # input = foobar123+ boofar789+
303           # output (using Pegex::Tree) = [
304           #     'foobar',
305           #     'boofar',
306           # ]
307
308       The skip modifier can also be used with groups. (This is the only group
309       modifier allowed so far.)
310
311           r: .(a b)+ % +
312           a: /( ALPHA+ )/
313           b: /( DIGIT+ )( PLUS )/
314
315           # output (using Pegex::Tree) = []
316
317   Wrapping
318       You can also turn on "wrapping" with the Pegex::Tree::Wrap receiver.
319       This will wrap all match values in a hash with the rule name, like so:
320
321           { rule_A => $match }
322           { rule_B => [ @matches ] }
323
324       Note that this behavior can be "hard set" with the "+/-" rule
325       modifiers:
326
327           -rule  # Flatten array captures
328           +rule  # Always wrap (even if using Pegex::Tree)
329
330       This is simply a check in the "gotrule" for the receiver. So, any
331       specific "got_*" receiver methods will override even these settings,
332       and choose to pass the match as-is. In this case, the "got_*" sub
333       return value dictates what ultimately gets put into the tree object:
334
335           +rule_A   # in this case, the + is useless here
336
337           sub got_rule_A {
338               my ($self, $matches_arrayref) = @_;
339               return $matches_arrayref;
340               # will be received as [ @matches ]
341           }
342
343       You can "correct" this behavior by passing it back to "gotrule":
344
345           +rule_A   # now + is honored
346
347           sub got_rule_A {
348               my ($self, $matches_arrayref) = @_;
349               return $self->gotrule($matches_arrayref);
350               # will be received as { rule_A => [ @matches ] }
351           }
352

Pegex Syntax

Meta Section

Rule Section

Return Values

See Also