1Pegex::Syntax(3) User Contributed Perl Documentation Pegex::Syntax(3)
2
3
4
6 The term "Pegex" can be used to mean both the Pegex Parser Framework
7 and also the Pegex Grammar Language Syntax that is used to write Pegex
8 grammar files. This document details the Pegex Syntax.
9
10 Pegex is a self-hosting language. That means that the grammar for
11 defining the Pegex Language is written in the Pegex Language itself.
12 You can see it for yourself here:
13 <https://github.com/ingydotnet/pegex-pgx/blob/master/pegex.pgx>.
14
15 I encourage you to take a quick look at that link even now. A Pegex
16 grammar (like this one) is made up of 2 parts: a meta section and a
17 rule section.
18
19 The meta section just contains keyword/value meta attributes about the
20 grammar. Things like the grammar's name and version.
21
22 The real meat of a Pegex grammar is in its rules. The very first rule
23 of the grammar above is (basically):
24
25 grammar: meta_section rule_section
26
27 Which says, a grammar IS a meta_section followed by a rule_section. But
28 hey, we already knew that!
29
31 The meta statements ate the top of a grammar file look like this:
32
33 %pegexKeyword value
34
35 Let's look at the top the the pegex.pgx grammar:
36
37 # This is the Pegex grammar for Pegex grammars!
38 %grammar pegex
39 %version 0.1.0
40
41 This defines two meta values: "grammar" and "version", which specify
42 the name and the version of the grammar, respectively.
43
44 You'll also notice that the first line is a comment. Comments start
45 with a "#" and go until the end of the line. Comments are allowed
46 almost anywhere in the grammar, both on their own lines, after
47 statements, and even within regex definitions as we will see later.
48
49 The Pegex Meta Section ends when the Pegex Rule Section begins (with
50 the first rule definition).
51
53 The remainder of a Pegex grammar is a set of named rules. Each rule is
54 a rule name, followed by a ':', followed by the definition of the rule,
55 followed by a ';' or a newline.
56
57 Here are a couple rules from the pegex.pgx grammar. (These are the
58 rules that start to define a rule!).
59
60 rule_definition:
61 rule_start
62 rule_group
63 ending
64
65 rule_start: /
66 ( rule_name ) # Capture the rule_name
67 BLANK*
68 COLON -
69 /
70
71 Rule definitions are infix expressions. They consist of tokens
72 separated by operators, with parentheses to disambiguate binding
73 precedence. There are 3 distinct tokens and 3 operators.
74
75 The 3 token types are: rule-reference, regex and error-message. The 3
76 operators are AND (' '), OR ('|') and ALT ('%', '%%').
77
78 Here's an example from a Pegex grammar for parsing JSON:
79
80 json: hash | array
81 array: / LSQUARE / ( node* % / COMMA / ) (
82 / RSQUARE / | `missing ']'` )
83
84 This is saying: "json is either a hash or array. array is '[', zero or
85 more nodes separated by commas, and a ']'. error if no ']'".
86
87 "hash", "array" and "node" are rule references, meaning that they refer
88 to named rules within the grammar that must match at that point. Text
89 surrounded by a pair of '/' chars forms a regex. Text surrounding by
90 backticks is an error message.
91
92 "LSQUARE", "RSQUARE" and "COMMA" are also rule references. Rules may be
93 referred to inside of regexes, as long as they refer to regexes
94 themselves. In this way big regexes can be assembled from smaller ones,
95 thus leading to reuse and readability. Finally, the '*' after "node" is
96 called a "quantifier". More about those later.
97
98 Rule References
99 A rule reference is the name of a rule inside angle brackets. The
100 brackets are usually optional. Inside a regex, a rule reference without
101 "<>" must be preceded by a whitespace character.
102
103 <sub_rule_name>
104 sub_rule_name
105
106 When used outside a regex, a reference can have a number of prefix
107 modifiers. Note the the angle brackets are not required here, but add
108 to readability.
109
110 =rule # Zero-width positive assertion (look-ahead)
111 !rule # Zero-width negative assertion (look-ahead)
112 .rule # Skip (ie: parse but don't capture a subpattern)
113 -rule # Flat (flatten the array captures)
114 +rule # Always wrap
115
116 (Skipping and wrapping are explained in [Return Values].)
117
118 A reference can also have a number of suffixed quantifiers. Similar to
119 regular expression syntax, a quantifier indicates how many times a rule
120 (reference) should match.
121
122 rule? # optional
123 rule* # 0 or more times
124 rule+ # 1 or more times
125 <rule>8 # exactly 8 times
126 <rule>2+ # 2 or more times
127 <rule>2-3 # 2 or 3 times
128 <rule>0-6 # 0 to 6 times
129
130 Note that you must use angle brackets if you are using a numbered
131 modifier:
132
133 rule8 # WRONG! This would match rule "rule8".
134 rule2+ # WRONG! This would match rule "rule2", 1 or more times.
135 rule2-3 # WRONG! Pegex syntax error
136
137 There is a special set of predefined "Atoms" that refer to regular
138 expression fragments. Atoms exist for every punctuation character and
139 for characters commonly found in regular expressions. Atoms enhance
140 readability in grammar texts, and allow special characters (like slash
141 or hash) to be used as Pegex syntax.
142
143 For example, a regex to match a comment might be '#' followed by
144 anything, followed by a newline. In Pegex, you would write:
145
146 comment: / HASH ANY* EOL /
147
148 instead of:
149
150 comment: /#.*\r?\n/
151
152 Pegex would compile the former into the latter.
153
154 Here are some atoms:
155
156 DASH # -
157 PLUS # +
158 TILDE # ~
159 SLASH # /
160 HASH # # (literal)
161 QMARK # ? (literal)
162 STAR # * (literal)
163 LPAREN # ( (literal)
164 RPAREN # ) (literal)
165 WORD # \w
166 WS # \s
167
168 The full list can be found in the [Atoms source
169 code|<https://metacpan.org/source/Pegex::Grammar::Atoms].>
170
171 Regexes
172 In Pegex we call the syntax for a regular expression a "regex". ie When
173 the term "regex" is used, it is referring to Pegex syntax, and when the
174 term "regular expression" is used it refers to the actual regular
175 expression that the regex is compiled into.
176
177 A regex is a string inside forward slashes.
178
179 /regex/
180
181 The regex syntax mostly follows Perl, with the following exceptions:
182
183 # Any rules in angle brackets are referenced in the regex
184 / ( <rule1> | 'non_rule' ) / # "non_rule" is interpreted literally
185
186 # The syntax implies a /x modifier, so whitespace and comments are
187 # ignored.
188 / (
189 rule1+ # Match rule1 one or more times
190 |
191 rule2
192 ) /
193
194 # Whitespace is declared with dash and plus.
195 / - rule3 + / # - = \s*, + = \s+, etc.
196
197 # Any (?XX ) syntax can have the question mark removed
198 / (: a | b ) / # same as / (?: a | b ) /
199
200 Error Message
201 An error message is a string inside backticks. If the parser gets to an
202 error message in the grammar, it throws a parse error with that
203 message.
204
205 `error message`
206
207 Operators
208 The Pegex operators in descending precedence order are: ALT, AND, and
209 OR.
210
211 AND and OR are the most common operators. AND is represented by the
212 absence of an operator. Like in these rules:
213
214 r1: <a><b>
215 r2: a b
216
217 Those are both the same. They mean rule "a" AND (followed immediately
218 by) rule "b".
219
220 OR means match one or the other.
221
222 r: a | b | c
223
224 means match rule "a" OR rule "b" OR rule "c". The rules are checked in
225 order and if one matches, the others are skipped.
226
227 ALT means alternation. It's a way to specify a separator in a list.
228
229 r: a+ % b
230
231 would match these:
232
233 a
234 aba
235 ababa
236
237 "%%" means that a trailing separator is optional.
238
239 r: a+ %% b
240
241 would match these:
242
243 a
244 ab
245 aba
246 abab
247
248 ANY operators take precedence over everything else, similar to other
249 parsers. These rules have the same binding precedence:
250
251 r1: a b | c % d
252 r2: (a b) | (c % d)
253
254 Parens are not only used for indicating binding precedence; they also
255 can create quantifiable groups:
256
257 r1: (a b)+ c
258
259 would match:
260
261 abababac
262
264 All return values are based on the capture groups ("$1/$2/$3/etc." type
265 variables) of parsed RE statements. The exact structure of the result
266 tree depends on the type of Receiver used. For example, Pegex::Tree
267 will return:
268
269 $1 # single capture group
270 [ @+[1..$#+] ] # multiple capture groups
271
272 This would be a match directly from the RE rule. As rules go further
273 back, things are put into arrays, but only if there is more than one
274 result. For example:
275
276 r: (a b)+ % +
277 a: /( ALPHA+ )/
278 b: /( DIGIT+ )( PLUS )/
279
280 # input = foobar123+
281 # output (using Pegex::Tree) = [
282 # 'foobar', [ '123', '+' ]
283 # ]
284 #
285 # input = foobar123+ boofar789+
286 # output (using Pegex::Tree) = [
287 # [ 'foobar', [ '123', '+' ] ],
288 # [ 'boofar', [ '789', '+' ] ],
289 # ]
290
291 Skipping
292 Any rule can use the skip modifier (DOT) to completely skip the return
293 from that rule (and any children below it). The rule is still
294 processed, but nothing is put into the tree. (This is different from,
295 say, putting "undef" into the return.) This can also affect the number
296 of values returned, and thus, whether a value comes as an array:
297
298 r: (a .b)+ % +
299 a: /( ALPHA+ )/
300 b: /( DIGIT+ )( PLUS )/
301
302 # input = foobar123+ boofar789+
303 # output (using Pegex::Tree) = [
304 # 'foobar',
305 # 'boofar',
306 # ]
307
308 The skip modifier can also be used with groups. (This is the only group
309 modifier allowed so far.)
310
311 r: .(a b)+ % +
312 a: /( ALPHA+ )/
313 b: /( DIGIT+ )( PLUS )/
314
315 # output (using Pegex::Tree) = []
316
317 Wrapping
318 You can also turn on "wrapping" with the Pegex::Tree::Wrap receiver.
319 This will wrap all match values in a hash with the rule name, like so:
320
321 { rule_A => $match }
322 { rule_B => [ @matches ] }
323
324 Note that this behavior can be "hard set" with the "+/-" rule
325 modifiers:
326
327 -rule # Flatten array captures
328 +rule # Always wrap (even if using Pegex::Tree)
329
330 This is simply a check in the "gotrule" for the receiver. So, any
331 specific "got_*" receiver methods will override even these settings,
332 and choose to pass the match as-is. In this case, the "got_*" sub
333 return value dictates what ultimately gets put into the tree object:
334
335 +rule_A # in this case, the + is useless here
336
337 sub got_rule_A {
338 my ($self, $matches_arrayref) = @_;
339 return $matches_arrayref;
340 # will be received as [ @matches ]
341 }
342
343 You can "correct" this behavior by passing it back to "gotrule":
344
345 +rule_A # now + is honored
346
347 sub got_rule_A {
348 my ($self, $matches_arrayref) = @_;
349 return $self->gotrule($matches_arrayref);
350 # will be received as { rule_A => [ @matches ] }
351 }
352
354 • Pegex::API
355
356 • Pegex::Tutorial
357
358 • Pegex::Resources
359
360
361
362perl v5.38.0 2023-07-21 Pegex::Syntax(3)