1Pegex::Tutorial::JSON(3U)ser Contributed Perl DocumentatiPoengex::Tutorial::JSON(3)
2
3
4
6 This document details the creation of the CPAN module: Pegex::JSON
7 which is a JSON parser/decoder written in Perl using the Pegex parsing
8 framework. The code lives on github here:
9 <https://github.com/ingydotnet/pegex-json-pm>.
10
12 Have a look at
13 <https://github.com/ingydotnet/pegex-json-pm/blob/master/test/test.t>.
14 This simple test has a bunch of small pieces of JSON and their YAML
15 equivalents. It asserts that when the JSON is decoded, it will match
16 the YAML.
17
18 The test is written in a testing language known as TestML. TestML just
19 happens to also use Pegex in its compiler. Both TestML and Pegex are
20 Acmeist frameworks, meaning that they are intended to work in multiple
21 programming languages.
22
23 You can run the test like normal:
24
25 > prove -lv t/test.t
26
28 The next thing to do is write the JSON grammar in the Pegex grammar
29 language. Writing grammars is the heart and soul of using Pegex. A
30 grammar is simply a definition of a language that specifies what is
31 what, and how it must be structured.
32
33 Since Pegex is Acmeist, I put the JSON grammar in its own repo so that
34 it could be shared by many different projects in different programming
35 languages. The grammar file is here:
36 <https://github.com/ingydotnet/json-pgx/blob/master/json.pgx>.
37
38 Let's look at this small but complete language definition in detail.
39
40 The file starts with some comments. Comments can be used liberally in
41 Pegex and go from a '#' to the end of a line. Just as you would expect.
42
43 # A simple grammar for the simple JSON data language.
44 # For parser implementations that use this grammar, see:
45 # * https://github.com/ingydotnet/pegex-json-pm
46
47 Next we have what is called the Meta section of the grammar.
48
49 %grammar json
50 %version 0.0.1
51
52 Meta section lines are of the form:
53
54 %key value
55
56 Your grammar should have at least a name and a version.
57
58 Everything else in the grammar is a set of things called "rules". A
59 rule has a name and a definition. The first rule in a grammar is
60 special. When a parser starts parsing it defaults to using the first
61 rule as the starting rule (although this can be overridden).
62
63 We start the JSON grammar with this rule:
64
65 json: map | seq
66
67 The name of this rule is 'json'. When we start parsing JSON we say that
68 the entire text must match the rule 'json'. This makes a lot of sense.
69
70 This style of parsing is known as Top Down and Recursive Descent. Pegex
71 is both of these. It should be noted that Pegex does not tokenize a
72 text before parsing it. The rules themselves form a kind of tokenizer,
73 pulling out the desired data segments as needed.
74
75 In this rule, we are saying that a 'json' document is either a 'map'
76 (aka mapping or hash) OR it is a 'seq' (aka sequence or array), which
77 assuming you know JSON (almost everybody does), is the only thing
78 allowed at the top level.
79
80 In this rule, 'map' and 'seq' are called 'rule references'. They point
81 to other named rules that are expected to be in the grammar. References
82 are usually just the name of the subrule itself, but can also be
83 enclosed in angle brackets (which are sometimes required). ie the rule
84 above could also be written like this:
85
86 json: map | seq
87
88 We are also introduced to the OR operator which is a single PIPE
89 character. It should also be noted that a COLON simply separates a rule
90 name and its definition.
91
92 The next line defines a new rule called 'node':
93
94 node: map | seq | scalar
95
96 We are calling 'node' the list of general structures that any given
97 point in the JSON data graph can be. This is simply a mapping, sequence
98 or scalar.
99
100 Moving on, we need rules describing 'map', 'seq' 'scalar'. A grammar is
101 complete when all of its rule references are defined.
102
103 Let's start with map:
104
105 map:
106 /- LCURLY -/
107 pair* % /- COMMA -/
108 /- RCURLY -/
109
110 This seems a lot more complicated, but let's break things down, one at
111 a time. What we will find this rule to mean, is that a map is a '{'
112 followed by zero or more (key/value) pairs separated by a comma, then
113 ending with a '}'. Along the way there may also be intermittent
114 whitespace. The '-' character indicates whitespace, but we'll cover
115 that more later.
116
117 The first thing we notice is that a rule definition can span multiple
118 lines. A rule definition ends when the next rule begins. Pegex actually
119 allows for multiple rules on one line, but they must be separated by
120 semicolons, like so:
121
122 rule1: a|b; rule2: c|d; rule3: e|f
123
124 The next thing we see are forward slash characters. Like in Perl and
125 JavaScript, a pair of slashes indicate a regular expression. In this
126 rule we have 3 regular expressions.
127
128 It is a good time to note that Pegex grammars get compiled into a
129 simple data structure that you can express as JSON or YAML. In fact the
130 repository containing the Pegex grammar that we are discussing also
131 contains the matching compiled forms. See:
132 <https://github.com/ingydotnet/json-pgx/blob/master/json.pgx.yaml>. A
133 quick glance at this file shows all the same rule definitions, but the
134 regexes look much different.
135
136 That's because Pegex tries to make regexes readable and composable.
137 That means that complex regexes can be defined as small parts that get
138 composed into bigger parts. By the time they get compiled, they can be
139 quite hard to understand.
140
141 For example if we had this Pegex grammar:
142
143 greeting: / hello COMMA - world /
144 hello: / (:'O' SPACE 'HAI' | 'Hey' SPACE 'there') /
145 world: / (:Earth|Mars|Venus) /
146
147 It would compile to:
148
149 greeting:
150 .rgx: (?:O\ HAI|Hey\ there),\s*(?:Earth|Mars|Venus)
151
152 Note the the 'hello' and 'world' rules are gone, but their definitions
153 have been baked into the one big regex for 'greeting'.
154
155 Additionally there are references to things like COMMA and SPACE. These
156 are called Pegex Atoms, and there are atoms for all the punctuation
157 characters, whitespace chars, and others. The full list is here:
158 Pegex::Grammar::Atoms.
159
160 Having to write out 'SEMI' instead of ';' seems strange at first, but
161 it is how Pegex easily separates metasyntax from text to be matched.
162 Once you get used to it, it is very readable.
163
164 The actual whitespace (and comments) inside a regex are completely
165 ignored by Pegex. This is the same as Perl's 'x' regex flag.
166
167 Finally the '-' is Pegex's 'possible whitespace' indicator, and usually
168 expands to "\s*". It actually expands to "ws1", which expands to "ws*",
169 which expands to "WS*", which expands to "\s*" (unless you override any
170 of those rules).
171
172 Getting back to JSON...
173
174 The rule we defined for 'map' should now be more readable. Let's look
175 at it again, but this time in a more compact form:
176
177 map: /- LCURLY -/ (pair* % /- COMMA -/) /- RCURLY -/
178
179 I've compacted the regexes (since they just mean curlies and commas
180 with possible whitespace), and I've added parentheses around the middle
181 stuff to indicate the the '%' operator has a tighter binding.
182
183 So what is the '%' operator? It was borrowed from Perl 6 Rules.
184 Consider:
185
186 a+ % b
187
188 This means one or more 'a', separated by 'b'. Simple. The %% operator
189 means the same thing, except it indicates that a trailing 'b' is OK.
190
191 This notation is handy for things like comma separated lists. (Which is
192 exactly what we are using it for here.)
193
194 The rule above means zero or more 'pair's separated by commas.
195 (trailing comma not allowed, which is strictly correct for JSON).
196
197 Now is a good time to bring up 'rule quantifiers'. A rule quantifier is
198 a suffix to a rule reference, and can be one of ? * or +. These
199 suffixes mean the same thing that they would in regexes.
200
201 There are two other quantifier suffixes. '2+' is equivalent to the
202 regex syntax {2,} and 2-5 is the same as {2,5}. When you use one of
203 these two forms, you need to put the rule reference in angle brackets,
204 or else the number looks like part of the rule name. For example:
205
206 rule1: <rule2>5-9 <rule3>29+
207
208 not:
209
210 rule1: rule25-9 rule329+
211
212 Let's take a look at that rule after Pegex compilation:
213
214 map:
215 .all:
216 - .rgx: \s*\{\s*
217 - +min: 0
218 .ref: pair
219 .sep:
220 .rgx: \s*,\s*
221 - .rgx: \s*\}\s*
222
223 The rule for 'map' says that the text must match 3 thing: a regex
224 (opening curly brace), zero or more occurrences of a rule called 'pair'
225 separated by a regex (comma), and finally another regex (closing
226 curly).
227
228 One thing that we have silently covered is the AND operator. That's
229 because there is no operator symbol for it. Consider the rules:
230
231 a: b c+ /d/
232 b: c | d
233 c: d e | f % g
234
235 The PIPE character between 2 things means OR. No symbol between 2
236 things means AND. A PERCENT means ALTERNATION. ALTERNATION binds
237 tightest and OR binds loosest, with AND in the middle. Precedence can
238 be disambiguated with parentheses. Thus the rule for 'c' can be
239 restated:
240
241 c: ((d e) | (f % g))
242
243 OK. I think we've covered just about everything needed so far. That was
244 a lot of learning for one rule, but now you know most of Pegex!
245
246 The next three rules need no new knowledge. Take a look at these and
247 see if you can figure them out.
248
249 pair:
250 string
251 /- COLON -/
252 node
253
254 seq:
255 /- LSQUARE -/
256 node* % /- COMMA -/
257 /- RSQUARE -/
258
259 scalar:
260 string |
261 number |
262 boolean |
263 null
264
265 A pair (you know... a hash key/value), is a string, a colon, and some
266 node. A seq is zero or more comma-separated nodes between square
267 brackets. A scalar can be one of 4 forms. Simple.
268
269 One interesting point is that has just arisen here, is the use of
270 recursion. The rules for pair and seq both reference the rule for
271 node. Thus, the grammar is recursive descent. It starts with the rule
272 for the thing as a whole (ie 'json') and descends (recursively) until
273 it matches all the specific characters.
274
275 Pegex Regexes in More Depth
276 Next we have the definition for a JSON string:
277
278 # string and number are interpretations of http://www.json.org/
279 string: /
280 DOUBLE
281 (
282 (:
283 BACK (: # Backslash escapes
284 [
285 DOUBLE # Double Quote
286 BACK # Back Slash
287 SLASH # Forward Slash
288 b # Back Space
289 f # Form Feed
290 n # New Line
291 r # Carriage Return
292 t # Horizontal Tab
293 ]
294 |
295 u HEX{4} # Unicode octet pair
296 )
297 |
298 [^ DOUBLE CONTROLS ] # Anything else
299 )*
300 )
301 DOUBLE
302 /
303
304 which Pegex compiles to this (simple and obvious;) regex:
305
306 /"((?:\\(?:["\\/bfnrt]|u[0-9a-fA-F]{4})|[^"\x00-\x1f])*)"/
307
308 Let's see what's new here...
309
310 First off, we have lots of whitespace and comments. This should make it
311 pretty easy to at least get the overall picture of what is being
312 accomplished.
313
314 Understanding how the text between a pair of '/' characters gets
315 transformed into a real regular expression, is the key to really
316 understanding Pegex.
317
318 Notice the '*', the '{4}', the '|', the '(...)' and the '^...'. All of
319 this punctuation gets passed on verbatim into the compiled regex. There
320 are just a few exceptions. Let's cover them in detail.
321
322 Everything inside "<some_rule_ref>" gets replaced by the regex that the
323 reference points to. Rule references inside a regex must point directly
324 to another reference, although those rules can point to even more regex
325 parts.
326
327 The "-" characters get replaced by 'ws1', which is subsequently
328 replaced by its rule definition. "+" gets replaced by "ws2" (which
329 resolves to "\s+" by default).
330
331 Finally '(:' gets replaced by '(?:)'. This is simply to make your non-
332 capturing paren syntax be a little prettier. In general, you can leave
333 out '?' after a '(' and Pegex will put them in for you.
334
335 That's it. Everything else that you put between slash characters, will
336 go verbatim into the regex.
337
338 In some sense, Pegex is just a very highly organized way to write a
339 parser from regular expressions. To be really good at Pegex does
340 require fairly solid understanding of how regexes work, but given that
341 regexes are so very common, Pegex makes the task of turning them into a
342 Parser, quite simple.
343
344 Capturing Data
345 The next thing to cover is regex capturing. When you are parsing data,
346 it is generally important to pull out certain chunks of text so that
347 you can do things with them.
348
349 Pegex has one very simple, straightforward and obvious way to do this:
350 Any capturing parens in any regexes will capture data and pass it on to
351 the receiver object. Since Pegex is built over regexes, this make
352 perfect sense.
353
354 We will talk about receiver objects in the next section. For now, just
355 know that a receiver object is the thing that makes something out of
356 the data and events matched during a parse. In our JSON case here, our
357 receiver will make a Perl data structure that matches the JSON we are
358 parsing. Obvious!
359
360 In the rule for 'string' above, we are capturing the characters between
361 the double quotes. This is the raw data that will be turned into a Perl
362 scalar.
363
364 Finishing up the JSON Grammar
365 There are just 5 more simple rules needed to complete the JSON Pegex
366 grammar:
367
368 number: /(
369 DASH?
370 (: 0 | [1-9] DIGIT* )
371 (: DOT DIGIT* )?
372 (: [eE] [ DASH PLUS ]? DIGIT+ )?
373 )/
374
375 boolean: true | false
376
377 true: /true/
378
379 false: /false/
380
381 null: /null/
382
383 Note that the rule for 'number' captures data, but the other rules
384 don't. That's because as long as the receiver is told that a 'null'
385 rule was matched, it can turn it into a Perl "undef". It's not
386 important what the actual matching text was (even though in this case
387 it has to be exactly the string 'null').
388
390 The Pegex JSON grammar that we just looked at in excruciating detail
391 gets compiled into a Perl data structure and then embedded into a
392 grammar class. A Pegex::Parser object (the thing that does the parsing)
393 requires a grammar object (what to look for) and a receiver object
394 (what to do when you find it).
395
396 It should be noted that Pegex uses Pegex to parse Pegex grammars. That
397 is, this grammar:
398 <https://github.com/ingydotnet/pegex-pgx/blob/master/pegex.pgx> is used
399 by Pegex to parse our json.pgx grammar (and yes, it can parse pegex.pgx
400 too).
401
402 It is conceivable that every time we wanted to parse JSON, we could
403 parse the json.pgx grammar first, but that would be a waste of time. So
404 we cache the big grammar tree as a pure Perl data structure inside
405 Pegex::JSON::Grammar.
406
407 If for some reason we did need to change the json.pgx, we would have to
408 recompile it to Perl and copy/paste it into our module. This would be a
409 pain, so there is a special command to do it for us. Just run this:
410
411 perl -Ilib -MPegex::JSON:Grammar=compile
412
413 If you are in heavy development mode and changing the grammar a lot,
414 you can simply set an environment variable like this:
415
416 export PERL_PEGEX_AUTO_COMPILE=Pegex::JSON::Grammar
417
418 and recompilation will happen automatically. This is possible because
419 of this line in the grammar module:
420
421 use constant file => '../json-pgx/json.pgx';
422
423 That line is only used during development and requires the grammar file
424 to be in that location.
425
427 Every Pegex parse requires a grammar object (the JSON grammar), and
428 input object (the JSON being parsed) and a receiver object (the Perl
429 maker).
430
431 One of the hallmarks of Pegex, is that it keeps the grammar separate
432 from the (receiver) code, thus
433
434
435
436perl v5.34.0 2021-07-22 Pegex::Tutorial::JSON(3)