Pegex::Tutorial::JSON(3pm)

1Pegex::Tutorial::JSON(3U)ser Contributed Perl DocumentatiPoengex::Tutorial::JSON(3)
2
3
4

How to write a JSON Parser in Pegex

6       This document details the creation of the CPAN module: Pegex::JSON
7       which is a JSON parser/decoder written in Perl using the Pegex parsing
8       framework. The code lives on github here:
9       <https://github.com/ingydotnet/pegex-json-pm>.
10

Test First

12       Have a look at
13       <https://github.com/ingydotnet/pegex-json-pm/blob/master/test/test.t>.
14       This simple test has a bunch of small pieces of JSON and their YAML
15       equivalents. It asserts that when the JSON is decoded, it will match
16       the YAML.
17
18       The test is written in a testing language known as TestML. TestML just
19       happens to also use Pegex in its compiler. Both TestML and Pegex are
20       Acmeist frameworks, meaning that they are intended to work in multiple
21       programming languages.
22
23       You can run the test like normal:
24
25           > prove -lv t/test.t
26

The Pegex JSON Grammar

28       The next thing to do is write the JSON grammar in the Pegex grammar
29       language.  Writing grammars is the heart and soul of using Pegex. A
30       grammar is simply a definition of a language that specifies what is
31       what, and how it must be structured.
32
33       Since Pegex is Acmeist, I put the JSON grammar in its own repo so that
34       it could be shared by many different projects in different programming
35       languages.  The grammar file is here:
36       <https://github.com/ingydotnet/json-pgx/blob/master/json.pgx>.
37
38       Let's look at this small but complete language definition in detail.
39
40       The file starts with some comments. Comments can be used liberally in
41       Pegex and go from a '#' to the end of a line. Just as you would expect.
42
43           # A simple grammar for the simple JSON data language.
44           # For parser implementations that use this grammar, see:
45           # * https://github.com/ingydotnet/pegex-json-pm
46
47       Next we have what is called the Meta section of the grammar.
48
49           %grammar json
50           %version 0.0.1
51
52       Meta section lines are of the form:
53
54           %key value
55
56       Your grammar should have at least a name and a version.
57
58       Everything else in the grammar is a set of things called "rules". A
59       rule has a name and a definition. The first rule in a grammar is
60       special. When a parser starts parsing it defaults to using the first
61       rule as the starting rule (although this can be overridden).
62
63       We start the JSON grammar with this rule:
64
65           json: map | seq
66
67       The name of this rule is 'json'. When we start parsing JSON we say that
68       the entire text must match the rule 'json'. This makes a lot of sense.
69
70       This style of parsing is known as Top Down and Recursive Descent. Pegex
71       is both of these. It should be noted that Pegex does not tokenize a
72       text before parsing it. The rules themselves form a kind of tokenizer,
73       pulling out the desired data segments as needed.
74
75       In this rule, we are saying that a 'json' document is either a 'map'
76       (aka mapping or hash) OR it is a 'seq' (aka sequence or array), which
77       assuming you know JSON (almost everybody does), is the only thing
78       allowed at the top level.
79
80       In this rule, 'map' and 'seq' are called 'rule references'. They point
81       to other named rules that are expected to be in the grammar. References
82       are usually just the name of the subrule itself, but can also be
83       enclosed in angle brackets (which are sometimes required). ie the rule
84       above could also be written like this:
85
86           json: map | seq
87
88       We are also introduced to the OR operator which is a single PIPE
89       character. It should also be noted that a COLON simply separates a rule
90       name and its definition.
91
92       The next line defines a new rule called 'node':
93
94           node: map | seq | scalar
95
96       We are calling 'node' the list of general structures that any given
97       point in the JSON data graph can be. This is simply a mapping, sequence
98       or scalar.
99
100       Moving on, we need rules describing 'map', 'seq' 'scalar'. A grammar is
101       complete when all of its rule references are defined.
102
103       Let's start with map:
104
105           map:
106               /- LCURLY -/
107               pair* % /- COMMA -/
108               /- RCURLY -/
109
110       This seems a lot more complicated, but let's break things down, one at
111       a time.  What we will find this rule to mean, is that a map is a '{'
112       followed by zero or more (key/value) pairs separated by a comma, then
113       ending with a '}'. Along the way there may also be intermittent
114       whitespace. The '-' character indicates whitespace, but we'll cover
115       that more later.
116
117       The first thing we notice is that a rule definition can span multiple
118       lines. A rule definition ends when the next rule begins. Pegex actually
119       allows for multiple rules on one line, but they must be separated by
120       semicolons, like so:
121
122           rule1: a|b; rule2: c|d; rule3: e|f
123
124       The next thing we see are forward slash characters. Like in Perl and
125       JavaScript, a pair of slashes indicate a regular expression. In this
126       rule we have 3 regular expressions.
127
128       It is a good time to note that Pegex grammars get compiled into a
129       simple data structure that you can express as JSON or YAML. In fact the
130       repository containing the Pegex grammar that we are discussing also
131       contains the matching compiled forms. See:
132       <https://github.com/ingydotnet/json-pgx/blob/master/json.pgx.yaml>.  A
133       quick glance at this file shows all the same rule definitions, but the
134       regexes look much different.
135
136       That's because Pegex tries to make regexes readable and composable.
137       That means that complex regexes can be defined as small parts that get
138       composed into bigger parts. By the time they get compiled, they can be
139       quite hard to understand.
140
141       For example if we had this Pegex grammar:
142
143           greeting: / hello COMMA - world /
144           hello: / (:'O' SPACE 'HAI' | 'Hey' SPACE 'there') /
145           world: / (:Earth|Mars|Venus) /
146
147       It would compile to:
148
149           greeting:
150             .rgx: (?:O\ HAI|Hey\ there),\s*(?:Earth|Mars|Venus)
151
152       Note the the 'hello' and 'world' rules are gone, but their definitions
153       have been baked into the one big regex for 'greeting'.
154
155       Additionally there are references to things like COMMA and SPACE. These
156       are called Pegex Atoms, and there are atoms for all the punctuation
157       characters, whitespace chars, and others. The full list is here:
158       Pegex::Grammar::Atoms.
159
160       Having to write out 'SEMI' instead of ';' seems strange at first, but
161       it is how Pegex easily separates metasyntax from text to be matched.
162       Once you get used to it, it is very readable.
163
164       The actual whitespace (and comments) inside a regex are completely
165       ignored by Pegex. This is the same as Perl's 'x' regex flag.
166
167       Finally the '-' is Pegex's 'possible whitespace' indicator, and usually
168       expands to "\s*". It actually expands to "ws1", which expands to "ws*",
169       which expands to "WS*", which expands to "\s*" (unless you override any
170       of those rules).
171
172       Getting back to JSON...
173
174       The rule we defined for 'map' should now be more readable. Let's look
175       at it again, but this time in a more compact form:
176
177           map: /- LCURLY -/   (pair* % /- COMMA -/)   /- RCURLY -/
178
179       I've compacted the regexes (since they just mean curlies and commas
180       with possible whitespace), and I've added parentheses around the middle
181       stuff to indicate the the '%' operator has a tighter binding.
182
183       So what is the '%' operator? It was borrowed from Perl 6 Rules.
184       Consider:
185
186           a+ % b
187
188       This means one or more 'a', separated by 'b'. Simple. The %% operator
189       means the same thing, except it indicates that a trailing 'b' is OK.
190
191       This notation is handy for things like comma separated lists. (Which is
192       exactly what we are using it for here.)
193
194       The rule above means zero or more 'pair's separated by commas.
195       (trailing comma not allowed, which is strictly correct for JSON).
196
197       Now is a good time to bring up 'rule quantifiers'. A rule quantifier is
198       a suffix to a rule reference, and can be one of ? * or +. These
199       suffixes mean the same thing that they would in regexes.
200
201       There are two other quantifier suffixes. '2+' is equivalent to the
202       regex syntax {2,} and 2-5 is the same as {2,5}. When you use one of
203       these two forms, you need to put the rule reference in angle brackets,
204       or else the number looks like part of the rule name. For example:
205
206           rule1: <rule2>5-9 <rule3>29+
207
208       not:
209
210           rule1: rule25-9 rule329+
211
212       Let's take a look at that rule after Pegex compilation:
213
214           map:
215             .all:
216             - .rgx: \s*\{\s*
217             - +min: 0
218               .ref: pair
219               .sep:
220                 .rgx: \s*,\s*
221             - .rgx: \s*\}\s*
222
223       The rule for 'map' says that the text must match 3 thing: a regex
224       (opening curly brace), zero or more occurrences of a rule called 'pair'
225       separated by a regex (comma), and finally another regex (closing
226       curly).
227
228       One thing that we have silently covered is the AND operator. That's
229       because there is no operator symbol for it. Consider the rules:
230
231           a: b c+ /d/
232           b: c | d
233           c: d e | f % g
234
235       The PIPE character between 2 things means OR. No symbol between 2
236       things means AND. A PERCENT means ALTERNATION. ALTERNATION binds
237       tightest and OR binds loosest, with AND in the middle. Precedence can
238       be disambiguated with parentheses. Thus the rule for 'c' can be
239       restated:
240
241           c: ((d e) | (f % g))
242
243       OK. I think we've covered just about everything needed so far. That was
244       a lot of learning for one rule, but now you know most of Pegex!
245
246       The next three rules need no new knowledge. Take a look at these and
247       see if you can figure them out.
248
249           pair:
250               string
251               /- COLON -/
252               node
253
254           seq:
255               /- LSQUARE -/
256               node* % /- COMMA -/
257               /- RSQUARE -/
258
259           scalar:
260               string |
261               number |
262               boolean |
263               null
264
265       A pair (you know... a hash key/value), is a string, a colon, and some
266       node. A seq is zero or more comma-separated nodes between square
267       brackets. A scalar can be one of 4 forms. Simple.
268
269       One interesting point is that has just arisen here, is the use of
270       recursion.  The rules for pair and seq both reference the rule for
271       node. Thus, the grammar is recursive descent. It starts with the rule
272       for the thing as a whole (ie 'json') and descends (recursively) until
273       it matches all the specific characters.
274
275   Pegex Regexes in More Depth
276       Next we have the definition for a JSON string:
277
278           # string and number are interpretations of http://www.json.org/
279           string: /
280               DOUBLE
281                   (
282                       (:
283                           BACK (:       # Backslash escapes
284                               [
285                                   DOUBLE      # Double Quote
286                                   BACK        # Back Slash
287                                   SLASH       # Forward Slash
288                                   b           # Back Space
289                                   f           # Form Feed
290                                   n           # New Line
291                                   r           # Carriage Return
292                                   t           # Horizontal Tab
293                               ]
294                           |
295                               u HEX{4}        # Unicode octet pair
296                           )
297                       |
298                           [^ DOUBLE CONTROLS ]  # Anything else
299                       )*
300                   )
301               DOUBLE
302           /
303
304       which Pegex compiles to this (simple and obvious;) regex:
305
306           /"((?:\\(?:["\\/bfnrt]|u[0-9a-fA-F]{4})|[^"\x00-\x1f])*)"/
307
308       Let's see what's new here...
309
310       First off, we have lots of whitespace and comments. This should make it
311       pretty easy to at least get the overall picture of what is being
312       accomplished.
313
314       Understanding how the text between a pair of '/' characters gets
315       transformed into a real regular expression, is the key to really
316       understanding Pegex.
317
318       Notice the '*', the '{4}', the '|', the '(...)' and the '^...'. All of
319       this punctuation gets passed on verbatim into the compiled regex. There
320       are just a few exceptions. Let's cover them in detail.
321
322       Everything inside "<some_rule_ref>" gets replaced by the regex that the
323       reference points to. Rule references inside a regex must point directly
324       to another reference, although those rules can point to even more regex
325       parts.
326
327       The "-" characters get replaced by 'ws1', which is subsequently
328       replaced by its rule definition. "+" gets replaced by "ws2" (which
329       resolves to "\s+" by default).
330
331       Finally '(:' gets replaced by '(?:)'. This is simply to make your non-
332       capturing paren syntax be a little prettier. In general, you can leave
333       out '?'  after a '(' and Pegex will put them in for you.
334
335       That's it. Everything else that you put between slash characters, will
336       go verbatim into the regex.
337
338       In some sense, Pegex is just a very highly organized way to write a
339       parser from regular expressions. To be really good at Pegex does
340       require fairly solid understanding of how regexes work, but given that
341       regexes are so very common, Pegex makes the task of turning them into a
342       Parser, quite simple.
343
344   Capturing Data
345       The next thing to cover is regex capturing. When you are parsing data,
346       it is generally important to pull out certain chunks of text so that
347       you can do things with them.
348
349       Pegex has one very simple, straightforward and obvious way to do this:
350       Any capturing parens in any regexes will capture data and pass it on to
351       the receiver object. Since Pegex is built over regexes, this make
352       perfect sense.
353
354       We will talk about receiver objects in the next section. For now, just
355       know that a receiver object is the thing that makes something out of
356       the data and events matched during a parse. In our JSON case here, our
357       receiver will make a Perl data structure that matches the JSON we are
358       parsing. Obvious!
359
360       In the rule for 'string' above, we are capturing the characters between
361       the double quotes. This is the raw data that will be turned into a Perl
362       scalar.
363
364   Finishing up the JSON Grammar
365       There are just 5 more simple rules needed to complete the JSON Pegex
366       grammar:
367
368           number: /(
369               DASH?
370               (: 0 | [1-9] DIGIT* )
371               (: DOT DIGIT* )?
372               (: [eE] [ DASH PLUS ]? DIGIT+ )?
373           )/
374
375           boolean: true | false
376
377           true: /true/
378
379           false: /false/
380
381           null: /null/
382
383       Note that the rule for 'number' captures data, but the other rules
384       don't.  That's because as long as the receiver is told that a 'null'
385       rule was matched, it can turn it into a Perl "undef". It's not
386       important what the actual matching text was (even though in this case
387       it has to be exactly the string 'null').
388

Pegex::JSON::Grammar - The Pegex Grammar Class

390       The Pegex JSON grammar that we just looked at in excruciating detail
391       gets compiled into a Perl data structure and then embedded into a
392       grammar class. A Pegex::Parser object (the thing that does the parsing)
393       requires a grammar object (what to look for) and a receiver object
394       (what to do when you find it).
395
396       It should be noted that Pegex uses Pegex to parse Pegex grammars. That
397       is, this grammar:
398       <https://github.com/ingydotnet/pegex-pgx/blob/master/pegex.pgx> is used
399       by Pegex to parse our json.pgx grammar (and yes, it can parse pegex.pgx
400       too).
401
402       It is conceivable that every time we wanted to parse JSON, we could
403       parse the json.pgx grammar first, but that would be a waste of time. So
404       we cache the big grammar tree as a pure Perl data structure inside
405       Pegex::JSON::Grammar.
406
407       If for some reason we did need to change the json.pgx, we would have to
408       recompile it to Perl and copy/paste it into our module. This would be a
409       pain, so there is a special command to do it for us. Just run this:
410
411           perl -Ilib -MPegex::JSON:Grammar=compile
412
413       If you are in heavy development mode and changing the grammar a lot,
414       you can simply set an environment variable like this:
415
416           export PERL_PEGEX_AUTO_COMPILE=Pegex::JSON::Grammar
417
418       and recompilation will happen automatically. This is possible because
419       of this line in the grammar module:
420
421           use constant file => '../json-pgx/json.pgx';
422
423       That line is only used during development and requires the grammar file
424       to be in that location.
425

Pegex::JSON::Data - The Pegex Receiver Class

427       Every Pegex parse requires a grammar object (the JSON grammar), and
428       input object (the JSON being parsed) and a receiver object (the Perl
429       maker).
430
431       One of the hallmarks of Pegex, is that it keeps the grammar separate
432       from the (receiver) code, thus
433
434
435
436perl v5.34.0                      2022-01-21          Pegex::Tutorial::JSON(3)