1Syntax::Highlight::PerlU:s:eIrmpCroonvterdi(b3u)ted PerlSyDnotcauxm:e:nHtiagthiloinght::Perl::Improved(3)
2
3
4
6 Syntax::Highlight::Perl::Improved - Highlighting of Perl Syntactical
7 Structures
8
10 This file documents Syntax::Highlight::Perl::Improved version 1.0.
11
13 # simple procedural
14 use Syntax::Highlight::Perl::Improved ':BASIC'; # or ':FULL'
15
16 print format_string($my_string);
17
18
19 # OO
20 use Syntax::Highlight::Perl::Improved;
21
22 my $formatter = new Syntax::Highlight::Perl::Improved;
23 print $formatter->format_string($my_string);
24
26 This module provides syntax highlighting for Perl code. The design
27 bias is roughly line-oriented and streamed (ie, processing a file line-
28 by-line in a single pass). Provisions may be made in the future for
29 tasks related to "back-tracking" (ie, re-doing a single line in the
30 middle of a stream) such as speeding up state copying.
31
32 Constructors
33 The only constructor provided is new(). When called on an existing
34 object, new() will create a new copy of that object. Otherwise, new()
35 creates a new copy of the (internal) Default Object. Note that the use
36 of the procedural syntax modifies the Default Object and that those
37 changes will be reflected in any subsequent new() calls.
38
39 Formatting
40 Formatting is done using the format_string() method. Call
41 format_string() with one or more strings to format, or it will default
42 to using $_.
43
44 Setting and Getting Formats
45 You can set the text used for formatting a syntax element using
46 set_format() (or set the start and end format individually using
47 set_start_format() and set_end_format(), respectively).
48
49 You can also retrieve the text used for formatting for an element via
50 get_start_format() or "get_end_format". Bulk retrieval of the names or
51 values of defined formats is possible via get_format_names_list()
52 (names), get_start_format_values_list() and
53 get_end_format_values_list().
54
55 See "FORMAT TYPES" later in this document for information on what
56 format elements can be used.
57
58 Checking and Setting the State
59 You can check certain aspects of the state of the formatter via the
60 methods: in_heredoc(), in_string(), in_pod(), was_pod(), in_data(), and
61 line_count().
62
63 You can reset all of the above states (and a few other internal ones)
64 using reset().
65
66 Stable and Unstable Formatting Modes
67 You can set or check the stability of formatting via unstable().
68
69 In unstable (TRUE) mode, formatting is not considered to be persistent
70 with nested formats. Or, put another way, when unstable, the formatter
71 can only "remember" one format at a time and must reinstate formatting
72 for each token. An example of unstable formatting is using ANSI color
73 escape sequences in a terminal.
74
75 In stable (FALSE) mode (the default), formatting is considered
76 persistent within arbitrarily nested formats. Even in stable mode,
77 however, formatting is never allowed to span multiple lines; it is
78 always fully closed at the end of the line and reinstated at the
79 beginning of a new line, if necessary. This is to ensure properly
80 balanced tags when only formatting a partial code snippet. An example
81 of stable formatting is HTML.
82
83 Substitutions
84 Using define_substitution(), you can have the formatter substitute
85 certain strings with others, after the original string has been parsed
86 (but before formatting is applied). This is useful for escaping
87 characters special to the output mode (eg, > and < in HTML) without
88 them affecting the way the code is parsed.
89
90 You can retrieve the current substitutions (as a hash-ref) via
91 substitutions().
92
94 The Syntax::Highlight::Perl::Improved formatter recognizes and
95 differentiates between many Perl syntactical elements. Each type of
96 syntactical element has a Format Type associated with it. There is
97 also a 'DEFAULT' type that is applied to any element who's Format Type
98 does not have a value.
99
100 Several of the Format Types have underscores in their name. This
101 underscore is special, and indicates that the Format Type can be
102 "generalized." This means that you can assign a value to just the
103 first part of the Format Type name (the part before the underscore) and
104 that value will be applied to all Format Types with the same first
105 part. For example, the Format Types for all types of variables begin
106 with "Variable_". Thus, if you assign a value to the Format Type
107 "Variable", it will be applied to any type of variable. Generalized
108 Format Types take precedence over non-generalized Format Types. So the
109 value assigned to "Variable" would be applied to "Variable_Scalar",
110 even if "Variable_Scalar" had a value explicitly assigned to it.
111
112 You can also define a "short-cut" name for each Format Type that can be
113 generalized. The short-cut name would be the part of the Format Type
114 name after the underscore. For example, the short-cut for
115 "Variable_Scalar" would be "Scalar". Short-cut names have the least
116 precedence and are only assigned if neither the generalized Type name,
117 nor the full Type name have values.
118
119 Following is a list of all the syntactical elements that
120 Syntax::Highlight::Perl::Improved currently recognizes, along with a
121 short description of what each would be applied to.
122
123 Comment_Normal
124 A normal Perl comment. Starts with '#' and goes until the end of
125 the line.
126
127 Comment_POD
128 Inline documentation. Starts with a line beginning with an equal
129 sign ('=') followed by a word (eg: '=pod') and continuing until a
130 line beginning with '=cut'.
131
132 Directive
133 Either the "she-bang" line at the beginning of the file, or a line
134 directive altering what the compiler thinks the current line and
135 file is.
136
137 Label
138 A loop or statement label (to be the target of a goto, next, last
139 or redo).
140
141 Quote
142 Any string or character that begins or ends a String. Including,
143 but not necessarily limited to: quote-like regular expression
144 operators ("m//", "s///", "tr///", etc), a Here-Document
145 terminating line, the lone period terminating a format, and, of
146 course, normal quotes ("'", """, "`", "q{}", "qq{}", "qr{}",
147 "qx{}").
148
149 String
150 Any text within quotes, "format"s, Here-Documents, Regular
151 Expressions, and the like.
152
153 Subroutine
154 The identifier used to define, identify, or call a subroutine (or
155 method). Note that Syntax::Highlight::Perl::Improved cannot
156 recognize a subroutine if it is called without using parentheses or
157 an ampersand, or methods called using the indirect object syntax.
158 It formats those as barewords.
159
160 Variable_Scalar
161 A scalar variable.
162
163 Note that (theoretically) this format is not applied to non-scalar
164 variables that are being used as scalars (ie: array or hash
165 lookups, nor references to anything other than scalars).
166 Syntax::Highlight::Perl::Improved figures out (or at least tries
167 to) the actual type of the variable being used (by looking at how
168 you're subscripting it) and formats it accordingly. The first
169 character of the variable (ie, the "$", "@", "%", or "*") tells you
170 the type of value being used, and the color (hopefully) tells you
171 the type of variable being used to get that value.
172
173 (See "KNOWN ISSUES" for information about when this doesn't work
174 quite right.)
175
176 Variable_Array
177 An array variable (but not usually a slice; see above).
178
179 Variable_Hash
180 A hash variable.
181
182 Variable_Typeglob
183 A typeglob. Note that typeglobs not beginning with an asterisk (*)
184 (eg: filehandles) are formatted as barewords. This is because,
185 well, they are.
186
187 Whitespace
188 Whitespace. Not usually formatted but it can be.
189
190 Character
191 A special, or backslash-escaped, character. For example: "\n"
192 (newline), or "\d" (digits).
193
194 Only occurs within strings or regular expressions.
195
196 Keyword
197 A Perl keyword. Some examples include: my, local, sub, next.
198
199 Note that Perl does not make any distinction between keywords and
200 built-in functions (at least not in the documentation). Thus I had
201 to make a subjective call as to what would be considered keywords
202 and what would be built-in functions.
203
204 The list of keywords can be found (and overloaded) in the variable
205 $Syntax::Highlight::Perl::Improved::keyword_list_re as a pre-
206 compiled regular expression.
207
208 Builtin_Function
209 A Perl built-in function, called as a function (ie, using
210 parentheses).
211
212 The list of built-in functions can be found (and overloaded) in the
213 variable $Syntax::Highlight::Perl::Improved::builtin_list_re as a
214 pre-compiled regular expression.
215
216 Builtin_Operator
217 A Perl built-in function, called as a list or unary operator (ie,
218 without using parentheses).
219
220 The list of built-in functions can be found (and overloaded) in the
221 variable $Syntax::Highlight::Perl::Improved::builtin_list_re as a
222 pre-compiled regular expression.
223
224 Operator
225 A Perl operator.
226
227 The list of operators can be found (and overloaded) in the variable
228 $Syntax::Highlight::Perl::Improved::operator_list_re as a pre-
229 compiled regular expression.
230
231 Bareword
232 A bareword. This can be user-defined subroutine called without
233 parentheses, a typeglob used without an asterisk (*), or just a
234 plain old bareword.
235
236 Package
237 The name of a package or pragmatic module.
238
239 Note that this does not apply to the package portion of a fully
240 qualified variable name.
241
242 Number
243 A numeric literal.
244
245 Symbol
246 A symbol (ie, non-operator punctuation).
247
248 CodeTerm
249 The special tokens that signal the end of executable code and the
250 begining of the DATA section. Specifically, '"__END__"' and
251 '"__DATA__"'.
252
253 DATA
254 Anything in the DATA section (see "CodeTerm").
255
257 Syntax::Highlight::Perl::Improved uses OO method-calls internally (and
258 actually defines a Default Object that is used when the functions are
259 invoked procedurally) so you will not gain anything (efficiency-wise)
260 by using the procedural interface. It is just a matter of style.
261
262 It is actually recommended that you use the OO interface, as this
263 allows you to instantiate multiple, concurrent-yet-separate formatters.
264 Though I cannot think of why you would need multiple formatters
265 instantiated. :-)
266
267 One point to note: the new() method uses the Default Object to
268 initialize new objects. This means that any changes to the state of
269 the Default Object (including Format definitions) made by using the
270 procedural interface will be reflected in any subsequently created
271 objects. This can be useful in some cases (eg, call set_format()
272 procedurally just before creating a batch of new objects to define
273 default Formats for them all) but will most likely lead to trouble.
274
276 new PACKAGE
277 new OBJECT
278 Creates a new object. If called on an existing object, creates a
279 new copy of that object (which is thenceforth totally separate from
280 the original).
281
282 reset
283 Resets the object's internal state. This breaks out of strings and
284 here-docs, ends PODs, resets the line-count, and otherwise gets the
285 object back into a "normal" state to begin processing a new stream.
286
287 Note that this does not reset any user options (including formats
288 and format stability).
289
290 unstable EXPR
291 unstable
292 Returns true if the formatter is in unstable mode.
293
294 If called with a non-zero number, puts the formatter into unstable
295 formatting mode.
296
297 In unstable mode, it is assumed that formatting is not persistent
298 one token to the next and that each token must be explicitly
299 formatted.
300
301 in_heredoc
302 Returns true if the next string to be formatted will be inside a
303 Here-Document.
304
305 in_string
306 Returns true if the next string to be formatted will be inside a
307 multi-line string.
308
309 in_pod
310 Returns true if the formatter would consider the next string passed
311 to it as begin within a POD structure. This is false immediately
312 before any POD instigators ("=pod", "=head1", "=item", etc), true
313 immediately after an instigator, throughout the POD and immediately
314 before the POD terminator ("=cut"), and false immediately after the
315 POD terminator.
316
317 was_pod
318 Returns true if the last line of the string just formatted was part
319 of a POD structure. This includes the "/^=\w+/" POD instigators
320 and terminators.
321
322 in_data
323 Returns true if the next string to be formatted will be inside the
324 DATA section (ie, follows a "__DATA__" or "__END__" tag).
325
326 line_count
327 Returns the number of lines processed by the formatter.
328
329 substitutions
330 Returns a reference to the substitution table used. The
331 substitution table is a hash whose keys are the strings to be
332 replaced, and whose values are what to replace them with.
333
334 define_substitution HASH_REF
335 define_substitution LIST
336 Allows user to define certain characters that will be substituted
337 before formatting is done (but after they have been processed for
338 meaning).
339
340 If the first parameter is a reference to a hash, the formatter will
341 replace it's own hash with the given one, and subsequent changes to
342 the hash outside the formatter will be reflected.
343
344 Otherwise, it will copy the arguments passed into it's own hash,
345 and any substitutions already defined (but not in the parameter
346 list) will be preserved. (ie, the new substitutions will be added,
347 without destroying what was there already.)
348
349 set_start_format HASH_REF
350 set_start_format LIST
351 Given either a list of keys/values, or a reference to a hash of
352 keys/values, copy them into the object's Formats list.
353
354 set_end_format HASH_REF
355 set_end_format LIST
356 Given either a list of keys/values, or a reference to a hash of
357 keys/values, copy them into the object's Formats list.
358
359 set_format LIST
360 Sets the formatting string for one or more formats.
361
362 You should pass a list of keys/values where the keys are the format
363 names and the values are references to arrays containing the
364 starting and ending formatting strings (in that order) for that
365 format.
366
367 get_start_format LIST
368 Retrieve the string that is inserted to begin a given format type
369 (starting format string).
370
371 The names are looked for in the following order:
372
373 First: Prefer the names joined by underscore, from most general to
374 least. For example, given ("Variable", "Scalar"): "Variable" then
375 "Variable_Scalar".
376
377 Second: Then try each name singly, in reverse order. For example,
378 "Scalar" then "Variable".
379
380 See "FORMAT TYPES" for more information.
381
382 get_end_format LIST
383 Retrieve the string that is inserted to end a given format type
384 (ending format string).
385
386 get_format_names_list
387 Returns a list of the names of all the Formats defined.
388
389 get_start_format_values_list
390 Returns a list of the values of all the start Formats defined (in
391 the same order as the names returned by get_format_names_list()).
392
393 get_end_format_values_list
394 Returns a list of the values of all the end Formats defined (in the
395 same order as the names returned by get_format_names_list()).
396
397 format_string LIST
398 Formats one or more strings of Perl code. If no strings are
399 specified, defaults to $_. Returns the list of formatted strings
400 (or the first string formatted if called in scalar context).
401
402 Note: The end of the string is considered to be the end of a line,
403 regardless of whether or not there is a trailing line-break (but
404 trailing line-breaks will not cause an extra, empty line).
405
406 Another Note: The function actually uses $/ to determine line-
407 breaks, unless $/ is set to "\n" (newline). If $/ is "\n", then it
408 looks for the first match of "m/\r?\n|\n?\r/" in the string and
409 uses that to determine line-breaks. This is to make it easy to
410 handle non-unix text. Whatever characters it ends up using as
411 line-breaks are preserved.
412
413 format_token TOKEN, LIST
414 Returns TOKEN wrapped in the start and end Formats corresponding to
415 LIST (as would be returned by get_start_format( LIST ) and
416 get_end_format( LIST ), respectively).
417
418 No syntax checking is done on TOKEN but substitutions defined with
419 define_substitution() are performed.
420
422 • Barewords used as keys to a hash are formatted as strings. This is
423 Good. They should not be, however, if they are not the only thing
424 within the curly braces. That can be fixed.
425
426 • This version does not handle formats (see perlform(1)) very well.
427 It treats them as Here-Documents and ignores the rules for comment
428 lines, as well as the fact that picture lines are not supposed to
429 be interpolated. Thus, your picture lines will look strange with
430 the '@'s being formatted as array variables (albeit, invalid ones).
431 Ideally, it would also treat value lines as normal Perl code and
432 format accordingly. I think I'll get to the comment lines and non-
433 interpolating picture lines first. If/When I do get this fixed, I
434 will most likely add a format type of 'Format' or something, so
435 that they can be formatted differently, if so desired.
436
437 • This version does not handle Regular Expression significant
438 characters. It simply treats Regular Expressions as interpolated
439 strings.
440
441 • User-defined subroutines, called without parentheses, are formatted
442 as barewords. This is because there is no way to tell them apart
443 from barewords without parsing the code, and would require us to go
444 as far as perl does when doing the "-c" check (ie, executing BEGIN
445 and END blocks and the like). That's not going to happen.
446
447 • If you are indexing (subscripting) an array or hash, the formatter
448 tries to figure out the "real" variable class by looking at how you
449 index the variable. However, if you do something funky (but legal
450 in Perl) and put line-breaks or comments between the variable class
451 character ($) and your identifier, the formatter will get confused
452 and treat your variable as a scalar. Until it finds the index
453 character. Then it will format the scalar class character ($) as a
454 scalar and your identifier as the "correct" class.
455
456 • If you put a line-break between your variable identifier and it's
457 indexing character (see above), which is also legal in Perl, the
458 formatter will never find it and treat your variable as a scalar.
459
460 • If you put a line-break between a bareword hash-subscript and the
461 hash variable, or between a bareword and its associated "=>"
462 operator, the bareword will not be formatted correctly (as a
463 string). (Noticing a pattern here?)
464
466 Bug reports are always welcome. Email me at b<davidcyl@cpan.org>.
467
469 David C.Y. Liu b<davidcyl@cpan.org>
470
471 based on code by Cory Johns darkness@yossman.net
472
473 Copyright (c) 2004 David C.Y. Liu. This library is free software; you
474 can redistribute and/or modify it under the same conditions as Perl
475 itself.
476
478 Note: This is Cory John's todo list, not mine. Currently none of these
479 features are planned for the near future.
480
481 1. Improve handling of regular expressions. Add support for regexp-
482 special characters. Recognize the /e option to the substitution
483 operator (maybe).
484
485 2. Improve handling of formats. Don't treat format definitions as
486 interpolating. Handle format-comments. Possibly format value
487 lines as normal Perl code.
488
489 3. Create in-memory deep-copy routine to replace eval(Data::Dumper)
490 deep-copy.
491
492 4. Generalize state transitions (reset() and, in the future,
493 copy_state()) to use non-hard-coded keys and values for state
494 variables. Probably will extrapolate them into an overloadable
495 hash, and use the aforementioned deep-copy to assign them.
496
497 5. Create a method to save or copy states between objects
498 (copy_state()). Would be useful for using this module in an
499 editor.
500
501 6. Add support for greater-than-one length special characters.
502 Specifically, octal, hexidecimal, and control character codes. For
503 example, "\644", "\x1a4" or "\c[".
504
506 05-03-2004 David C.Y. Liu (Version 1.01)
507 • Added 'our' to the keywords list.
508
509 • Fixed bug that prevented interpolation inside qq() quotes.
510
511 • Renamed to Syntax::Highlight::Perl::Improved.
512
513 04-04-2001 Cory Johns
514 • Fixed problem with special characters not formatting inside of
515 Here-Documents.
516
517 • Fixed bug causing hash variables to format inside of Here-
518 Documents.
519
520 03-30-2001 Cory Johns
521 • Fixed bug where quote-terminators were checked for inside of Here-
522 Documents.
523
524 03-29-2001 Cory Johns
525 • Moved token processing tests from _format_line() into
526 _process_token() (where they should've been all along), generally
527 making _format_line() more logical. Contemplating extrapolating
528 the tokenizing and token loop into its own subroutine to avoid all
529 the recursive calls.
530
531 • Fixed bug that caused special characters to be recognized outside
532 of strings.
533
534 • Added $VERSION variable.
535
536 • Added support for different types of literal numbers: floating
537 point, exponential notation (eg: 1.3e10), hexidecimal, and
538 underscore-separated.
539
540 • Added the "CodeTerm" and "DATA" Formats.
541
542 03-27-2001 Cory Johns
543 • Added was_pod() and updated the documentation for in_pod().
544
545 03-20-2001 Cory Johns
546 • Added support for Perl formats (ie, `"format = ..."').
547
549 Hey! The above document had some coding errors, which are explained
550 below:
551
552 Around line 47:
553 You forgot a '=back' before '=head2'
554
555 Around line 102:
556 =back without =over
557
558
559
560perl v5.36.0 2023-01-2S0yntax::Highlight::Perl::Improved(3)