1Syntax::Highlight::PerlU:s:eIrmpCroonvterdi(b3u)ted PerlSyDnotcauxm:e:nHtiagthiloinght::Perl::Improved(3)
2
3
4

NAME

6       Syntax::Highlight::Perl::Improved - Highlighting of Perl Syntactical
7       Structures
8

VERSION

10       This file documents Syntax::Highlight::Perl::Improved version 1.0.
11

SYNOPSIS

13           # simple procedural
14           use Syntax::Highlight::Perl::Improved ':BASIC';  # or ':FULL'
15
16           print format_string($my_string);
17
18
19           # OO
20           use Syntax::Highlight::Perl::Improved;
21
22           my $formatter = new Syntax::Highlight::Perl::Improved;
23           print $formatter->format_string($my_string);
24

DESCRIPTION

26       This module provides syntax highlighting for Perl code.  The design
27       bias is roughly line-oriented and streamed (ie, processing a file line-
28       by-line in a single pass).  Provisions may be made in the future for
29       tasks related to "back-tracking" (ie, re-doing a single line in the
30       middle of a stream) such as speeding up state copying.
31
32   Constructors
33       The only constructor provided is "new()".  When called on an existing
34       object, "new()" will create a new copy of that object.  Otherwise,
35       "new()" creates a new copy of the (internal) Default Object.  Note that
36       the use of the procedural syntax modifies the Default Object and that
37       those changes will be reflected in any subsequent "new()" calls.
38
39   Formatting
40       Formatting is done using the "format_string()" method.  Call
41       "format_string()" with one or more strings to format, or it will
42       default to using $_.
43
44   Setting and Getting Formats
45       You can set the text used for formatting a syntax element using
46       "set_format()" (or set the start and end format individually using
47       "set_start_format()" and "set_end_format()", respectively).
48
49       You can also retrieve the text used for formatting for an element via
50       "get_start_format()" or "get_end_format".  Bulk retrieval of the names
51       or values of defined formats is possible via "get_format_names_list()"
52       (names), "get_start_format_values_list()" and
53       "get_end_format_values_list()".
54
55       See "FORMAT TYPES" later in this document for information on what
56       format elements can be used.
57
58   Checking and Setting the State
59       You can check certain aspects of the state of the formatter via the
60       methods: "in_heredoc()", "in_string()", "in_pod()", "was_pod()",
61       "in_data()", and "line_count()".
62
63       You can reset all of the above states (and a few other internal ones)
64       using "reset()".
65
66   Stable and Unstable Formatting Modes
67       You can set or check the stability of formatting via "unstable()".
68
69       In unstable (TRUE) mode, formatting is not considered to be persistent
70       with nested formats.  Or, put another way, when unstable, the formatter
71       can only "remember" one format at a time and must reinstate formatting
72       for each token.  An example of unstable formatting is using ANSI color
73       escape sequences in a terminal.
74
75       In stable (FALSE) mode (the default), formatting is considered
76       persistent within arbitrarily nested formats.  Even in stable mode,
77       however, formatting is never allowed to span multiple lines; it is
78       always fully closed at the end of the line and reinstated at the
79       beginning of a new line, if necessary.  This is to ensure properly
80       balanced tags when only formatting a partial code snippet.  An example
81       of stable formatting is HTML.
82
83   Substitutions
84       Using "define_substitution()", you can have the formatter substitute
85       certain strings with others, after the original string has been parsed
86       (but before formatting is applied).  This is useful for escaping
87       characters special to the output mode (eg, > and < in HTML) without
88       them affecting the way the code is parsed.
89
90       You can retrieve the current substitutions (as a hash-ref) via
91       "substitutions()".
92

FORMAT TYPES

94       The Syntax::Highlight::Perl::Improved formatter recognizes and
95       differentiates between many Perl syntactical elements.  Each type of
96       syntactical element has a Format Type associated with it.  There is
97       also a 'DEFAULT' type that is applied to any element who's Format Type
98       does not have a value.
99
100       Several of the Format Types have underscores in their name.  This
101       underscore is special, and indicates that the Format Type can be
102       "generalized."  This means that you can assign a value to just the
103       first part of the Format Type name (the part before the underscore) and
104       that value will be applied to all Format Types with the same first
105       part.  For example, the Format Types for all types of variables begin
106       with "Variable_".  Thus, if you assign a value to the Format Type
107       "Variable", it will be applied to any type of variable.  Generalized
108       Format Types take precedence over non-generalized Format Types.  So the
109       value assigned to "Variable" would be applied to "Variable_Scalar",
110       even if "Variable_Scalar" had a value explicitly assigned to it.
111
112       You can also define a "short-cut" name for each Format Type that can be
113       generalized.  The short-cut name would be the part of the Format Type
114       name after the underscore.  For example, the short-cut for
115       "Variable_Scalar" would be "Scalar".  Short-cut names have the least
116       precedence and are only assigned if neither the generalized Type name,
117       nor the full Type name have values.
118
119       Following is a list of all the syntactical elements that
120       Syntax::Highlight::Perl::Improved currently recognizes, along with a
121       short description of what each would be applied to.
122
123       Comment_Normal
124           A normal Perl comment.  Starts with '#' and goes until the end of
125           the line.
126
127       Comment_POD
128           Inline documentation.  Starts with a line beginning with an equal
129           sign ('=') followed by a word (eg: '=pod') and continuing until a
130           line beginning with '=cut'.
131
132       Directive
133           Either the "she-bang" line at the beginning of the file, or a line
134           directive altering what the compiler thinks the current line and
135           file is.
136
137       Label
138           A loop or statement label (to be the target of a goto, next, last
139           or redo).
140
141       Quote
142           Any string or character that begins or ends a String.  Including,
143           but not necessarily limited to: quote-like regular expression
144           operators ("m//", "s///", "tr///", etc), a Here-Document
145           terminating line, the lone period terminating a format, and, of
146           course, normal quotes ("'", """, "`", "q{}", "qq{}", "qr{}",
147           "qx{}").
148
149       String
150           Any text within quotes, "format"s, Here-Documents, Regular
151           Expressions, and the like.
152
153       Subroutine
154           The identifier used to define, identify, or call a subroutine (or
155           method).  Note that Syntax::Highlight::Perl::Improved cannot
156           recognize a subroutine if it is called without using parentheses or
157           an ampersand, or methods called using the indirect object syntax.
158           It formats those as barewords.
159
160       Variable_Scalar
161           A scalar variable.
162
163           Note that (theoretically) this format is not applied to non-scalar
164           variables that are being used as scalars (ie: array or hash
165           lookups, nor references to anything other than scalars).
166           Syntax::Highlight::Perl::Improved figures out (or at least tries
167           to) the actual type of the variable being used (by looking at how
168           you're subscripting it) and formats it accordingly.  The first
169           character of the variable (ie, the "$", "@", "%", or "*") tells you
170           the type of value being used, and the color (hopefully) tells you
171           the type of variable being used to get that value.
172
173           (See "KNOWN ISSUES" for information about when this doesn't work
174           quite right.)
175
176       Variable_Array
177           An array variable (but not usually a slice; see above).
178
179       Variable_Hash
180           A hash variable.
181
182       Variable_Typeglob
183           A typeglob.  Note that typeglobs not beginning with an asterisk (*)
184           (eg: filehandles) are formatted as barewords.  This is because,
185           well, they are.
186
187       Whitespace
188           Whitespace.  Not usually formatted but it can be.
189
190       Character
191           A special, or backslash-escaped, character.  For example: "\n"
192           (newline), or "\d" (digits).
193
194           Only occurs within strings or regular expressions.
195
196       Keyword
197           A Perl keyword.  Some examples include: my, local, sub, next.
198
199           Note that Perl does not make any distinction between keywords and
200           built-in functions (at least not in the documentation).  Thus I had
201           to make a subjective call as to what would be considered keywords
202           and what would be built-in functions.
203
204           The list of keywords can be found (and overloaded) in the variable
205           $Syntax::Highlight::Perl::Improved::keyword_list_re as a pre-
206           compiled regular expression.
207
208       Builtin_Function
209           A Perl built-in function, called as a function (ie, using
210           parentheses).
211
212           The list of built-in functions can be found (and overloaded) in the
213           variable $Syntax::Highlight::Perl::Improved::builtin_list_re as a
214           pre-compiled regular expression.
215
216       Builtin_Operator
217           A Perl built-in function, called as a list or unary operator (ie,
218           without using parentheses).
219
220           The list of built-in functions can be found (and overloaded) in the
221           variable $Syntax::Highlight::Perl::Improved::builtin_list_re as a
222           pre-compiled regular expression.
223
224       Operator
225           A Perl operator.
226
227           The list of operators can be found (and overloaded) in the variable
228           $Syntax::Highlight::Perl::Improved::operator_list_re as a pre-
229           compiled regular expression.
230
231       Bareword
232           A bareword.  This can be user-defined subroutine called without
233           parentheses, a typeglob used without an asterisk (*), or just a
234           plain old bareword.
235
236       Package
237           The name of a package or pragmatic module.
238
239           Note that this does not apply to the package portion of a fully
240           qualified variable name.
241
242       Number
243           A numeric literal.
244
245       Symbol
246           A symbol (ie, non-operator punctuation).
247
248       CodeTerm
249           The special tokens that signal the end of executable code and the
250           begining of the DATA section.  Specifically, '"__END__"' and
251           '"__DATA__"'.
252
253       DATA
254           Anything in the DATA section (see "CodeTerm").
255

PROCEDURAL vs. OBJECT ORIENTED

257       Syntax::Highlight::Perl::Improved uses OO method-calls internally (and
258       actually defines a Default Object that is used when the functions are
259       invoked procedurally) so you will not gain anything (efficiency-wise)
260       by using the procedural interface.  It is just a matter of style.
261
262       It is actually recommended that you use the OO interface, as this
263       allows you to instantiate multiple, concurrent-yet-separate formatters.
264       Though I cannot think of why you would need multiple formatters
265       instantiated. :-)
266
267       One point to note: the "new()" method uses the Default Object to
268       initialize new objects.  This means that any changes to the state of
269       the Default Object (including Format definitions) made by using the
270       procedural interface will be reflected in any subsequently created
271       objects.  This can be useful in some cases (eg, call "set_format()"
272       procedurally just before creating a batch of new objects to define
273       default Formats for them all) but will most likely lead to trouble.
274

METHODS

276       new PACKAGE
277       new OBJECT
278           Creates a new object.  If called on an existing object, creates a
279           new copy of that object (which is thenceforth totally separate from
280           the original).
281
282       reset
283           Resets the object's internal state.  This breaks out of strings and
284           here-docs, ends PODs, resets the line-count, and otherwise gets the
285           object back into a "normal" state to begin processing a new stream.
286
287           Note that this does not reset any user options (including formats
288           and format stability).
289
290       unstable EXPR
291       unstable
292           Returns true if the formatter is in unstable mode.
293
294           If called with a non-zero number, puts the formatter into unstable
295           formatting mode.
296
297           In unstable mode, it is assumed that formatting is not persistent
298           one token to the next and that each token must be explicitly
299           formatted.
300
301       in_heredoc
302           Returns true if the next string to be formatted will be inside a
303           Here-Document.
304
305       in_string
306           Returns true if the next string to be formatted will be inside a
307           multi-line string.
308
309       in_pod
310           Returns true if the formatter would consider the next string passed
311           to it as begin within a POD structure.  This is false immediately
312           before any POD instigators ("=pod", "=head1", "=item", etc), true
313           immediately after an instigator, throughout the POD and immediately
314           before the POD terminator ("=cut"), and false immediately after the
315           POD terminator.
316
317       was_pod
318           Returns true if the last line of the string just formatted was part
319           of a POD structure.  This includes the "/^=\w+/" POD instigators
320           and terminators.
321
322       in_data
323           Returns true if the next string to be formatted will be inside the
324           DATA section (ie, follows a "__DATA__" or "__END__" tag).
325
326       line_count
327           Returns the number of lines processed by the formatter.
328
329       substitutions
330           Returns a reference to the substitution table used.  The
331           substitution table is a hash whose keys are the strings to be
332           replaced, and whose values are what to replace them with.
333
334       define_substitution HASH_REF
335       define_substitution LIST
336           Allows user to define certain characters that will be substituted
337           before formatting is done (but after they have been processed for
338           meaning).
339
340           If the first parameter is a reference to a hash, the formatter will
341           replace it's own hash with the given one, and subsequent changes to
342           the hash outside the formatter will be reflected.
343
344           Otherwise, it will copy the arguments passed into it's own hash,
345           and any substitutions already defined (but not in the parameter
346           list) will be preserved. (ie, the new substitutions will be added,
347           without destroying what was there already.)
348
349       set_start_format HASH_REF
350       set_start_format LIST
351           Given either a list of keys/values, or a reference to a hash of
352           keys/values, copy them into the object's Formats list.
353
354       set_end_format HASH_REF
355       set_end_format LIST
356           Given either a list of keys/values, or a reference to a hash of
357           keys/values, copy them into the object's Formats list.
358
359       set_format LIST
360           Sets the formatting string for one or more formats.
361
362           You should pass a list of keys/values where the keys are the format
363           names and the values are references to arrays containing the
364           starting and ending formatting strings (in that order) for that
365           format.
366
367       get_start_format LIST
368           Retrieve the string that is inserted to begin a given format type
369           (starting format string).
370
371           The names are looked for in the following order:
372
373           First: Prefer the names joined by underscore, from most general to
374           least.  For example, given ("Variable", "Scalar"): "Variable" then
375           "Variable_Scalar".
376
377           Second: Then try each name singly, in reverse order.  For example,
378           "Scalar" then "Variable".
379
380           See "FORMAT TYPES" for more information.
381
382       get_end_format LIST
383           Retrieve the string that is inserted to end a given format type
384           (ending format string).
385
386       get_format_names_list
387           Returns a list of the names of all the Formats defined.
388
389       get_start_format_values_list
390           Returns a list of the values of all the start Formats defined (in
391           the same order as the names returned by "get_format_names_list()").
392
393       get_end_format_values_list
394           Returns a list of the values of all the end Formats defined (in the
395           same order as the names returned by "get_format_names_list()").
396
397       format_string LIST
398           Formats one or more strings of Perl code.  If no strings are
399           specified, defaults to $_.  Returns the list of formatted strings
400           (or the first string formatted if called in scalar context).
401
402           Note:  The end of the string is considered to be the end of a line,
403           regardless of whether or not there is a trailing line-break (but
404           trailing line-breaks will not cause an extra, empty line).
405
406           Another Note:  The function actually uses $/ to determine line-
407           breaks, unless $/ is set to "\n" (newline).  If $/ is "\n", then it
408           looks for the first match of "m/\r?\n|\n?\r/" in the string and
409           uses that to determine line-breaks.  This is to make it easy to
410           handle non-unix text.  Whatever characters it ends up using as
411           line-breaks are preserved.
412
413       format_token TOKEN, LIST
414           Returns TOKEN wrapped in the start and end Formats corresponding to
415           LIST (as would be returned by "get_start_format( LIST )" and
416           "get_end_format( LIST )", respectively).
417
418           No syntax checking is done on TOKEN but substitutions defined with
419           "define_substitution()" are performed.
420

KNOWN ISSUES or LIMITATIONS

422       •   Barewords used as keys to a hash are formatted as strings.  This is
423           Good.  They should not be, however, if they are not the only thing
424           within the curly braces.  That can be fixed.
425
426       •   This version does not handle formats (see perlform(1)) very well.
427           It treats them as Here-Documents and ignores the rules for comment
428           lines, as well as the fact that picture lines are not supposed to
429           be interpolated.  Thus, your picture lines will look strange with
430           the '@'s being formatted as array variables (albeit, invalid ones).
431           Ideally, it would also treat value lines as normal Perl code and
432           format accordingly.  I think I'll get to the comment lines and non-
433           interpolating picture lines first.  If/When I do get this fixed, I
434           will most likely add a format type of 'Format' or something, so
435           that they can be formatted differently, if so desired.
436
437       •   This version does not handle Regular Expression significant
438           characters.  It simply treats Regular Expressions as interpolated
439           strings.
440
441       •   User-defined subroutines, called without parentheses, are formatted
442           as barewords.  This is because there is no way to tell them apart
443           from barewords without parsing the code, and would require us to go
444           as far as perl does when doing the "-c" check (ie, executing BEGIN
445           and END blocks and the like).  That's not going to happen.
446
447       •   If you are indexing (subscripting) an array or hash, the formatter
448           tries to figure out the "real" variable class by looking at how you
449           index the variable.  However, if you do something funky (but legal
450           in Perl) and put line-breaks or comments between the variable class
451           character ($) and your identifier, the formatter will get confused
452           and treat your variable as a scalar.  Until it finds the index
453           character.  Then it will format the scalar class character ($) as a
454           scalar and your identifier as the "correct" class.
455
456       •   If you put a line-break between your variable identifier and it's
457           indexing character (see above), which is also legal in Perl, the
458           formatter will never find it and treat your variable as a scalar.
459
460       •   If you put a line-break between a bareword hash-subscript and the
461           hash variable, or between a bareword and its associated "=>"
462           operator, the bareword will not be formatted correctly (as a
463           string).  (Noticing a pattern here?)
464

BUGS

466       Bug reports are always welcome. Email me at b<davidcyl@cpan.org>.
467

AUTHOR

469       David C.Y. Liu b<davidcyl@cpan.org>
470
471       based on code by Cory Johns darkness@yossman.net
472
473       Copyright (c) 2004 David C.Y. Liu.  This library is free software; you
474       can redistribute and/or modify it under the same conditions as Perl
475       itself.
476

TO DO

478       Note: This is Cory John's todo list, not mine. Currently none of these
479       features are planned for the near future.
480
481       1.  Improve handling of regular expressions.  Add support for regexp-
482           special characters.  Recognize the /e option to the substitution
483           operator (maybe).
484
485       2.  Improve handling of formats.  Don't treat format definitions as
486           interpolating.  Handle format-comments.  Possibly format value
487           lines as normal Perl code.
488
489       3.  Create in-memory deep-copy routine to replace "eval(Data::Dumper)"
490           deep-copy.
491
492       4.  Generalize state transitions ("reset()" and, in the future,
493           "copy_state()") to use non-hard-coded keys and values for state
494           variables.  Probably will extrapolate them into an overloadable
495           hash, and use the aforementioned deep-copy to assign them.
496
497       5.  Create a method to save or copy states between objects
498           ("copy_state()").  Would be useful for using this module in an
499           editor.
500
501       6.  Add support for greater-than-one length special characters.
502           Specifically, octal, hexidecimal, and control character codes.  For
503           example, "\644", "\x1a4" or "\c[".
504

REVISIONS

506   05-03-2004  David C.Y. Liu (Version 1.01)
507       •   Added 'our' to the keywords list.
508
509       •   Fixed bug that prevented interpolation inside qq() quotes.
510
511       •   Renamed to Syntax::Highlight::Perl::Improved.
512
513   04-04-2001  Cory Johns
514       •   Fixed problem with special characters not formatting inside of
515           Here-Documents.
516
517       •   Fixed bug causing hash variables to format inside of Here-
518           Documents.
519
520   03-30-2001  Cory Johns
521       •   Fixed bug where quote-terminators were checked for inside of Here-
522           Documents.
523
524   03-29-2001  Cory Johns
525       •   Moved token processing tests from _format_line() into
526           _process_token() (where they should've been all along), generally
527           making _format_line() more logical.  Contemplating extrapolating
528           the tokenizing and token loop into its own subroutine to avoid all
529           the recursive calls.
530
531       •   Fixed bug that caused special characters to be recognized outside
532           of strings.
533
534       •   Added $VERSION variable.
535
536       •   Added support for different types of literal numbers: floating
537           point, exponential notation (eg: 1.3e10), hexidecimal, and
538           underscore-separated.
539
540       •   Added the "CodeTerm" and "DATA" Formats.
541
542   03-27-2001  Cory Johns
543       •   Added was_pod() and updated the documentation for in_pod().
544
545   03-20-2001  Cory Johns
546       •   Added support for Perl formats (ie, `"format = ..."').
547

POD ERRORS

549       Hey! The above document had some coding errors, which are explained
550       below:
551
552       Around line 47:
553           You forgot a '=back' before '=head2'
554
555       Around line 102:
556           =back without =over
557
558
559
560perl v5.36.0                      2022-07-2S2yntax::Highlight::Perl::Improved(3)
Impressum