1SGML::Parser::OpenSP(3)User Contributed Perl DocumentatioSnGML::Parser::OpenSP(3)
2
3
4

NAME

6       SGML::Parser::OpenSP - Parse SGML documents using OpenSP
7

SYNOPSIS

9         use SGML::Parser::OpenSP;
10
11         my $p = SGML::Parser::OpenSP->new;
12         my $h = ExampleHandler->new;
13
14         $p->catalogs(qw(xhtml.soc));
15         $p->warnings(qw(xml valid));
16         $p->handler($h);
17
18         $p->parse("example.xhtml");
19

DESCRIPTION

21       This module provides an interface to the OpenSP SGML parser. OpenSP and
22       this module are event based. As the parser recognizes parts of the
23       document (say the start or end of an element), then any handlers
24       registered for that type of an event are called with suitable
25       parameters.
26

COMMON METHODS

28       new()
29           Returns a new SGML::Parser::OpenSP object. Takes no arguments.
30
31       parse($file)
32           Parses the file passed as an argument. Note that this must be a
33           filename and not a filehandle. See "PROCESSING FILES" below for
34           details.
35
36       parse_string($data)
37           Parses the data passed as an argument. See "PROCESSING FILES" below
38           for details.
39
40       halt()
41           Halts processing before parsing the entire document. Takes no
42           arguments.
43
44       split_message()
45           Splits OpenSP's error messages into their component parts.  See
46           "POST-PROCESSING ERROR MESSAGES" below for details.
47
48       get_location()
49           See "POSITIONING INFORMATION" below for details.
50

CONFIGURATION

52   BOOLEAN OPTIONS
53       $p->handler([$handler])
54           Report events to the blessed reference $handler.
55
56   ERROR MESSAGE FORMAT
57       $p->show_open_entities([$bool])
58           Describe open entities in error messages. Error messages always
59           include the position of the most recently opened external entity.
60           The default is false.
61
62       $p->show_open_elements([$bool])
63           Show the generic identifiers of open elements in error messages.
64           The default is false.
65
66       $p->show_error_numbers([$bool])
67           Show message numbers in error messages.
68
69   GENERATED EVENTS
70       $p->output_comment_decls([$bool])
71           Generate "comment_decl" events. The default is false.
72
73       $p->output_marked_sections([$bool])
74           Generate marked section events ("marked_section_start",
75           "marked_section_end", "ignored_chars"). The default is false.
76
77       $p->output_general_entities([$bool])
78           Generate "general_entity" events. The default is false.
79
80   IO SETTINGS
81       $p->map_catalog_document([$bool])
82           "parse" arguments specify catalog files rather than the document
83           entity.  The document entity is specified by the first DOCUMENT
84           entry in the catalog files. The default is false.
85
86       $p->restrict_file_reading([$bool])
87           Restrict file reading to the specified directories (see the
88           "search_dirs" method and the "SGML_SEARCH_PATH" environment
89           variable). You should turn this option on and configure the search
90           paths accordingly if you intend to process untrusted resources. The
91           default is false.
92
93       $p->catalogs([@catalogs])
94           Map public identifiers and entity names to system identifiers using
95           the specified catalog entry files. Multiple catalogs are allowed.
96           If there is a catalog entry file called "catalog" in the same place
97           as the document entity, it will be searched for immediately after
98           those specified.
99
100       $p->search_dirs([@search_dirs])
101           Search the specified directories for files specified in system
102           identifiers.  Multiple values options are allowed. See the
103           description of the osfile storage manager in the OpenSP
104           documentation for more information about file searching.
105
106       $p->pass_file_descriptor([$bool])
107           Instruct "parse_string" to pass the input data down to the guts of
108           OpenSP using the "OSFD" storage manager (if true) or the "OSFILE"
109           storage manager (if false). This amounts to the difference between
110           passing a file descriptor and a (temporary) file name.
111
112           The default is true except on platforms, such as Win32, which are
113           known to not support passing file descriptors around in this
114           manner. On platforms which support it you can call this method with
115           a false parameter to force use of temporary file names instead.
116
117           In general, this will do the right thing on its own so it's best to
118           consider this an internal method. If your platform is such that you
119           have to force use of the OSFILE storage manager, please report it
120           as a bug and include the values of $^O, $Config{archname}, and a
121           description of the platform (e.g. "Windows Vista Service Pack 42").
122
123   PROCESSING OPTIONS
124       $p->include_params([@include_params])
125           For each name in @include_params pretend that
126
127             <!ENTITY % name "INCLUDE">
128
129           occurs at the start of the document type declaration subset in the
130           SGML document entity. Since repeated definitions of an entity are
131           ignored, this definition will take precedence over any other
132           definitions of this entity in the document type declaration.
133           Multiple names are allowed.  If the SGML declaration replaces the
134           reserved name INCLUDE then the new reserved name will be the
135           replacement text of the entity. Typically the document type
136           declaration will contain
137
138             <!ENTITY % name "IGNORE">
139
140           and will use %name; in the status keyword specification of a marked
141           section declaration. In this case the effect of the option will be
142           to cause the marked section not to be ignored.
143
144       $p->active_links([@active_links])
145           ???
146
147   ENABLING WARNINGS
148       Additional warnings can be enabled using
149
150         $p->warnings([@warnings])
151
152       The following values can be used to enable warnings:
153
154       xml Warn about constructs that are not allowed by XML.
155
156       mixed
157           Warn about mixed content models that do not allow #pcdata anywhere.
158
159       sgmldecl
160           Warn about various dubious constructions in the SGML declaration.
161
162       should
163           Warn about various recommendations made in ISO 8879 that the
164           document does not comply with. (Recommendations are expressed with
165           ``should'', as distinct from requirements which are usually
166           expressed with ``shall''.)
167
168       default
169           Warn about defaulted references.
170
171       duplicate
172           Warn about duplicate entity declarations.
173
174       undefined
175           Warn about undefined elements: elements used in the DTD but not
176           defined.
177
178       unclosed
179           Warn about unclosed start and end-tags.
180
181       empty
182           Warn about empty start and end-tags.
183
184       net Warn about net-enabling start-tags and null end-tags.
185
186       min-tag
187           Warn about minimized start and end-tags. Equivalent to combination
188           of unclosed, empty and net warnings.
189
190       unused-map
191           Warn about unused short reference maps: maps that are declared with
192           a short reference mapping declaration but never used in a short
193           reference use declaration in the DTD.
194
195       unused-param
196           Warn about parameter entities that are defined but not used in a
197           DTD.  Unused internal parameter entities whose text is "INCLUDE" or
198           "IGNORE" won't get the warning.
199
200       notation-sysid
201           Warn about notations for which no system identifier could be
202           generated.
203
204       all Warn about conditions that should usually be avoided (in the
205           opinion of the author). Equivalent to: "mixed", "should",
206           "default", "undefined", "sgmldecl", "unused-map", "unused-param",
207           "empty" and "unclosed".
208
209   DISABLING WARNINGS
210       A warning can be disabled by using its name prefixed with "no-".  Thus
211       calling warnings(qw(all no-duplicate)) will enable all warnings except
212       those about duplicate entity declarations.
213
214       The following values for warnings() disable errors:
215
216       no-idref
217           Do not give an error for an ID reference value which no element has
218           as its ID. The effect will be as if each attribute declared as an
219           ID reference value had been declared as a name.
220
221       no-significant
222           Do not give an error when a character that is not a significant
223           character in the reference concrete syntax occurs in a literal in
224           the SGML declaration. This may be useful in conjunction with
225           certain buggy test suites.
226
227       no-valid
228           Do not require the document to be type-valid. This has the effect
229           of changing the SGML declaration to specify "VALIDITY NOASSERT" and
230           "IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has the
231           effect of changing the SGML declaration to specify "VALIDITY TYPE"
232           and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
233           "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF"
234           specified in the SGML declaration will be used.
235
236   XML WARNINGS
237       The following warnings are turned on for the "xml" warning described
238       above:
239
240       inclusion
241           Warn about inclusions in element type declarations.
242
243       exclusion
244           Warn about exclusions in element type declarations.
245
246       rcdata-content
247           Warn about RCDATA declared content in element type declarations.
248
249       cdata-content
250           Warn about CDATA declared content in element type declarations.
251
252       ps-comment
253           Warn about comments in parameter separators.
254
255       attlist-group-decl
256           Warn about name groups in attribute declarations.
257
258       element-group-decl
259           Warn about name groups in element type declarations.
260
261       pi-entity
262           Warn about PI entities.
263
264       internal-sdata-entity
265           Warn about internal SDATA entities.
266
267       internal-cdata-entity
268           Warn about internal CDATA entities.
269
270       external-sdata-entity
271           Warn about external SDATA entities.
272
273       external-cdata-entity
274           Warn about external CDATA entities.
275
276       bracket-entity
277           Warn about bracketed text entities.
278
279       data-atts
280           Warn about attribute definition list declarations for notations.
281
282       missing-system-id
283           Warn about external identifiers without system identifiers.
284
285       conref
286           Warn about content reference attributes.
287
288       current
289           Warn about current attributes.
290
291       nutoken-decl-value
292           Warn about attributes with a declared value of NUTOKEN or NUTOKENS.
293
294       number-decl-value
295           Warn about attributes with a declared value of NUMBER or NUMBERS.
296
297       name-decl-value
298           Warn about attributes with a declared value of NAME or NAMES.
299
300       named-char-ref
301           Warn about named character references.
302
303       refc
304           Warn about ommitted refc delimiters.
305
306       temp-ms
307           Warn about TEMP marked sections.
308
309       rcdata-ms
310           Warn about RCDATA marked sections.
311
312       instance-include-ms
313           Warn about INCLUDE marked sections in the document instance.
314
315       instance-ignore-ms
316           Warn about IGNORE marked sections in the document instance.
317
318       and-group
319           Warn about AND connectors in model groups.
320
321       rank
322           Warn about ranked elements.
323
324       empty-comment-decl
325           Warn about empty comment declarations.
326
327       att-value-not-literal
328           Warn about attribute values which are not literals.
329
330       missing-att-name
331           Warn about ommitted attribute names in start tags.
332
333       comment-decl-s
334           Warn about spaces before the MDC in comment declarations.
335
336       comment-decl-multiple
337           Warn about comment declarations containing multiple comments.
338
339       missing-status-keyword
340           Warn about marked sections without a status keyword.
341
342       multiple-status-keyword
343           Warn about marked sections with multiple status keywords.
344
345       instance-param-entity
346           Warn about parameter entities in the document instance.
347
348       min-param
349           Warn about minimization parameters in element type declarations.
350
351       mixed-content-xml
352           Warn about cases of mixed content which are not allowed in XML.
353
354       name-group-not-or
355           Warn about name groups with a connector different from OR.
356
357       pi-missing-name
358           Warn about processing instructions which don't start with a name.
359
360       instance-status-keyword-s
361           Warn about spaces between DSO and status keyword in marked
362           sections.
363
364       external-data-entity-ref
365           Warn about references to external data entities in the content.
366
367       att-value-external-entity-ref
368           Warn about references to external data entities in attribute
369           values.
370
371       data-delim
372           Warn about occurances of `<' and `&' as data.
373
374       explicit-sgml-decl
375           Warn about an explicit SGML declaration.
376
377       internal-subset-ms
378           Warn about marked sections in the internal subset.
379
380       default-entity
381           Warn about a default entity declaration.
382
383       non-sgml-char-ref
384           Warn about numeric character references to non-SGML characters.
385
386       internal-subset-ps-param-entity
387           Warn about parameter entity references in parameter separators in
388           the internal subset.
389
390       internal-subset-ts-param-entity
391           Warn about parameter entity references in token separators in the
392           internal subset.
393
394       internal-subset-literal-param-entity
395           Warn about parameter entity references in parameter literals in the
396           internal subset.
397

PROCESSING FILES

399       In order to start processing of a document and recieve events, the
400       "parse" method must be called. It takes one argument specifying the
401       path to a file (not a file handle). You must set an event handler using
402       the "handler" method prior to using this method. The return value of
403       "parse" is currently undefined.
404

EVENT HANDLERS

406       In order to receive data from the parser you need to write an event
407       handler. For example,
408
409         package ExampleHandler;
410
411         sub new { bless {}, shift }
412
413         sub start_element
414         {
415             my ($self, $elem) = @_;
416             printf "  * %s\n", $elem->{Name};
417         }
418
419       This handler would print all the element names as they are found in the
420       document, for a typical XHTML document this might result in something
421       like
422
423         * html
424         * head
425         * title
426         * body
427         * p
428         * ...
429
430       The events closely match those in the generic interface to OpenSP, see
431       <http://openjade.sf.net/doc/generic.htm> for more information.
432
433       The event names have been changed to lowercase and underscores to
434       separate words and properties are capitalized. Arrays are represented
435       as Perl array references. "Position" information is not passed to the
436       handler but made available through the "get_location" method which can
437       be called from event handlers. Some redundant information has also been
438       stripped and the generic identifier of an element is stored in the
439       "Name" hash entry.
440
441       For example, for an EndElementEvent the "end_element" handler gets
442       called with a hash reference
443
444         {
445           Name => 'gi'
446         }
447
448       The following events are defined:
449
450         * appinfo
451         * processing_instruction
452         * start_element
453         * end_element
454         * data
455         * sdata
456         * external_data_entity_ref
457         * subdoc_entity_ref
458         * start_dtd
459         * end_dtd
460         * end_prolog
461         * general_entity       # set $p->output_general_entities(1)
462         * comment_decl         # set $p->output_comment_decls(1)
463         * marked_section_start # set $p->output_marked_sections(1)
464         * marked_section_end   # set $p->output_marked_sections(1)
465         * ignored_chars        # set $p->output_marked_sections(1)
466         * error
467         * open_entity_change
468
469       If the documentation of the generic interface to OpenSP states that
470       certain data is not valid, it will not be available through this
471       interface (i.e., the respective key does not exist in the hash ref).
472

POSITIONING INFORMATION

474       Event handlers can call the "get_location" method on the parser object
475       to retrieve positioning information, the get_location method will
476       return a hash reference with the following properties:
477
478         LineNumber   => ..., # line number
479         ColumnNumber => ..., # column number
480         ByteOffset   => ..., # number of preceding bytes
481         EntityOffset => ..., # number of preceding bit combinations
482         EntityName   => ..., # name of the external entity
483         FileName     => ..., # name of the file
484
485       These can be "undef" or an empty string.
486

POST-PROCESSING ERROR MESSAGES

488       OpenSP returns error messages in form of a string rather than
489       individual components of the message like line numbers or message text.
490       The "split_message" method on the parser object can be used to post-
491       process these error message strings as reliable as possible. It can be
492       used e.g.  from an error event handler if the parser object is
493       accessible like
494
495         sub error
496         {
497           my $self = shift;
498           my $erro = shift;
499           my $mess = $self->{parser}->split_message($erro);
500         }
501
502       See the documentation of "split_message" in the
503       SGML::Parser::OpenSP::Tools documentation.
504

UNICODE SUPPORT

506       All strings returned from event handlers and helper routines are UTF-8
507       encoded with the UTF-8 flag turned on, helper functions like
508       "split_message" expect (but don't check) that string arguments are
509       UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper
510       functions is undefined when you pass unexpected input and should be
511       avoided.
512
513       "parse" has limited support for binary input, but the binary input must
514       be compatible with OpenSP's generic interface requirements and you must
515       specify the encoding through means available to OpenSP to enable it to
516       properly decode the binary input. Any encoding meta data about such
517       binary input specific to Perl (such as encoding disciplines for file
518       handles when you pass a file descriptor) will be ignored. For more
519       specific information refer to the OpenSP manual.
520
521       •   <http://openjade.sourceforge.net/doc/sysid.htm>
522
523       •   <http://openjade.sourceforge.net/doc/charset.htm>
524

ENVIRONMENT VARIABLES

526       OpenSP supports a number of environment variables to control specific
527       processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED".
528       Portable applications need to ensure that these are set prior to
529       loading the OpenSP library into memory which happens when the XS code
530       is loaded. This means you need to wrap the code into a "BEGIN" block:
531
532         BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
533         use SGML::Parser::OpenSP;
534         # ...
535
536       Otherwise changes to the environment might not propagate to OpenSP.
537       This applies specifically to Win32 systems.
538
539       SGML_SEARCH_PATH
540           See <http://openjade.sourceforge.net/doc/sysid.htm>.
541
542       SP_HTTP_USER_AGENT
543           The "User-Agent" header for HTTP requests.
544
545       SP_HTTP_ACCEPT
546           The "Accept" header for HTTP requests.
547
548       SP_MESSAGE_FORMAT
549           Enable run time selection of message format, Value is one of "XML",
550           "NONE", "TRADITIONAL". Whether this will have an effect depends on
551           a compile time setting which might not be enabled in your OpenSP
552           build. This module assumes that no such support was compiled in.
553
554       SGML_CATALOG_FILES
555       SP_USE_DOCUMENT_CATALOG
556           See <http://openjade.sourceforge.net/doc/catalog.htm>.
557
558       SP_SYSTEM_CHARSET
559       SP_CHARSET_FIXED
560       SP_BCTF
561       SP_ENCODING
562           See <http://openjade.sourceforge.net/doc/charset.htm>.
563
564       Note that you can use the "search_dirs" method instead of using
565       "SGML_SEARCH_PATH" and the "catalogs" method instead of using
566       "SGML_CATALOG_FILES" and attributes on storage object specifications
567       for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
568       "SP_CHARSET_FIXED" is set to 1 you can use
569
570         $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");
571
572       to process "example.xhtml" using the "UTF-8" character encoding.
573

KNOWN ISSUES

575       OpenSP must be compiled with "SP_MULTI_BYTE" defined and with
576       "SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime
577       or not compile.
578

BUG REPORTS

580       Please report bugs in this module via
581       <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>
582
583       Please report bugs in OpenSP via
584       <http://sf.net/tracker/?group_id=2115&atid=102115>
585
586       Please send comments and questions to the spo-devel mailing list, see
587       <http://lists.sf.net/lists/listinfo/spo-devel> for details.
588

SEE ALSO

590       •   <http://openjade.sf.net/doc/generic.htm>
591
592       •   <http://openjade.sf.net/>
593
594       •   <http://sf.net/projects/spo/>
595

AUTHORS

597         Terje Bless <link@cpan.org> wrote version 0.01.
598         Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02+.
599
601         Copyright (c) 2006-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
602         This module is licensed under the same terms as Perl itself.
603
604
605
606perl v5.36.0                      2023-01-20           SGML::Parser::OpenSP(3)
Impressum