1SGML::Parser::OpenSP(3)User Contributed Perl DocumentatioSnGML::Parser::OpenSP(3)
2
3
4

NAME

6       SGML::Parser::OpenSP - Parse SGML documents using OpenSP
7

SYNOPSIS

9         use SGML::Parser::OpenSP;
10
11         my $p = SGML::Parser::OpenSP->new;
12         my $h = ExampleHandler->new;
13
14         $p->catalogs(qw(xhtml.soc));
15         $p->warnings(qw(xml valid));
16         $p->handler($h);
17
18         $p->parse("example.xhtml");
19

DESCRIPTION

21       This module provides an interface to the OpenSP SGML parser. OpenSP and
22       this module are event based. As the parser recognizes parts of the doc‐
23       ument (say the start or end of an element), then any handlers regis‐
24       tered for that type of an event are called with suitable parameters.
25

COMMON METHODS

27       new()
28           Returns a new SGML::Parser::OpenSP object. Takes no arguments.
29
30       parse($file)
31           Parses the file passed as an argument. Note that this must be a
32           filename and not a filehandle. See "PROCESSING FILES" below for
33           details.
34
35       parse_string($data)
36           Parses the data passed as an argument. See "PROCESSING FILES" below
37           for details.
38
39       halt()
40           Halts processing before parsing the entire document. Takes no argu‐
41           ments.
42
43       split_message()
44           Splits OpenSP's error messages into their component parts.  See
45           "POST-PROCESSING ERROR MESSAGES" below for details.
46
47       get_location()
48           See "POSITIONING INFORMATION" below for details.
49

CONFIGURATION

51       BOOLEAN OPTIONS
52
53       $p->handler([$handler])
54           Report events to the blessed reference $handler.
55
56       ERROR MESSAGE FORMAT
57
58       $p->show_open_entities([$bool])
59           Describe open entities in error messages. Error messages always
60           include the position of the most recently opened external entity.
61           The default is false.
62
63       $p->show_open_elements([$bool])
64           Show the generic identifiers of open elements in error messages.
65           The default is false.
66
67       $p->show_error_numbers([$bool])
68           Show message numbers in error messages.
69
70       GENERATED EVENTS
71
72       $p->output_comment_decls([$bool])
73           Generate "comment_decl" events. The default is false.
74
75       $p->output_marked_sections([$bool])
76           Generate marked section events ("marked_section_start",
77           "marked_section_end", "ignored_chars"). The default is false.
78
79       $p->output_general_entities([$bool])
80           Generate "general_entity" events. The default is false.
81
82       IO SETTINGS
83
84       $p->map_catalog_document([$bool])
85           "parse" arguments specify catalog files rather than the document
86           entity.  The document entity is specified by the first DOCUMENT
87           entry in the catalog files. The default is false.
88
89       $p->restrict_file_reading([$bool])
90           Restrict file reading to the specified directories (see the
91           "search_dirs" method and the "SGML_SEARCH_PATH" environment vari‐
92           able). You should turn this option on and configure the search
93           paths accordingly if you intend to process untrusted resources. The
94           default is false.
95
96       $p->catalogs([@catalogs])
97           Map public identifiers and entity names to system identifiers using
98           the specified catalog entry files. Multiple catalogs are allowed.
99           If there is a catalog entry file called "catalog" in the same place
100           as the document entity, it will be searched for immediately after
101           those specified.
102
103       $p->search_dirs([@search_dirs])
104           Search the specified directories for files specified in system
105           identifiers.  Multiple values options are allowed. See the descrip‐
106           tion of the osfile storage manager in the OpenSP documentation for
107           more information about file searching.
108
109       $p->pass_file_descriptor([$bool])
110           Instruct "parse_string" to pass the input data down to the guts of
111           OpenSP using the "OSFD" storage manager (if true) or the "OSFILE"
112           storage manager (if false). This amounts to the difference between
113           passing a file descriptor and a (temporary) file name.
114
115           The default is true except on platforms, such as Win32, which are
116           known to not support passing file descriptors around in this man‐
117           ner. On platforms which support it you can call this method with a
118           false parameter to force use of temporary file names instead.
119
120           In general, this will do the right thing on its own so it's best to
121           consider this an internal method. If your platform is such that you
122           have to force use of the OSFILE storage manager, please report it
123           as a bug and include the values of $^O, $Config{archname}, and a
124           description of the platform (e.g. "Windows Vista Service Pack 42").
125
126       PROCESSING OPTIONS
127
128       $p->include_params([@include_params])
129           For each name in @include_params pretend that
130
131             <!ENTITY % name "INCLUDE">
132
133           occurs at the start of the document type declaration subset in the
134           SGML document entity. Since repeated definitions of an entity are
135           ignored, this definition will take precedence over any other defi‐
136           nitions of this entity in the document type declaration. Multiple
137           names are allowed.  If the SGML declaration replaces the reserved
138           name INCLUDE then the new reserved name will be the replacement
139           text of the entity. Typically the document type declaration will
140           contain
141
142             <!ENTITY % name "IGNORE">
143
144           and will use %name; in the status keyword specification of a marked
145           section declaration. In this case the effect of the option will be
146           to cause the marked section not to be ignored.
147
148       $p->active_links([@active_links])
149           ???
150
151       ENABLING WARNINGS
152
153       Additional warnings can be enabled using
154
155         $p->warnings([@warnings])
156
157       The following values can be used to enable warnings:
158
159       xml Warn about constructs that are not allowed by XML.
160
161       mixed
162           Warn about mixed content models that do not allow #pcdata anywhere.
163
164       sgmldecl
165           Warn about various dubious constructions in the SGML declaration.
166
167       should
168           Warn about various recommendations made in ISO 8879 that the docu‐
169           ment does not comply with. (Recommendations are expressed with
170           ``should'', as distinct from requirements which are usually
171           expressed with ``shall''.)
172
173       default
174           Warn about defaulted references.
175
176       duplicate
177           Warn about duplicate entity declarations.
178
179       undefined
180           Warn about undefined elements: elements used in the DTD but not
181           defined.
182
183       unclosed
184           Warn about unclosed start and end-tags.
185
186       empty
187           Warn about empty start and end-tags.
188
189       net Warn about net-enabling start-tags and null end-tags.
190
191       min-tag
192           Warn about minimized start and end-tags. Equivalent to combination
193           of unclosed, empty and net warnings.
194
195       unused-map
196           Warn about unused short reference maps: maps that are declared with
197           a short reference mapping declaration but never used in a short
198           reference use declaration in the DTD.
199
200       unused-param
201           Warn about parameter entities that are defined but not used in a
202           DTD.  Unused internal parameter entities whose text is "INCLUDE" or
203           "IGNORE" won't get the warning.
204
205       notation-sysid
206           Warn about notations for which no system identifier could be gener‐
207           ated.
208
209       all Warn about conditions that should usually be avoided (in the opin‐
210           ion of the author). Equivalent to: "mixed", "should", "default",
211           "undefined", "sgmldecl", "unused-map", "unused-param", "empty" and
212           "unclosed".
213
214       DISABLING WARNINGS
215
216       A warning can be disabled by using its name prefixed with "no-".  Thus
217       calling warnings(qw(all no-duplicate)) will enable all warnings except
218       those about duplicate entity declarations.
219
220       The following values for "warnings()" disable errors:
221
222       no-idref
223           Do not give an error for an ID reference value which no element has
224           as its ID. The effect will be as if each attribute declared as an
225           ID reference value had been declared as a name.
226
227       no-significant
228           Do not give an error when a character that is not a significant
229           character in the reference concrete syntax occurs in a literal in
230           the SGML declaration. This may be useful in conjunction with cer‐
231           tain buggy test suites.
232
233       no-valid
234           Do not require the document to be type-valid. This has the effect
235           of changing the SGML declaration to specify "VALIDITY NOASSERT" and
236           "IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has the
237           effect of changing the SGML declaration to specify "VALIDITY TYPE"
238           and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
239           "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF" speci‐
240           fied in the SGML declaration will be used.
241
242       XML WARNINGS
243
244       The following warnings are turned on for the "xml" warning described
245       above:
246
247       inclusion
248           Warn about inclusions in element type declarations.
249
250       exclusion
251           Warn about exclusions in element type declarations.
252
253       rcdata-content
254           Warn about RCDATA declared content in element type declarations.
255
256       cdata-content
257           Warn about CDATA declared content in element type declarations.
258
259       ps-comment
260           Warn about comments in parameter separators.
261
262       attlist-group-decl
263           Warn about name groups in attribute declarations.
264
265       element-group-decl
266           Warn about name groups in element type declarations.
267
268       pi-entity
269           Warn about PI entities.
270
271       internal-sdata-entity
272           Warn about internal SDATA entities.
273
274       internal-cdata-entity
275           Warn about internal CDATA entities.
276
277       external-sdata-entity
278           Warn about external SDATA entities.
279
280       external-cdata-entity
281           Warn about external CDATA entities.
282
283       bracket-entity
284           Warn about bracketed text entities.
285
286       data-atts
287           Warn about attribute definition list declarations for notations.
288
289       missing-system-id
290           Warn about external identifiers without system identifiers.
291
292       conref
293           Warn about content reference attributes.
294
295       current
296           Warn about current attributes.
297
298       nutoken-decl-value
299           Warn about attributes with a declared value of NUTOKEN or NUTOKENS.
300
301       number-decl-value
302           Warn about attributes with a declared value of NUMBER or NUMBERS.
303
304       name-decl-value
305           Warn about attributes with a declared value of NAME or NAMES.
306
307       named-char-ref
308           Warn about named character references.
309
310       refc
311           Warn about ommitted refc delimiters.
312
313       temp-ms
314           Warn about TEMP marked sections.
315
316       rcdata-ms
317           Warn about RCDATA marked sections.
318
319       instance-include-ms
320           Warn about INCLUDE marked sections in the document instance.
321
322       instance-ignore-ms
323           Warn about IGNORE marked sections in the document instance.
324
325       and-group
326           Warn about AND connectors in model groups.
327
328       rank
329           Warn about ranked elements.
330
331       empty-comment-decl
332           Warn about empty comment declarations.
333
334       att-value-not-literal
335           Warn about attribute values which are not literals.
336
337       missing-att-name
338           Warn about ommitted attribute names in start tags.
339
340       comment-decl-s
341           Warn about spaces before the MDC in comment declarations.
342
343       comment-decl-multiple
344           Warn about comment declarations containing multiple comments.
345
346       missing-status-keyword
347           Warn about marked sections without a status keyword.
348
349       multiple-status-keyword
350           Warn about marked sections with multiple status keywords.
351
352       instance-param-entity
353           Warn about parameter entities in the document instance.
354
355       min-param
356           Warn about minimization parameters in element type declarations.
357
358       mixed-content-xml
359           Warn about cases of mixed content which are not allowed in XML.
360
361       name-group-not-or
362           Warn about name groups with a connector different from OR.
363
364       pi-missing-name
365           Warn about processing instructions which don't start with a name.
366
367       instance-status-keyword-s
368           Warn about spaces between DSO and status keyword in marked sec‐
369           tions.
370
371       external-data-entity-ref
372           Warn about references to external data entities in the content.
373
374       att-value-external-entity-ref
375           Warn about references to external data entities in attribute val‐
376           ues.
377
378       data-delim
379           Warn about occurances of `<' and `&' as data.
380
381       explicit-sgml-decl
382           Warn about an explicit SGML declaration.
383
384       internal-subset-ms
385           Warn about marked sections in the internal subset.
386
387       default-entity
388           Warn about a default entity declaration.
389
390       non-sgml-char-ref
391           Warn about numeric character references to non-SGML characters.
392
393       internal-subset-ps-param-entity
394           Warn about parameter entity references in parameter separators in
395           the internal subset.
396
397       internal-subset-ts-param-entity
398           Warn about parameter entity references in token separators in the
399           internal subset.
400
401       internal-subset-literal-param-entity
402           Warn about parameter entity references in parameter literals in the
403           internal subset.
404

PROCESSING FILES

406       In order to start processing of a document and recieve events, the
407       "parse" method must be called. It takes one argument specifying the
408       path to a file (not a file handle). You must set an event handler using
409       the "handler" method prior to using this method. The return value of
410       "parse" is currently undefined.
411

EVENT HANDLERS

413       In order to receive data from the parser you need to write an event
414       handler. For example,
415
416         package ExampleHandler;
417
418         sub new { bless {}, shift }
419
420         sub start_element
421         {
422             my ($self, $elem) = @_;
423             printf "  * %s\n", $elem->{Name};
424         }
425
426       This handler would print all the element names as they are found in the
427       document, for a typical XHTML document this might result in something
428       like
429
430         * html
431         * head
432         * title
433         * body
434         * p
435         * ...
436
437       The events closely match those in the generic interface to OpenSP, see
438       <http://openjade.sf.net/doc/generic.htm> for more information.
439
440       The event names have been changed to lowercase and underscores to sepa‐
441       rate words and properties are capitalized. Arrays are represented as
442       Perl array references. "Position" information is not passed to the han‐
443       dler but made available through the "get_location" method which can be
444       called from event handlers. Some redundant information has also been
445       stripped and the generic identifier of an element is stored in the
446       "Name" hash entry.
447
448       For example, for an EndElementEvent the "end_element" handler gets
449       called with a hash reference
450
451         {
452           Name => 'gi'
453         }
454
455       The following events are defined:
456
457         * appinfo
458         * processing_instruction
459         * start_element
460         * end_element
461         * data
462         * sdata
463         * external_data_entity_ref
464         * subdoc_entity_ref
465         * start_dtd
466         * end_dtd
467         * end_prolog
468         * general_entity       # set $p->output_general_entities(1)
469         * comment_decl         # set $p->output_comment_decls(1)
470         * marked_section_start # set $p->output_marked_sections(1)
471         * marked_section_end   # set $p->output_marked_sections(1)
472         * ignored_chars        # set $p->output_marked_sections(1)
473         * error
474         * open_entity_change
475
476       If the documentation of the generic interface to OpenSP states that
477       certain data is not valid, it will not be available through this inter‐
478       face (i.e., the respective key does not exist in the hash ref).
479

POSITIONING INFORMATION

481       Event handlers can call the "get_location" method on the parser object
482       to retrieve positioning information, the get_location method will
483       return a hash reference with the following properties:
484
485         LineNumber   => ..., # line number
486         ColumnNumber => ..., # column number
487         ByteOffset   => ..., # number of preceding bytes
488         EntityOffset => ..., # number of preceding bit combinations
489         EntityName   => ..., # name of the external entity
490         FileName     => ..., # name of the file
491
492       These can be "undef" or an empty string.
493

POST-PROCESSING ERROR MESSAGES

495       OpenSP returns error messages in form of a string rather than individ‐
496       ual components of the message like line numbers or message text. The
497       "split_message" method on the parser object can be used to post-process
498       these error message strings as reliable as possible. It can be used
499       e.g.  from an error event handler if the parser object is accessible
500       like
501
502         sub error
503         {
504           my $self = shift;
505           my $erro = shift;
506           my $mess = $self->{parser}->split_message($erro);
507         }
508
509       See the documentation of "split_message" in the
510       SGML::Parser::OpenSP::Tools documentation.
511

UNICODE SUPPORT

513       All strings returned from event handlers and helper routines are UTF-8
514       encoded with the UTF-8 flag turned on, helper functions like
515       "split_message" expect (but don't check) that string arguments are
516       UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper
517       functions is undefined when you pass unexpected input and should be
518       avoided.
519
520       "parse" has limited support for binary input, but the binary input must
521       be compatible with OpenSP's generic interface requirements and you must
522       specify the encoding through means available to OpenSP to enable it to
523       properly decode the binary input. Any encoding meta data about such
524       binary input specific to Perl (such as encoding disciplines for file
525       handles when you pass a file descriptor) will be ignored. For more spe‐
526       cific information refer to the OpenSP manual.
527
528       * <http://openjade.sourceforge.net/doc/sysid.htm>
529       * <http://openjade.sourceforge.net/doc/charset.htm>
530

ENVIRONMENT VARIABLES

532       OpenSP supports a number of environment variables to control specific
533       processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED".
534       Portable applications need to ensure that these are set prior to load‐
535       ing the OpenSP library into memory which happens when the XS code is
536       loaded. This means you need to wrap the code into a "BEGIN" block:
537
538         BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
539         use SGML::Parser::OpenSP;
540         # ...
541
542       Otherwise changes to the environment might not propagate to OpenSP.
543       This applies specifically to Win32 systems.
544
545       SGML_SEARCH_PATH
546           See <http://openjade.sourceforge.net/doc/sysid.htm>.
547
548       SP_HTTP_USER_AGENT
549           The "User-Agent" header for HTTP requests.
550
551       SP_HTTP_ACCEPT
552           The "Accept" header for HTTP requests.
553
554       SP_MESSAGE_FORMAT
555           Enable run time selection of message format, Value is one of "XML",
556           "NONE", "TRADITIONAL". Whether this will have an effect depends on
557           a compile time setting which might not be enabled in your OpenSP
558           build. This module assumes that no such support was compiled in.
559
560       SGML_CATALOG_FILES
561       SP_USE_DOCUMENT_CATALOG
562           See <http://openjade.sourceforge.net/doc/catalog.htm>.
563
564       SP_SYSTEM_CHARSET
565       SP_CHARSET_FIXED
566       SP_BCTF
567       SP_ENCODING
568           See <http://openjade.sourceforge.net/doc/charset.htm>.
569
570       Note that you can use the "search_dirs" method instead of using
571       "SGML_SEARCH_PATH" and the "catalogs" method instead of using
572       "SGML_CATALOG_FILES" and attributes on storage object specifications
573       for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
574       "SP_CHARSET_FIXED" is set to 1 you can use
575
576         $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");
577
578       to process "example.xhtml" using the "UTF-8" character encoding.
579

KNOWN ISSUES

581       OpenSP must be compiled with "SP_MULTI_BYTE" defined and with
582       "SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime
583       or not compile.
584

BUG REPORTS

586       Please report bugs in this module via
587       <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>
588
589       Please report bugs in OpenSP via
590       <http://sf.net/tracker/?group_id=2115&atid=102115>
591
592       Please send comments and questions to the spo-devel mailing list, see
593       <http://lists.sf.net/lists/listinfo/spo-devel> for details.
594

SEE ALSO

596       * <http://openjade.sf.net/doc/generic.htm>
597       * <http://openjade.sf.net/>
598       * <http://sf.net/projects/spo/>
599
601         Terje Bless <link@cpan.org> wrote version 0.01.
602         Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02.
603
604         Copyright (c) 2006 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
605         This module is licensed under the same terms as Perl itself.
606
607
608
609perl v5.8.8                       2006-08-30           SGML::Parser::OpenSP(3)
Impressum