1SGML::Parser::OpenSP(3)User Contributed Perl DocumentatioSnGML::Parser::OpenSP(3)
2
3
4
6 SGML::Parser::OpenSP - Parse SGML documents using OpenSP
7
9 use SGML::Parser::OpenSP;
10
11 my $p = SGML::Parser::OpenSP->new;
12 my $h = ExampleHandler->new;
13
14 $p->catalogs(qw(xhtml.soc));
15 $p->warnings(qw(xml valid));
16 $p->handler($h);
17
18 $p->parse("example.xhtml");
19
21 This module provides an interface to the OpenSP SGML parser. OpenSP and
22 this module are event based. As the parser recognizes parts of the
23 document (say the start or end of an element), then any handlers
24 registered for that type of an event are called with suitable
25 parameters.
26
28 new()
29 Returns a new SGML::Parser::OpenSP object. Takes no arguments.
30
31 parse($file)
32 Parses the file passed as an argument. Note that this must be a
33 filename and not a filehandle. See "PROCESSING FILES" below for
34 details.
35
36 parse_string($data)
37 Parses the data passed as an argument. See "PROCESSING FILES" below
38 for details.
39
40 halt()
41 Halts processing before parsing the entire document. Takes no
42 arguments.
43
44 split_message()
45 Splits OpenSP's error messages into their component parts. See
46 "POST-PROCESSING ERROR MESSAGES" below for details.
47
48 get_location()
49 See "POSITIONING INFORMATION" below for details.
50
52 BOOLEAN OPTIONS
53 $p->handler([$handler])
54 Report events to the blessed reference $handler.
55
56 ERROR MESSAGE FORMAT
57 $p->show_open_entities([$bool])
58 Describe open entities in error messages. Error messages always
59 include the position of the most recently opened external entity.
60 The default is false.
61
62 $p->show_open_elements([$bool])
63 Show the generic identifiers of open elements in error messages.
64 The default is false.
65
66 $p->show_error_numbers([$bool])
67 Show message numbers in error messages.
68
69 GENERATED EVENTS
70 $p->output_comment_decls([$bool])
71 Generate "comment_decl" events. The default is false.
72
73 $p->output_marked_sections([$bool])
74 Generate marked section events ("marked_section_start",
75 "marked_section_end", "ignored_chars"). The default is false.
76
77 $p->output_general_entities([$bool])
78 Generate "general_entity" events. The default is false.
79
80 IO SETTINGS
81 $p->map_catalog_document([$bool])
82 "parse" arguments specify catalog files rather than the document
83 entity. The document entity is specified by the first DOCUMENT
84 entry in the catalog files. The default is false.
85
86 $p->restrict_file_reading([$bool])
87 Restrict file reading to the specified directories (see the
88 "search_dirs" method and the "SGML_SEARCH_PATH" environment
89 variable). You should turn this option on and configure the search
90 paths accordingly if you intend to process untrusted resources. The
91 default is false.
92
93 $p->catalogs([@catalogs])
94 Map public identifiers and entity names to system identifiers using
95 the specified catalog entry files. Multiple catalogs are allowed.
96 If there is a catalog entry file called "catalog" in the same place
97 as the document entity, it will be searched for immediately after
98 those specified.
99
100 $p->search_dirs([@search_dirs])
101 Search the specified directories for files specified in system
102 identifiers. Multiple values options are allowed. See the
103 description of the osfile storage manager in the OpenSP
104 documentation for more information about file searching.
105
106 $p->pass_file_descriptor([$bool])
107 Instruct "parse_string" to pass the input data down to the guts of
108 OpenSP using the "OSFD" storage manager (if true) or the "OSFILE"
109 storage manager (if false). This amounts to the difference between
110 passing a file descriptor and a (temporary) file name.
111
112 The default is true except on platforms, such as Win32, which are
113 known to not support passing file descriptors around in this
114 manner. On platforms which support it you can call this method with
115 a false parameter to force use of temporary file names instead.
116
117 In general, this will do the right thing on its own so it's best to
118 consider this an internal method. If your platform is such that you
119 have to force use of the OSFILE storage manager, please report it
120 as a bug and include the values of $^O, $Config{archname}, and a
121 description of the platform (e.g. "Windows Vista Service Pack 42").
122
123 PROCESSING OPTIONS
124 $p->include_params([@include_params])
125 For each name in @include_params pretend that
126
127 <!ENTITY % name "INCLUDE">
128
129 occurs at the start of the document type declaration subset in the
130 SGML document entity. Since repeated definitions of an entity are
131 ignored, this definition will take precedence over any other
132 definitions of this entity in the document type declaration.
133 Multiple names are allowed. If the SGML declaration replaces the
134 reserved name INCLUDE then the new reserved name will be the
135 replacement text of the entity. Typically the document type
136 declaration will contain
137
138 <!ENTITY % name "IGNORE">
139
140 and will use %name; in the status keyword specification of a marked
141 section declaration. In this case the effect of the option will be
142 to cause the marked section not to be ignored.
143
144 $p->active_links([@active_links])
145 ???
146
147 ENABLING WARNINGS
148 Additional warnings can be enabled using
149
150 $p->warnings([@warnings])
151
152 The following values can be used to enable warnings:
153
154 xml Warn about constructs that are not allowed by XML.
155
156 mixed
157 Warn about mixed content models that do not allow #pcdata anywhere.
158
159 sgmldecl
160 Warn about various dubious constructions in the SGML declaration.
161
162 should
163 Warn about various recommendations made in ISO 8879 that the
164 document does not comply with. (Recommendations are expressed with
165 ``should'', as distinct from requirements which are usually
166 expressed with ``shall''.)
167
168 default
169 Warn about defaulted references.
170
171 duplicate
172 Warn about duplicate entity declarations.
173
174 undefined
175 Warn about undefined elements: elements used in the DTD but not
176 defined.
177
178 unclosed
179 Warn about unclosed start and end-tags.
180
181 empty
182 Warn about empty start and end-tags.
183
184 net Warn about net-enabling start-tags and null end-tags.
185
186 min-tag
187 Warn about minimized start and end-tags. Equivalent to combination
188 of unclosed, empty and net warnings.
189
190 unused-map
191 Warn about unused short reference maps: maps that are declared with
192 a short reference mapping declaration but never used in a short
193 reference use declaration in the DTD.
194
195 unused-param
196 Warn about parameter entities that are defined but not used in a
197 DTD. Unused internal parameter entities whose text is "INCLUDE" or
198 "IGNORE" won't get the warning.
199
200 notation-sysid
201 Warn about notations for which no system identifier could be
202 generated.
203
204 all Warn about conditions that should usually be avoided (in the
205 opinion of the author). Equivalent to: "mixed", "should",
206 "default", "undefined", "sgmldecl", "unused-map", "unused-param",
207 "empty" and "unclosed".
208
209 DISABLING WARNINGS
210 A warning can be disabled by using its name prefixed with "no-". Thus
211 calling warnings(qw(all no-duplicate)) will enable all warnings except
212 those about duplicate entity declarations.
213
214 The following values for "warnings()" disable errors:
215
216 no-idref
217 Do not give an error for an ID reference value which no element has
218 as its ID. The effect will be as if each attribute declared as an
219 ID reference value had been declared as a name.
220
221 no-significant
222 Do not give an error when a character that is not a significant
223 character in the reference concrete syntax occurs in a literal in
224 the SGML declaration. This may be useful in conjunction with
225 certain buggy test suites.
226
227 no-valid
228 Do not require the document to be type-valid. This has the effect
229 of changing the SGML declaration to specify "VALIDITY NOASSERT" and
230 "IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has the
231 effect of changing the SGML declaration to specify "VALIDITY TYPE"
232 and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
233 "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF"
234 specified in the SGML declaration will be used.
235
236 XML WARNINGS
237 The following warnings are turned on for the "xml" warning described
238 above:
239
240 inclusion
241 Warn about inclusions in element type declarations.
242
243 exclusion
244 Warn about exclusions in element type declarations.
245
246 rcdata-content
247 Warn about RCDATA declared content in element type declarations.
248
249 cdata-content
250 Warn about CDATA declared content in element type declarations.
251
252 ps-comment
253 Warn about comments in parameter separators.
254
255 attlist-group-decl
256 Warn about name groups in attribute declarations.
257
258 element-group-decl
259 Warn about name groups in element type declarations.
260
261 pi-entity
262 Warn about PI entities.
263
264 internal-sdata-entity
265 Warn about internal SDATA entities.
266
267 internal-cdata-entity
268 Warn about internal CDATA entities.
269
270 external-sdata-entity
271 Warn about external SDATA entities.
272
273 external-cdata-entity
274 Warn about external CDATA entities.
275
276 bracket-entity
277 Warn about bracketed text entities.
278
279 data-atts
280 Warn about attribute definition list declarations for notations.
281
282 missing-system-id
283 Warn about external identifiers without system identifiers.
284
285 conref
286 Warn about content reference attributes.
287
288 current
289 Warn about current attributes.
290
291 nutoken-decl-value
292 Warn about attributes with a declared value of NUTOKEN or NUTOKENS.
293
294 number-decl-value
295 Warn about attributes with a declared value of NUMBER or NUMBERS.
296
297 name-decl-value
298 Warn about attributes with a declared value of NAME or NAMES.
299
300 named-char-ref
301 Warn about named character references.
302
303 refc
304 Warn about ommitted refc delimiters.
305
306 temp-ms
307 Warn about TEMP marked sections.
308
309 rcdata-ms
310 Warn about RCDATA marked sections.
311
312 instance-include-ms
313 Warn about INCLUDE marked sections in the document instance.
314
315 instance-ignore-ms
316 Warn about IGNORE marked sections in the document instance.
317
318 and-group
319 Warn about AND connectors in model groups.
320
321 rank
322 Warn about ranked elements.
323
324 empty-comment-decl
325 Warn about empty comment declarations.
326
327 att-value-not-literal
328 Warn about attribute values which are not literals.
329
330 missing-att-name
331 Warn about ommitted attribute names in start tags.
332
333 comment-decl-s
334 Warn about spaces before the MDC in comment declarations.
335
336 comment-decl-multiple
337 Warn about comment declarations containing multiple comments.
338
339 missing-status-keyword
340 Warn about marked sections without a status keyword.
341
342 multiple-status-keyword
343 Warn about marked sections with multiple status keywords.
344
345 instance-param-entity
346 Warn about parameter entities in the document instance.
347
348 min-param
349 Warn about minimization parameters in element type declarations.
350
351 mixed-content-xml
352 Warn about cases of mixed content which are not allowed in XML.
353
354 name-group-not-or
355 Warn about name groups with a connector different from OR.
356
357 pi-missing-name
358 Warn about processing instructions which don't start with a name.
359
360 instance-status-keyword-s
361 Warn about spaces between DSO and status keyword in marked
362 sections.
363
364 external-data-entity-ref
365 Warn about references to external data entities in the content.
366
367 att-value-external-entity-ref
368 Warn about references to external data entities in attribute
369 values.
370
371 data-delim
372 Warn about occurances of `<' and `&' as data.
373
374 explicit-sgml-decl
375 Warn about an explicit SGML declaration.
376
377 internal-subset-ms
378 Warn about marked sections in the internal subset.
379
380 default-entity
381 Warn about a default entity declaration.
382
383 non-sgml-char-ref
384 Warn about numeric character references to non-SGML characters.
385
386 internal-subset-ps-param-entity
387 Warn about parameter entity references in parameter separators in
388 the internal subset.
389
390 internal-subset-ts-param-entity
391 Warn about parameter entity references in token separators in the
392 internal subset.
393
394 internal-subset-literal-param-entity
395 Warn about parameter entity references in parameter literals in the
396 internal subset.
397
399 In order to start processing of a document and recieve events, the
400 "parse" method must be called. It takes one argument specifying the
401 path to a file (not a file handle). You must set an event handler using
402 the "handler" method prior to using this method. The return value of
403 "parse" is currently undefined.
404
406 In order to receive data from the parser you need to write an event
407 handler. For example,
408
409 package ExampleHandler;
410
411 sub new { bless {}, shift }
412
413 sub start_element
414 {
415 my ($self, $elem) = @_;
416 printf " * %s\n", $elem->{Name};
417 }
418
419 This handler would print all the element names as they are found in the
420 document, for a typical XHTML document this might result in something
421 like
422
423 * html
424 * head
425 * title
426 * body
427 * p
428 * ...
429
430 The events closely match those in the generic interface to OpenSP, see
431 <http://openjade.sf.net/doc/generic.htm> for more information.
432
433 The event names have been changed to lowercase and underscores to
434 separate words and properties are capitalized. Arrays are represented
435 as Perl array references. "Position" information is not passed to the
436 handler but made available through the "get_location" method which can
437 be called from event handlers. Some redundant information has also been
438 stripped and the generic identifier of an element is stored in the
439 "Name" hash entry.
440
441 For example, for an EndElementEvent the "end_element" handler gets
442 called with a hash reference
443
444 {
445 Name => 'gi'
446 }
447
448 The following events are defined:
449
450 * appinfo
451 * processing_instruction
452 * start_element
453 * end_element
454 * data
455 * sdata
456 * external_data_entity_ref
457 * subdoc_entity_ref
458 * start_dtd
459 * end_dtd
460 * end_prolog
461 * general_entity # set $p->output_general_entities(1)
462 * comment_decl # set $p->output_comment_decls(1)
463 * marked_section_start # set $p->output_marked_sections(1)
464 * marked_section_end # set $p->output_marked_sections(1)
465 * ignored_chars # set $p->output_marked_sections(1)
466 * error
467 * open_entity_change
468
469 If the documentation of the generic interface to OpenSP states that
470 certain data is not valid, it will not be available through this
471 interface (i.e., the respective key does not exist in the hash ref).
472
474 Event handlers can call the "get_location" method on the parser object
475 to retrieve positioning information, the get_location method will
476 return a hash reference with the following properties:
477
478 LineNumber => ..., # line number
479 ColumnNumber => ..., # column number
480 ByteOffset => ..., # number of preceding bytes
481 EntityOffset => ..., # number of preceding bit combinations
482 EntityName => ..., # name of the external entity
483 FileName => ..., # name of the file
484
485 These can be "undef" or an empty string.
486
488 OpenSP returns error messages in form of a string rather than
489 individual components of the message like line numbers or message text.
490 The "split_message" method on the parser object can be used to post-
491 process these error message strings as reliable as possible. It can be
492 used e.g. from an error event handler if the parser object is
493 accessible like
494
495 sub error
496 {
497 my $self = shift;
498 my $erro = shift;
499 my $mess = $self->{parser}->split_message($erro);
500 }
501
502 See the documentation of "split_message" in the
503 SGML::Parser::OpenSP::Tools documentation.
504
506 All strings returned from event handlers and helper routines are UTF-8
507 encoded with the UTF-8 flag turned on, helper functions like
508 "split_message" expect (but don't check) that string arguments are
509 UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper
510 functions is undefined when you pass unexpected input and should be
511 avoided.
512
513 "parse" has limited support for binary input, but the binary input must
514 be compatible with OpenSP's generic interface requirements and you must
515 specify the encoding through means available to OpenSP to enable it to
516 properly decode the binary input. Any encoding meta data about such
517 binary input specific to Perl (such as encoding disciplines for file
518 handles when you pass a file descriptor) will be ignored. For more
519 specific information refer to the OpenSP manual.
520
521 · <http://openjade.sourceforge.net/doc/sysid.htm>
522
523 · <http://openjade.sourceforge.net/doc/charset.htm>
524
526 OpenSP supports a number of environment variables to control specific
527 processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED".
528 Portable applications need to ensure that these are set prior to
529 loading the OpenSP library into memory which happens when the XS code
530 is loaded. This means you need to wrap the code into a "BEGIN" block:
531
532 BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
533 use SGML::Parser::OpenSP;
534 # ...
535
536 Otherwise changes to the environment might not propagate to OpenSP.
537 This applies specifically to Win32 systems.
538
539 SGML_SEARCH_PATH
540 See <http://openjade.sourceforge.net/doc/sysid.htm>.
541
542 SP_HTTP_USER_AGENT
543 The "User-Agent" header for HTTP requests.
544
545 SP_HTTP_ACCEPT
546 The "Accept" header for HTTP requests.
547
548 SP_MESSAGE_FORMAT
549 Enable run time selection of message format, Value is one of "XML",
550 "NONE", "TRADITIONAL". Whether this will have an effect depends on
551 a compile time setting which might not be enabled in your OpenSP
552 build. This module assumes that no such support was compiled in.
553
554 SGML_CATALOG_FILES
555 SP_USE_DOCUMENT_CATALOG
556 See <http://openjade.sourceforge.net/doc/catalog.htm>.
557
558 SP_SYSTEM_CHARSET
559 SP_CHARSET_FIXED
560 SP_BCTF
561 SP_ENCODING
562 See <http://openjade.sourceforge.net/doc/charset.htm>.
563
564 Note that you can use the "search_dirs" method instead of using
565 "SGML_SEARCH_PATH" and the "catalogs" method instead of using
566 "SGML_CATALOG_FILES" and attributes on storage object specifications
567 for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
568 "SP_CHARSET_FIXED" is set to 1 you can use
569
570 $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");
571
572 to process "example.xhtml" using the "UTF-8" character encoding.
573
575 OpenSP must be compiled with "SP_MULTI_BYTE" defined and with
576 "SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime
577 or not compile.
578
580 Please report bugs in this module via
581 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>
582
583 Please report bugs in OpenSP via
584 <http://sf.net/tracker/?group_id=2115&atid=102115>
585
586 Please send comments and questions to the spo-devel mailing list, see
587 <http://lists.sf.net/lists/listinfo/spo-devel> for details.
588
590 · <http://openjade.sf.net/doc/generic.htm>
591
592 · <http://openjade.sf.net/>
593
594 · <http://sf.net/projects/spo/>
595
597 Terje Bless <link@cpan.org> wrote version 0.01.
598 Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02+.
599
601 Copyright (c) 2006-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
602 This module is licensed under the same terms as Perl itself.
603
604
605
606perl v5.30.0 2019-07-26 SGML::Parser::OpenSP(3)