1SGML::Parser::OpenSP(3)User Contributed Perl DocumentatioSnGML::Parser::OpenSP(3)
2
3
4
6 SGML::Parser::OpenSP - Parse SGML documents using OpenSP
7
9 use SGML::Parser::OpenSP;
10
11 my $p = SGML::Parser::OpenSP->new;
12 my $h = ExampleHandler->new;
13
14 $p->catalogs(qw(xhtml.soc));
15 $p->warnings(qw(xml valid));
16 $p->handler($h);
17
18 $p->parse("example.xhtml");
19
21 This module provides an interface to the OpenSP SGML parser. OpenSP and
22 this module are event based. As the parser recognizes parts of the doc‐
23 ument (say the start or end of an element), then any handlers regis‐
24 tered for that type of an event are called with suitable parameters.
25
27 new()
28 Returns a new SGML::Parser::OpenSP object. Takes no arguments.
29
30 parse($file)
31 Parses the file passed as an argument. Note that this must be a
32 filename and not a filehandle. See "PROCESSING FILES" below for
33 details.
34
35 parse_string($data)
36 Parses the data passed as an argument. See "PROCESSING FILES" below
37 for details.
38
39 halt()
40 Halts processing before parsing the entire document. Takes no argu‐
41 ments.
42
43 split_message()
44 Splits OpenSP's error messages into their component parts. See
45 "POST-PROCESSING ERROR MESSAGES" below for details.
46
47 get_location()
48 See "POSITIONING INFORMATION" below for details.
49
51 BOOLEAN OPTIONS
52
53 $p->handler([$handler])
54 Report events to the blessed reference $handler.
55
56 ERROR MESSAGE FORMAT
57
58 $p->show_open_entities([$bool])
59 Describe open entities in error messages. Error messages always
60 include the position of the most recently opened external entity.
61 The default is false.
62
63 $p->show_open_elements([$bool])
64 Show the generic identifiers of open elements in error messages.
65 The default is false.
66
67 $p->show_error_numbers([$bool])
68 Show message numbers in error messages.
69
70 GENERATED EVENTS
71
72 $p->output_comment_decls([$bool])
73 Generate "comment_decl" events. The default is false.
74
75 $p->output_marked_sections([$bool])
76 Generate marked section events ("marked_section_start",
77 "marked_section_end", "ignored_chars"). The default is false.
78
79 $p->output_general_entities([$bool])
80 Generate "general_entity" events. The default is false.
81
82 IO SETTINGS
83
84 $p->map_catalog_document([$bool])
85 "parse" arguments specify catalog files rather than the document
86 entity. The document entity is specified by the first DOCUMENT
87 entry in the catalog files. The default is false.
88
89 $p->restrict_file_reading([$bool])
90 Restrict file reading to the specified directories (see the
91 "search_dirs" method and the "SGML_SEARCH_PATH" environment vari‐
92 able). You should turn this option on and configure the search
93 paths accordingly if you intend to process untrusted resources. The
94 default is false.
95
96 $p->catalogs([@catalogs])
97 Map public identifiers and entity names to system identifiers using
98 the specified catalog entry files. Multiple catalogs are allowed.
99 If there is a catalog entry file called "catalog" in the same place
100 as the document entity, it will be searched for immediately after
101 those specified.
102
103 $p->search_dirs([@search_dirs])
104 Search the specified directories for files specified in system
105 identifiers. Multiple values options are allowed. See the descrip‐
106 tion of the osfile storage manager in the OpenSP documentation for
107 more information about file searching.
108
109 $p->pass_file_descriptor([$bool])
110 Instruct "parse_string" to pass the input data down to the guts of
111 OpenSP using the "OSFD" storage manager (if true) or the "OSFILE"
112 storage manager (if false). This amounts to the difference between
113 passing a file descriptor and a (temporary) file name.
114
115 The default is true except on platforms, such as Win32, which are
116 known to not support passing file descriptors around in this man‐
117 ner. On platforms which support it you can call this method with a
118 false parameter to force use of temporary file names instead.
119
120 In general, this will do the right thing on its own so it's best to
121 consider this an internal method. If your platform is such that you
122 have to force use of the OSFILE storage manager, please report it
123 as a bug and include the values of $^O, $Config{archname}, and a
124 description of the platform (e.g. "Windows Vista Service Pack 42").
125
126 PROCESSING OPTIONS
127
128 $p->include_params([@include_params])
129 For each name in @include_params pretend that
130
131 <!ENTITY % name "INCLUDE">
132
133 occurs at the start of the document type declaration subset in the
134 SGML document entity. Since repeated definitions of an entity are
135 ignored, this definition will take precedence over any other defi‐
136 nitions of this entity in the document type declaration. Multiple
137 names are allowed. If the SGML declaration replaces the reserved
138 name INCLUDE then the new reserved name will be the replacement
139 text of the entity. Typically the document type declaration will
140 contain
141
142 <!ENTITY % name "IGNORE">
143
144 and will use %name; in the status keyword specification of a marked
145 section declaration. In this case the effect of the option will be
146 to cause the marked section not to be ignored.
147
148 $p->active_links([@active_links])
149 ???
150
151 ENABLING WARNINGS
152
153 Additional warnings can be enabled using
154
155 $p->warnings([@warnings])
156
157 The following values can be used to enable warnings:
158
159 xml Warn about constructs that are not allowed by XML.
160
161 mixed
162 Warn about mixed content models that do not allow #pcdata anywhere.
163
164 sgmldecl
165 Warn about various dubious constructions in the SGML declaration.
166
167 should
168 Warn about various recommendations made in ISO 8879 that the docu‐
169 ment does not comply with. (Recommendations are expressed with
170 ``should'', as distinct from requirements which are usually
171 expressed with ``shall''.)
172
173 default
174 Warn about defaulted references.
175
176 duplicate
177 Warn about duplicate entity declarations.
178
179 undefined
180 Warn about undefined elements: elements used in the DTD but not
181 defined.
182
183 unclosed
184 Warn about unclosed start and end-tags.
185
186 empty
187 Warn about empty start and end-tags.
188
189 net Warn about net-enabling start-tags and null end-tags.
190
191 min-tag
192 Warn about minimized start and end-tags. Equivalent to combination
193 of unclosed, empty and net warnings.
194
195 unused-map
196 Warn about unused short reference maps: maps that are declared with
197 a short reference mapping declaration but never used in a short
198 reference use declaration in the DTD.
199
200 unused-param
201 Warn about parameter entities that are defined but not used in a
202 DTD. Unused internal parameter entities whose text is "INCLUDE" or
203 "IGNORE" won't get the warning.
204
205 notation-sysid
206 Warn about notations for which no system identifier could be gener‐
207 ated.
208
209 all Warn about conditions that should usually be avoided (in the opin‐
210 ion of the author). Equivalent to: "mixed", "should", "default",
211 "undefined", "sgmldecl", "unused-map", "unused-param", "empty" and
212 "unclosed".
213
214 DISABLING WARNINGS
215
216 A warning can be disabled by using its name prefixed with "no-". Thus
217 calling warnings(qw(all no-duplicate)) will enable all warnings except
218 those about duplicate entity declarations.
219
220 The following values for "warnings()" disable errors:
221
222 no-idref
223 Do not give an error for an ID reference value which no element has
224 as its ID. The effect will be as if each attribute declared as an
225 ID reference value had been declared as a name.
226
227 no-significant
228 Do not give an error when a character that is not a significant
229 character in the reference concrete syntax occurs in a literal in
230 the SGML declaration. This may be useful in conjunction with cer‐
231 tain buggy test suites.
232
233 no-valid
234 Do not require the document to be type-valid. This has the effect
235 of changing the SGML declaration to specify "VALIDITY NOASSERT" and
236 "IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has the
237 effect of changing the SGML declaration to specify "VALIDITY TYPE"
238 and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
239 "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF" speci‐
240 fied in the SGML declaration will be used.
241
242 XML WARNINGS
243
244 The following warnings are turned on for the "xml" warning described
245 above:
246
247 inclusion
248 Warn about inclusions in element type declarations.
249
250 exclusion
251 Warn about exclusions in element type declarations.
252
253 rcdata-content
254 Warn about RCDATA declared content in element type declarations.
255
256 cdata-content
257 Warn about CDATA declared content in element type declarations.
258
259 ps-comment
260 Warn about comments in parameter separators.
261
262 attlist-group-decl
263 Warn about name groups in attribute declarations.
264
265 element-group-decl
266 Warn about name groups in element type declarations.
267
268 pi-entity
269 Warn about PI entities.
270
271 internal-sdata-entity
272 Warn about internal SDATA entities.
273
274 internal-cdata-entity
275 Warn about internal CDATA entities.
276
277 external-sdata-entity
278 Warn about external SDATA entities.
279
280 external-cdata-entity
281 Warn about external CDATA entities.
282
283 bracket-entity
284 Warn about bracketed text entities.
285
286 data-atts
287 Warn about attribute definition list declarations for notations.
288
289 missing-system-id
290 Warn about external identifiers without system identifiers.
291
292 conref
293 Warn about content reference attributes.
294
295 current
296 Warn about current attributes.
297
298 nutoken-decl-value
299 Warn about attributes with a declared value of NUTOKEN or NUTOKENS.
300
301 number-decl-value
302 Warn about attributes with a declared value of NUMBER or NUMBERS.
303
304 name-decl-value
305 Warn about attributes with a declared value of NAME or NAMES.
306
307 named-char-ref
308 Warn about named character references.
309
310 refc
311 Warn about ommitted refc delimiters.
312
313 temp-ms
314 Warn about TEMP marked sections.
315
316 rcdata-ms
317 Warn about RCDATA marked sections.
318
319 instance-include-ms
320 Warn about INCLUDE marked sections in the document instance.
321
322 instance-ignore-ms
323 Warn about IGNORE marked sections in the document instance.
324
325 and-group
326 Warn about AND connectors in model groups.
327
328 rank
329 Warn about ranked elements.
330
331 empty-comment-decl
332 Warn about empty comment declarations.
333
334 att-value-not-literal
335 Warn about attribute values which are not literals.
336
337 missing-att-name
338 Warn about ommitted attribute names in start tags.
339
340 comment-decl-s
341 Warn about spaces before the MDC in comment declarations.
342
343 comment-decl-multiple
344 Warn about comment declarations containing multiple comments.
345
346 missing-status-keyword
347 Warn about marked sections without a status keyword.
348
349 multiple-status-keyword
350 Warn about marked sections with multiple status keywords.
351
352 instance-param-entity
353 Warn about parameter entities in the document instance.
354
355 min-param
356 Warn about minimization parameters in element type declarations.
357
358 mixed-content-xml
359 Warn about cases of mixed content which are not allowed in XML.
360
361 name-group-not-or
362 Warn about name groups with a connector different from OR.
363
364 pi-missing-name
365 Warn about processing instructions which don't start with a name.
366
367 instance-status-keyword-s
368 Warn about spaces between DSO and status keyword in marked sec‐
369 tions.
370
371 external-data-entity-ref
372 Warn about references to external data entities in the content.
373
374 att-value-external-entity-ref
375 Warn about references to external data entities in attribute val‐
376 ues.
377
378 data-delim
379 Warn about occurances of `<' and `&' as data.
380
381 explicit-sgml-decl
382 Warn about an explicit SGML declaration.
383
384 internal-subset-ms
385 Warn about marked sections in the internal subset.
386
387 default-entity
388 Warn about a default entity declaration.
389
390 non-sgml-char-ref
391 Warn about numeric character references to non-SGML characters.
392
393 internal-subset-ps-param-entity
394 Warn about parameter entity references in parameter separators in
395 the internal subset.
396
397 internal-subset-ts-param-entity
398 Warn about parameter entity references in token separators in the
399 internal subset.
400
401 internal-subset-literal-param-entity
402 Warn about parameter entity references in parameter literals in the
403 internal subset.
404
406 In order to start processing of a document and recieve events, the
407 "parse" method must be called. It takes one argument specifying the
408 path to a file (not a file handle). You must set an event handler using
409 the "handler" method prior to using this method. The return value of
410 "parse" is currently undefined.
411
413 In order to receive data from the parser you need to write an event
414 handler. For example,
415
416 package ExampleHandler;
417
418 sub new { bless {}, shift }
419
420 sub start_element
421 {
422 my ($self, $elem) = @_;
423 printf " * %s\n", $elem->{Name};
424 }
425
426 This handler would print all the element names as they are found in the
427 document, for a typical XHTML document this might result in something
428 like
429
430 * html
431 * head
432 * title
433 * body
434 * p
435 * ...
436
437 The events closely match those in the generic interface to OpenSP, see
438 <http://openjade.sf.net/doc/generic.htm> for more information.
439
440 The event names have been changed to lowercase and underscores to sepa‐
441 rate words and properties are capitalized. Arrays are represented as
442 Perl array references. "Position" information is not passed to the han‐
443 dler but made available through the "get_location" method which can be
444 called from event handlers. Some redundant information has also been
445 stripped and the generic identifier of an element is stored in the
446 "Name" hash entry.
447
448 For example, for an EndElementEvent the "end_element" handler gets
449 called with a hash reference
450
451 {
452 Name => 'gi'
453 }
454
455 The following events are defined:
456
457 * appinfo
458 * processing_instruction
459 * start_element
460 * end_element
461 * data
462 * sdata
463 * external_data_entity_ref
464 * subdoc_entity_ref
465 * start_dtd
466 * end_dtd
467 * end_prolog
468 * general_entity # set $p->output_general_entities(1)
469 * comment_decl # set $p->output_comment_decls(1)
470 * marked_section_start # set $p->output_marked_sections(1)
471 * marked_section_end # set $p->output_marked_sections(1)
472 * ignored_chars # set $p->output_marked_sections(1)
473 * error
474 * open_entity_change
475
476 If the documentation of the generic interface to OpenSP states that
477 certain data is not valid, it will not be available through this inter‐
478 face (i.e., the respective key does not exist in the hash ref).
479
481 Event handlers can call the "get_location" method on the parser object
482 to retrieve positioning information, the get_location method will
483 return a hash reference with the following properties:
484
485 LineNumber => ..., # line number
486 ColumnNumber => ..., # column number
487 ByteOffset => ..., # number of preceding bytes
488 EntityOffset => ..., # number of preceding bit combinations
489 EntityName => ..., # name of the external entity
490 FileName => ..., # name of the file
491
492 These can be "undef" or an empty string.
493
495 OpenSP returns error messages in form of a string rather than individ‐
496 ual components of the message like line numbers or message text. The
497 "split_message" method on the parser object can be used to post-process
498 these error message strings as reliable as possible. It can be used
499 e.g. from an error event handler if the parser object is accessible
500 like
501
502 sub error
503 {
504 my $self = shift;
505 my $erro = shift;
506 my $mess = $self->{parser}->split_message($erro);
507 }
508
509 See the documentation of "split_message" in the
510 SGML::Parser::OpenSP::Tools documentation.
511
513 All strings returned from event handlers and helper routines are UTF-8
514 encoded with the UTF-8 flag turned on, helper functions like
515 "split_message" expect (but don't check) that string arguments are
516 UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper
517 functions is undefined when you pass unexpected input and should be
518 avoided.
519
520 "parse" has limited support for binary input, but the binary input must
521 be compatible with OpenSP's generic interface requirements and you must
522 specify the encoding through means available to OpenSP to enable it to
523 properly decode the binary input. Any encoding meta data about such
524 binary input specific to Perl (such as encoding disciplines for file
525 handles when you pass a file descriptor) will be ignored. For more spe‐
526 cific information refer to the OpenSP manual.
527
528 * <http://openjade.sourceforge.net/doc/sysid.htm>
529 * <http://openjade.sourceforge.net/doc/charset.htm>
530
532 OpenSP supports a number of environment variables to control specific
533 processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED".
534 Portable applications need to ensure that these are set prior to load‐
535 ing the OpenSP library into memory which happens when the XS code is
536 loaded. This means you need to wrap the code into a "BEGIN" block:
537
538 BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
539 use SGML::Parser::OpenSP;
540 # ...
541
542 Otherwise changes to the environment might not propagate to OpenSP.
543 This applies specifically to Win32 systems.
544
545 SGML_SEARCH_PATH
546 See <http://openjade.sourceforge.net/doc/sysid.htm>.
547
548 SP_HTTP_USER_AGENT
549 The "User-Agent" header for HTTP requests.
550
551 SP_HTTP_ACCEPT
552 The "Accept" header for HTTP requests.
553
554 SP_MESSAGE_FORMAT
555 Enable run time selection of message format, Value is one of "XML",
556 "NONE", "TRADITIONAL". Whether this will have an effect depends on
557 a compile time setting which might not be enabled in your OpenSP
558 build. This module assumes that no such support was compiled in.
559
560 SGML_CATALOG_FILES
561 SP_USE_DOCUMENT_CATALOG
562 See <http://openjade.sourceforge.net/doc/catalog.htm>.
563
564 SP_SYSTEM_CHARSET
565 SP_CHARSET_FIXED
566 SP_BCTF
567 SP_ENCODING
568 See <http://openjade.sourceforge.net/doc/charset.htm>.
569
570 Note that you can use the "search_dirs" method instead of using
571 "SGML_SEARCH_PATH" and the "catalogs" method instead of using
572 "SGML_CATALOG_FILES" and attributes on storage object specifications
573 for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
574 "SP_CHARSET_FIXED" is set to 1 you can use
575
576 $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");
577
578 to process "example.xhtml" using the "UTF-8" character encoding.
579
581 OpenSP must be compiled with "SP_MULTI_BYTE" defined and with
582 "SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime
583 or not compile.
584
586 Please report bugs in this module via
587 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>
588
589 Please report bugs in OpenSP via
590 <http://sf.net/tracker/?group_id=2115&atid=102115>
591
592 Please send comments and questions to the spo-devel mailing list, see
593 <http://lists.sf.net/lists/listinfo/spo-devel> for details.
594
596 * <http://openjade.sf.net/doc/generic.htm>
597 * <http://openjade.sf.net/>
598 * <http://sf.net/projects/spo/>
599
601 Terje Bless <link@cpan.org> wrote version 0.01.
602 Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02.
603
604 Copyright (c) 2006 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
605 This module is licensed under the same terms as Perl itself.
606
607
608
609perl v5.8.8 2006-08-30 SGML::Parser::OpenSP(3)