1HTML::WikiConverter::DiUasleerctCso(n3t)ributed Perl DocHuTmMeLn:t:aWtiikoinConverter::Dialects(3)
2
3
4
6 HTML::WikiConverter::Dialects - How to add a dialect
7
9 # In your dialect module:
10
11 package HTML::WikiConverter::MySlimWiki;
12 use base 'HTML::WikiConverter';
13
14 sub rules { {
15 b => { start => '**', end => '**' },
16 i => { start => '//', end => '//' },
17 strong => { alias => 'b' },
18 em => { alias => 'i' },
19 hr => { replace => "\n----\n" }
20 } }
21
22 # In a nearby piece of code:
23
24 package main;
25 use Test::More tests => 5;
26
27 my $wc = new HTML::WikiConverter(
28 dialect => 'MySlimWiki'
29 );
30
31 is( $wc->html2wiki( '<b>text</b>' ), '**text**', b );
32 is( $wc->html2wiki( '<i>text</i>' ), '//text//', i );
33 is( $wc->html2wiki( '<strong>text</strong>' ), '**text**', 'strong' );
34 is( $wc->html2wiki( '<em>text</em>' ), '//text//', 'em' );
35 is( $wc->html2wiki( '<hr/>' ), '----', 'hr' );
36
38 HTML::WikiConverter (or H::WC, for short) is an HTML to wiki converter.
39 It can convert HTML source into a variety of wiki markups, called wiki
40 "dialects". This manual describes how you to create your own dialect
41 to be plugged into HTML::WikiConverter.
42
44 Each dialect has a separate dialect module containing rules for
45 converting HTML into wiki markup specific for that dialect. Currently,
46 all dialect modules are in the "HTML::WikiConverter::" package space
47 and subclass HTML::WikiConverter. For example, the MediaWiki dialect
48 module is HTML::WikiConverter::MediaWiki, while PhpWiki's is
49 HTML::WikiConverter::PhpWiki. However, dialect modules need not be in
50 the "HTML::WikiConverter::" package space; you may just as easily use
51 "package MyWikiDialect;" and H::WC will Do The Right Thing.
52
53 From now on, I'll be using the terms "dialect" and "dialect module"
54 interchangeably.
55
56 Subclassing
57 To interface with H::WC, dialects need to subclass it. This is done
58 like so at the start of the dialect module:
59
60 package HTML::WikiConverter::MySlimWiki;
61 use base 'HTML::WikiConverter';
62
63 Conversion rules
64 Dialects guide H::WC's conversion process with a set of rules that
65 define how HTML elements are turned into their wiki counterparts. Each
66 rule corresponds to an HTML tag and there may be any number of rules.
67 Rules are specified in your dialect's rules() method, which returns a
68 reference to a hash of rules. Each entry in the hash maps a tag name to
69 a set of subrules, as in:
70
71 $tag => \%subrules
72
73 where $tag is the name of the HTML tag (e.g., "b", "em", etc.) and
74 %subrules contains subrules that specify how that tag will be converted
75 when it is encountered in the HTML input.
76
77 Subrules
78
79 The following subrules are recognized:
80
81 start
82 end
83
84 preserve
85 attributes
86 empty
87
88 replace
89 alias
90
91 block
92 line_format
93 line_prefix
94
95 trim
96
97 A simple example
98
99 The following rules could be used for a dialect that uses "*asterisks*"
100 for bold and "_underscores_" for italic text:
101
102 sub rules {
103 b => { start => '*', end => '*' },
104 i => { start => '_', end => '_' },
105 }
106
107 Aliases
108
109 To add "<strong>" and "<em>" as aliases of "<b>" and "<i>", use the
110 "alias" subrule:
111
112 strong => { alias => 'b' },
113 em => { alias => 'i' },
114
115 (The "alias" subrule cannot be used with any other subrule.)
116
117 Blocks
118
119 Many dialects separate paragraphs and other block-level elements with a
120 blank line. To indicate this, use the "block" subrule:
121
122 p => { block => 1 },
123
124 (To better support nested block elements, if a block elements are
125 nested inside each other, blank lines are only added to the outermost
126 element.)
127
128 Line formatting
129
130 Many dialects require that the text of an element be contained on a
131 single line of text, or that it cannot contain any newlines, etc. These
132 options can be specified using the "line_format" subrule, which can be
133 assigned the value "single", "multi", or "blocks".
134
135 If the element must be contained on a single line, then the
136 "line_format" subrule should be "single". If the element can span
137 multiple lines, but there can be no blank lines contained within, then
138 use "multi". If blank lines (which delimit blocks) are allowed, then
139 use "blocks". For example, paragraphs are specified like so in the
140 MediaWiki dialect:
141
142 p => { block => 1, line_format => 'multi', trim => 'both' },
143
144 Trimming whitespace
145
146 The "trim" subrule specifies whether leading or trailing whitespace (or
147 both) should be stripped from the element. To strip leading whitespace
148 only, use "leading"; for trailing whitespace, use "trailing"; for both,
149 use the aptly named "both"; for neither (the default), use "none".
150
151 Line prefixes
152
153 Some elements require that each line be prefixed with a particular
154 string. This is specified with the "line_prefix" subrule. For example,
155 preformatted text in MediaWiki is prefixed with a space:
156
157 pre => { block => 1, line_prefix => ' ' },
158
159 Replacement
160
161 In some cases, conversion from HTML to wiki markup is as simple as
162 string replacement. To replace a tag and its contents with a particular
163 string, use the "replace" subrule. For example, in PhpWiki, three
164 percent signs, "%%%", represents a line break, "<br>", hence:
165
166 br => { replace => '%%%' },
167
168 (The "replace" subrule cannot be used with any other subrule.)
169
170 Preserving HTML tags
171
172 Some dialects allow a subset of HTML in their markup. While H::WC
173 ignores unhandled HTML tags by default (i.e., if H::WC encounters a tag
174 that does not exist in a dialect's rule specification, then the
175 contents of the tag is simply passed through to the wiki markup), you
176 may specify that some be preserved using the "preserve" subrule. For
177 example, to allow "<font>" tag in wiki markup:
178
179 font => { preserve => 1 },
180
181 Preserved tags may also specify a list of attributes that may also
182 passthrough from HTML to wiki markup. This is done with the
183 "attributes" subrule:
184
185 font => { preserve => 1, attributes => [ qw/ style class / ] },
186
187 (The "attributes" subrule can only be used if the "preserve" subrule is
188 also present.)
189
190 Some HTML elements have no content (e.g., line breaks, images) and the
191 wiki dialect might require them to be preserved in a more XHTML-
192 friendly way. To indicate that a preserved tag should have no content,
193 use the "empty" subrule. This will cause the element to be replaced
194 with "<tag />" and no end tag. For example, MediaWiki handles line
195 breaks like so:
196
197 br => {
198 preserve => 1,
199 attributes => [ qw/ id class title style clear / ],
200 empty => 1
201 },
202
203 This will convert, for example, "<br clear='both'>" into "<br
204 clear='both' />". Without specifying the "empty" subrule, this would be
205 converted into the (probably undesirable) "<br clear='both'></br>".
206
207 (The "empty" subrule can only be used if the "preserve" subrule is also
208 present.)
209
210 Rules that depend on attribute values
211
212 In some circumstances, you might want your dialect's conversion rules
213 to depend on the value of one or more attributes. This can be achieved
214 by producing rules in a conditional manner within rules(). For example:
215
216 sub rules {
217 my $self = shift;
218
219 my %rules = (
220 em => { start => "''", end => "''" },
221 strong => { start => "'''", end => "'''" },
222 );
223
224 $rules{i} = { preserve => 1 } if $self->preserve_italic;
225 $rules{b} = { preserve => 1 } if $self->preserve_bold;
226
227 return \%rules;
228 }
229
230 Dynamic subrules
231 Instead of simple strings, you may use coderefs as values for the
232 "start", "end", "replace", and "line_prefix" subrules. If you do, the
233 code will be called when the subrule is applied, and will be passed
234 three arguments: the current H::WC object, the current HTML::Element
235 node being operated on, and a reference to the hash containing the
236 dialect's subrules associated with elements of that type.
237
238 For example, MoinMoin handles lists like so:
239
240 ul => { line_format => 'multi', block => 1, line_prefix => ' ' },
241 li => { start => \&_li_start, trim => 'leading' },
242 ol => { alias => 'ul' },
243
244 It then defines _li_start():
245
246 sub _li_start {
247 my( $self, $node, $subrules ) = @_;
248 my $bullet = '';
249 $bullet = '*' if $node->parent->tag eq 'ul';
250 $bullet = '1.' if $node->parent->tag eq 'ol';
251 return "\n$bullet ";
252 }
253
254 This prefixes every unordered list item with "*" and every ordered list
255 item with "1.", which MoinMoin requires. It also puts each list item on
256 its own line and places a space between the prefix and the content of
257 the list item.
258
259 Subrule validation
260 Certain subrule combinations are not allowed. Hopefully it's intuitive
261 why this is, but in case it's not, prohibited combinations have been
262 mentioned above parenthetically. For example, the "replace" and "alias"
263 subrules cannot be combined with any other subrules, and "attributes"
264 can only be specified alongside "preserve". Invalid subrule
265 combinations will trigger a fatal error when the H::WC object is
266 instantiated.
267
268 Dialect attributes
269 H::WC's constructor accepts a number of attributes that help determine
270 how conversion takes place. Dialects can alter these attributes or add
271 their own by defining an attributes() method, which returns a reference
272 to a hash of attributes. Each entry in the hash maps the attribute's
273 name to an attribute specification, as in:
274
275 $attr => \%spec
276
277 where $attr is the name of the attribute and %spec is a
278 Params::Validate specification for the attribute.
279
280 For example, to add a boolean attribute called "camel_case" which is
281 disabled by default:
282
283 sub attributes {
284 camel_case => { default => 0 },
285 }
286
287 Attributes defined liks this are given accessor and mutator methods via
288 Perl's "AUTOLOAD" mechanism, so you can later say:
289
290 my $ok = $wc->camel_case;
291 $wc->camel_case(0);
292
293 You may override the default H::WC attributes using this mechanism. For
294 example, while H::WC considers the "base_uri" attribute optional, it is
295 required for the PbWiki dialect. PbWiki can override this default-
296 optional behavior by saying:
297
298 sub attributes {
299 base_uri => { optional => 0 }
300 }
301
302 Preprocessing
303 The first step H::WC takes in converting HTML source to wiki markup is
304 to parse the HTML into a syntax tree using HTML::TreeBuilder. It is
305 often useful for dialects to preprocess the tree prior to converting it
306 into wiki markup. Dialects that need to preprocess the tree can define
307 a "preprocess_node" method that will be called on each node of the tree
308 (traversal is done in pre-order). The method receives two arguments,
309 the H::WC object, and the current HTML::Element node being traversed.
310 It may modify the node or decide to ignore it; its return value is
311 discarded.
312
313 Built-in preprocessors
314
315 Because they are commonly needed, H::WC automatically carries out two
316 preprocessing steps, regardless of the dialect: 1) relative URIs in
317 images and links are converted to absolute URIs (based upon the
318 "base_uri" parameter), and 2) ignorable text (e.g. between a "</td>"
319 and "<td>") is discarded.
320
321 H::WC also provides additional preprocessing steps that may be
322 explicitly enabled by dialect modules.
323
324 strip_aname
325 Removes any anchor elements that do not contain an "href"
326 attribute.
327
328 caption2para
329 Removes table captions and reinserts them as paragraphs before the
330 table.
331
332 Dialects may apply these optional preprocessing steps by calling them
333 as methods on the dialect object inside "preprocess_node". For example:
334
335 sub preprocess_node {
336 my( $self, $node ) = @_;
337 $self->strip_aname($node);
338 $self->caption2para($node);
339 }
340
341 Postprocessing
342 Once the work of converting HTML is complete, it is sometimes useful to
343 postprocess the resulting wiki markup. Postprocessing can be used to
344 clean up whitespace, fix subtle bugs introduced in the markup during
345 conversion, etc.
346
347 Dialects that want to postprocess the wiki markup should define a
348 "postprocess_output" method that will be called just before the
349 "html2wiki" method returns to the client. The method will be passed two
350 arguments, the H::WC object and a reference to the wiki markup. The
351 method may modify the wiki markup that the reference points to; its
352 return value is discarded.
353
354 For example, to replace a series of line breaks with a pair of
355 newlines, a dialect might implement this:
356
357 sub postprocess_output {
358 my( $self, $outref ) = @_;
359 $$outref =~ s/<br>\s*<br>/\n\n/gs;
360 }
361
362 (This example assumes that HTML line breaks were replaced with "<br>"
363 in the wiki markup.)
364
365 Dialect utility methods
366 H::WC defines a set of utility methods that dialect modules may find
367 useful.
368
369 get_elem_contents
370
371 my $wiki = $wc->get_elem_contents( $node );
372
373 Converts the contents of $node into wiki markup and returns the
374 resulting wiki markup.
375
376 get_wiki_page
377
378 my $title = $wc->get_wiki_page( $url );
379
380 Attempts to extract the title of a wiki page from the given URL,
381 returning the title on success, "undef" on failure. If "wiki_uri" is
382 empty, this method always return "undef". See "ATTRIBUTES" in
383 HTML::WikiConverter for details on how the "wiki_uri" attribute is
384 interpreted.
385
386 is_camel_case
387
388 my $ok = $wc->is_camel_case( $str );
389
390 Returns true if $str is in CamelCase, false otherwise. CamelCase-ness
391 is determined using the same rules that Kwiki's formatting module uses.
392
393 get_attr_str
394
395 my $attr_str = $wc->get_attr_str( $node, @attrs );
396
397 Returns a string containing the specified attributes in the given node.
398 The returned string is suitable for insertion into an HTML tag. For
399 example, if $node contains the HTML
400
401 <style id="ht" class="head" onclick="editPage()">Header</span>
402
403 and @attrs contains "id" and "class", then get_attr_str() will return
404 'id="ht" class="head"'.
405
406 _attr
407
408 my $value = $wc->_attr( $name );
409
410 Returns the value of the named attribute. This is rarely needed since
411 you can access attribute values by treating the attribute name as a
412 method (i.e., "$wc->$name"). This low-level method of accessing
413 attributes is provided for when you need to override an attribute's
414 accessor/mutator method, as in:
415
416 sub attributes { {
417 my_attr => { default => 1 },
418 } }
419
420 sub my_attr {
421 my( $wc, $name, $new_value ) = @_;
422 # do something special
423 return $wc->_attr( $name => $new_value );
424 }
425
427 David J. Iberri <diberri@cpan.org>
428
430 Copyright 2006 David J. Iberri, all rights reserved.
431
432 This program is free software; you can redistribute it and/or modify it
433 under the same terms as Perl itself.
434
435
436
437perl v5.38.0 2023-07-20 HTML::WikiConverter::Dialects(3)