1HTML::WikiConverter::DiUasleerctCso(n3t)ributed Perl DocHuTmMeLn:t:aWtiikoinConverter::Dialects(3)
2
3
4
6 HTML::WikiConverter::Dialects - How to add a dialect
7
9 # In your dialect module:
10
11 package HTML::WikiConverter::MySlimWiki;
12 use base 'HTML::WikiConverter';
13
14 sub rules { {
15 b => { start => '**', end => '**' },
16 i => { start => '//', end => '//' },
17 strong => { alias => 'b' },
18 em => { alias => 'i' },
19 hr => { replace => "\n----\n" }
20 } }
21
22 # In a nearby piece of code:
23
24 package main;
25 use Test::More tests => 5;
26
27 my $wc = new HTML::WikiConverter(
28 dialect => 'MySlimWiki'
29 );
30
31 is( $wc->html2wiki( '<b>text</b>' ), '**text**', b );
32 is( $wc->html2wiki( '<i>text</i>' ), '//text//', i );
33 is( $wc->html2wiki( '<strong>text</strong>' ), '**text**', 'strong' );
34 is( $wc->html2wiki( '<em>text</em>' ), '//text//', 'em' );
35 is( $wc->html2wiki( '<hr/>' ), '----', 'hr' );
36
38 HTML::WikiConverter (or H::WC, for short) is an HTML to wiki converter.
39 It can convert HTML source into a variety of wiki markups, called wiki
40 "dialects". This manual describes how you to create your own dialect
41 to be plugged into HTML::WikiConverter.
42
44 Each dialect has a separate dialect module containing rules for
45 converting HTML into wiki markup specific for that dialect. Currently,
46 all dialect modules are in the "HTML::WikiConverter::" package space
47 and subclass HTML::WikiConverter. For example, the MediaWiki dialect
48 module is HTML::WikiConverter::MediaWiki, while PhpWiki's is
49 HTML::WikiConverter::PhpWiki. However, dialect modules need not be in
50 the "HTML::WikiConverter::" package space; you may just as easily use
51 "package MyWikiDialect;" and H::WC will Do The Right Thing.
52
53 From now on, I'll be using the terms "dialect" and "dialect module"
54 interchangeably.
55
56 Subclassing
57 To interface with H::WC, dialects need to subclass it. This is done
58 like so at the start of the dialect module:
59
60 package HTML::WikiConverter::MySlimWiki;
61 use base 'HTML::WikiConverter';
62
63 Conversion rules
64 Dialects guide H::WC's conversion process with a set of rules that
65 define how HTML elements are turned into their wiki counterparts. Each
66 rule corresponds to an HTML tag and there may be any number of rules.
67 Rules are specified in your dialect's "rules()" method, which returns a
68 reference to a hash of rules. Each entry in the hash maps a tag name to
69 a set of subrules, as in:
70
71 $tag => \%subrules
72
73 where $tag is the name of the HTML tag (e.g., "b", "em", etc.) and
74 %subrules contains subrules that specify how that tag will be converted
75 when it is encountered in the HTML input.
76
77 Subrules
78
79 The following subrules are recognized:
80
81 start
82 end
83
84 preserve
85 attributes
86 empty
87
88 replace
89 alias
90
91 block
92 line_format
93 line_prefix
94
95 trim
96
97 A simple example
98
99 The following rules could be used for a dialect that uses "*asterisks*"
100 for bold and "_underscores_" for italic text:
101
102 sub rules {
103 b => { start => '*', end => '*' },
104 i => { start => '_', end => '_' },
105 }
106
107 Aliases
108
109 To add "<strong>" and "<em>" as aliases of "<b>" and "<i>", use the
110 "alias" subrule:
111
112 strong => { alias => 'b' },
113 em => { alias => 'i' },
114
115 (The "alias" subrule cannot be used with any other subrule.)
116
117 Blocks
118
119 Many dialects separate paragraphs and other block-level elements with a
120 blank line. To indicate this, use the "block" subrule:
121
122 p => { block => 1 },
123
124 (To better support nested block elements, if a block elements are
125 nested inside each other, blank lines are only added to the outermost
126 element.)
127
128 Line formatting
129
130 Many dialects require that the text of an element be contained on a
131 single line of text, or that it cannot contain any newlines, etc. These
132 options can be specified using the "line_format" subrule, which can be
133 assigned the value "single", "multi", or "blocks".
134
135 If the element must be contained on a single line, then the
136 "line_format" subrule should be "single". If the element can span
137 multiple lines, but there can be no blank lines contained within, then
138 use "multi". If blank lines (which delimit blocks) are allowed, then
139 use "blocks". For example, paragraphs are specified like so in the
140 MediaWiki dialect:
141
142 p => { block => 1, line_format => 'multi', trim => 'both' },
143
144 Trimming whitespace
145
146 The "trim" subrule specifies whether leading or trailing whitespace (or
147 both) should be stripped from the element. To strip leading whitespace
148 only, use "leading"; for trailing whitespace, use "trailing"; for both,
149 use the aptly named "both"; for neither (the default), use "none".
150
151 Line prefixes
152
153 Some elements require that each line be prefixed with a particular
154 string. This is specified with the "line_prefix" subrule. For example,
155 preformatted text in MediaWiki is prefixed with a space:
156
157 pre => { block => 1, line_prefix => ' ' },
158
159 Replacement
160
161 In some cases, conversion from HTML to wiki markup is as simple as
162 string replacement. To replace a tag and its contents with a particular
163 string, use the "replace" subrule. For example, in PhpWiki, three
164 percent signs, "%%%", represents a line break, "<br>", hence:
165
166 br => { replace => '%%%' },
167
168 (The "replace" subrule cannot be used with any other subrule.)
169
170 Preserving HTML tags
171
172 Some dialects allow a subset of HTML in their markup. While H::WC
173 ignores unhandled HTML tags by default (i.e., if H::WC encounters a tag
174 that does not exist in a dialect's rule specification, then the
175 contents of the tag is simply passed through to the wiki markup), you
176 may specify that some be preserved using the "preserve" subrule. For
177 example, to allow "<font>" tag in wiki markup:
178
179 font => { preserve => 1 },
180
181 Preserved tags may also specify a list of attributes that may also
182 passthrough from HTML to wiki markup. This is done with the
183 "attributes" subrule:
184
185 font => { preserve => 1, attributes => [ qw/ style class / ] },
186
187 (The "attributes" subrule can only be used if the "preserve" subrule is
188 also present.)
189
190 Some HTML elements have no content (e.g., line breaks, images) and the
191 wiki dialect might require them to be preserved in a more XHTML-
192 friendly way. To indicate that a preserved tag should have no content,
193 use the "empty" subrule. This will cause the element to be replaced
194 with "<tag />" and no end tag. For example, MediaWiki handles line
195 breaks like so:
196
197 br => {
198 preserve => 1,
199 attributes => [ qw/ id class title style clear / ],
200 empty => 1
201 },
202
203 This will convert, for example, "<br clear='both'>" into "<br
204 clear='both' />". Without specifying the "empty" subrule, this would be
205 converted into the (probably undesirable) "<br clear='both'></br>".
206
207 (The "empty" subrule can only be used if the "preserve" subrule is also
208 present.)
209
210 Rules that depend on attribute values
211
212 In some circumstances, you might want your dialect's conversion rules
213 to depend on the value of one or more attributes. This can be achieved
214 by producing rules in a conditional manner within "rules()". For
215 example:
216
217 sub rules {
218 my $self = shift;
219
220 my %rules = (
221 em => { start => "''", end => "''" },
222 strong => { start => "'''", end => "'''" },
223 );
224
225 $rules{i} = { preserve => 1 } if $self->preserve_italic;
226 $rules{b} = { preserve => 1 } if $self->preserve_bold;
227
228 return \%rules;
229 }
230
231 Dynamic subrules
232 Instead of simple strings, you may use coderefs as values for the
233 "start", "end", "replace", and "line_prefix" subrules. If you do, the
234 code will be called when the subrule is applied, and will be passed
235 three arguments: the current H::WC object, the current HTML::Element
236 node being operated on, and a reference to the hash containing the
237 dialect's subrules associated with elements of that type.
238
239 For example, MoinMoin handles lists like so:
240
241 ul => { line_format => 'multi', block => 1, line_prefix => ' ' },
242 li => { start => \&_li_start, trim => 'leading' },
243 ol => { alias => 'ul' },
244
245 It then defines "_li_start()":
246
247 sub _li_start {
248 my( $self, $node, $subrules ) = @_;
249 my $bullet = '';
250 $bullet = '*' if $node->parent->tag eq 'ul';
251 $bullet = '1.' if $node->parent->tag eq 'ol';
252 return "\n$bullet ";
253 }
254
255 This prefixes every unordered list item with "*" and every ordered list
256 item with "1.", which MoinMoin requires. It also puts each list item on
257 its own line and places a space between the prefix and the content of
258 the list item.
259
260 Subrule validation
261 Certain subrule combinations are not allowed. Hopefully it's intuitive
262 why this is, but in case it's not, prohibited combinations have been
263 mentioned above parenthetically. For example, the "replace" and "alias"
264 subrules cannot be combined with any other subrules, and "attributes"
265 can only be specified alongside "preserve". Invalid subrule
266 combinations will trigger a fatal error when the H::WC object is
267 instantiated.
268
269 Dialect attributes
270 H::WC's constructor accepts a number of attributes that help determine
271 how conversion takes place. Dialects can alter these attributes or add
272 their own by defining an "attributes()" method, which returns a
273 reference to a hash of attributes. Each entry in the hash maps the
274 attribute's name to an attribute specification, as in:
275
276 $attr => \%spec
277
278 where $attr is the name of the attribute and %spec is a
279 Params::Validate specification for the attribute.
280
281 For example, to add a boolean attribute called "camel_case" which is
282 disabled by default:
283
284 sub attributes {
285 camel_case => { default => 0 },
286 }
287
288 Attributes defined liks this are given accessor and mutator methods via
289 Perl's "AUTOLOAD" mechanism, so you can later say:
290
291 my $ok = $wc->camel_case;
292 $wc->camel_case(0);
293
294 You may override the default H::WC attributes using this mechanism. For
295 example, while H::WC considers the "base_uri" attribute optional, it is
296 required for the PbWiki dialect. PbWiki can override this default-
297 optional behavior by saying:
298
299 sub attributes {
300 base_uri => { optional => 0 }
301 }
302
303 Preprocessing
304 The first step H::WC takes in converting HTML source to wiki markup is
305 to parse the HTML into a syntax tree using HTML::TreeBuilder. It is
306 often useful for dialects to preprocess the tree prior to converting it
307 into wiki markup. Dialects that need to preprocess the tree can define
308 a "preprocess_node" method that will be called on each node of the tree
309 (traversal is done in pre-order). The method receives two arguments,
310 the H::WC object, and the current HTML::Element node being traversed.
311 It may modify the node or decide to ignore it; its return value is
312 discarded.
313
314 Built-in preprocessors
315
316 Because they are commonly needed, H::WC automatically carries out two
317 preprocessing steps, regardless of the dialect: 1) relative URIs in
318 images and links are converted to absolute URIs (based upon the
319 "base_uri" parameter), and 2) ignorable text (e.g. between a "</td>"
320 and "<td>") is discarded.
321
322 H::WC also provides additional preprocessing steps that may be
323 explicitly enabled by dialect modules.
324
325 strip_aname
326 Removes any anchor elements that do not contain an "href"
327 attribute.
328
329 caption2para
330 Removes table captions and reinserts them as paragraphs before the
331 table.
332
333 Dialects may apply these optional preprocessing steps by calling them
334 as methods on the dialect object inside "preprocess_node". For example:
335
336 sub preprocess_node {
337 my( $self, $node ) = @_;
338 $self->strip_aname($node);
339 $self->caption2para($node);
340 }
341
342 Postprocessing
343 Once the work of converting HTML is complete, it is sometimes useful to
344 postprocess the resulting wiki markup. Postprocessing can be used to
345 clean up whitespace, fix subtle bugs introduced in the markup during
346 conversion, etc.
347
348 Dialects that want to postprocess the wiki markup should define a
349 "postprocess_output" method that will be called just before the
350 "html2wiki" method returns to the client. The method will be passed two
351 arguments, the H::WC object and a reference to the wiki markup. The
352 method may modify the wiki markup that the reference points to; its
353 return value is discarded.
354
355 For example, to replace a series of line breaks with a pair of
356 newlines, a dialect might implement this:
357
358 sub postprocess_output {
359 my( $self, $outref ) = @_;
360 $$outref =~ s/<br>\s*<br>/\n\n/gs;
361 }
362
363 (This example assumes that HTML line breaks were replaced with "<br>"
364 in the wiki markup.)
365
366 Dialect utility methods
367 H::WC defines a set of utility methods that dialect modules may find
368 useful.
369
370 get_elem_contents
371
372 my $wiki = $wc->get_elem_contents( $node );
373
374 Converts the contents of $node into wiki markup and returns the
375 resulting wiki markup.
376
377 get_wiki_page
378
379 my $title = $wc->get_wiki_page( $url );
380
381 Attempts to extract the title of a wiki page from the given URL,
382 returning the title on success, "undef" on failure. If "wiki_uri" is
383 empty, this method always return "undef". See "ATTRIBUTES" in
384 HTML::WikiConverter for details on how the "wiki_uri" attribute is
385 interpreted.
386
387 is_camel_case
388
389 my $ok = $wc->is_camel_case( $str );
390
391 Returns true if $str is in CamelCase, false otherwise. CamelCase-ness
392 is determined using the same rules that Kwiki's formatting module uses.
393
394 get_attr_str
395
396 my $attr_str = $wc->get_attr_str( $node, @attrs );
397
398 Returns a string containing the specified attributes in the given node.
399 The returned string is suitable for insertion into an HTML tag. For
400 example, if $node contains the HTML
401
402 <style id="ht" class="head" onclick="editPage()">Header</span>
403
404 and @attrs contains "id" and "class", then "get_attr_str()" will return
405 'id="ht" class="head"'.
406
407 _attr
408
409 my $value = $wc->_attr( $name );
410
411 Returns the value of the named attribute. This is rarely needed since
412 you can access attribute values by treating the attribute name as a
413 method (i.e., "$wc->$name"). This low-level method of accessing
414 attributes is provided for when you need to override an attribute's
415 accessor/mutator method, as in:
416
417 sub attributes { {
418 my_attr => { default => 1 },
419 } }
420
421 sub my_attr {
422 my( $wc, $name, $new_value ) = @_;
423 # do something special
424 return $wc->_attr( $name => $new_value );
425 }
426
428 David J. Iberri <diberri@cpan.org>
429
431 Copyright 2006 David J. Iberri, all rights reserved.
432
433 This program is free software; you can redistribute it and/or modify it
434 under the same terms as Perl itself.
435
436
437
438perl v5.34.0 2022-01-21 HTML::WikiConverter::Dialects(3)