1Unicode::LineBreak(3) User Contributed Perl DocumentationUnicode::LineBreak(3)
2
3
4
6 Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm
7
9 use Unicode::LineBreak;
10 $lb = Unicode::LineBreak->new();
11 $broken = $lb->break($string);
12
14 Unicode::LineBreak performs Line Breaking Algorithm described in
15 Unicode Standard Annex #14 [UAX #14]. East_Asian_Width informative
16 property defined by Annex #11 [UAX #11] will be concerned to determine
17 breaking positions.
18
19 Terminology
20 Following terms are used for convenience.
21
22 Mandatory break is obligatory line breaking behavior defined by core
23 rules and performed regardless of surrounding characters. Arbitrary
24 break is line breaking behavior allowed by core rules and chosen by
25 user to perform it. Arbitrary break includes direct break and indirect
26 break defined by [UAX #14].
27
28 Alphabetic characters are characters usually no line breaks are allowed
29 between pairs of them, except that other characters provide break
30 oppotunities. Ideographic characters are characters that usually allow
31 line breaks both before and after themselves. [UAX #14] classifies
32 most of alphabetic to AL and most of ideographic to ID (These terms are
33 inaccurate from the point of view by grammatology). On several
34 scripts, breaking positions are not obvious by each characters
35 therefore heuristic based on dictionary is used.
36
37 Number of columns of a string is not always equal to the number of
38 characters it contains: Each of characters is either wide, narrow or
39 nonspacing; they occupy 2, 1 or 0 columns, respectively. Several
40 characters may be both wide and narrow by the contexts they are used.
41 Characters may have more various widths by customization.
42
44 Line Breaking
45 new ([KEY => VALUE, ...])
46 Constructor. About KEY => VALUE pairs see "Options".
47
48 break (STRING)
49 Instance method. Break Unicode string STRING and returns it. In
50 array context, returns array of lines contained in the result.
51
52 break_partial (STRING)
53 Instance method. Same as break() but accepts incremental inputs.
54 Give "undef" as STRING argument to specify that input was
55 completed.
56
57 config (KEY)
58 config (KEY => VALUE, ...)
59 Instance method. Get or update configuration. About KEY => VALUE
60 pairs see "Options".
61
62 copy
63 Copy constructor. Create a copy of object instance.
64
65 Getting Informations
66 breakingRule (BEFORESTR, AFTERSTR)
67 Instance method. Get possible line breaking behavior between
68 strings BEFORESTR and AFTERSTR. See "Constants" for returned
69 value.
70
71 Note: This method gives just approximate description of line
72 breaking behavior. Use break() and so on to wrap actual texts.
73
74 context ([Charset => CHARSET], [Language => LANGUAGE])
75 Function. Get language/region context used by character set
76 CHARSET or language LANGUAGE.
77
78 Options
79 "new" and "config" methods accept following pairs. Some of them affect
80 number of columns ([E]), grapheme cluster segmentation ([G]) (see also
81 Unicode::GCString) or line breaking behavior ([L]).
82
83 BreakIndent => "YES" | "NO"
84 [L] Always allows break after SPACEs at beginning of line, a.k.a.
85 indent. [UAX #14] does not take account of such usage of SPACE.
86 Default is "YES".
87
88 Note: This option was introduced at release 1.011.
89
90 CharMax => NUMBER
91 [L] Possible maximum number of characters in one line, not counting
92 trailing SPACEs and newline sequence. Note that number of
93 characters generally doesn't represent length of line. Default is
94 998. 0 means unlimited (as of release 2012.01).
95
96 ColMin => NUMBER
97 [L] Minimum number of columns which line broken arbitrarily may
98 include, not counting trailing spaces and newline sequences.
99 Default is 0.
100
101 ColMax => NUMBER
102 [L] Maximum number of columns line may include not counting
103 trailing spaces and newline sequence. In other words, maximum
104 length of line. Default is 76.
105
106 See also "Urgent" option and "User-Defined Breaking Behaviors".
107
108 ComplexBreaking => "YES" | "NO"
109 [L] Performs heuristic breaking on South East Asian complex
110 context. Default is, if word segmentation for South East Asian
111 writing systems is enabled, "YES".
112
113 Context => CONTEXT
114 [E][L] Specify language/region context. Currently available
115 contexts are "EASTASIAN" and "NONEASTASIAN". Default context is
116 "NONEASTASIAN".
117
118 In "EASTASIAN" context, characters with East_Asian_Width property
119 ambiguous (A) are treated as "wide" and with Line Breaking Class AI
120 as ideographic (ID).
121
122 In "NONEASTASIAN" context, characters with East_Asian_Width
123 property ambiguous (A) are treated as "narrow" and with Line
124 Breaking Class AI as alphabetic (AL).
125
126 EAWidth => "[" ORD "=>" PROPERTY "]"
127 EAWidth => "undef"
128 [E] Tailor classification of East_Asian_Width property. ORD is UCS
129 scalar value of character or array reference of them. PROPERTY is
130 one of East_Asian_Width property values and extended values (See
131 "Constants"). This option may be specified multiple times. If
132 "undef" is specified, all tailoring assigned before will be
133 canceled.
134
135 By default, no tailorings are available. See also "Tailoring
136 Character Properties".
137
138 Format => METHOD
139 [L] Specify the method to format broken lines.
140
141 "SIMPLE"
142 Default method. Just only insert newline at arbitrary breaking
143 positions.
144
145 "NEWLINE"
146 Insert or replace newline sequences with that specified by
147 "Newline" option, remove SPACEs leading newline sequences or
148 end-of-text. Then append newline at end of text if it does not
149 exist.
150
151 "TRIM"
152 Insert newline at arbitrary breaking positions. Remove SPACEs
153 leading newline sequences.
154
155 "undef"
156 Do nothing, even inserting any newlines.
157
158 Subroutine reference
159 See "Formatting Lines".
160
161 HangulAsAL => "YES" | "NO"
162 [L] Treat hangul syllables and conjoining jamos as alphabetic
163 characters (AL). Default is "NO".
164
165 LBClass => "[" ORD "=>" CLASS "]"
166 LBClass => "undef"
167 [G][L] Tailor classification of line breaking property. ORD is UCS
168 scalar value of character or array reference of them. CLASS is one
169 of line breaking classes (See "Constants"). This option may be
170 specified multiple times. If "undef" is specified, all tailoring
171 assigned before will be canceled.
172
173 By default, no tailorings are available. See also "Tailoring
174 Character Properties".
175
176 LegacyCM => "YES" | "NO"
177 [G][L] Treat combining characters lead by a SPACE as an isolated
178 combining character (ID). As of Unicode 5.0, such use of SPACE is
179 not recommended. Default is "YES".
180
181 Newline => STRING
182 [L] Unicode string to be used for newline sequence. Default is
183 "\n".
184
185 Prep => METHOD
186 [L] Add user-defined line breaking behavior(s). This option may be
187 specified multiple times. Following methods are available.
188
189 "NONBREAKURI"
190 Won't break URIs.
191
192 "BREAKURI"
193 Break URIs according to a rule suitable for printed materials.
194 For more details see [CMOS], sections 6.17 and 17.11.
195
196 "[" REGEX, SUBREF "]"
197 The sequences matching regular expression REGEX will be broken
198 by subroutine referred by SUBREF. For more details see "User-
199 Defined Breaking Behaviors".
200
201 "undef"
202 Cancel all methods assigned before.
203
204 Sizing => METHOD
205 [L] Specify method to calculate size of string. Following options
206 are available.
207
208 "UAX11"
209 Default method. Sizes are computed by columns of each
210 characters accoring to built-in character database.
211
212 "undef"
213 Number of grapheme clusters (see Unicode::GCString) contained
214 in the string.
215
216 Subroutine reference
217 See "Calculating String Size".
218
219 See also "ColMax", "ColMin" and "EAWidth" options.
220
221 Urgent => METHOD
222 [L] Specify method to handle excessing lines. Following options
223 are available.
224
225 "CROAK"
226 Print error message and die.
227
228 "FORCE"
229 Force breaking excessing fragment.
230
231 "undef"
232 Default method. Won't break excessing fragment.
233
234 Subroutine reference
235 See "User-Defined Breaking Behaviors".
236
237 ViramaAsJoiner => "YES" | "NO"
238 [G] Virama sign ("halant" in Hindi, "coeng" in Khmer) and its
239 succeeding letter are not broken. Default is "YES". Note: This
240 option was introduced by release 2012.001_29. On previous
241 releases, it was fixed to "NO". "Default" grapheme cluster defined
242 by [UAX #29] does not include this feature.
243
244 Constants
245 "EA_Na", "EA_N", "EA_A", "EA_W", "EA_H", "EA_F"
246 Index values to specify six East_Asian_Width property values
247 defined by [UAX #11]: narrow (Na), neutral (N), ambiguous (A), wide
248 (W), halfwidth (H) and fullwidth (F).
249
250 "EA_Z"
251 Index value to specify nonspacing characters.
252
253 Note: This "nonspacing" value is extension by this module, not a
254 part of [UAX #11].
255
256 "LB_BK", "LB_CR", "LB_LF", "LB_NL", "LB_SP", "LB_OP", "LB_CL", "LB_CP",
257 "LB_QU", "LB_GL", "LB_NS", "LB_EX", "LB_SY", "LB_IS", "LB_PR", "LB_PO",
258 "LB_NU", "LB_AL", "LB_HL", "LB_ID", "LB_IN", "LB_HY", "LB_BA", "LB_BB",
259 "LB_B2", "LB_CB", "LB_ZW", "LB_CM", "LB_WJ", "LB_H2", "LB_H3", "LB_JL",
260 "LB_JV", "LB_JT", "LB_SG", "LB_AI", "LB_CJ", "LB_SA", "LB_XX", "LB_RI"
261 Index values to specify 40 line breaking property values (classes)
262 defined by [UAX #14].
263
264 Note: Property value CP was introduced by Unicode 5.2.0. Property
265 values HL and CJ were introduced by Unicode 6.1.0. Property value
266 RI was introduced by Unicode 6.2.0.
267
268 "MANDATORY", "DIRECT", "INDIRECT", "PROHIBITED"
269 Four values to specify line breaking behaviors: Mandatory break;
270 Both direct break and indirect break are allowed; Indirect break is
271 allowed but direct break is prohibited; Prohibited break.
272
273 "Unicode::LineBreak::SouthEastAsian::supported"
274 Flag to determin if word segmentation for South East Asian writing
275 systems is enabled. If this feature was enabled, a non-empty
276 string is set. Otherwise, "undef" is set.
277
278 N.B.: Current release supports Thai script of modern Thai language
279 only.
280
281 "UNICODE_VERSION"
282 A string to specify version of Unicode standard this module refers.
283
285 Formatting Lines
286 If you specify subroutine reference as a value of "Format" option, it
287 should accept three arguments:
288
289 $MODIFIED = &subroutine(SELF, EVENT, STR);
290
291 SELF is a Unicode::LineBreak object, EVENT is a string to determine the
292 context that subroutine was called in, and STR is a fragment of Unicode
293 string leading or trailing breaking position.
294
295 EVENT |When Fired |Value of STR
296 -----------------------------------------------------------------
297 "sot" |Beginning of text |Fragment of first line
298 "sop" |After mandatory break|Fragment of next line
299 "sol" |After arbitrary break|Fragment on sequel of line
300 "" |Just before any |Complete line without trailing
301 |breaks |SPACEs
302 "eol" |Arbitrary break |SPACEs leading breaking position
303 "eop" |Mandatory break |Newline and its leading SPACEs
304 "eot" |End of text |SPACEs (and newline) at end of
305 | |text
306 -----------------------------------------------------------------
307
308 Subroutine should return modified text fragment or may return "undef"
309 to express that no modification occurred. Note that modification in
310 the context of "sot", "sop" or "sol" may affect decision of successive
311 breaking positions while in the others won't.
312
313 Note: String arguments are actually sequences of grapheme clusters.
314 See Unicode::GCString.
315
316 For example, following code folds lines removing trailing spaces:
317
318 sub fmt {
319 if ($_[1] =~ /^eo/) {
320 return "\n";
321 }
322 return undef;
323 }
324 my $lb = Unicode::LineBreak->new(Format => \&fmt);
325 $output = $lb->break($text);
326
327 User-Defined Breaking Behaviors
328 When a line generated by arbitrary break is expected to be beyond
329 measure of either CharMax, ColMax or ColMin, urgent break may be
330 performed on successive string. If you specify subroutine reference as
331 a value of "Urgent" option, it should accept two arguments:
332
333 @BROKEN = &subroutine(SELF, STR);
334
335 SELF is a Unicode::LineBreak object and STR is a Unicode string to be
336 broken.
337
338 Subroutine should return an array of broken string STR.
339
340 Note: String argument is actually a sequence of grapheme clusters. See
341 Unicode::GCString.
342
343 For example, following code inserts hyphen to the name of several
344 chemical substances (such as Titin) so that it may be folded:
345
346 sub hyphenize {
347 return map {$_ =~ s/yl$/yl-/; $_} split /(\w+?yl(?=\w))/, $_[1];
348 }
349 my $lb = Unicode::LineBreak->new(Urgent => \&hyphenize);
350 $output = $lb->break("Methionylthreonylthreonylglutaminylarginyl...");
351
352 If you specify [REGEX, SUBREF] array reference as any of "Prep" option,
353 subroutine should accept two arguments:
354
355 @BROKEN = &subroutine(SELF, STR);
356
357 SELF is a Unicode::LineBreak object and STR is a Unicode string matched
358 with REGEX.
359
360 Subroutine should return an array of broken string STR.
361
362 For example, following code will break HTTP URLs using [CMOS] rule.
363
364 my $url = qr{http://[\x21-\x7E]+}i;
365 sub breakurl {
366 my $self = shift;
367 my $str = shift;
368 return split m{(?<=[/]) (?=[^/]) |
369 (?<=[^-.]) (?=[-~.,_?\#%=&]) |
370 (?<=[=&]) (?=.)}x, $str;
371 }
372 my $lb = Unicode::LineBreak->new(Prep => [$url, \&breakurl]);
373 $output = $lb->break($string);
374
375 Preserving State
376
377 Unicode::LineBreak object can behave as hash reference. Any items may
378 be preserved throughout its life.
379
380 For example, following code will separate paragraphs with empty lines.
381
382 sub paraformat {
383 my $self = shift;
384 my $action = shift;
385 my $str = shift;
386
387 if ($action eq 'sot' or $action eq 'sop') {
388 $self->{'line'} = '';
389 } elsif ($action eq '') {
390 $self->{'line'} = $str;
391 } elsif ($action eq 'eol') {
392 return "\n";
393 } elsif ($action eq 'eop') {
394 if (length $self->{'line'}) {
395 return "\n\n";
396 } else {
397 return "\n";
398 }
399 } elsif ($action eq 'eot') {
400 return "\n";
401 }
402 return undef;
403 }
404 my $lb = Unicode::LineBreak->new(Format => \¶format);
405 $output = $lb->break($string);
406
407 Calculating String Size
408 If you specify subroutine reference as a value of "Sizing" option, it
409 will be called with five arguments:
410
411 $COLS = &subroutine(SELF, LEN, PRE, SPC, STR);
412
413 SELF is a Unicode::LineBreak object, LEN is size of preceding string,
414 PRE is preceding Unicode string, SPC is additional SPACEs and STR is a
415 Unicode string to be processed.
416
417 Subroutine should return calculated number of columns of "PRE.SPC.STR".
418 The number of columns may not be an integer: Unit of the number may be
419 freely chosen, however, it should be same as those of "ColMin" and
420 "ColMax" option.
421
422 Note: String arguments are actually sequences of grapheme clusters.
423 See Unicode::GCString.
424
425 For example, following code processes lines with tab stops by each
426 eight columns.
427
428 sub tabbedsizing {
429 my ($self, $cols, $pre, $spc, $str) = @_;
430
431 my $spcstr = $spc.$str;
432 while ($spcstr->lbc == LB_SP) {
433 my $c = $spcstr->item(0);
434 if ($c eq "\t") {
435 $cols += 8 - $cols % 8;
436 } else {
437 $cols += $c->columns;
438 }
439 $spcstr = $spcstr->substr(1);
440 }
441 $cols += $spcstr->columns;
442 return $cols;
443 };
444 my $lb = Unicode::LineBreak->new(LBClass => [ord("\t") => LB_SP],
445 Sizing => \&tabbedsizing);
446 $output = $lb->break($string);
447
448 Tailoring Character Properties
449 Character properties may be tailored by "LBClass" and "EAWidth"
450 options. Some constants are defined for convenience of tailoring.
451
452 Line Breaking Properties
453
454 Non-starters of Kana-like Characters
455
456 By default, several hiragana, katakana and characters corresponding to
457 kana are treated as non-starters (NS or CJ). When the following
458 pair(s) are specified for value of "LBClass" option, these characters
459 are treated as normal ideographic characters (ID).
460
461 "KANA_NONSTARTERS() => LB_ID"
462 All of characters below.
463
464 "IDEOGRAPHIC_ITERATION_MARKS() => LB_ID"
465 Ideographic iteration marks. U+3005 IDEOGRAPHIC ITERATION MARK,
466 U+303B VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D HIRAGANA
467 ITERATION MARK, U+309E HIRAGANA VOICED ITERATION MARK, U+30FD
468 KATAKANA ITERATION MARK and U+30FE KATAKANA VOICED ITERATION MARK.
469
470 N.B. Some of them are neither hiragana nor katakana.
471
472 "KANA_SMALL_LETTERS() => LB_ID"
473 "KANA_PROLONGED_SOUND_MARKS() => LB_ID"
474 Hiragana or katakana small letters: Hiragana small letters U+3041
475 A, U+3043 I, U+3045 U, U+3047 E, U+3049 O, U+3063 TU, U+3083 YA,
476 U+3085 YU, U+3087 YO, U+308E WA, U+3095 KA, U+3096 KE. Katakana
477 small letters U+30A1 A, U+30A3 I, U+30A5 U, U+30A7 E, U+30A9 O,
478 U+30C3 TU, U+30E3 YA, U+30E5 YU, U+30E7 YO, U+30EE WA, U+30F5 KA,
479 U+30F6 KE. Katakana phonetic extensions U+31F0 KU - U+31FF RO.
480 Halfwidth katakana small letters U+FF67 A - U+FF6F TU.
481
482 Hiragana or katakana prolonged sound marks: U+30FC KATAKANA-
483 HIRAGANA PROLONGED SOUND MARK and U+FF70 HALFWIDTH KATAKANA-
484 HIRAGANA PROLONGED SOUND MARK.
485
486 N.B. These letters are optionally treated either as non-starter or
487 as normal ideographic. See [JIS X 4051] 6.1.1, [JLREQ] 3.1.7 or
488 [UAX14].
489
490 N.B. U+3095, U+3096, U+30F5, U+30F6 are considered to be neither
491 hiragana nor katakana.
492
493 "MASU_MARK() => LB_ID"
494 U+303C MASU MARK.
495
496 N.B. Although this character is not kana, it is usually regarded as
497 abbreviation to sequence of hiragana ま す or katakana マ ス, MA
498 and SU.
499
500 N.B. This character is classified as non-starter (NS) by [UAX #14]
501 and as the class corresponding to ID by [JIS X 4051] and [JLREQ].
502
503 Ambiguous Quotation Marks
504
505 By default, some punctuations are ambiguous quotation marks (QU).
506
507 "BACKWARD_QUOTES() => LB_OP, FORWARD_QUOTES() => LB_CL"
508 Some languages (Dutch, English, Italian, Portugese, Spanish,
509 Turkish and most East Asian) use rotated-9-style punctuations (‘ “)
510 as opening and 9-style punctuations (’ ”) as closing quotation
511 marks.
512
513 "FORWARD_QUOTES() => LB_OP, BACKWARD_QUOTES() => LB_CL"
514 Some others (Czech, German and Slovak) use 9-style punctuations (’
515 ”) as opening and rotated-9-style punctuations (‘ “) as closing
516 quotation marks.
517
518 "BACKWARD_GUILLEMETS() => LB_OP, FORWARD_GUILLEMETS() => LB_CL"
519 French, Greek, Russian etc. use left-pointing guillemets (« ‹) as
520 opening and right-pointing guillemets (» ›) as closing quotation
521 marks.
522
523 "FORWARD_GUILLEMETS() => LB_OP, BACKWARD_GUILLEMETS() => LB_CL"
524 German and Slovak use right-pointing guillemets (» ›) as opening
525 and left-pointing guillemets (« ‹) as closing quotation marks.
526
527 Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing
528 punctuations (’ ” » ›) as both opening and closing quotation marks.
529
530 IDEOGRAPHIC SPACE
531
532 "IDEOGRAPHIC_SPACE() => LB_BA"
533 U+3000 IDEOGRAPHIC SPACE won't be placed at beginning of line.
534 This is default behavior.
535
536 "IDEOGRAPHIC_SPACE() => LB_ID"
537 IDEOGRAPHIC SPACE can be placed at beginning of line. This was
538 default behavior by Unicode 6.2 and earlier.
539
540 "IDEOGRAPHIC_SPACE() => LB_SP"
541 IDEOGRAPHIC SPACE won't be placed at beginning of line, and will
542 protrude from end of line.
543
544 East_Asian_Width Properties
545
546 Some particular letters of Latin, Greek and Cyrillic scripts have
547 ambiguous (A) East_Asian_Width property. Thus, these characters are
548 treated as wide in "EASTASIAN" context. Specifying "EAWidth => [
549 AMBIGUOUS_"*"() => EA_N ]", those characters are always treated as
550 narrow.
551
552 "AMBIGUOUS_ALPHABETICS() => EA_N"
553 Treat all of characters below as East_Asian_Width neutral (N).
554
555 "AMBIGUOUS_CYRILLIC() => EA_N"
556 "AMBIGUOUS_GREEK() => EA_N"
557 "AMBIGUOUS_LATIN() => EA_N"
558 Treate letters having ambiguous (A) width of Cyrillic, Greek and
559 Latin scripts as neutral (N).
560
561 On the other hand, despite several characters were occasionally
562 rendered as wide characters by number of implementations for East Asian
563 character sets, they are given narrow (Na) East_Asian_Width property
564 just because they have fullwidth (F) compatibility characters.
565 Specifying "EAWidth" as below, those characters are treated as
566 ambiguous --- wide on "EASTASIAN" context.
567
568 "QUESTIONABLE_NARROW_SIGNS() => EA_A"
569 U+00A2 CENT SIGN, U+00A3 POUND SIGN, U+00A5 YEN SIGN (or yuan
570 sign), U+00A6 BROKEN BAR, U+00AC NOT SIGN, U+00AF MACRON.
571
572 Configuration File
573 Built-in defaults of option parameters for "new" and "config" method
574 can be overridden by configuration files:
575 Unicode/LineBreak/Defaults.pm. For more details read
576 Unicode/LineBreak/Defaults.pm.sample.
577
579 Please report bugs or buggy behaviors to developer.
580
581 CPAN Request Tracker:
582 <http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-LineBreak>.
583
585 Consult $VERSION variable.
586
587 Incompatible Changes
588 Release 2012.06
589 • eawidth() method was deprecated. "columns" in
590 Unicode::GCString may be used instead.
591
592 • lbclass() method was deprecated. Use "lbc" in
593 Unicode::GCString or "lbcext" in Unicode::GCString.
594
595 Conformance to Standards
596 Character properties this module is based on are defined by Unicode
597 Standard version 8.0.0.
598
599 This module is intended to implement UAX14-C2.
600
602 • Some ideographic characters may be treated either as NS or as ID by
603 choice.
604
605 • Hangul syllables and conjoining jamos may be treated as either ID
606 or AL by choice.
607
608 • Characters assigned to AI may be resolved to either AL or ID by
609 choice.
610
611 • Character(s) assigned to CB are not resolved.
612
613 • Characters assigned to CJ are always resolved to NS. More flexible
614 tailoring mechanism is provided.
615
616 • When word segmentation for South East Asian writing systems is not
617 supported, characters assigned to SA are resolved to AL, except
618 that characters that have Grapheme_Cluster_Break property value
619 Extend or SpacingMark be resolved to CM.
620
621 • Characters assigned to SG or XX are resolved to AL.
622
623 • Code points of following UCS ranges are given fixed property values
624 even if they have not been assigned any characers.
625
626 Ranges | UAX #14 | UAX #11 | Description
627 -------------------------------------------------------------
628 U+20A0..U+20CF | PR [*1] | N [*2] | Currency symbols
629 U+3400..U+4DBF | ID | W | CJK ideographs
630 U+4E00..U+9FFF | ID | W | CJK ideographs
631 U+D800..U+DFFF | AL (SG) | N | Surrogates
632 U+E000..U+F8FF | AL (XX) | F or N (A) | Private use
633 U+F900..U+FAFF | ID | W | CJK ideographs
634 U+20000..U+2FFFD | ID | W | CJK ideographs
635 U+30000..U+3FFFD | ID | W | Old hanzi
636 U+F0000..U+FFFFD | AL (XX) | F or N (A) | Private use
637 U+100000..U+10FFFD | AL (XX) | F or N (A) | Private use
638 Other unassigned | AL (XX) | N | Unassigned,
639 | | | reserved or
640 | | | noncharacters
641 -------------------------------------------------------------
642 [*1] Except U+20A7 PESETA SIGN (PO),
643 U+20B6 LIVRE TOURNOIS SIGN (PO), U+20BB NORDIC MARK SIGN (PO)
644 and U+20BE LARI SIGN (PO).
645 [*2] Except U+20A9 WON SIGN (H) and U+20AC EURO SIGN
646 (F or N (A)).
647
648 • Characters belonging to General Category Mn, Me, Cc, Cf, Zl or Zp
649 are treated as nonspacing by this module.
650
652 [CMOS]
653 The Chicago Manual of Style, 15th edition. University of Chicago
654 Press, 2003.
655
656 [JIS X 4051]
657 JIS X 4051:2004 日本語文書の組版方法 (Formatting Rules for Japanese
658 Documents). Japanese Standards Association, 2004.
659
660 [JLREQ]
661 Anan, Yasuhiro et al. Requirements for Japanese Text Layout, W3C
662 Working Group Note 3 April 2012.
663 <http://www.w3.org/TR/2012/NOTE-jlreq-20120403/>.
664
665 [UAX #11]
666 A. Freytag (ed.) (2008-2009). Unicode Standard Annex #11: East
667 Asian Width, Revisions 17-19. <http://unicode.org/reports/tr11/>.
668
669 [UAX #14]
670 A. Freytag and A. Heninger (eds.) (2008-2015). Unicode Standard
671 Annex #14: Unicode Line Breaking Algorithm, Revisions 22-35.
672 <http://unicode.org/reports/tr14/>.
673
674 [UAX #29]
675 Mark Davis (ed.) (2009-2013). Unicode Standard Annex #29: Unicode
676 Text Segmentation, Revisions 15-23.
677 <http://www.unicode.org/reports/tr29/>.
678
680 Text::LineFold, Text::Wrap, Unicode::GCString.
681
683 Copyright (C) 2009-2018 Hatuka*nezumi - IKEDA Soji
684 <hatuka(at)nezumi.nu>.
685
686 This program is free software; you can redistribute it and/or modify it
687 under the same terms as Perl itself.
688
689
690
691perl v5.34.0 2022-01-21 Unicode::LineBreak(3)