Unicode::LineBreak(3pm)

1Unicode::LineBreak(3) User Contributed Perl DocumentationUnicode::LineBreak(3)
2
3
4

NAME

6       Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm
7

SYNOPSIS

9           use Unicode::LineBreak;
10           $lb = Unicode::LineBreak->new();
11           $broken = $lb->break($string);
12

DESCRIPTION

14       Unicode::LineBreak performs Line Breaking Algorithm described in
15       Unicode Standard Annex #14 [UAX #14]. East_Asian_Width informative
16       property defined by Annex #11 [UAX #11] will be concerned to determine
17       breaking positions.
18
19   Terminology
20       Following terms are used for convenience.
21
22       Mandatory break is obligatory line breaking behavior defined by core
23       rules and performed regardless of surrounding characters.  Arbitrary
24       break is line breaking behavior allowed by core rules and chosen by
25       user to perform it.  Arbitrary break includes direct break and indirect
26       break defined by [UAX #14].
27
28       Alphabetic characters are characters usually no line breaks are allowed
29       between pairs of them, except that other characters provide break
30       oppotunities.  Ideographic characters are characters that usually allow
31       line breaks both before and after themselves.  [UAX #14] classifies
32       most of alphabetic to AL and most of ideographic to ID (These terms are
33       inaccurate from the point of view by grammatology).  On several
34       scripts, breaking positions are not obvious by each characters
35       therefore heuristic based on dictionary is used.
36
37       Number of columns of a string is not always equal to the number of
38       characters it contains: Each of characters is either wide, narrow or
39       nonspacing; they occupy 2, 1 or 0 columns, respectively.  Several
40       characters may be both wide and narrow by the contexts they are used.
41       Characters may have more various widths by customization.
42

PUBLIC INTERFACE

44   Line Breaking
45       new ([KEY => VALUE, ...])
46           Constructor.  About KEY => VALUE pairs see "Options".
47
48       break (STRING)
49           Instance method.  Break Unicode string STRING and returns it.  In
50           array context, returns array of lines contained in the result.
51
52       break_partial (STRING)
53           Instance method.  Same as break() but accepts incremental inputs.
54           Give "undef" as STRING argument to specify that input was
55           completed.
56
57       config (KEY)
58       config (KEY => VALUE, ...)
59           Instance method.  Get or update configuration.  About KEY => VALUE
60           pairs see "Options".
61
62       copy
63           Copy constructor.  Create a copy of object instance.
64
65   Getting Informations
66       breakingRule (BEFORESTR, AFTERSTR)
67           Instance method.  Get possible line breaking behavior between
68           strings BEFORESTR and AFTERSTR.  See "Constants" for returned
69           value.
70
71           Note: This method gives just approximate description of line
72           breaking behavior.  Use break() and so on to wrap actual texts.
73
74       context ([Charset => CHARSET], [Language => LANGUAGE])
75           Function.  Get language/region context used by character set
76           CHARSET or language LANGUAGE.
77
78   Options
79       "new" and "config" methods accept following pairs.  Some of them affect
80       number of columns ([E]), grapheme cluster segmentation ([G]) (see also
81       Unicode::GCString) or line breaking behavior ([L]).
82
83       BreakIndent => "YES" | "NO"
84           [L] Always allows break after SPACEs at beginning of line, a.k.a.
85           indent.  [UAX #14] does not take account of such usage of SPACE.
86           Default is "YES".
87
88           Note: This option was introduced at release 1.011.
89
90       CharMax => NUMBER
91           [L] Possible maximum number of characters in one line, not counting
92           trailing SPACEs and newline sequence.  Note that number of
93           characters generally doesn't represent length of line.  Default is
94           998.  0 means unlimited (as of release 2012.01).
95
96       ColMin => NUMBER
97           [L] Minimum number of columns which line broken arbitrarily may
98           include, not counting trailing spaces and newline sequences.
99           Default is 0.
100
101       ColMax => NUMBER
102           [L] Maximum number of columns line may include not counting
103           trailing spaces and newline sequence.  In other words, maximum
104           length of line.  Default is 76.
105
106       See also "Urgent" option and "User-Defined Breaking Behaviors".
107
108       ComplexBreaking => "YES" | "NO"
109           [L] Performs heuristic breaking on South East Asian complex
110           context.  Default is, if word segmentation for South East Asian
111           writing systems is enabled, "YES".
112
113       Context => CONTEXT
114           [E][L] Specify language/region context.  Currently available
115           contexts are "EASTASIAN" and "NONEASTASIAN".  Default context is
116           "NONEASTASIAN".
117
118           In "EASTASIAN" context, characters with East_Asian_Width property
119           ambiguous (A) are treated as "wide" and with Line Breaking Class AI
120           as ideographic (ID).
121
122           In "NONEASTASIAN" context, characters with East_Asian_Width
123           property ambiguous (A) are treated as "narrow" and with Line
124           Breaking Class AI as alphabetic (AL).
125
126       EAWidth => "[" ORD "=>" PROPERTY "]"
127       EAWidth => "undef"
128           [E] Tailor classification of East_Asian_Width property.  ORD is UCS
129           scalar value of character or array reference of them.  PROPERTY is
130           one of East_Asian_Width property values and extended values (See
131           "Constants").  This option may be specified multiple times.  If
132           "undef" is specified, all tailoring assigned before will be
133           canceled.
134
135           By default, no tailorings are available.  See also "Tailoring
136           Character Properties".
137
138       Format => METHOD
139           [L] Specify the method to format broken lines.
140
141           "SIMPLE"
142               Default method.  Just only insert newline at arbitrary breaking
143               positions.
144
145           "NEWLINE"
146               Insert or replace newline sequences with that specified by
147               "Newline" option, remove SPACEs leading newline sequences or
148               end-of-text.  Then append newline at end of text if it does not
149               exist.
150
151           "TRIM"
152               Insert newline at arbitrary breaking positions. Remove SPACEs
153               leading newline sequences.
154
155           "undef"
156               Do nothing, even inserting any newlines.
157
158           Subroutine reference
159               See "Formatting Lines".
160
161       HangulAsAL => "YES" | "NO"
162           [L] Treat hangul syllables and conjoining jamos as alphabetic
163           characters (AL).  Default is "NO".
164
165       LBClass => "[" ORD "=>" CLASS "]"
166       LBClass => "undef"
167           [G][L] Tailor classification of line breaking property.  ORD is UCS
168           scalar value of character or array reference of them.  CLASS is one
169           of line breaking classes (See "Constants").  This option may be
170           specified multiple times.  If "undef" is specified, all tailoring
171           assigned before will be canceled.
172
173           By default, no tailorings are available.  See also "Tailoring
174           Character Properties".
175
176       LegacyCM => "YES" | "NO"
177           [G][L] Treat combining characters lead by a SPACE as an isolated
178           combining character (ID).  As of Unicode 5.0, such use of SPACE is
179           not recommended.  Default is "YES".
180
181       Newline => STRING
182           [L] Unicode string to be used for newline sequence.  Default is
183           "\n".
184
185       Prep => METHOD
186           [L] Add user-defined line breaking behavior(s).  This option may be
187           specified multiple times.  Following methods are available.
188
189           "NONBREAKURI"
190               Won't break URIs.
191
192           "BREAKURI"
193               Break URIs according to a rule suitable for printed materials.
194               For more details see [CMOS], sections 6.17 and 17.11.
195
196           "[" REGEX, SUBREF "]"
197               The sequences matching regular expression REGEX will be broken
198               by subroutine referred by SUBREF.  For more details see "User-
199               Defined Breaking Behaviors".
200
201           "undef"
202               Cancel all methods assigned before.
203
204       Sizing => METHOD
205           [L] Specify method to calculate size of string.  Following options
206           are available.
207
208           "UAX11"
209               Default method.  Sizes are computed by columns of each
210               characters accoring to built-in character database.
211
212           "undef"
213               Number of grapheme clusters (see Unicode::GCString) contained
214               in the string.
215
216           Subroutine reference
217               See "Calculating String Size".
218
219           See also "ColMax", "ColMin" and "EAWidth" options.
220
221       Urgent => METHOD
222           [L] Specify method to handle excessing lines.  Following options
223           are available.
224
225           "CROAK"
226               Print error message and die.
227
228           "FORCE"
229               Force breaking excessing fragment.
230
231           "undef"
232               Default method.  Won't break excessing fragment.
233
234           Subroutine reference
235               See "User-Defined Breaking Behaviors".
236
237       ViramaAsJoiner => "YES" | "NO"
238           [G] Virama sign ("halant" in Hindi, "coeng" in Khmer) and its
239           succeeding letter are not broken.  Default is "YES".  Note: This
240           option was introduced by release 2012.001_29.  On previous
241           releases, it was fixed to "NO".  "Default" grapheme cluster defined
242           by [UAX #29] does not include this feature.
243
244   Constants
245       "EA_Na", "EA_N", "EA_A", "EA_W", "EA_H", "EA_F"
246           Index values to specify six East_Asian_Width property values
247           defined by [UAX #11]: narrow (Na), neutral (N), ambiguous (A), wide
248           (W), halfwidth (H) and fullwidth (F).
249
250       "EA_Z"
251           Index value to specify nonspacing characters.
252
253           Note: This "nonspacing" value is extension by this module, not a
254           part of [UAX #11].
255
256       "LB_BK", "LB_CR", "LB_LF", "LB_NL", "LB_SP", "LB_OP", "LB_CL", "LB_CP",
257       "LB_QU", "LB_GL", "LB_NS", "LB_EX", "LB_SY", "LB_IS", "LB_PR", "LB_PO",
258       "LB_NU", "LB_AL", "LB_HL", "LB_ID", "LB_IN", "LB_HY", "LB_BA", "LB_BB",
259       "LB_B2", "LB_CB", "LB_ZW", "LB_CM", "LB_WJ", "LB_H2", "LB_H3", "LB_JL",
260       "LB_JV", "LB_JT", "LB_SG", "LB_AI", "LB_CJ", "LB_SA", "LB_XX", "LB_RI"
261           Index values to specify 40 line breaking property values (classes)
262           defined by [UAX #14].
263
264           Note: Property value CP was introduced by Unicode 5.2.0.  Property
265           values HL and CJ were introduced by Unicode 6.1.0.  Property value
266           RI was introduced by Unicode 6.2.0.
267
268       "MANDATORY", "DIRECT", "INDIRECT", "PROHIBITED"
269           Four values to specify line breaking behaviors: Mandatory break;
270           Both direct break and indirect break are allowed; Indirect break is
271           allowed but direct break is prohibited; Prohibited break.
272
273       "Unicode::LineBreak::SouthEastAsian::supported"
274           Flag to determin if word segmentation for South East Asian writing
275           systems is enabled.  If this feature was enabled, a non-empty
276           string is set.  Otherwise, "undef" is set.
277
278           N.B.: Current release supports Thai script of modern Thai language
279           only.
280
281       "UNICODE_VERSION"
282           A string to specify version of Unicode standard this module refers.
283

CUSTOMIZATION

285   Formatting Lines
286       If you specify subroutine reference as a value of "Format" option, it
287       should accept three arguments:
288
289           $MODIFIED = &subroutine(SELF, EVENT, STR);
290
291       SELF is a Unicode::LineBreak object, EVENT is a string to determine the
292       context that subroutine was called in, and STR is a fragment of Unicode
293       string leading or trailing breaking position.
294
295           EVENT |When Fired           |Value of STR
296           -----------------------------------------------------------------
297           "sot" |Beginning of text    |Fragment of first line
298           "sop" |After mandatory break|Fragment of next line
299           "sol" |After arbitrary break|Fragment on sequel of line
300           ""    |Just before any      |Complete line without trailing
301                 |breaks               |SPACEs
302           "eol" |Arbitrary break      |SPACEs leading breaking position
303           "eop" |Mandatory break      |Newline and its leading SPACEs
304           "eot" |End of text          |SPACEs (and newline) at end of
305                 |                     |text
306           -----------------------------------------------------------------
307
308       Subroutine should return modified text fragment or may return "undef"
309       to express that no modification occurred.  Note that modification in
310       the context of "sot", "sop" or "sol" may affect decision of successive
311       breaking positions while in the others won't.
312
313       Note: String arguments are actually sequences of grapheme clusters.
314       See Unicode::GCString.
315
316       For example, following code folds lines removing trailing spaces:
317
318           sub fmt {
319               if ($_[1] =~ /^eo/) {
320                   return "\n";
321               }
322               return undef;
323           }
324           my $lb = Unicode::LineBreak->new(Format => \&fmt);
325           $output = $lb->break($text);
326
327   User-Defined Breaking Behaviors
328       When a line generated by arbitrary break is expected to be beyond
329       measure of either CharMax, ColMax or ColMin, urgent break may be
330       performed on successive string.  If you specify subroutine reference as
331       a value of "Urgent" option, it should accept two arguments:
332
333           @BROKEN = &subroutine(SELF, STR);
334
335       SELF is a Unicode::LineBreak object and STR is a Unicode string to be
336       broken.
337
338       Subroutine should return an array of broken string STR.
339
340       Note: String argument is actually a sequence of grapheme clusters.  See
341       Unicode::GCString.
342
343       For example, following code inserts hyphen to the name of several
344       chemical substances (such as Titin) so that it may be folded:
345
346           sub hyphenize {
347               return map {$_ =~ s/yl$/yl-/; $_} split /(\w+?yl(?=\w))/, $_[1];
348           }
349           my $lb = Unicode::LineBreak->new(Urgent => \&hyphenize);
350           $output = $lb->break("Methionylthreonylthreonylglutaminylarginyl...");
351
352       If you specify [REGEX, SUBREF] array reference as any of "Prep" option,
353       subroutine should accept two arguments:
354
355           @BROKEN = &subroutine(SELF, STR);
356
357       SELF is a Unicode::LineBreak object and STR is a Unicode string matched
358       with REGEX.
359
360       Subroutine should return an array of broken string STR.
361
362       For example, following code will break HTTP URLs using [CMOS] rule.
363
364           my $url = qr{http://[\x21-\x7E]+}i;
365           sub breakurl {
366               my $self = shift;
367               my $str = shift;
368               return split m{(?<=[/]) (?=[^/]) |
369                              (?<=[^-.]) (?=[-~.,_?\#%=&]) |
370                              (?<=[=&]) (?=.)}x, $str;
371           }
372           my $lb = Unicode::LineBreak->new(Prep => [$url, \&breakurl]);
373           $output = $lb->break($string);
374
375       Preserving State
376
377       Unicode::LineBreak object can behave as hash reference.  Any items may
378       be preserved throughout its life.
379
380       For example, following code will separate paragraphs with empty lines.
381
382           sub paraformat {
383               my $self = shift;
384               my $action = shift;
385               my $str = shift;
386
387               if ($action eq 'sot' or $action eq 'sop') {
388                   $self->{'line'} = '';
389               } elsif ($action eq '') {
390                   $self->{'line'} = $str;
391               } elsif ($action eq 'eol') {
392                   return "\n";
393               } elsif ($action eq 'eop') {
394                   if (length $self->{'line'}) {
395                       return "\n\n";
396                   } else {
397                       return "\n";
398                   }
399               } elsif ($action eq 'eot') {
400                   return "\n";
401               }
402               return undef;
403           }
404           my $lb = Unicode::LineBreak->new(Format => \&paraformat);
405           $output = $lb->break($string);
406
407   Calculating String Size
408       If you specify subroutine reference as a value of "Sizing" option, it
409       will be called with five arguments:
410
411           $COLS = &subroutine(SELF, LEN, PRE, SPC, STR);
412
413       SELF is a Unicode::LineBreak object, LEN is size of preceding string,
414       PRE is preceding Unicode string, SPC is additional SPACEs and STR is a
415       Unicode string to be processed.
416
417       Subroutine should return calculated number of columns of "PRE.SPC.STR".
418       The number of columns may not be an integer: Unit of the number may be
419       freely chosen, however, it should be same as those of "ColMin" and
420       "ColMax" option.
421
422       Note: String arguments are actually sequences of grapheme clusters.
423       See Unicode::GCString.
424
425       For example, following code processes lines with tab stops by each
426       eight columns.
427
428           sub tabbedsizing {
429               my ($self, $cols, $pre, $spc, $str) = @_;
430
431               my $spcstr = $spc.$str;
432               while ($spcstr->lbc == LB_SP) {
433                   my $c = $spcstr->item(0);
434                   if ($c eq "\t") {
435                       $cols += 8 - $cols % 8;
436                   } else {
437                       $cols += $c->columns;
438                   }
439                   $spcstr = $spcstr->substr(1);
440               }
441               $cols += $spcstr->columns;
442               return $cols;
443           };
444           my $lb = Unicode::LineBreak->new(LBClass => [ord("\t") => LB_SP],
445                                            Sizing => \&tabbedsizing);
446           $output = $lb->break($string);
447
448   Tailoring Character Properties
449       Character properties may be tailored by "LBClass" and "EAWidth"
450       options.  Some constants are defined for convenience of tailoring.
451
452       Line Breaking Properties
453
454       Non-starters of Kana-like Characters
455
456       By default, several hiragana, katakana and characters corresponding to
457       kana are treated as non-starters (NS or CJ).  When the following
458       pair(s) are specified for value of "LBClass" option, these characters
459       are treated as normal ideographic characters (ID).
460
461       "KANA_NONSTARTERS() => LB_ID"
462           All of characters below.
463
464       "IDEOGRAPHIC_ITERATION_MARKS() => LB_ID"
465           Ideographic iteration marks.  U+3005 IDEOGRAPHIC ITERATION MARK,
466           U+303B VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D HIRAGANA
467           ITERATION MARK, U+309E HIRAGANA VOICED ITERATION MARK, U+30FD
468           KATAKANA ITERATION MARK and U+30FE KATAKANA VOICED ITERATION MARK.
469
470           N.B. Some of them are neither hiragana nor katakana.
471
472       "KANA_SMALL_LETTERS() => LB_ID"
473       "KANA_PROLONGED_SOUND_MARKS() => LB_ID"
474           Hiragana or katakana small letters: Hiragana small letters U+3041
475           A, U+3043 I, U+3045 U, U+3047 E, U+3049 O, U+3063 TU, U+3083 YA,
476           U+3085 YU, U+3087 YO, U+308E WA, U+3095 KA, U+3096 KE.  Katakana
477           small letters U+30A1 A, U+30A3 I, U+30A5 U, U+30A7 E, U+30A9 O,
478           U+30C3 TU, U+30E3 YA, U+30E5 YU, U+30E7 YO, U+30EE WA, U+30F5 KA,
479           U+30F6 KE.  Katakana phonetic extensions U+31F0 KU - U+31FF RO.
480           Halfwidth katakana small letters U+FF67 A - U+FF6F TU.
481
482           Hiragana or katakana prolonged sound marks: U+30FC KATAKANA-
483           HIRAGANA PROLONGED SOUND MARK and U+FF70 HALFWIDTH KATAKANA-
484           HIRAGANA PROLONGED SOUND MARK.
485
486           N.B. These letters are optionally treated either as non-starter or
487           as normal ideographic.  See [JIS X 4051] 6.1.1, [JLREQ] 3.1.7 or
488           [UAX14].
489
490           N.B. U+3095, U+3096, U+30F5, U+30F6 are considered to be neither
491           hiragana nor katakana.
492
493       "MASU_MARK() => LB_ID"
494           U+303C MASU MARK.
495
496           N.B. Although this character is not kana, it is usually regarded as
497           abbreviation to sequence of hiragana ま す or katakana マ ス, MA
498           and SU.
499
500           N.B. This character is classified as non-starter (NS) by [UAX #14]
501           and as the class corresponding to ID by [JIS X 4051] and [JLREQ].
502
503       Ambiguous Quotation Marks
504
505       By default, some punctuations are ambiguous quotation marks (QU).
506
507       "BACKWARD_QUOTES() => LB_OP, FORWARD_QUOTES() => LB_CL"
508           Some languages (Dutch, English, Italian, Portugese, Spanish,
509           Turkish and most East Asian) use rotated-9-style punctuations (‘ “)
510           as opening and 9-style punctuations (’ ”) as closing quotation
511           marks.
512
513       "FORWARD_QUOTES() => LB_OP, BACKWARD_QUOTES() => LB_CL"
514           Some others (Czech, German and Slovak) use 9-style punctuations (’
515           ”) as opening and rotated-9-style punctuations (‘ “) as closing
516           quotation marks.
517
518       "BACKWARD_GUILLEMETS() => LB_OP, FORWARD_GUILLEMETS() => LB_CL"
519           French, Greek, Russian etc. use left-pointing guillemets (« ‹) as
520           opening and right-pointing guillemets (» ›) as closing quotation
521           marks.
522
523       "FORWARD_GUILLEMETS() => LB_OP, BACKWARD_GUILLEMETS() => LB_CL"
524           German and Slovak use right-pointing guillemets (» ›) as opening
525           and left-pointing guillemets (« ‹) as closing quotation marks.
526
527       Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing
528       punctuations (’ ” » ›) as both opening and closing quotation marks.
529
530       IDEOGRAPHIC SPACE
531
532       "IDEOGRAPHIC_SPACE() => LB_BA"
533           U+3000 IDEOGRAPHIC SPACE won't be placed at beginning of line.
534           This is default behavior.
535
536       "IDEOGRAPHIC_SPACE() => LB_ID"
537           IDEOGRAPHIC SPACE can be placed at beginning of line.  This was
538           default behavior by Unicode 6.2 and earlier.
539
540       "IDEOGRAPHIC_SPACE() => LB_SP"
541           IDEOGRAPHIC SPACE won't be placed at beginning of line, and will
542           protrude from end of line.
543
544       East_Asian_Width Properties
545
546       Some particular letters of Latin, Greek and Cyrillic scripts have
547       ambiguous (A) East_Asian_Width property.  Thus, these characters are
548       treated as wide in "EASTASIAN" context.  Specifying "EAWidth => [
549       AMBIGUOUS_"*"() => EA_N ]", those characters are always treated as
550       narrow.
551
552       "AMBIGUOUS_ALPHABETICS() => EA_N"
553           Treat all of characters below as East_Asian_Width neutral (N).
554
555       "AMBIGUOUS_CYRILLIC() => EA_N"
556       "AMBIGUOUS_GREEK() => EA_N"
557       "AMBIGUOUS_LATIN() => EA_N"
558           Treate letters having ambiguous (A) width of Cyrillic, Greek and
559           Latin scripts as neutral (N).
560
561       On the other hand, despite several characters were occasionally
562       rendered as wide characters by number of implementations for East Asian
563       character sets, they are given narrow (Na) East_Asian_Width property
564       just because they have fullwidth (F) compatibility characters.
565       Specifying "EAWidth" as below, those characters are treated as
566       ambiguous --- wide on "EASTASIAN" context.
567
568       "QUESTIONABLE_NARROW_SIGNS() => EA_A"
569           U+00A2 CENT SIGN, U+00A3 POUND SIGN, U+00A5 YEN SIGN (or yuan
570           sign), U+00A6 BROKEN BAR, U+00AC NOT SIGN, U+00AF MACRON.
571
572   Configuration File
573       Built-in defaults of option parameters for "new" and "config" method
574       can be overridden by configuration files:
575       Unicode/LineBreak/Defaults.pm.  For more details read
576       Unicode/LineBreak/Defaults.pm.sample.
577

BUGS

579       Please report bugs or buggy behaviors to developer.
580
581       CPAN Request Tracker:
582       <http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-LineBreak>.
583

VERSION

585       Consult $VERSION variable.
586
587   Incompatible Changes
588       Release 2012.06
589           •   eawidth() method was deprecated.  "columns" in
590               Unicode::GCString may be used instead.
591
592           •   lbclass() method was deprecated.  Use "lbc" in
593               Unicode::GCString or "lbcext" in Unicode::GCString.
594
595   Conformance to Standards
596       Character properties this module is based on are defined by Unicode
597       Standard version 8.0.0.
598
599       This module is intended to implement UAX14-C2.
600

IMPLEMENTATION NOTES

602       •   Some ideographic characters may be treated either as NS or as ID by
603           choice.
604
605       •   Hangul syllables and conjoining jamos may be treated as either ID
606           or AL by choice.
607
608       •   Characters assigned to AI may be resolved to either AL or ID by
609           choice.
610
611       •   Character(s) assigned to CB are not resolved.
612
613       •   Characters assigned to CJ are always resolved to NS.  More flexible
614           tailoring mechanism is provided.
615
616       •   When word segmentation for South East Asian writing systems is not
617           supported, characters assigned to SA are resolved to AL, except
618           that characters that have Grapheme_Cluster_Break property value
619           Extend or SpacingMark be resolved to CM.
620
621       •   Characters assigned to SG or XX are resolved to AL.
622
623       •   Code points of following UCS ranges are given fixed property values
624           even if they have not been assigned any characers.
625
626               Ranges             | UAX #14    | UAX #11    | Description
627               -------------------------------------------------------------
628               U+20A0..U+20CF     | PR [*1]    | N [*2]     | Currency symbols
629               U+3400..U+4DBF     | ID         | W          | CJK ideographs
630               U+4E00..U+9FFF     | ID         | W          | CJK ideographs
631               U+D800..U+DFFF     | AL (SG)    | N          | Surrogates
632               U+E000..U+F8FF     | AL (XX)    | F or N (A) | Private use
633               U+F900..U+FAFF     | ID         | W          | CJK ideographs
634               U+20000..U+2FFFD   | ID         | W          | CJK ideographs
635               U+30000..U+3FFFD   | ID         | W          | Old hanzi
636               U+F0000..U+FFFFD   | AL (XX)    | F or N (A) | Private use
637               U+100000..U+10FFFD | AL (XX)    | F or N (A) | Private use
638               Other unassigned   | AL (XX)    | N          | Unassigned,
639                                  |            |            | reserved or
640                                  |            |            | noncharacters
641               -------------------------------------------------------------
642               [*1] Except U+20A7 PESETA SIGN (PO),
643                 U+20B6 LIVRE TOURNOIS SIGN (PO), U+20BB NORDIC MARK SIGN (PO)
644                 and U+20BE LARI SIGN (PO).
645               [*2] Except U+20A9 WON SIGN (H) and U+20AC EURO SIGN
646                 (F or N (A)).
647
648       •   Characters belonging to General Category Mn, Me, Cc, Cf, Zl or Zp
649           are treated as nonspacing by this module.
650

REFERENCES

652       [CMOS]
653           The Chicago Manual of Style, 15th edition.  University of Chicago
654           Press, 2003.
655
656       [JIS X 4051]
657           JIS X 4051:2004 日本語文書の組版方法 (Formatting Rules for Japanese
658           Documents).  Japanese Standards Association, 2004.
659
660       [JLREQ]
661           Anan, Yasuhiro et al.  Requirements for Japanese Text Layout, W3C
662           Working Group Note 3 April 2012.
663           <http://www.w3.org/TR/2012/NOTE-jlreq-20120403/>.
664
665       [UAX #11]
666           A. Freytag (ed.) (2008-2009).  Unicode Standard Annex #11: East
667           Asian Width, Revisions 17-19.  <http://unicode.org/reports/tr11/>.
668
669       [UAX #14]
670           A. Freytag and A. Heninger (eds.) (2008-2015).  Unicode Standard
671           Annex #14: Unicode Line Breaking Algorithm, Revisions 22-35.
672           <http://unicode.org/reports/tr14/>.
673
674       [UAX #29]
675           Mark Davis (ed.) (2009-2013).  Unicode Standard Annex #29: Unicode
676           Text Segmentation, Revisions 15-23.
677           <http://www.unicode.org/reports/tr29/>.
678

AUTHOR

683       Copyright (C) 2009-2018 Hatuka*nezumi - IKEDA Soji
684       <hatuka(at)nezumi.nu>.
685
686       This program is free software; you can redistribute it and/or modify it
687       under the same terms as Perl itself.
688
689
690
691perl v5.38.0                      2023-07-21             Unicode::LineBreak(3)