1hunspell(4) Kernel Interfaces Manual hunspell(4)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) requires two files to define the language that it is
10 spellchecking. The first file is a dictionary containing words for the
11 language, and the second is an "affix" file that defines the meaning
12 of special flags in the dictionary.
13
14 A dictionary file (*.dic) contains a list of words, one per line. The
15 first line of the dictionaries (except personal dictionaries) contains
16 the approximate word count (for optimal hash memory size). Each word
17 may optionally be followed by a slash ("/") and one or more flags,
18 which represents affixes or special attributes. Dictionary words can
19 contain also slashes with the "" syntax. Default flag format is a sin‐
20 gle (usually alphabetic) character. In a Hunspell dictionary file,
21 there is also an optional morphological field separated by tabulator.
22
23 Morphological desciptions have custom format.
24
25 An affix file (*.aff) may contain a lot of optional attributes. For
26 example, SET is used for setting the character encodings of affixes and
27 dictionary files. TRY sets the change characters for suggestions. REP
28 sets a replacement table for multiple character corrections in sugges‐
29 tion mode. PFX and SFX defines prefix and suffix classes named with
30 affix flags.
31
32 The following affix file example defines UTF-8 character encoding.
33 `TRY' suggestions differ from the bad word with an English letter or an
34 apostrophe. With these REP definitions, Hunspell can suggest the right
35 word form, when the misspelled word contains f instead of ph and vice
36 versa.
37
38
39 SET UTF-8
40 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
41
42 REP 2
43 REP f ph
44 REP ph f
45
46 PFX A Y 1
47 PFX A 0 re .
48
49 SFX B Y 2
50 SFX B 0 ed [^y]
51 SFX B y ied y
52
53 There are two affix classes in the dictionary. Class A defines an `re-'
54 prefix. Class B defines two `-ed' suffixes. First suffix can be added
55 to a word if the last character of the word isn't `y'. Second suffix
56 can be added to words terminated with an `y'. (See details later.) The
57 following dictionary file uses these affix classes.
58
59
60 3
61 hello
62 try/B
63 work/AB
64
65 All accepted words with this example: hello, try, tried, work, worked,
66 rework, reworked.
67
68
70 SET encoding
71 Set character encoding of words and morphemes in affix and dic‐
72 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
73 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, microsoft-cp1251,
74 ISCII-DEVANAGARI.
75
76 FLAG value
77 Set flag type. Default type is the extended ASCII (8-bit) char‐
78 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
79 flags. The `long' value sets the double extended ASCII charac‐
80 ter flag type, the `num' sets the decimal number flag type. Dec‐
81 imal flags numbered from 1 to 65535, and in flag fields are sep‐
82 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
83 form.
84
85 COMPLEXPREFIXES
86 Set twofold prefix stripping (but single suffix stripping) for
87 agglutinative languages with right-to-left writing system.
88
89 LANG langcode
90 Set language code. In Hunspell may be language specific codes
91 enabled by LANG code. At present there are az_AZ, hu_HU, TR_tr
92 specific codes in Hunspell (see the source code).
93
94 IGNORE characters
95 Ignore characters from dictionary words, affixes and input
96 words. Useful for optional characters, as Arabic diacritical
97 marks (Harakat).
98
99 AF number_of_flag_vector_aliases
100
101 AF flag_vector
102 Hunspell can substitue affix flag sets with a natural number in
103 affix rules (alias compression). First example with alias com‐
104 pression:
105
106 3
107 hello
108 try/1
109 work/2
110
111 AF definitions in the affix file:
112
113 SET UTF-8
114 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
115 AF 2
116 AF A
117 AF AB
118
119 See also tests/alias* examples.
120
121 Note: If affix file contains the FLAG parameter, define it before the
122 AF definitions.
123
124 Note II: Use makealias utility in Hunspell distribution to compress aff
125 and dic files.
126
127 AM number_of_morphological_description_aliases
128
129 AM morphological_description
130 Hunspell can substitue also morphological descriptions with a
131 natural number in affix rules (alias compression). See
132 tests/alias* examples.
133
135 TRY characters
136 Hunspell can suggest right word forms, when those differs from
137 the bad form by one TRY character. The parameter of TRY is case
138 sensitive.
139
140 NOSUGGEST flag
141 Words signed with NOSUGGEST flag are not suggested. Proposed
142 flag for vulgar and obscene words.
143
144 MAXNGRAMSUGS num
145 Set number of n-gram suggestions. Value 0 switches off the n-
146 gram suggestions.
147
148 NOSPLITSUGS
149 Disable split-word suggestions.
150
151 SUGSWITHDOTS
152 Add dot(s) to suggestions, if input word terminates in dot(s).
153 (Not for OpenOffice.org dictionaries, because OpenOffice.org has
154 an automatic dot expansion mechanism.)
155
156 REP number_of_replacement_definitions
157
158 REP what replacement
159 We can define language-dependent phonetic information in the
160 affix file (.aff) by a replacement table. First REP is the
161 header of this table and one or more REP data line are following
162 it. With this table, Hunspell can suggest the right forms for
163 the typical faults of spelling when the incorrect form differs
164 by more, than 1 letter from the right form. For example a pos‐
165 sible English replacement table definition to handle misspelled
166 consonants:
167
168 REP 8
169 REP f ph
170 REP ph f
171 REP f gh
172 REP gh f
173 REP j dg
174 REP dg j
175 REP k ch
176 REP ch k
177
178 Note: It's very useful to define replacements for the most typical one-
179 character mistakes, too: with REP you can add higher priority to a sub‐
180 set of the TRY suggestions (suggestion list begins with the REP sugges‐
181 tions).
182
183 Note II: Replacement table can be used for a stricter compound word
184 checking (forbidding generated compound words, if they are also simple
185 words with typical fault, see CHECKCOMPOUNDREP).
186
187
188 MAP number_of_map_definitions
189
190 MAP string_of_related_chars
191 We can define language-dependent information on characters that
192 should be considered related (ie. nearer than other chars not in
193 the set) in the affix file (.aff) by a character map table.
194 With this table, Hunspell can suggest the right forms for words,
195 which incorrectly choose the wrong letter from a related set
196 more than once in a word.
197
198 For example a possible mapping could be for the German umlauted
199 ü versus the regular u; the word Frühstück really should be
200 written with umlauted u's and not regular ones
201
202 MAP 1
203 MAP uü
204
206 BREAK number_of_break_definitions
207
208 BREAK character_or_character_sequence
209 Define break points for breaking words and checking word parts
210 separately. Rationale: useful for compounding with joining
211 character or strings (for example, hyphen in English and German
212 or hyphen and n-dash in Hungarian). Dashes are often bad break
213 points for tokenization, because compounds with dashes may con‐
214 tain not valid parts, too.) With BREAK, Hunspell can check both
215 side of these compounds, breaking the words at dashes and n-
216 dashes:
217
218 BREAK 2
219 BREAK -
220 BREAK -- # n-dash
221
222 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
223 be valid compounds.
224
225 Note: COMPOUNDRULE is better (or will be better) for handling dashes
226 and other compound joining characters or character strings. Use BREAK,
227 if you want check words with dashes or other joining characters and
228 there is no time or possibility to describe precise compound rules with
229 COMPOUNDRULE (COMPOUNDRULE has handled only the last suffixation of the
230 compound word yet).
231
232 Note II: For command line spell checking, set WORDCHARS parameters:
233 WORDCHARS --- (see tests/break.*) example
234
235 COMPOUNDRULE number_of_compound_definitions
236
237 COMPOUNDRULE compound_pattern
238 Define custom compound patterns with a regex-like syntax. The
239 first COMPOUNDRULE is a header with the number of the following
240 COMPOUNDRULE definitions. Compound patterns consist compound
241 flags and star or question mark meta characters. A flag followed
242 by a `*' matches a word sequence of 0 or more matches of words
243 signed with this compound flag. A flag followed by a `?'
244 matches a word sequence of 0 or 1 matches of a word signed with
245 this compound flag. See tests/compound*.* examples.
246
247 Note: `*' and `?' metacharacters work only with the default
248 8-bit character and the UTF-8 FLAG types.
249
250 Note II: COMPOUNDRULE flags haven't been compatible with the
251 COMPOUNDFLAG, COMPOUNDBEGIN, etc. compound flags yet (use these
252 flags on different words).
253
254 COMPOUNDMIN num
255 Minimum length of words in compound words. Default value is 3
256 letters.
257
258 COMPOUNDFLAG flag
259 Words signed with COMPOUNDFLAG may be in compound words (except
260 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
261 also permits compounding of affixed words.
262
263 COMPOUNDBEGIN flag
264 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
265 first elements in compound words.
266
267 COMPOUNDLAST flag
268 Words signed with COMPOUNDLAST (or with a signed affix) may be
269 last elements in compound words.
270
271 COMPOUNDMIDDLE flag
272 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
273 middle elements in compound words.
274
275 ONLYINCOMPOUND flag
276 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
277 compounds (Fuge-elements in German, fogemorphemes in Swedish).
278 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
279 pound.*).
280
281 COMPOUNDPERMITFLAG flag
282 Prefixes are allowed at the beginning of compounds, suffixes are
283 allowed at the end of compounds by default. Affixes with COM‐
284 POUNDPERMITFLAG may be inside of compounds.
285
286 COMPOUNDFORBIDFLAG flag
287 Suffixes with this flag forbid compounding of the affixed word.
288
289 COMPOUNDROOT flag
290 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
291 is used only in the Hungarian language specific code).
292
293 COMPOUNDWORDMAX number
294 Set maximum word count in a compound word. (Default is unlim‐
295 ited.)
296
297 CHECKCOMPOUNDDUP
298 Forbid word duplication in compounds (eg. foofoo).
299
300 CHECKCOMPOUNDREP
301 Forbid compounding, if the (usually bad) compound word may be a
302 non compound word with a REP fault. Useful for languages with
303 `compound friendly' orthography.
304
305 CHECKCOMPOUNDCASE
306 Forbid upper case characters at word bound in compounds.
307
308 CHECKCOMPOUNDTRIPLE
309 Forbid compounding, if compound word contains triple letters
310 (eg. foo|ox or xo|oof). Bug: missing multi-byte character sup‐
311 port in UTF-8 encoding (works only for 7-bit ASCII characters).
312
313 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
314
315 CHECKCOMPOUNDPATTERN endchars beginchars
316 Forbid compounding, if first word in compound ends with end‐
317 chars, and next word begins with beginchars.
318
319 COMPOUNDSYLLABLE max_syllable vowels
320 Need for special compounding rules in Hungarian. First parame‐
321 ter is the maximum syllable number, that may be in a compound,
322 if words in compounds are more than COMPOUNDWORDMAX. Second
323 parameter is the list of vowels (for calculating syllables).
324
325 SYLLABLENUM flags
326 Need for special compounding rules in Hungarian.
327
329 PFX flag cross_product number
330
331 PFX flag stripping prefix condition morphological_description
332
333 SFX flag cross_product number
334
335 SFX flag stripping suffix condition morphological_description
336 An affix is either a prefix or a suffix attached to root words
337 to make other words. We can define affix classes with arbitrary
338 number affix rules. Affix classes are signed with affix flags.
339 The first line of an affix class definition is the header. The
340 fields of an affix class header:
341
342 (0) Option name (PFX or SFX)
343
344 (1) Flag (name of the affix class)
345
346 (2) Cross product (permission to combine prefixes and suffixes).
347 Possible values: Y (yes) or N (no)
348
349 (3) Line count of the following rules.
350
351 Fields of an affix rules:
352
353 (0) Option name
354
355 (1) Flag
356
357 (2) stripping characters from beginning (at prefix rules) or end
358 (at suffix rules) of the word
359
360 (3) affix (optionally with flags of continuation classes, sepa‐
361 rated by a slash)
362
363 (4) condition.
364
365 Zero stripping or affix are indicated by zero. Zero condition is
366 indicated by dot. Condition is a simplified, regular expres‐
367 sion-like pattern, which must be met before the affix can be
368 applied. (Dot signs an arbitrary character. Characters in braces
369 sign an arbitrary character from the character subset. Dash
370 hasn't got special meaning, but circumflex (^) next the first
371 brace sets the complementer character set.)
372
373 (5) Custom morphological description.
374
375
377 CIRCUMFIX flag
378 Affixes signed with CIRCUMFIX flag may be on a word when this
379 word also has a prefix with CIRCUMFIX flag and vice versa.
380
381 FORBIDDENWORD flag
382 This flag signs forbidden word form. Because affixed forms are
383 also forbidden, we can substract a subset from set of the
384 accepted affixed and compound words.
385
386 KEEPCASE flag
387 Forbid uppercased and capitalized forms of words signed with
388 KEEPCASE flags. Useful for special ortographies (measurements
389 and currency often keep their case in uppercased texts) and
390 writing systems (eg. keeping lower case of IPA characters).
391
392 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
393 CASE flag may be capitalised and uppercased, but uppercased
394 forms of these words may not contain sharp s, only SS. See ger‐
395 mancompounding example in the tests directory of the Hunspell
396 distribution.
397
398 LEMMA_PRESENT flag
399 Generally, there are dictionary words as lemmas in output of
400 morphological analysis. Sometimes dictionary words are not lem‐
401 mas, but affixed (not real) stems and virtual stems. In this
402 case lemmas (real stems) need to put into morphological descrip‐
403 tion, and forbid not real lemmas in morphological analysis
404 adding LEMMA_PRESENT flag to dictionary words.
405
406 NEEDAFFIX flag
407 This flag signs virtual stems in the dictionary. Only affixed
408 forms of these words will be accepted by Hunspell. Except, if
409 the dictionary word has a homonym or a zero affix. NEEDAFFIX
410 works also with prefixes and prefix + suffix combinations (see
411 tests/pseudoroot5.*).
412
413 PSEUDOROOT flag
414 Deprecated. (Former name of the NEEDAFFIX option.)
415
416 WORDCHARS characters
417 WORDCHARS extends tokenizer of Hunspell command line interface
418 with additional word character. For example, dot, dash, n-dash,
419 numbers, percent sign are word character in Hungarian.
420
421 CHECKSHARPS
422 SS letter pair in uppercased (German) words may be upper case
423 sharp s (ß). Hunspell can handle this special casing with the
424 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
425 mancompounding example) in both spelling and suggestion.
426
427
429 Hunspell's affix rules have got an optional morphological description
430 field. There is a similar optional field in dictionary file, separated
431 by tabulator:
432
433
434 word/flags morphology
435
436 We define a simple resource with morphological informations.
437
438 Affix file:
439
440
441 SFX X Y 1
442 SFX X 0 able . +ABLE
443
444 Dictionary file:
445
446
447 drink/X [VERB]
448
449 Test file:
450
451
452 drink
453 drinkable
454
455 Test:
456
457
458 $ hunmorph test.aff test.dic test.txt
459 drink: drink[VERB]
460 drinkable: drink[VERB]+ABLE
461
462 You can see in the example, that the analyzer concatenates the morpho‐
463 logical fields in item and arrangement style.
464
465
467 Ispell's original algorithm strips only one suffix. Hunspell can strip
468 another one yet.
469
470 The twofold suffix stripping is a significant improvement in handling
471 of immense number of suffixes, that characterize agglutinative lan‐
472 guages.
473
474 Extending the previous example by adding a second suffix (affix class Y
475 will be the continuation class of the suffix `able'):
476
477
478 SFX Y Y 1
479 SFX Y 0 s . +PLUR
480
481 SFX X Y 1
482 SFX X 0 able/Y . +ABLE
483
484 Dictionary file:
485
486
487 drink/X [VERB]
488
489 Test file:
490
491
492 drink
493 drinkable
494 drinkables
495
496 Test:
497
498
499 $ hunmorph test.aff test.dic test.txt
500 drink: drink[VERB]
501 drinkable: drink[VERB]+ABLE
502 drinkables: drink[VERB]+ABLE+PLUR
503
504 Theoretically with the twofold suffix stripping needs only the square
505 root of the number of suffix rules, compared with a Hunspell implemen‐
506 tation. In our practice, we could have elaborated the Hungarian inflec‐
507 tional morphology with twofold suffix stripping.
508
509 Note: In Hunlex preprocessor's grammar can be use not only twofold, but
510 multiple suffix slitting.
511
512
514 Hunspell can handle more than 65000 affix classes. There are two new
515 syntax for giving flags in affix and dictionary files.
516
517 FLAG long command sets 2-character flags:
518
519
520 FLAG long
521 SFX Y1 Y 1
522 SFX Y1 0 s 1
523
524 Dictionary record with the Y1, Z3, F? flags:
525
526
527 foo/Y1Z3F?
528
529 FLAG num command sets numerical flags separated by comma:
530
531
532 FLAG num
533 SFX 65000 Y 1
534 SFX 65000 0 s 1
535
536 Dictionary example:
537
538
539 foo/65000,12,2756
540
542 Hunspell's dictionary can contain repeating elements that are homonyms:
543
544
545 work/A [VERB]
546 work/B [NOUN]
547
548 An affix file:
549
550
551 SFX A Y 1
552 SFX A 0 s . +SG3
553
554 SFX B Y 1
555 SFX B 0 s . +PLUR
556
557 Test file:
558
559
560 works
561
562 Test:
563
564
565 > works
566 work[VERB]+SG3
567 work[NOUN]+PLUR
568
569 This feature also gives a way to forbid illegal prefix/suffix combina‐
570 tions in difficult cases.
571
572
574 An interesting side-effect of multi-step stripping is, that the appro‐
575 priate treatment of circumfixes now comes for free. For instance, in
576 Hungarian, superlatives are formed by simultaneous prefixation of leg-
577 and suffixation of -bb to the adjective base. A problem with the one-
578 level architecture is that there is no way to render lexical licensing
579 of particular prefixes and suffixes interdependent, and therefore
580 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
581 `old'. Until the introduction of clusters, a special treatment of the
582 superlative had to be hardwired in the earlier HunSpell code. This may
583 have been legitimate for a single case, but in fact prefix--suffix
584 dependences are ubiquitous in category-changing derivational patterns
585 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
586 able but *undrink). In simple words, here, the prefix un- is legitimate
587 only if the base drink is suffixed with -able. If both these patters
588 are handled by on-line affix rules and affix rules are checked against
589 the base only, there is no way to express this dependency and the sys‐
590 tem will necessarily over- or undergenerate.
591
592 In next example, suffix class R have got a prefix `continuation' class
593 (class P).
594
595
596 PFX P Y 1
597 PFX P 0 un . [prefix_un]+
598
599 SFX S Y 1
600 SFX S 0 s . +PL
601
602 SFX Q Y 1
603 SFX Q 0 s . +3SGV
604
605 SFX R Y 1
606 SFX R 0 able/PS . +DER_V_ADJ_ABLE
607
608 Dictionary:
609
610
611 2
612 drink/RQ [verb]
613 drink/S [noun]
614
615 Morphological analysis:
616
617
618 > drink
619 drink[verb]
620 drink[noun]
621 > drinks
622 drink[verb]+3SGV
623 drink[noun]+PL
624 > drinkable
625 drink[verb]+DER_V_ADJ_ABLE
626 > drinkables
627 drink[verb]+DER_V_ADJ_ABLE+PL
628 > undrinkable
629 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
630 > undrinkables
631 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
632 > undrink
633 Unknown word.
634 > undrinks
635 Unknown word.
636
638 Conditional affixes implemented by a continuation class are not enough
639 for circumfixes, because a circumfix is one affix in morphology. We
640 also need CIRCUMFIX option for correct morphological analysis.
641
642
643 # circumfixes: ~ obligate prefix/suffix combinations
644 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
645 # nagy, nagyobb, legnagyobb, legeslegnagyobb
646 # (great, greater, greatest, most greatest)
647
648 CIRCUMFIX X
649
650 PFX A Y 1
651 PFX A 0 leg/X .
652
653 PFX B Y 1
654 PFX B 0 legesleg/X .
655
656 SFX C Y 3
657 SFX C 0 obb . +COMPARATIVE
658 SFX C 0 obb/AX . +SUPERLATIVE
659 SFX C 0 obb/BX . +SUPERSUPERLATIVE
660
661 Dictionary:
662
663
664 1
665 nagy/C [MN]
666
667 Analysis:
668
669
670 > nagy
671 nagy[MN]
672 > nagyobb
673 nagy[MN]+COMPARATIVE
674 > legnagyobb
675 nagy[MN]+SUPERLATIVE
676 > legeslegnagyobb
677 nagy[MN]+SUPERSUPERLATIVE
678
680 Allowing free compounding yields decrease in precision of recognition,
681 not to mention stemming and morphological analysis. Although lexical
682 switches are introduced to license compounding of bases by Ispell, this
683 proves not to be restrictive enough. For example:
684
685
686 # affix file
687 COMPOUNDFLAG X
688
689 2
690 foo/X
691 bar/X
692
693 With this resource, foobar and barfoo also are accepted words.
694
695 This has been improved upon with the introduction of direction-sensi‐
696 tive compounding, i.e., lexical features can specify separately whether
697 a base can occur as leftmost or rightmost constituent in compounds.
698 This, however, is still insufficient to handle the intricate patterns
699 of compounding, not to mention idiosyncratic (and language specific)
700 norms of hyphenation.
701
702 The Hunspell algorithm currently allows any affixed form of words,
703 which are lexically marked as potential members of compounds. Hunspell
704 improved this, and its recursive compound checking rules makes it pos‐
705 sible to implement the intricate spelling conventions of Hungarian com‐
706 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
707 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6--3'
708 rule. Further example in Hungarian, derivate suffixes often modify
709 compounding properties. Hunspell allows the compounding flags on the
710 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
711 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
712
713 Suffixes with this flag forbid compounding of the affixed word.
714
715 We also need several Hunspell features for handling German compounding:
716
717
718 # German compounding
719
720 # set language to handle special casing of German sharp s
721
722 LANG de_DE
723
724 # compound flags
725
726 COMPOUNDBEGIN U
727 COMPOUNDMIDDLE V
728 COMPOUNDEND W
729
730 # Prefixes are allowed at the beginning of compounds,
731 # suffixes are allowed at the end of compounds by default:
732 # (prefix)?(root)+(affix)?
733 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
734 COMPOUNDPERMITFLAG P
735
736 # for German fogemorphemes (Fuge-element)
737 # Hint: ONLYINCOMPOUND is not required everywhere, but the
738 # checking will be a little faster with it.
739
740 ONLYINCOMPOUND X
741
742 # forbid uppercase characters at compound word bounds
743 CHECKCOMPOUNDCASE
744
745 # for handling Fuge-elements with dashes (Arbeits-)
746 # dash will be a special word
747
748 COMPOUNDMIN 1
749 WORDCHARS -
750
751 # compound settings and fogemorpheme for `Arbeit'
752
753 SFX A Y 3
754 SFX A 0 s/UPX .
755 SFX A 0 s/VPDX .
756 SFX A 0 0/WXD .
757
758 SFX B Y 2
759 SFX B 0 0/UPX .
760 SFX B 0 0/VWXDP .
761
762 # a suffix for `Computer'
763
764 SFX C Y 1
765 SFX C 0 n/WD .
766
767 # for forbid exceptions (*Arbeitsnehmer)
768
769 FORBIDDENWORD Z
770
771 # dash prefix for compounds with dash (Arbeits-Computer)
772
773 PFX - Y 1
774 PFX - 0 -/P .
775
776 # decapitalizing prefix
777 # circumfix for positioning in compounds
778
779 PFX D Y 29
780 PFX D A a/PX A
781 PFX D Ä ä/PX Ä
782 .
783 .
784 PFX D Y y/PX Y
785 PFX D Z z/PX Z
786
787 Example dictionary:
788
789
790 4
791 Arbeit/A-
792 Computer/BC-
793 -/W
794 Arbeitsnehmer/Z
795
796 Accepted compound compound words with the previous resource:
797
798
799 Computer
800 Computern
801 Arbeit
802 Arbeits-
803 Computerarbeit
804 Computerarbeits-
805 Arbeitscomputer
806 Arbeitscomputern
807 Computerarbeitscomputer
808 Computerarbeitscomputern
809 Arbeitscomputerarbeit
810 Computerarbeits-Computer
811 Computerarbeits-Computern
812
813 Not accepted compoundings:
814
815
816 computer
817 arbeit
818 Arbeits
819 arbeits
820 ComputerArbeit
821 ComputerArbeits
822 Arbeitcomputer
823 ArbeitsComputer
824 Computerarbeitcomputer
825 ComputerArbeitcomputer
826 ComputerArbeitscomputer
827 Arbeitscomputerarbeits
828 Computerarbeits-computer
829 Arbeitsnehmer
830
831 This solution is still not ideal, however, and will be replaced by a
832 pattern-based compound-checking algorithm which is closely integrated
833 with input buffer tokenization. Patterns describing compounds come as a
834 separate input resource that can refer to high-level properties of con‐
835 stituent parts (e.g. the number of syllables, affix flags, and contain‐
836 ment of hyphens). The patterns are matched against potential segmenta‐
837 tions of compounds to assess wellformedness.
838
839
841 Problems with the 8-bit encoding
842
843 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
844 major deficiency when it comes to scalability. Although a language
845 like Hungarian has a standard ASCII character set (ISO 8859-2), it
846 fails to allow a full implementation of Hungarian orthographic conven‐
847 tions. For instance, the '--' symbol (n-dash) is missing from this
848 character set contrary to the fact that it is not only the official
849 symbol to delimit parenthetic clauses in the language, but it can be in
850 compound words as a special 'big' hyphen.
851
852 MySpell has got some 8-bit encoding tables, but there are languages
853 without standard 8-bit encoding, too. For example, a lot of African
854 languages have non-latin or extended latin characters.
855
856 Similarly, using the original spelling of certain foreign names like
857 Ĺngström or Moličre is encouraged by the Hungarian spelling norm, and,
858 since characters 'Ĺ' and 'č' are not part of ISO 8859-2, when they com‐
859 bine with inflections containing characters only in ISO 8859-2 (like
860 elative -bo=l, allative -to=l or delative -ro=l with double acute),
861 these result in words (like Ĺngströmro=l or Moličre-to=l.) that can not
862 be encoded using any single ASCII encoding scheme.
863
864 The problems raised in relation to 8-bit ASCII encoding have long been
865 recognized by proponents of Unicode. Unfortunately, switching to Uni‐
866 code (e.g., UTF-16 encoding) would require a great deal of code opti‐
867 mization and would have an impact on the efficiency of the algorithm.
868 The Dömölki algorithm used in checking affixing conditions utilizes
869 256-byte character arrays, which would grow to 64k with Unicode encod‐
870 ing. Since online affixing for a richly agglutinative language can eas‐
871 ily have several hundred such arrays (in the case of the standard Hun‐
872 garian resources we use, this number is ca. 300 or more since redundant
873 storage of structurally identical affix patterns improves efficiency),
874 switching to Unicode would incur high resource costs. Nonetheless, it
875 is clear that trading efficiency for encoding-independence has its
876 advantages when it comes a truly multi-lingual application, therefore
877 it was among our plans for a long while to extend the architecture in
878 this direction.
879
880 A hybrid solution
881
882 Recently we implemented successfully a memory and time efficient Uni‐
883 code handling. In non-UTF-8 character encodings Hunspell works with the
884 original 8-bit algorithms, but with UTF-8 encoded dictionary and affix
885 file Hunspell uses a hybrid string manipulation and condition checking
886 to support Unicode:
887
888 Affixes and words are stored in UTF-8, during the analysis are handled
889 in mostly UTF-8, in condition checking and suggestion are converted to
890 UTF-16.
891
892 Dömölki-algorithm is used for storing and checking 7-bit ASCII (ISO
893 646) condition characters, and sorted UTF-16 lists for other Unicode
894 characters of condition patterns.
895
896 Hunspell has supported only the first 65536 characters (Basic Multilin‐
897 gual Plane) of Unicode Standard, yet.
898
899
901 hunspell (1), ispell (1), ispell (4)
902
903
904
905
906 2005-12-31 hunspell(4)