1hunspell(5) File Formats Manual hunspell(5)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) Hunspell requires two files to define the way a language is
10 being spell checked: a dictionary file containing words and applicable
11 flags, and an affix file that specifies how these flags will control
12 spell checking. An optional file is the personal dictionary file.
13
14
16 A dictionary file (*.dic) contains a list of words, one per line. The
17 first line of the dictionaries (except personal dictionaries) contains
18 the approximate word count (for optimal hash memory size). Each word
19 may optionally be followed by a slash ("/") and one or more flags,
20 which represents the word attributes, for example affixes.
21
22 Note: Dictionary words can contain also slashes when escaped like ""
23 syntax.
24
25
27 Personal dictionaries are simple word lists. Asterisk at the first
28 character position signs prohibition. A second word separated by a
29 slash sets the affixation.
30
31
32 foo
33 Foo/Simpson
34 *bar
35
36 In this example, "foo" and "Foo" are personal words, plus Foo will be
37 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
38 word.
39
40
42 Dictionary file:
43
44 3
45 hello
46 try/B
47 work/AB
48
49 The flags B and A specify attributes of these words.
50
51 Affix file:
52
53
54 SET UTF-8
55 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
56
57 REP 2
58 REP f ph
59 REP ph f
60
61 PFX A Y 1
62 PFX A 0 re .
63
64 SFX B Y 2
65 SFX B 0 ed [^y]
66 SFX B y ied y
67
68 In the affix file, prefix A and suffix B have been defined. Flag A
69 defines a `re-' prefix. Class B defines two `-ed' suffixes. First B
70 suffix can be added to a word if the last character of the word isn't
71 `y'. Second suffix can be added to the words terminated with an `y'.
72
73
74 All accepted words with this dictionary and affix combination are:
75 "hello", "try", "tried", "work", "worked", "rework", "reworked".
76
77
79 Hunspell source distribution contains more than 80 examples for option
80 usage.
81
82
83 SET encoding
84 Set character encoding of words and morphemes in affix and dic‐
85 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
86 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, cp1251, ISCII-DEVANA‐
87 GARI.
88
89 SET UTF-8
90
91 FLAG value
92 Set flag type. Default type is the extended ASCII (8-bit) char‐
93 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
94 flags. The `long' value sets the double extended ASCII charac‐
95 ter flag type, the `num' sets the decimal number flag type. Dec‐
96 imal flags numbered from 1 to 65000, and in flag fields are sep‐
97 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
98 form.
99
100 FLAG long
101
102 COMPLEXPREFIXES
103 Set twofold prefix stripping (but single suffix stripping) eg.
104 for morphologically complex languages with right-to-left writing
105 system.
106
107
108 LANG langcode
109 Set language code for language specific functions of Hunspell.
110 Use it to activate special casing of Azeri (LANG az) and Turkish
111 (LANG tr).
112
113 IGNORE characters
114 Sets characters to ignore dictionary words, affixes and input
115 words. Useful for optional characters, as Arabic (harakat) or
116 Hebrew (niqqud) diacritical marks (see tests/ignore.* test dic‐
117 tionary in Hunspell distribution).
118
119
120 AF number_of_flag_vector_aliases
121
122 AF flag_vector
123 Hunspell can substitute affix flag sets with ordinal numbers in
124 affix rules (alias compression, see makealias tool). First exam‐
125 ple with alias compression:
126
127 3
128 hello
129 try/1
130 work/2
131
132 AF definitions in the affix file:
133
134 AF 2
135 AF A
136 AF AB
137
138 It is equivalent of the following dic file:
139
140 3
141 hello
142 try/A
143 work/AB
144
145 See also tests/alias* examples of the source distribution.
146
147 Note I: If affix file contains the FLAG parameter, define it before the
148 AF definitions.
149
150 Note II: Use makealias utility in Hunspell distribution to compress aff
151 and dic files.
152
153 AM number_of_morphological_aliases
154
155 AM morphological_fields
156 Hunspell can substitute also morphological data with ordinal
157 numbers in affix rules (alias compression). See tests/alias*
158 examples.
159
161 Suggestion parameters can optimize the default n-gram (similarity
162 search in the dictionary words based on the common 1, 2, 3, 4-character
163 length common character-sequences), character swap and deletion sugges‐
164 tions of Hunspell. REP is suggested to fix the typical and especially
165 bad language specific bugs, because the REP suggestions have the high‐
166 est priority in the suggestion list. PHONE is for languages with not
167 pronunciation based orthography.
168
169 KEY characters_separated_by_vertical_line_optionally
170 Hunspell searches and suggests words with one different charac‐
171 ter replaced by a neighbor KEY character. Not neighbor charac‐
172 ters in KEY string separated by vertical line characters. Sug‐
173 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
174
175 KEY qwertyuiop|asdfghjkl|zxcvbnm
176 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
177
178 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
179 "*nide". A character may have more neighbors, too:
180
181 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
182
183 TRY characters
184 Hunspell can suggest right word forms, when they differ from the
185 bad input word by one TRY character. The parameter of TRY is
186 case sensitive.
187
188 NOSUGGEST flag
189 Words signed with NOSUGGEST flag are not suggested (but still
190 accepted when typed correctly). Proposed flag for vulgar and
191 obscene words (see also SUBSTANDARD).
192
193 MAXCPDSUGS num
194 Set max. number of suggested compound words generated by com‐
195 pound rules. The number of the suggested compound words may be
196 greater from the same 1-character distance type.
197
198 MAXNGRAMSUGS num
199 Set max. number of n-gram suggestions. Value 0 switches off the
200 n-gram suggestions (see also MAXDIFF).
201
202 MAXDIFF [0-10]
203 Set the similarity factor for the n-gram based suggestions (5 =
204 default value; 0 = fewer n-gram suggestions, but min. 1; 10 =
205 MAXNGRAMSUGS n-gram suggestions).
206
207 ONLYMAXDIFF
208 Remove all bad n-gram suggestions (default mode keeps one, see
209 MAXDIFF).
210
211 NOSPLITSUGS
212 Disable word suggestions with spaces.
213
214 SUGSWITHDOTS
215 Add dot(s) to suggestions, if input word terminates in dot(s).
216 (Not for LibreOffice dictionaries, because LibreOffice has an
217 automatic dot expansion mechanism.)
218
219 REP number_of_replacement_definitions
220
221 REP what replacement
222 This table specifies modifications to try first. First REP is
223 the header of this table and one or more REP data line are fol‐
224 lowing it. With this table, Hunspell can suggest the right
225 forms for the typical spelling mistakes when the incorrect form
226 differs by more than 1 letter from the right form. The search
227 string supports the regex boundary signs (^ and $). For example
228 a possible English replacement table definition to handle mis‐
229 spelled consonants:
230
231 REP 5
232 REP f ph
233 REP ph f
234 REP tion$ shun
235 REP ^cooccurr co-occurr
236 REP ^alot$ a_lot
237
238 Note I: It's very useful to define replacements for the most typical
239 one-character mistakes, too: with REP you can add higher priority to a
240 subset of the TRY suggestions (suggestion list begins with the REP sug‐
241 gestions).
242
243 Note II: Suggesting separated words, specify spaces with underlines:
244
245
246 REP 1
247 REP onetwothree one_two_three
248
249 Note III: Replacement table can be used for a stricter compound word
250 checking with the option CHECKCOMPOUNDREP.
251
252
253 MAP number_of_map_definitions
254
255 MAP string_of_related_chars_or_parenthesized_character_sequences
256 We can define language-dependent information on characters and
257 character sequences that should be considered related (i.e.
258 nearer than other chars not in the set) in the affix file (.aff)
259 by a map table. With this table, Hunspell can suggest the right
260 forms for words, which incorrectly choose the wrong letter or
261 letter groups from a related set more than once in a word (see
262 REP).
263
264 For example a possible mapping could be for the German umlauted
265 ü versus the regular u; the word Frühstück really should be
266 written with umlauted u's and not regular ones
267
268 MAP 1
269 MAP uü
270
271 Use parenthesized groups for character sequences (eg. for composed Uni‐
272 code characters):
273
274 MAP 3
275 MAP ß(ss) (character sequence)
276 MAP fi(fi) ("fi" compatibility characters for Unicode fi ligature)
277 MAP (ọ́)o (composed Unicode character: ó with bottom dot)
278
279 PHONE number_of_phone_definitions
280
281 PHONE what replacement
282 PHONE uses a table-driven phonetic transcription algorithm bor‐
283 rowed from Aspell. It is useful for languages with not pronunci‐
284 ation based orthography. You can add a full alphabet conversion
285 and other rules for conversion of special letter sequences. For
286 detailed documentation see http://aspell.net/man-html/Phonetic-
287 Code.html. Note: Multibyte UTF-8 characters have not worked
288 with bracket expression yet. Dash expression has signed bytes
289 and not UTF-8 characters yet.
290
291 WARN flag
292 This flag is for rare words, which are also often spelling mis‐
293 takes, see option -r of command line Hunspell and FORBIDWARN.
294
295 FORBIDWARN
296 Words with flag WARN aren't accepted by the spell checker using
297 this parameter.
298
300 BREAK number_of_break_definitions
301
302 BREAK character_or_character_sequence
303 Define new break points for breaking words and checking word
304 parts separately. Use ^ and $ to delete characters at end and
305 start of the word. Rationale: useful for compounding with join‐
306 ing character or strings (for example, hyphen in English and
307 German or hyphen and n-dash in Hungarian). Dashes are often bad
308 break points for tokenization, because compounds with dashes may
309 contain not valid parts, too.) With BREAK, Hunspell can check
310 both side of these compounds, breaking the words at dashes and
311 n-dashes:
312
313 BREAK 2
314 BREAK -
315 BREAK -- # n-dash
316
317 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
318 be valid compounds. Note: The default word break of Hunspell is equiv‐
319 alent of the following BREAK definition:
320
321 BREAK 3
322 BREAK -
323 BREAK ^-
324 BREAK -$
325
326 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
327 definition:
328
329 BREAK 1
330 BREAK -
331
332 Switching off the default values:
333
334 BREAK 0
335
336 Note II: COMPOUNDRULE is better for handling dashes and other compound
337 joining characters or character strings. Use BREAK, if you want to
338 check words with dashes or other joining characters and there is no
339 time or possibility to describe precise compound rules with COM‐
340 POUNDRULE (COMPOUNDRULE handles only the suffixation of the last word
341 part of a compound word).
342
343 Note III: For command line spell checking of words with extra charac‐
344 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
345 ple
346
347 COMPOUNDRULE number_of_compound_definitions
348
349 COMPOUNDRULE compound_pattern
350 Define custom compound patterns with a regex-like syntax. The
351 first COMPOUNDRULE is a header with the number of the following
352 COMPOUNDRULE definitions. Compound patterns consist compound
353 flags, parentheses, star and question mark meta characters. A
354 flag followed by a `*' matches a word sequence of 0 or more
355 matches of words signed with this compound flag. A flag fol‐
356 lowed by a `?' matches a word sequence of 0 or 1 matches of a
357 word signed with this compound flag. See tests/compound*.*
358 examples.
359
360 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
361 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
362 1000122nd etc.).
363
364 Note II: In the case of long and numerical flag types use only
365 parenthesized flags: (1500)*(2000)?
366
367 Note III: COMPOUNDRULE flags work completely separately from the
368 compounding mechanisms using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
369 compound flags. (Use these flags on different entries for
370 words).
371
372
373 COMPOUNDMIN num
374 Minimum length of words used for compounding. Default value is
375 3 letters.
376
377 COMPOUNDFLAG flag
378 Words signed with COMPOUNDFLAG may be in compound words (except
379 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
380 also permits compounding of affixed words.
381
382 COMPOUNDBEGIN flag
383 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
384 first elements in compound words.
385
386 COMPOUNDLAST flag
387 Words signed with COMPOUNDLAST (or with a signed affix) may be
388 last elements in compound words.
389
390 COMPOUNDMIDDLE flag
391 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
392 middle elements in compound words.
393
394 ONLYINCOMPOUND flag
395 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
396 compounds (Fuge-elements in German, fogemorphemes in Swedish).
397 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
398 pound.*). Note: also valuable to flag compounding parts which
399 are not correct as a word by itself.
400
401 COMPOUNDPERMITFLAG flag
402 Prefixes are allowed at the beginning of compounds, suffixes are
403 allowed at the end of compounds by default. Affixes with COM‐
404 POUNDPERMITFLAG may be inside of compounds.
405
406 COMPOUNDFORBIDFLAG flag
407 Suffixes with this flag forbid compounding of the affixed word.
408
409 COMPOUNDMORESUFFIXES
410 Allow twofold suffixes within compounds.
411
412 COMPOUNDROOT flag
413 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
414 is used only in the Hungarian language specific code).
415
416 COMPOUNDWORDMAX number
417 Set maximum word count in a compound word. (Default is unlim‐
418 ited.)
419
420 CHECKCOMPOUNDDUP
421 Forbid word duplication in compounds (e.g. foofoo).
422
423 CHECKCOMPOUNDREP
424 Forbid compounding, if the (usually bad) compound word may be a
425 non compound word with a REP fault. Useful for languages with
426 `compound friendly' orthography.
427
428 CHECKCOMPOUNDCASE
429 Forbid upper case characters at word boundaries in compounds.
430
431 CHECKCOMPOUNDTRIPLE
432 Forbid compounding, if compound word contains triple repeating
433 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
434 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
435 acters).
436
437 SIMPLIFIEDTRIPLE
438 Allow simplified 2-letter forms of the compounds forbidden by
439 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
440 for the old German orthography: Schiff|fahrt -> Schiffahrt).
441
442 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
443
444 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
445 Forbid compounding, if the first word in the compound ends with
446 endchars, and next word begins with beginchars and (optionally)
447 they have the requested flags. The optional replacement parame‐
448 ter allows simplified compound form.
449
450 The special "endchars" pattern 0 (zero) limits the rule to the
451 unmodified stems (stems and stems with zero affixes):
452
453 CHECKCOMPOUNDPATTERN 0/x /y
454
455 Note: COMPOUNDMIN doesn't work correctly with the compound word alter‐
456 nation, so it may need to set COMPOUNDMIN to lower value.
457
458 FORCEUCASE flag
459 Last word part of a compound with flag FORCEUCASE forces capi‐
460 talization of the whole compound word. Eg. Dutch word "straat"
461 (street) with FORCEUCASE flags will allowed only in capitalized
462 compound forms, according to the Dutch spelling rules for proper
463 names.
464
465 COMPOUNDSYLLABLE max_syllable vowels
466 Need for special compounding rules in Hungarian. First parame‐
467 ter is the maximum syllable number, that may be in a compound,
468 if words in compounds are more than COMPOUNDWORDMAX. Second
469 parameter is the list of vowels (for calculating syllables).
470
471 SYLLABLENUM flags
472 Need for special compounding rules in Hungarian.
473
475 PFX flag cross_product number
476
477 PFX flag stripping prefix [condition [morphological_fields...]]
478
479 SFX flag cross_product number
480
481 SFX flag stripping suffix [condition [morphological_fields...]]
482 An affix is either a prefix or a suffix attached to root words
483 to make other words. We can define affix classes with arbitrary
484 number affix rules. Affix classes are signed with affix flags.
485 The first line of an affix class definition is the header. The
486 fields of an affix class header:
487
488 (0) Option name (PFX or SFX)
489
490 (1) Flag (name of the affix class)
491
492 (2) Cross product (permission to combine prefixes and suffixes).
493 Possible values: Y (yes) or N (no)
494
495 (3) Line count of the following rules.
496
497 Fields of an affix rules:
498
499 (0) Option name
500
501 (1) Flag
502
503 (2) stripping characters from beginning (at prefix rules) or end
504 (at suffix rules) of the word
505
506 (3) affix (optionally with flags of continuation classes, sepa‐
507 rated by a slash)
508
509 (4) condition.
510
511 Zero stripping or affix are indicated by zero. Zero condition is
512 indicated by dot. Condition is a simplified, regular expres‐
513 sion-like pattern, which must be met before the affix can be
514 applied. (Dot signs an arbitrary character. Characters in braces
515 sign an arbitrary character from the character subset. Dash
516 hasn't got special meaning, but circumflex (^) next the first
517 brace sets the complementer character set.)
518
519 (5) Optional morphological fields separated by spaces or tabula‐
520 tors.
521
522
524 CIRCUMFIX flag
525 Affixes signed with CIRCUMFIX flag may be on a word when this
526 word also has a prefix with CIRCUMFIX flag and vice versa (see
527 circumfix.* test files in the source distribution).
528
529 FORBIDDENWORD flag
530 This flag signs forbidden word form. Because affixed forms are
531 also forbidden, we can subtract a subset from set of the
532 accepted affixed and compound words. Note: usefull to forbid
533 erroneous words, generated by the compounding mechanism.
534
535 FULLSTRIP
536 With FULLSTRIP, affix rules can strip full words, not only one
537 less characters, before adding the affixes, see fullstrip.* test
538 files in the source distribution). Note: conditions may be word
539 length without FULLSTRIP, too.
540
541 KEEPCASE flag
542 Forbid uppercased and capitalized forms of words signed with
543 KEEPCASE flags. Useful for special orthographies (measurements
544 and currency often keep their case in uppercased texts) and
545 writing systems (e.g. keeping lower case of IPA characters).
546 Also valuable for words erroneously written in the wrong case.
547
548 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
549 CASE flag may be capitalized and uppercased, but uppercased
550 forms of these words may not contain sharp s, only SS. See ger‐
551 mancompounding example in the tests directory of the Hunspell
552 distribution.
553
554
555 ICONV number_of_ICONV_definitions
556
557 ICONV pattern pattern2
558 Define input conversion table. Note: useful to convert one type
559 of quote to another one, or change ligature.
560
561 OCONV number_of_OCONV_definitions
562
563 OCONV pattern pattern2
564 Define output conversion table.
565
566 LEMMA_PRESENT flag
567 Deprecated. Use "st:" field instead of LEMMA_PRESENT.
568
569 NEEDAFFIX flag
570 This flag signs virtual stems in the dictionary, words only
571 valid when affixed. Except, if the dictionary word has a
572 homonym or a zero affix. NEEDAFFIX works also with prefixes and
573 prefix + suffix combinations (see tests/pseudoroot5.*).
574
575 PSEUDOROOT flag
576 Deprecated. (Former name of the NEEDAFFIX option.)
577
578 SUBSTANDARD flag
579 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
580 morphs) not used in morphological generation (and in suggestion
581 in the future versions). See also NOSUGGEST.
582
583 WORDCHARS characters
584 WORDCHARS extends tokenizer of Hunspell command line interface
585 with additional word character. For example, dot, dash, n-dash,
586 numbers, percent sign are word character in Hungarian.
587
588 CHECKSHARPS
589 SS letter pair in uppercased (German) words may be upper case
590 sharp s (ß). Hunspell can handle this special casing with the
591 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
592 mancompounding example) in both spelling and suggestion.
593
594
596 Hunspell's dictionary items and affix rules may have optional space or
597 tabulator separated morphological description fields, started with
598 3-character (two letters and a colon) field IDs:
599
600
601 word/flags po:noun is:nom
602
603 Example: We define a simple resource with morphological informations, a
604 derivative suffix (ds:) and a part of speech category (po:):
605
606 Affix file:
607
608
609 SFX X Y 1
610 SFX X 0 able . ds:able
611
612 Dictionary file:
613
614
615 drink/X po:verb
616
617 Test file:
618
619
620 drink
621 drinkable
622
623 Test:
624
625
626 $ analyze test.aff test.dic test.txt
627 > drink
628 analyze(drink) = po:verb
629 stem(drink) = po:verb
630 > drinkable
631 analyze(drinkable) = po:verb ds:able
632 stem(drinkable) = drinkable
633
634 You can see in the example, that the analyzer concatenates the morpho‐
635 logical fields in item and arrangement style.
636
637
639 Default morphological and other IDs (used in suggestion, stemming and
640 morphological generation):
641
642 ph: Alternative transliteration for better suggestion. It's useful
643 for words with foreign pronunciation. (Dictionary based phonetic
644 suggestion.) For example:
645
646
647 Marseille ph:maarsayl
648
649 st: Stem. Optional: default stem is the dictionary item in morpho‐
650 logical analysis. Stem field is useful for virtual stems (dic‐
651 tionary words with NEEDAFFIX flag) and morphological exceptions
652 instead of new, single used morphological rules.
653
654 feet st:foot is:plural
655 mice st:mouse is:plural
656 teeth st:tooth is:plural
657
658 Word forms with multiple stems need multiple dictionary items:
659
660
661 lay po:verb st:lie is:past_2
662 lay po:verb is:present
663 lay po:noun
664
665 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
666 Morphological generation needs stem, allomorph and affix fields.
667
668 sing al:sang al:sung
669 sang st:sing
670 sung st:sing
671
672 po: Part of speech category.
673
674 ds: Derivational suffix(es). Stemming doesn't remove derivational
675 suffixes. Morphological generation depends on the order of the
676 suffix fields.
677
678 In affix rules:
679
680
681 SFX Y Y 1
682 SFX Y 0 ly . ds:ly_adj
683
684 In the dictionary:
685
686
687 ably st:able ds:ly_adj
688 able al:ably
689
690 is: Inflectional suffix(es). All inflectional suffixes are removed
691 by stemming. Morphological generation depends on the order of
692 the suffix fields.
693
694
695 feet st:foot is:plural
696
697 ts: Terminal suffix(es). Terminal suffix fields are inflectional
698 suffix fields "removed" by additional (not terminal) suffixes.
699
700 Useful for zero morphemes and affixes removed by splitting
701 rules.
702
703
704 work/D ts:present
705
706 SFX D Y 2
707 SFX D 0 ed . is:past_1
708 SFX D 0 ed . is:past_2
709
710 Typical example of the terminal suffix is the zero morpheme of the nom‐
711 inative case.
712
713
714 sp: Surface prefix. Temporary solution for adding prefixes to the
715 stems and generated word forms. See tests/morph.* example.
716
717
718 pa: Parts of the compound words. Output fields of morphological
719 analysis for stemming.
720
721 dp: Planned: derivational prefix.
722
723 ip: Planned: inflectional prefix.
724
725 tp: Planned: terminal prefix.
726
727
729 Ispell's original algorithm strips only one suffix. Hunspell can strip
730 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
731
732 The twofold suffix stripping is a significant improvement in handling
733 of immense number of suffixes, that characterize agglutinative lan‐
734 guages.
735
736 A second `s' suffix (affix class Y) will be the continuation class of
737 the suffix `able' in the following example:
738
739
740 SFX Y Y 1
741 SFX Y 0 s .
742
743 SFX X Y 1
744 SFX X 0 able/Y .
745
746 Dictionary file:
747
748
749 drink/X
750
751 Test file:
752
753
754 drink
755 drinkable
756 drinkables
757
758 Test:
759
760
761 $ hunspell -m -d test <test.txt
762 drink st:drink
763 drinkable st:drink fl:X
764 drinkables st:drink fl:X fl:Y
765
766 Theoretically with the twofold suffix stripping needs only the square
767 root of the number of suffix rules, compared with a Hunspell implemen‐
768 tation. In our practice, we could have elaborated the Hungarian inflec‐
769 tional morphology with twofold suffix stripping.
770
771
773 Hunspell can handle more than 65000 affix classes. There are three new
774 syntax for giving flags in affix and dictionary files.
775
776 FLAG long command sets 2-character flags:
777
778
779 FLAG long
780 SFX Y1 Y 1
781 SFX Y1 0 s 1
782
783 Dictionary record with the Y1, Z3, F? flags:
784
785
786 foo/Y1Z3F?
787
788 FLAG num command sets numerical flags separated by comma:
789
790
791 FLAG num
792 SFX 65000 Y 1
793 SFX 65000 0 s 1
794
795 Dictionary example:
796
797
798 foo/65000,12,2756
799
800 The third one is the Unicode character flags.
801
802
804 Hunspell's dictionary can contain repeating elements that are homonyms:
805
806
807 work/A po:verb
808 work/B po:noun
809
810 An affix file:
811
812
813 SFX A Y 1
814 SFX A 0 s . sf:sg3
815
816 SFX B Y 1
817 SFX B 0 s . is:plur
818
819 Test file:
820
821
822 works
823
824 Test:
825
826
827 $ hunspell -d test -m <testwords
828 work st:work po:verb is:sg3
829 work st:work po:noun is:plur
830
831 This feature also gives a way to forbid illegal prefix/suffix combina‐
832 tions.
833
834
836 An interesting side-effect of multi-step stripping is, that the appro‐
837 priate treatment of circumfixes now comes for free. For instance, in
838 Hungarian, superlatives are formed by simultaneous prefixation of leg-
839 and suffixation of -bb to the adjective base. A problem with the one-
840 level architecture is that there is no way to render lexical licensing
841 of particular prefixes and suffixes interdependent, and therefore
842 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
843 `old'. Until the introduction of clusters, a special treatment of the
844 superlative had to be hardwired in the earlier HunSpell code. This may
845 have been legitimate for a single case, but in fact prefix--suffix
846 dependences are ubiquitous in category-changing derivational patterns
847 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
848 able but *undrink). In simple words, here, the prefix un- is legitimate
849 only if the base drink is suffixed with -able. If both these patters
850 are handled by on-line affix rules and affix rules are checked against
851 the base only, there is no way to express this dependency and the sys‐
852 tem will necessarily over- or undergenerate.
853
854 In next example, suffix class R have got a prefix `continuation' class
855 (class P).
856
857
858 PFX P Y 1
859 PFX P 0 un . [prefix_un]+
860
861 SFX S Y 1
862 SFX S 0 s . +PL
863
864 SFX Q Y 1
865 SFX Q 0 s . +3SGV
866
867 SFX R Y 1
868 SFX R 0 able/PS . +DER_V_ADJ_ABLE
869
870 Dictionary:
871
872
873 2
874 drink/RQ [verb]
875 drink/S [noun]
876
877 Morphological analysis:
878
879
880 > drink
881 drink[verb]
882 drink[noun]
883 > drinks
884 drink[verb]+3SGV
885 drink[noun]+PL
886 > drinkable
887 drink[verb]+DER_V_ADJ_ABLE
888 > drinkables
889 drink[verb]+DER_V_ADJ_ABLE+PL
890 > undrinkable
891 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
892 > undrinkables
893 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
894 > undrink
895 Unknown word.
896 > undrinks
897 Unknown word.
898
900 Conditional affixes implemented by a continuation class are not enough
901 for circumfixes, because a circumfix is one affix in morphology. We
902 also need CIRCUMFIX option for correct morphological analysis.
903
904
905 # circumfixes: ~ obligate prefix/suffix combinations
906 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
907 # nagy, nagyobb, legnagyobb, legeslegnagyobb
908 # (great, greater, greatest, most greatest)
909
910 CIRCUMFIX X
911
912 PFX A Y 1
913 PFX A 0 leg/X .
914
915 PFX B Y 1
916 PFX B 0 legesleg/X .
917
918 SFX C Y 3
919 SFX C 0 obb . +COMPARATIVE
920 SFX C 0 obb/AX . +SUPERLATIVE
921 SFX C 0 obb/BX . +SUPERSUPERLATIVE
922
923 Dictionary:
924
925
926 1
927 nagy/C [MN]
928
929 Analysis:
930
931
932 > nagy
933 nagy[MN]
934 > nagyobb
935 nagy[MN]+COMPARATIVE
936 > legnagyobb
937 nagy[MN]+SUPERLATIVE
938 > legeslegnagyobb
939 nagy[MN]+SUPERSUPERLATIVE
940
942 Allowing free compounding yields decrease in precision of recognition,
943 not to mention stemming and morphological analysis. Although lexical
944 switches are introduced to license compounding of bases by Ispell, this
945 proves not to be restrictive enough. For example:
946
947
948 # affix file
949 COMPOUNDFLAG X
950
951 2
952 foo/X
953 bar/X
954
955 With this resource, foobar and barfoo also are accepted words.
956
957 This has been improved upon with the introduction of direction-sensi‐
958 tive compounding, i.e., lexical features can specify separately whether
959 a base can occur as leftmost or rightmost constituent in compounds.
960 This, however, is still insufficient to handle the intricate patterns
961 of compounding, not to mention idiosyncratic (and language specific)
962 norms of hyphenation.
963
964 The Hunspell algorithm currently allows any affixed form of words,
965 which are lexically marked as potential members of compounds. Hunspell
966 improved this, and its recursive compound checking rules makes it pos‐
967 sible to implement the intricate spelling conventions of Hungarian com‐
968 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
969 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
970 rule. Further example in Hungarian, derivate suffixes often modify
971 compounding properties. Hunspell allows the compounding flags on the
972 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
973 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
974
975 Suffixes with this flag forbid compounding of the affixed word.
976
977 We also need several Hunspell features for handling German compounding:
978
979
980 # German compounding
981
982 # set language to handle special casing of German sharp s
983
984 LANG de_DE
985
986 # compound flags
987
988 COMPOUNDBEGIN U
989 COMPOUNDMIDDLE V
990 COMPOUNDEND W
991
992 # Prefixes are allowed at the beginning of compounds,
993 # suffixes are allowed at the end of compounds by default:
994 # (prefix)?(root)+(affix)?
995 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
996 COMPOUNDPERMITFLAG P
997
998 # for German fogemorphemes (Fuge-element)
999 # Hint: ONLYINCOMPOUND is not required everywhere, but the
1000 # checking will be a little faster with it.
1001
1002 ONLYINCOMPOUND X
1003
1004 # forbid uppercase characters at compound word bounds
1005 CHECKCOMPOUNDCASE
1006
1007 # for handling Fuge-elements with dashes (Arbeits-)
1008 # dash will be a special word
1009
1010 COMPOUNDMIN 1
1011 WORDCHARS -
1012
1013 # compound settings and fogemorpheme for `Arbeit'
1014
1015 SFX A Y 3
1016 SFX A 0 s/UPX .
1017 SFX A 0 s/VPDX .
1018 SFX A 0 0/WXD .
1019
1020 SFX B Y 2
1021 SFX B 0 0/UPX .
1022 SFX B 0 0/VWXDP .
1023
1024 # a suffix for `Computer'
1025
1026 SFX C Y 1
1027 SFX C 0 n/WD .
1028
1029 # for forbid exceptions (*Arbeitsnehmer)
1030
1031 FORBIDDENWORD Z
1032
1033 # dash prefix for compounds with dash (Arbeits-Computer)
1034
1035 PFX - Y 1
1036 PFX - 0 -/P .
1037
1038 # decapitalizing prefix
1039 # circumfix for positioning in compounds
1040
1041 PFX D Y 29
1042 PFX D A a/PX A
1043 PFX D Ä ä/PX Ä
1044 .
1045 .
1046 PFX D Y y/PX Y
1047 PFX D Z z/PX Z
1048
1049 Example dictionary:
1050
1051
1052 4
1053 Arbeit/A-
1054 Computer/BC-
1055 -/W
1056 Arbeitsnehmer/Z
1057
1058 Accepted compound compound words with the previous resource:
1059
1060
1061 Computer
1062 Computern
1063 Arbeit
1064 Arbeits-
1065 Computerarbeit
1066 Computerarbeits-
1067 Arbeitscomputer
1068 Arbeitscomputern
1069 Computerarbeitscomputer
1070 Computerarbeitscomputern
1071 Arbeitscomputerarbeit
1072 Computerarbeits-Computer
1073 Computerarbeits-Computern
1074
1075 Not accepted compoundings:
1076
1077
1078 computer
1079 arbeit
1080 Arbeits
1081 arbeits
1082 ComputerArbeit
1083 ComputerArbeits
1084 Arbeitcomputer
1085 ArbeitsComputer
1086 Computerarbeitcomputer
1087 ComputerArbeitcomputer
1088 ComputerArbeitscomputer
1089 Arbeitscomputerarbeits
1090 Computerarbeits-computer
1091 Arbeitsnehmer
1092
1093 This solution is still not ideal, however, and will be replaced by a
1094 pattern-based compound-checking algorithm which is closely integrated
1095 with input buffer tokenization. Patterns describing compounds come as a
1096 separate input resource that can refer to high-level properties of con‐
1097 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1098 ment of hyphens). The patterns are matched against potential segmenta‐
1099 tions of compounds to assess wellformedness.
1100
1101
1103 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1104 major deficiency when it comes to scalability. Although a language
1105 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1106 fails to allow a full implementation of Hungarian orthographic conven‐
1107 tions. For instance, the '--' symbol (n-dash) is missing from this
1108 character set contrary to the fact that it is not only the official
1109 symbol to delimit parenthetic clauses in the language, but it can be in
1110 compound words as a special 'big' hyphen.
1111
1112 MySpell has got some 8-bit encoding tables, but there are languages
1113 without standard 8-bit encoding, too. For example, a lot of African
1114 languages have non-latin or extended latin characters.
1115
1116 Similarly, using the original spelling of certain foreign names like
1117 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1118 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1119 bine with inflections containing characters only in ISO 8859-2 (like
1120 elative -ből, allative -től or delative -ről with double acute), these
1121 result in words (like Ångströmről or Molière-től.) that can not be
1122 encoded using any single ASCII encoding scheme.
1123
1124 The problems raised in relation to 8-bit ASCII encoding have long been
1125 recognized by proponents of Unicode. It is clear that trading effi‐
1126 ciency for encoding-independence has its advantages when it comes a
1127 truly multi-lingual application. There is implemented a memory and time
1128 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1129 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1130 affixes and words are stored in UTF-8, during the analysis are handled
1131 in mostly UTF-8, under condition checking and suggestion are converted
1132 to UTF-16. Unicode text analysis and spell checking have a minimal
1133 (0-20%) time overhead and minimal or reasonable memory overhead depends
1134 from the language (its UTF-8 encoding and affixation).
1135
1136
1138 Aspell dictionaries can be easily converted into hunspell. Conversion
1139 steps:
1140
1141 dictionary (xx.cwl -> xx.wl):
1142
1143 preunzip xx.cwl
1144 wc -l < xx.wl > xx.dic
1145 cat xx.wl >> xx.dic
1146
1147 affix file
1148
1149 If the affix file exists, copy it:
1150 cp xx_affix.dat xx.aff
1151 If not, create it with the suitable character encoding (see xx.dat)
1152 echo "SET ISO8859-x" > xx.aff
1153 or
1154 echo "SET UTF-8" > xx.aff
1155
1156 It's useful to add a TRY option with the characters of the dictionary
1157 with frequency order to set edit distance suggestions:
1158 echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1159
1160
1162 hunspell (1), ispell (1), ispell (4)
1163
1164
1165
1166
1167 2014-05-26 hunspell(5)