1hunspell(5) File Formats Manual hunspell(5)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) Hunspell requires two files to define the way a language is
10 being spell checked: a dictionary file containing words and applicable
11 flags, and an affix file that specifies how these flags will control
12 spell checking. An optional file is the personal dictionary file.
13
14
16 A dictionary file (*.dic) contains a list of words, one per line. The
17 first line of the dictionaries (except personal dictionaries) contains
18 the approximate word count (for optimal hash memory size). Each word
19 may optionally be followed by a slash ("/") and one or more flags,
20 which represents the word attributes, for example affixes.
21
22 Note: Dictionary words can contain also slashes when escaped like "\/"
23 syntax.
24
25 It's worth to add not only words, but word pairs to the dictionary to
26 get correct suggestions for common misspellings with missing space, as
27 in the following example, for the bad "alot" and "inspite" (see also
28 "REP" and field "ph:" about correct suggestions for common mis‐
29 spellings):
30
31
32 3
33 word
34 a lot
35 in spite
36
38 Personal dictionaries are simple word lists. Asterisk at the first
39 character position signs prohibition. A second word separated by a
40 slash sets the affixation.
41
42
43 foo
44 Foo/Simpson
45 *bar
46
47 In this example, "foo" and "Foo" are personal words, plus Foo will be
48 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
49 word.
50
51
53 Dictionary file:
54
55 3
56 hello
57 try/B
58 work/AB
59
60 The flags B and A specify attributes of these words.
61
62 Affix file:
63
64
65 SET UTF-8
66 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
67
68 REP 2
69 REP f ph
70 REP ph f
71
72 PFX A Y 1
73 PFX A 0 re .
74
75 SFX B Y 2
76 SFX B 0 ed [^y]
77 SFX B y ied y
78
79 In the affix file, prefix A and suffix B have been defined. Flag A de‐
80 fines a `re-' prefix. Class B defines two `-ed' suffixes. First B suf‐
81 fix can be added to a word if the last character of the word isn't `y'.
82 Second suffix can be added to the words terminated with an `y'.
83
84 All accepted words with this dictionary and affix combination are:
85 "hello", "try", "tried", "work", "worked", "rework", "reworked".
86
87
89 Hunspell source distribution contains more than 80 examples for option
90 usage.
91
92
93 SET encoding
94 Set character encoding of words and morphemes in affix and dic‐
95 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
96 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, cp1251, ISCII-DEVANA‐
97 GARI.
98
99 SET UTF-8
100
101 FLAG value
102 Set flag type. Default type is the extended ASCII (8-bit) char‐
103 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
104 flags. The `long' value sets the double extended ASCII charac‐
105 ter flag type, the `num' sets the decimal number flag type. Dec‐
106 imal flags numbered from 1 to 65000, and in flag fields are sep‐
107 arated by comma.
108
109 FLAG long
110
111 COMPLEXPREFIXES
112 Set twofold prefix stripping (but single suffix stripping) eg.
113 for morphologically complex languages with right-to-left writing
114 system.
115
116
117 LANG langcode
118 Set language code for language-specific functions of Hunspell.
119 Use it to activate special casing of Azeri (LANG az), Turkish
120 (LANG tr) and Crimean Tatar (LANG crh), also not generalized
121 syllable-counting compounding rules of Hungarian (LANG hu).
122
123
124 IGNORE characters
125 Sets characters to ignore dictionary words, affixes and input
126 words. Useful for optional characters, as Arabic (harakat) or
127 Hebrew (niqqud) diacritical marks (see tests/ignore.* test dic‐
128 tionary in Hunspell distribution).
129
130
131 AF number_of_flag_vector_aliases
132
133 AF flag_vector
134 Hunspell can substitute affix flag sets with ordinal numbers in
135 affix rules (alias compression, see makealias tool). First exam‐
136 ple with alias compression:
137
138 3
139 hello
140 try/1
141 work/2
142
143 AF definitions in the affix file:
144
145 AF 2
146 AF A
147 AF AB
148
149 It is equivalent of the following dic file:
150
151 3
152 hello
153 try/A
154 work/AB
155
156 See also tests/alias* examples of the source distribution.
157
158 Note I: If affix file contains the FLAG parameter, define it before the
159 AF definitions.
160
161 Note II: Use makealias utility in Hunspell distribution to compress aff
162 and dic files.
163
164 AM number_of_morphological_aliases
165
166 AM morphological_fields
167 Hunspell can substitute also morphological data with ordinal
168 numbers in affix rules (alias compression). See tests/alias*
169 examples.
170
172 Suggestion parameters can optimize the default n-gram (similarity
173 search in the dictionary words based on the common 1, 2, 3, 4-character
174 length common character-sequences), character swap and deletion sugges‐
175 tions of Hunspell. REP is suggested to fix the typical and especially
176 bad language specific bugs, because the REP suggestions have the high‐
177 est priority in the suggestion list. PHONE is for languages with not
178 pronunciation based orthography.
179
180 For short common misspellings, it's important to use the ph: field (see
181 later) to give the best suggestions.
182
183 KEY characters_separated_by_vertical_line_optionally
184 Hunspell searches and suggests words with one different charac‐
185 ter replaced by a neighbor KEY character. Not neighbor charac‐
186 ters in KEY string separated by vertical line characters. Sug‐
187 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
188
189 KEY qwertyuiop|asdfghjkl|zxcvbnm
190 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
191
192 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
193 "*nide". A character may have more neighbors, too:
194
195 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
196
197 TRY characters
198 Hunspell can suggest right word forms, when they differ from the
199 bad input word by one TRY character. The parameter of TRY is
200 case sensitive.
201
202 NOSUGGEST flag
203 Words signed with NOSUGGEST flag are not suggested (but still
204 accepted when typed correctly). Proposed flag for vulgar and ob‐
205 scene words (see also SUBSTANDARD).
206
207 MAXCPDSUGS num
208 Set max. number of suggested compound words generated by com‐
209 pound rules. The number of the suggested compound words may be
210 greater from the same 1-character distance type.
211
212 MAXNGRAMSUGS num
213 Set max. number of n-gram suggestions. Value 0 switches off the
214 n-gram suggestions (see also MAXDIFF).
215
216 MAXDIFF [0-10]
217 Set the similarity factor for the n-gram based suggestions (5 =
218 default value; 0 = fewer n-gram suggestions, but min. 1; 10 =
219 MAXNGRAMSUGS n-gram suggestions).
220
221 ONLYMAXDIFF
222 Remove all bad n-gram suggestions (default mode keeps one, see
223 MAXDIFF).
224
225 NOSPLITSUGS
226 Disable word suggestions with spaces.
227
228 SUGSWITHDOTS
229 Add dot(s) to suggestions, if input word terminates in dot(s).
230 (Not for LibreOffice dictionaries, because LibreOffice has an
231 automatic dot expansion mechanism.)
232
233 REP number_of_replacement_definitions
234
235 REP what replacement
236 This table specifies modifications to try first. First REP is
237 the header of this table and one or more REP data line are fol‐
238 lowing it. With this table, Hunspell can suggest the right
239 forms for the typical spelling mistakes when the incorrect form
240 differs by more than 1 letter from the right form (see also
241 "ph:"). The search string supports the regex boundary signs (^
242 and $). For example a possible English replacement table defi‐
243 nition to handle misspelled consonants:
244
245 REP 5
246 REP f ph
247 REP ph f
248 REP tion$ shun
249 REP ^cooccurr co-occurr
250 REP ^alot$ a_lot
251
252 Note I: It's very useful to define replacements for the most typical
253 one-character mistakes, too: with REP you can add higher priority to a
254 subset of the TRY suggestions (suggestion list begins with the REP sug‐
255 gestions).
256
257 Note II: Suggesting separated words, specify spaces with underlines:
258
259
260 REP 1
261 REP onetwothree one_two_three
262
263 Note III: Replacement table can be used for a stricter compound word
264 checking with the option CHECKCOMPOUNDREP.
265
266
267 MAP number_of_map_definitions
268
269 MAP string_of_related_chars_or_parenthesized_character_sequences
270 We can define language-dependent information on characters and
271 character sequences that should be considered related (i.e.
272 nearer than other chars not in the set) in the affix file (.aff)
273 by a map table. With this table, Hunspell can suggest the right
274 forms for words, which incorrectly choose the wrong letter or
275 letter groups from a related set more than once in a word (see
276 REP).
277
278 For example a possible mapping could be for the German umlauted
279 ü versus the regular u; the word Frühstück really should be
280 written with umlauted u's and not regular ones
281
282 MAP 1
283 MAP uü
284
285 Use parenthesized groups for character sequences (eg. for composed Uni‐
286 code characters):
287
288 MAP 3
289 MAP ß(ss) (character sequence)
290 MAP fi(fi) ("fi" compatibility characters for Unicode fi ligature)
291 MAP (ọ́)o (composed Unicode character: ó with bottom dot)
292
293 PHONE number_of_phone_definitions
294
295 PHONE what replacement
296 PHONE uses a table-driven phonetic transcription algorithm bor‐
297 rowed from Aspell. It is useful for languages with not pronunci‐
298 ation based orthography. You can add a full alphabet conversion
299 and other rules for conversion of special letter sequences. For
300 detailed documentation see http://aspell.net/man-html/Phonetic-
301 Code.html. Note: Multibyte UTF-8 characters have not worked
302 with bracket expression yet. Dash expression has signed bytes
303 and not UTF-8 characters yet.
304
305 WARN flag
306 This flag is for rare words, which are also often spelling mis‐
307 takes, see option -r of command line Hunspell and FORBIDWARN.
308
309 FORBIDWARN
310 Words with flag WARN aren't accepted by the spell checker using
311 this parameter.
312
314 BREAK number_of_break_definitions
315
316 BREAK character_or_character_sequence
317 Define new break points for breaking words and checking word
318 parts separately. Use ^ and $ to delete characters at end and
319 start of the word. Rationale: useful for compounding with join‐
320 ing character or strings (for example, hyphen in English and
321 German or hyphen and n-dash in Hungarian). Dashes are often bad
322 break points for tokenization, because compounds with dashes may
323 contain not valid parts, too.) With BREAK, Hunspell can check
324 both side of these compounds, breaking the words at dashes and
325 n-dashes:
326
327 BREAK 2
328 BREAK -
329 BREAK -- # n-dash
330
331 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
332 be valid compounds. Note: The default word break of Hunspell is equiv‐
333 alent of the following BREAK definition:
334
335 BREAK 3
336 BREAK -
337 BREAK ^-
338 BREAK -$
339
340 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
341 definition:
342
343 BREAK 1
344 BREAK -
345
346 Switching off the default values:
347
348 BREAK 0
349
350 Note II: COMPOUNDRULE is better for handling dashes and other compound
351 joining characters or character strings. Use BREAK, if you want to
352 check words with dashes or other joining characters and there is no
353 time or possibility to describe precise compound rules with COM‐
354 POUNDRULE (COMPOUNDRULE handles only the suffixation of the last word
355 part of a compound word).
356
357 Note III: For command line spell checking of words with extra charac‐
358 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
359 ple
360
361 COMPOUNDRULE number_of_compound_definitions
362
363 COMPOUNDRULE compound_pattern
364 Define custom compound patterns with a regex-like syntax. The
365 first COMPOUNDRULE is a header with the number of the following
366 COMPOUNDRULE definitions. Compound patterns consist compound
367 flags, parentheses, star and question mark meta characters. A
368 flag followed by a `*' matches a word sequence of 0 or more
369 matches of words signed with this compound flag. A flag fol‐
370 lowed by a `?' matches a word sequence of 0 or 1 matches of a
371 word signed with this compound flag. See tests/compound*.* ex‐
372 amples.
373
374 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
375 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
376 1000122nd etc.).
377
378 Note II: In the case of long and numerical flag types use only
379 parenthesized flags: (1500)*(2000)?
380
381 Note III: COMPOUNDRULE flags work completely separately from the
382 compounding mechanisms using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
383 compound flags. (Use these flags on different entries for
384 words).
385
386
387 COMPOUNDMIN num
388 Minimum length of words used for compounding. Default value is
389 3 letters.
390
391 COMPOUNDFLAG flag
392 Words signed with COMPOUNDFLAG may be in compound words (except
393 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
394 also permits compounding of affixed words.
395
396 COMPOUNDBEGIN flag
397 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
398 first elements in compound words.
399
400 COMPOUNDLAST flag
401 Words signed with COMPOUNDLAST (or with a signed affix) may be
402 last elements in compound words.
403
404 COMPOUNDMIDDLE flag
405 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
406 middle elements in compound words.
407
408 ONLYINCOMPOUND flag
409 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
410 compounds (Fuge-elements in German, fogemorphemes in Swedish).
411 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
412 pound.*). Note: also valuable to flag compounding parts which
413 are not correct as a word by itself.
414
415 COMPOUNDPERMITFLAG flag
416 Prefixes are allowed at the beginning of compounds, suffixes are
417 allowed at the end of compounds by default. Affixes with COM‐
418 POUNDPERMITFLAG may be inside of compounds.
419
420 COMPOUNDFORBIDFLAG flag
421 Suffixes with this flag forbid compounding of the affixed word.
422 Dictionary words with this flag are removed from the beginning
423 and middle of compound words, overriding the effect of COMPOUND‐
424 PERMITFLAG.
425
426 COMPOUNDMORESUFFIXES
427 Allow twofold suffixes within compounds.
428
429 COMPOUNDROOT flag
430 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
431 is used only in the Hungarian language specific code).
432
433 COMPOUNDWORDMAX number
434 Set maximum word count in a compound word. (Default is unlim‐
435 ited.)
436
437 CHECKCOMPOUNDDUP
438 Forbid word duplication in compounds (e.g. foofoo).
439
440 CHECKCOMPOUNDREP
441 Forbid compounding, if the (usually bad) compound word may be a
442 non-compound word with a REP fault. Useful for languages with
443 `compound friendly' orthography.
444
445 CHECKCOMPOUNDCASE
446 Forbid upper case characters at word boundaries in compounds.
447
448 CHECKCOMPOUNDTRIPLE
449 Forbid compounding, if compound word contains triple repeating
450 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
451 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
452 acters).
453
454 SIMPLIFIEDTRIPLE
455 Allow simplified 2-letter forms of the compounds forbidden by
456 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
457 for the old German orthography: Schiff|fahrt -> Schiffahrt).
458
459 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
460
461 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
462 Forbid compounding, if the first word in the compound ends with
463 endchars, and next word begins with beginchars and (optionally)
464 they have the requested flags. The optional replacement parame‐
465 ter allows simplified compound form.
466
467 The special "endchars" pattern 0 (zero) limits the rule to the
468 unmodified stems (stems and stems with zero affixes):
469
470 CHECKCOMPOUNDPATTERN 0/x /y
471
472 Note: COMPOUNDMIN doesn't work correctly with the compound word alter‐
473 nation, so it may need to set COMPOUNDMIN to lower value.
474
475 FORCEUCASE flag
476 Last word part of a compound with flag FORCEUCASE forces capi‐
477 talization of the whole compound word. Eg. Dutch word "straat"
478 (street) with FORCEUCASE flags will allowed only in capitalized
479 compound forms, according to the Dutch spelling rules for proper
480 names.
481
482 COMPOUNDSYLLABLE max_syllable vowels
483 Need for special compounding rules in Hungarian. First parame‐
484 ter is the maximum syllable number, that may be in a compound,
485 if words in compounds are more than COMPOUNDWORDMAX. Second pa‐
486 rameter is the list of vowels (for calculating syllables).
487
488 SYLLABLENUM flags
489 Need for special compounding rules in Hungarian.
490
492 PFX flag cross_product number
493
494 PFX flag stripping prefix [condition [morphological_fields...]]
495
496 SFX flag cross_product number
497
498 SFX flag stripping suffix [condition [morphological_fields...]]
499 An affix is either a prefix or a suffix attached to root words
500 to make other words. We can define affix classes with arbitrary
501 number affix rules. Affix classes are signed with affix flags.
502 The first line of an affix class definition is the header. The
503 fields of an affix class header:
504
505 (0) Option name (PFX or SFX)
506
507 (1) Flag (name of the affix class)
508
509 (2) Cross product (permission to combine prefixes and suffixes).
510 Possible values: Y (yes) or N (no)
511
512 (3) Line count of the following rules.
513
514 Fields of an affix rules:
515
516 (0) Option name
517
518 (1) Flag
519
520 (2) stripping characters from beginning (at prefix rules) or end
521 (at suffix rules) of the word
522
523 (3) affix (optionally with flags of continuation classes, sepa‐
524 rated by a slash)
525
526 (4) condition.
527
528 Zero stripping or affix are indicated by zero. Zero condition is
529 indicated by dot. Condition is a simplified, regular expres‐
530 sion-like pattern, which must be met before the affix can be ap‐
531 plied. (Dot signs an arbitrary character. Characters in braces
532 sign an arbitrary character from the character subset. Dash
533 hasn't got special meaning, but circumflex (^) next the first
534 brace sets the complementer character set.)
535
536 (5) Optional morphological fields separated by spaces or tabula‐
537 tors.
538
539
541 CIRCUMFIX flag
542 Affixes signed with CIRCUMFIX flag may be on a word when this
543 word also has a prefix with CIRCUMFIX flag and vice versa (see
544 circumfix.* test files in the source distribution).
545
546 FORBIDDENWORD flag
547 This flag signs forbidden word form. Because affixed forms are
548 also forbidden, we can subtract a subset from set of the ac‐
549 cepted affixed and compound words. Note: usefull to forbid er‐
550 roneous words, generated by the compounding mechanism.
551
552 FULLSTRIP
553 With FULLSTRIP, affix rules can strip full words, not only one
554 less characters, before adding the affixes, see fullstrip.* test
555 files in the source distribution). Note: conditions may be word
556 length without FULLSTRIP, too.
557
558 KEEPCASE flag
559 Forbid uppercased and capitalized forms of words signed with
560 KEEPCASE flags. Useful for special orthographies (measurements
561 and currency often keep their case in uppercased texts) and
562 writing systems (e.g. keeping lower case of IPA characters).
563 Also valuable for words erroneously written in the wrong case.
564
565 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
566 CASE flag may be capitalized and uppercased, but uppercased
567 forms of these words may not contain sharp s, only SS. See ger‐
568 mancompounding example in the tests directory of the Hunspell
569 distribution.
570
571
572 ICONV number_of_ICONV_definitions
573
574 ICONV pattern pattern2
575 Define input conversion table. Note: useful to convert one type
576 of quote to another one, or change ligature.
577
578 OCONV number_of_OCONV_definitions
579
580 OCONV pattern pattern2
581 Define output conversion table.
582
583 LEMMA_PRESENT flag
584 Deprecated. Use "st:" field instead of LEMMA_PRESENT.
585
586 NEEDAFFIX flag
587 This flag signs virtual stems in the dictionary, words only
588 valid when affixed. Except, if the dictionary word has a
589 homonym or a zero affix. NEEDAFFIX works also with prefixes and
590 prefix + suffix combinations (see tests/needaffix5.*).
591
592 PSEUDOROOT flag
593 Deprecated. (Former name of the NEEDAFFIX option.)
594
595 SUBSTANDARD flag
596 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
597 morphs) not used in morphological generation and root words re‐
598 moved from suggestion. See also NOSUGGEST.
599
600 WORDCHARS characters
601 WORDCHARS extends tokenizer of Hunspell command line interface
602 with additional word character. For example, dot, dash, n-dash,
603 numbers, percent sign are word character in Hungarian.
604
605 CHECKSHARPS
606 SS letter pair in uppercased (German) words may be upper case
607 sharp s (ß). Hunspell can handle this special casing with the
608 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
609 mancompounding example) in both spelling and suggestion.
610
611
613 Hunspell's dictionary items and affix rules may have optional space or
614 tabulator separated morphological description fields, started with
615 3-character (two letters and a colon) field IDs:
616
617
618 word/flags po:noun is:nom
619
620 Example: We define a simple resource with morphological informations, a
621 derivative suffix (ds:) and a part of speech category (po:):
622
623 Affix file:
624
625
626 SFX X Y 1
627 SFX X 0 able . ds:able
628
629 Dictionary file:
630
631
632 drink/X po:verb
633
634 Test file:
635
636
637 drink
638 drinkable
639
640 Test:
641
642
643 $ analyze test.aff test.dic test.txt
644 > drink
645 analyze(drink) = po:verb
646 stem(drink) = po:verb
647 > drinkable
648 analyze(drinkable) = po:verb ds:able
649 stem(drinkable) = drinkable
650
651 You can see in the example, that the analyzer concatenates the morpho‐
652 logical fields in item and arrangement style.
653
654
656 Default morphological and other IDs (used in suggestion, stemming and
657 morphological generation):
658
659 ph: Alternative transliteration for better suggestions, ie. mis‐
660 spellings related to the special orthography and pronunciation
661 of the word. The best way to handle common misspellings, so it's
662 worth to add ph: field to the most affected few thousand dictio‐
663 nary words (or word pairs etc.) to get correct suggestions for
664 their misspellings.
665
666
667 For example:
668
669
670 Wednesday ph:wendsay ph:wensday
671 Marseille ph:maarsayl
672
673 Hunspell adds all ph: transliterations to the inner REP table, so it
674 will always suggest the correct word for the specified misspellings
675 with the highest priority.
676
677 The previous example is equivalent of the following REP definition:
678
679
680 REP 6
681 REP wendsay Wednesday
682 REP Wendsay Wednesday
683 REP wensday Wednesday
684 REP Wensday Wednesday
685 REP maarsayl Marseille
686 REP Maarsayl Marseille
687
688 The asterisk at the end of the ph: pattern means stripping the termi‐
689 nating character both from the pattern and the word in the associated
690 REP rule:
691
692
693 pretty ph:prity*
694
695 will result
696
697
698 REP 1
699 REP prit prett
700
701 REP rule, resulting the following correct suggestions
702
703
704 *prity -> pretty
705 *pritier -> prettier
706 *pritiest -> prettiest
707
708 Moreover, ph: fields can handle suggestions with more than two words,
709 also different suggestions for the same misspelling:
710
711 do not know ph:dunno
712 don't know ph:dunno
713
714 results
715
716
717 *dunno -> do not know, don't know
718
719 Note: if available, ph: is used in n-gram similarity, too.
720
721 The ASCII arrow "->" in a ph: pattern means a REP rule (see REP), cre‐
722 ating arbitrary replacement rule associated to the dictionary item:
723
724 happy/B ph:hepy ph:hepi->happi
725
726 results
727
728
729 *hepy -> happy
730 *hepiest -> happiest
731
732 st: Stem. Optional: default stem is the dictionary item in morpho‐
733 logical analysis. Stem field is useful for virtual stems (dic‐
734 tionary words with NEEDAFFIX flag) and morphological exceptions
735 instead of new, single used morphological rules.
736
737 feet st:foot is:plural
738 mice st:mouse is:plural
739 teeth st:tooth is:plural
740
741 Word forms with multiple stems need multiple dictionary items:
742
743
744 lay po:verb st:lie is:past_2
745 lay po:verb is:present
746 lay po:noun
747
748 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
749 Morphological generation needs stem, allomorph and affix fields.
750
751 sing al:sang al:sung
752 sang st:sing
753 sung st:sing
754
755 po: Part of speech category.
756
757 ds: Derivational suffix(es). Stemming doesn't remove derivational
758 suffixes. Morphological generation depends on the order of the
759 suffix fields.
760
761 In affix rules:
762
763
764 SFX Y Y 1
765 SFX Y 0 ly . ds:ly_adj
766
767 In the dictionary:
768
769
770 ably st:able ds:ly_adj
771 able al:ably
772
773 is: Inflectional suffix(es). All inflectional suffixes are removed
774 by stemming. Morphological generation depends on the order of
775 the suffix fields.
776
777
778 feet st:foot is:plural
779
780 ts: Terminal suffix(es). Terminal suffix fields are inflectional
781 suffix fields "removed" by additional (not terminal) suffixes.
782
783 Useful for zero morphemes and affixes removed by splitting
784 rules.
785
786
787 work/D ts:present
788
789 SFX D Y 2
790 SFX D 0 ed . is:past_1
791 SFX D 0 ed . is:past_2
792
793 Typical example of the terminal suffix is the zero morpheme of the nom‐
794 inative case.
795
796
797 sp: Surface prefix. Temporary solution for adding prefixes to the
798 stems and generated word forms. See tests/morph.* example.
799
800
801 pa: Parts of the compound words. Output fields of morphological
802 analysis for stemming.
803
804 dp: Planned: derivational prefix.
805
806 ip: Planned: inflectional prefix.
807
808 tp: Planned: terminal prefix.
809
810
812 Ispell's original algorithm strips only one suffix. Hunspell can strip
813 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
814
815 The twofold suffix stripping is a significant improvement in handling
816 of immense number of suffixes, that characterize agglutinative lan‐
817 guages.
818
819 A second `s' suffix (affix class Y) will be the continuation class of
820 the suffix `able' in the following example:
821
822
823 SFX Y Y 1
824 SFX Y 0 s .
825
826 SFX X Y 1
827 SFX X 0 able/Y .
828
829 Dictionary file:
830
831
832 drink/X
833
834 Test file:
835
836
837 drink
838 drinkable
839 drinkables
840
841 Test:
842
843
844 $ hunspell -m -d test <test.txt
845 drink st:drink
846 drinkable st:drink fl:X
847 drinkables st:drink fl:X fl:Y
848
849 Theoretically with the twofold suffix stripping needs only the square
850 root of the number of suffix rules, compared with a Hunspell implemen‐
851 tation. In our practice, we could have elaborated the Hungarian inflec‐
852 tional morphology with twofold suffix stripping.
853
854
856 Hunspell can handle more than 65000 affix classes. There are three new
857 syntax for giving flags in affix and dictionary files.
858
859 FLAG long command sets 2-character flags:
860
861
862 FLAG long
863 SFX Y1 Y 1
864 SFX Y1 0 s 1
865
866 Dictionary record with the Y1, Z3, F? flags:
867
868
869 foo/Y1Z3F?
870
871 FLAG num command sets numerical flags separated by comma:
872
873
874 FLAG num
875 SFX 65000 Y 1
876 SFX 65000 0 s 1
877
878 Dictionary example:
879
880
881 foo/65000,12,2756
882
883 The third one is the Unicode character flags.
884
885
887 Hunspell's dictionary can contain repeating elements that are homonyms:
888
889
890 work/A po:verb
891 work/B po:noun
892
893 An affix file:
894
895
896 SFX A Y 1
897 SFX A 0 s . sf:sg3
898
899 SFX B Y 1
900 SFX B 0 s . is:plur
901
902 Test file:
903
904
905 works
906
907 Test:
908
909
910 $ hunspell -d test -m <testwords
911 work st:work po:verb is:sg3
912 work st:work po:noun is:plur
913
914 This feature also gives a way to forbid illegal prefix/suffix combina‐
915 tions.
916
917
919 An interesting side-effect of multi-step stripping is, that the appro‐
920 priate treatment of circumfixes now comes for free. For instance, in
921 Hungarian, superlatives are formed by simultaneous prefixation of leg-
922 and suffixation of -bb to the adjective base. A problem with the one-
923 level architecture is that there is no way to render lexical licensing
924 of particular prefixes and suffixes interdependent, and therefore in‐
925 correct forms are recognized as valid, i.e. *legvén = leg + vén `old'.
926 Until the introduction of clusters, a special treatment of the superla‐
927 tive had to be hardwired in the earlier HunSpell code. This may have
928 been legitimate for a single case, but in fact prefix--suffix depen‐
929 dences are ubiquitous in category-changing derivational patterns (cf.
930 English payable, non-payable but *non-pay or drinkable, undrinkable but
931 *undrink). In simple words, here, the prefix un- is legitimate only if
932 the base drink is suffixed with -able. If both these patters are han‐
933 dled by on-line affix rules and affix rules are checked against the
934 base only, there is no way to express this dependency and the system
935 will necessarily over- or undergenerate.
936
937 In next example, suffix class R have got a prefix `continuation' class
938 (class P).
939
940
941 PFX P Y 1
942 PFX P 0 un . [prefix_un]+
943
944 SFX S Y 1
945 SFX S 0 s . +PL
946
947 SFX Q Y 1
948 SFX Q 0 s . +3SGV
949
950 SFX R Y 1
951 SFX R 0 able/PS . +DER_V_ADJ_ABLE
952
953 Dictionary:
954
955
956 2
957 drink/RQ [verb]
958 drink/S [noun]
959
960 Morphological analysis:
961
962
963 > drink
964 drink[verb]
965 drink[noun]
966 > drinks
967 drink[verb]+3SGV
968 drink[noun]+PL
969 > drinkable
970 drink[verb]+DER_V_ADJ_ABLE
971 > drinkables
972 drink[verb]+DER_V_ADJ_ABLE+PL
973 > undrinkable
974 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
975 > undrinkables
976 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
977 > undrink
978 Unknown word.
979 > undrinks
980 Unknown word.
981
983 Conditional affixes implemented by a continuation class are not enough
984 for circumfixes, because a circumfix is one affix in morphology. We
985 also need CIRCUMFIX option for correct morphological analysis.
986
987
988 # circumfixes: ~ obligate prefix/suffix combinations
989 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
990 # nagy, nagyobb, legnagyobb, legeslegnagyobb
991 # (great, greater, greatest, most greatest)
992
993 CIRCUMFIX X
994
995 PFX A Y 1
996 PFX A 0 leg/X .
997
998 PFX B Y 1
999 PFX B 0 legesleg/X .
1000
1001 SFX C Y 3
1002 SFX C 0 obb . +COMPARATIVE
1003 SFX C 0 obb/AX . +SUPERLATIVE
1004 SFX C 0 obb/BX . +SUPERSUPERLATIVE
1005
1006 Dictionary:
1007
1008
1009 1
1010 nagy/C [MN]
1011
1012 Analysis:
1013
1014
1015 > nagy
1016 nagy[MN]
1017 > nagyobb
1018 nagy[MN]+COMPARATIVE
1019 > legnagyobb
1020 nagy[MN]+SUPERLATIVE
1021 > legeslegnagyobb
1022 nagy[MN]+SUPERSUPERLATIVE
1023
1025 Allowing free compounding yields decrease in precision of recognition,
1026 not to mention stemming and morphological analysis. Although lexical
1027 switches are introduced to license compounding of bases by Ispell, this
1028 proves not to be restrictive enough. For example:
1029
1030
1031 # affix file
1032 COMPOUNDFLAG X
1033
1034 2
1035 foo/X
1036 bar/X
1037
1038 With this resource, foobar and barfoo also are accepted words.
1039
1040 This has been improved upon with the introduction of direction-sensi‐
1041 tive compounding, i.e., lexical features can specify separately whether
1042 a base can occur as leftmost or rightmost constituent in compounds.
1043 This, however, is still insufficient to handle the intricate patterns
1044 of compounding, not to mention idiosyncratic (and language specific)
1045 norms of hyphenation.
1046
1047 The Hunspell algorithm currently allows any affixed form of words,
1048 which are lexically marked as potential members of compounds. Hunspell
1049 improved this, and its recursive compound checking rules makes it pos‐
1050 sible to implement the intricate spelling conventions of Hungarian com‐
1051 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
1052 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
1053 rule. Further example in Hungarian, derivate suffixes often modify
1054 compounding properties. Hunspell allows the compounding flags on the
1055 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
1056 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
1057
1058 Suffixes with this flag forbid compounding of the affixed word.
1059
1060 We also need several Hunspell features for handling German compounding:
1061
1062
1063 # German compounding
1064
1065 # set language to handle special casing of German sharp s
1066
1067 LANG de_DE
1068
1069 # compound flags
1070
1071 COMPOUNDBEGIN U
1072 COMPOUNDMIDDLE V
1073 COMPOUNDEND W
1074
1075 # Prefixes are allowed at the beginning of compounds,
1076 # suffixes are allowed at the end of compounds by default:
1077 # (prefix)?(root)+(affix)?
1078 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
1079 COMPOUNDPERMITFLAG P
1080
1081 # for German fogemorphemes (Fuge-element)
1082 # Hint: ONLYINCOMPOUND is not required everywhere, but the
1083 # checking will be a little faster with it.
1084
1085 ONLYINCOMPOUND X
1086
1087 # forbid uppercase characters at compound word bounds
1088 CHECKCOMPOUNDCASE
1089
1090 # for handling Fuge-elements with dashes (Arbeits-)
1091 # dash will be a special word
1092
1093 COMPOUNDMIN 1
1094 WORDCHARS -
1095
1096 # compound settings and fogemorpheme for `Arbeit'
1097
1098 SFX A Y 3
1099 SFX A 0 s/UPX .
1100 SFX A 0 s/VPDX .
1101 SFX A 0 0/WXD .
1102
1103 SFX B Y 2
1104 SFX B 0 0/UPX .
1105 SFX B 0 0/VWXDP .
1106
1107 # a suffix for `Computer'
1108
1109 SFX C Y 1
1110 SFX C 0 n/WD .
1111
1112 # for forbid exceptions (*Arbeitsnehmer)
1113
1114 FORBIDDENWORD Z
1115
1116 # dash prefix for compounds with dash (Arbeits-Computer)
1117
1118 PFX - Y 1
1119 PFX - 0 -/P .
1120
1121 # decapitalizing prefix
1122 # circumfix for positioning in compounds
1123
1124 PFX D Y 29
1125 PFX D A a/PX A
1126 PFX D Ä ä/PX Ä
1127 .
1128 .
1129 PFX D Y y/PX Y
1130 PFX D Z z/PX Z
1131
1132 Example dictionary:
1133
1134
1135 4
1136 Arbeit/A-
1137 Computer/BC-
1138 -/W
1139 Arbeitsnehmer/Z
1140
1141 Accepted compound compound words with the previous resource:
1142
1143
1144 Computer
1145 Computern
1146 Arbeit
1147 Arbeits-
1148 Computerarbeit
1149 Computerarbeits-
1150 Arbeitscomputer
1151 Arbeitscomputern
1152 Computerarbeitscomputer
1153 Computerarbeitscomputern
1154 Arbeitscomputerarbeit
1155 Computerarbeits-Computer
1156 Computerarbeits-Computern
1157
1158 Not accepted compoundings:
1159
1160
1161 computer
1162 arbeit
1163 Arbeits
1164 arbeits
1165 ComputerArbeit
1166 ComputerArbeits
1167 Arbeitcomputer
1168 ArbeitsComputer
1169 Computerarbeitcomputer
1170 ComputerArbeitcomputer
1171 ComputerArbeitscomputer
1172 Arbeitscomputerarbeits
1173 Computerarbeits-computer
1174 Arbeitsnehmer
1175
1176 This solution is still not ideal, however, and will be replaced by a
1177 pattern-based compound-checking algorithm which is closely integrated
1178 with input buffer tokenization. Patterns describing compounds come as a
1179 separate input resource that can refer to high-level properties of con‐
1180 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1181 ment of hyphens). The patterns are matched against potential segmenta‐
1182 tions of compounds to assess wellformedness.
1183
1184
1186 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1187 major deficiency when it comes to scalability. Although a language
1188 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1189 fails to allow a full implementation of Hungarian orthographic conven‐
1190 tions. For instance, the '--' symbol (n-dash) is missing from this
1191 character set contrary to the fact that it is not only the official
1192 symbol to delimit parenthetic clauses in the language, but it can be in
1193 compound words as a special 'big' hyphen.
1194
1195 MySpell has got some 8-bit encoding tables, but there are languages
1196 without standard 8-bit encoding, too. For example, a lot of African
1197 languages have non-latin or extended latin characters.
1198
1199 Similarly, using the original spelling of certain foreign names like
1200 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1201 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1202 bine with inflections containing characters only in ISO 8859-2 (like
1203 elative -ből, allative -től or delative -ről with double acute), these
1204 result in words (like Ångströmről or Molière-től.) that can not be en‐
1205 coded using any single ASCII encoding scheme.
1206
1207 The problems raised in relation to 8-bit ASCII encoding have long been
1208 recognized by proponents of Unicode. It is clear that trading effi‐
1209 ciency for encoding-independence has its advantages when it comes a
1210 truly multi-lingual application. There is implemented a memory and time
1211 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1212 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1213 affixes and words are stored in UTF-8, during the analysis are handled
1214 in mostly UTF-8, under condition checking and suggestion are converted
1215 to UTF-16. Unicode text analysis and spell checking have a minimal
1216 (0-20%) time overhead and minimal or reasonable memory overhead depends
1217 from the language (its UTF-8 encoding and affixation).
1218
1219
1221 Aspell dictionaries can be easily converted into hunspell. Conversion
1222 steps:
1223
1224 dictionary (xx.cwl -> xx.wl):
1225
1226 preunzip xx.cwl
1227 wc -l < xx.wl > xx.dic
1228 cat xx.wl >> xx.dic
1229
1230 affix file
1231
1232 If the affix file exists, copy it:
1233 cp xx_affix.dat xx.aff
1234 If not, create it with the suitable character encoding (see xx.dat)
1235 echo "SET ISO8859-x" > xx.aff
1236 or
1237 echo "SET UTF-8" > xx.aff
1238
1239 It's useful to add a TRY option with the characters of the dictionary
1240 with frequency order to set edit distance suggestions:
1241 echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1242
1243
1245 hunspell (1), ispell (1), ispell (4)
1246
1247
1248
1249
1250 2017-09-20 hunspell(5)