1hunspell(5) File Formats Manual hunspell(5)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) Hunspell requires two files to define the way a language is
10 being spell checked: a dictionary file containing words and applicable
11 flags, and an affix file that specifies how these flags will control
12 spell checking. An optional file is the personal dictionary file.
13
14
16 A dictionary file (*.dic) contains a list of words, one per line. The
17 first line of the dictionaries (except personal dictionaries) contains
18 the approximate word count (for optimal hash memory size). Each word
19 may optionally be followed by a slash ("/") and one or more flags,
20 which represents the word attributes, for example affixes.
21
22 Note: Dictionary words can contain also slashes when escaped like ""
23 syntax.
24
25 It's worth to add not only words, but word pairs to the dictionary to
26 get correct suggestions for common misspellings with missing space, as
27 in the following example, for the bad "alot" and "inspite" (see also
28 "REP" and field "ph:" about correct suggestions for common mis‐
29 spellings):
30
31
32 3
33 word
34 a lot
35 in spite
36
38 Personal dictionaries are simple word lists. Asterisk at the first
39 character position signs prohibition. A second word separated by a
40 slash sets the affixation.
41
42
43 foo
44 Foo/Simpson
45 *bar
46
47 In this example, "foo" and "Foo" are personal words, plus Foo will be
48 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
49 word.
50
51
53 Dictionary file:
54
55 3
56 hello
57 try/B
58 work/AB
59
60 The flags B and A specify attributes of these words.
61
62 Affix file:
63
64
65 SET UTF-8
66 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
67
68 REP 2
69 REP f ph
70 REP ph f
71
72 PFX A Y 1
73 PFX A 0 re .
74
75 SFX B Y 2
76 SFX B 0 ed [^y]
77 SFX B y ied y
78
79 In the affix file, prefix A and suffix B have been defined. Flag A
80 defines a `re-' prefix. Class B defines two `-ed' suffixes. First B
81 suffix can be added to a word if the last character of the word isn't
82 `y'. Second suffix can be added to the words terminated with an `y'.
83
84 All accepted words with this dictionary and affix combination are:
85 "hello", "try", "tried", "work", "worked", "rework", "reworked".
86
87
89 Hunspell source distribution contains more than 80 examples for option
90 usage.
91
92
93 SET encoding
94 Set character encoding of words and morphemes in affix and dic‐
95 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
96 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, cp1251, ISCII-DEVANA‐
97 GARI.
98
99 SET UTF-8
100
101 FLAG value
102 Set flag type. Default type is the extended ASCII (8-bit) char‐
103 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
104 flags. The `long' value sets the double extended ASCII charac‐
105 ter flag type, the `num' sets the decimal number flag type. Dec‐
106 imal flags numbered from 1 to 65000, and in flag fields are sep‐
107 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
108 form.
109
110 FLAG long
111
112 COMPLEXPREFIXES
113 Set twofold prefix stripping (but single suffix stripping) eg.
114 for morphologically complex languages with right-to-left writing
115 system.
116
117
118 LANG langcode
119 Set language code for language-specific functions of Hunspell.
120 Use it to activate special casing of Azeri (LANG az), Turkish
121 (LANG tr) and Crimean Tatar (LANG crh), also not generalized
122 syllable-counting compounding rules of Hungarian (LANG hu).
123
124
125 IGNORE characters
126 Sets characters to ignore dictionary words, affixes and input
127 words. Useful for optional characters, as Arabic (harakat) or
128 Hebrew (niqqud) diacritical marks (see tests/ignore.* test dic‐
129 tionary in Hunspell distribution).
130
131
132 AF number_of_flag_vector_aliases
133
134 AF flag_vector
135 Hunspell can substitute affix flag sets with ordinal numbers in
136 affix rules (alias compression, see makealias tool). First exam‐
137 ple with alias compression:
138
139 3
140 hello
141 try/1
142 work/2
143
144 AF definitions in the affix file:
145
146 AF 2
147 AF A
148 AF AB
149
150 It is equivalent of the following dic file:
151
152 3
153 hello
154 try/A
155 work/AB
156
157 See also tests/alias* examples of the source distribution.
158
159 Note I: If affix file contains the FLAG parameter, define it before the
160 AF definitions.
161
162 Note II: Use makealias utility in Hunspell distribution to compress aff
163 and dic files.
164
165 AM number_of_morphological_aliases
166
167 AM morphological_fields
168 Hunspell can substitute also morphological data with ordinal
169 numbers in affix rules (alias compression). See tests/alias*
170 examples.
171
173 Suggestion parameters can optimize the default n-gram (similarity
174 search in the dictionary words based on the common 1, 2, 3, 4-character
175 length common character-sequences), character swap and deletion sugges‐
176 tions of Hunspell. REP is suggested to fix the typical and especially
177 bad language specific bugs, because the REP suggestions have the high‐
178 est priority in the suggestion list. PHONE is for languages with not
179 pronunciation based orthography.
180
181 For short common misspellings, it's important to use the ph: field (see
182 later) to give the best suggestions.
183
184 KEY characters_separated_by_vertical_line_optionally
185 Hunspell searches and suggests words with one different charac‐
186 ter replaced by a neighbor KEY character. Not neighbor charac‐
187 ters in KEY string separated by vertical line characters. Sug‐
188 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
189
190 KEY qwertyuiop|asdfghjkl|zxcvbnm
191 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
192
193 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
194 "*nide". A character may have more neighbors, too:
195
196 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
197
198 TRY characters
199 Hunspell can suggest right word forms, when they differ from the
200 bad input word by one TRY character. The parameter of TRY is
201 case sensitive.
202
203 NOSUGGEST flag
204 Words signed with NOSUGGEST flag are not suggested (but still
205 accepted when typed correctly). Proposed flag for vulgar and
206 obscene words (see also SUBSTANDARD).
207
208 MAXCPDSUGS num
209 Set max. number of suggested compound words generated by com‐
210 pound rules. The number of the suggested compound words may be
211 greater from the same 1-character distance type.
212
213 MAXNGRAMSUGS num
214 Set max. number of n-gram suggestions. Value 0 switches off the
215 n-gram suggestions (see also MAXDIFF).
216
217 MAXDIFF [0-10]
218 Set the similarity factor for the n-gram based suggestions (5 =
219 default value; 0 = fewer n-gram suggestions, but min. 1; 10 =
220 MAXNGRAMSUGS n-gram suggestions).
221
222 ONLYMAXDIFF
223 Remove all bad n-gram suggestions (default mode keeps one, see
224 MAXDIFF).
225
226 NOSPLITSUGS
227 Disable word suggestions with spaces.
228
229 SUGSWITHDOTS
230 Add dot(s) to suggestions, if input word terminates in dot(s).
231 (Not for LibreOffice dictionaries, because LibreOffice has an
232 automatic dot expansion mechanism.)
233
234 REP number_of_replacement_definitions
235
236 REP what replacement
237 This table specifies modifications to try first. First REP is
238 the header of this table and one or more REP data line are fol‐
239 lowing it. With this table, Hunspell can suggest the right
240 forms for the typical spelling mistakes when the incorrect form
241 differs by more than 1 letter from the right form (see also
242 "ph:"). The search string supports the regex boundary signs (^
243 and $). For example a possible English replacement table defi‐
244 nition to handle misspelled consonants:
245
246 REP 5
247 REP f ph
248 REP ph f
249 REP tion$ shun
250 REP ^cooccurr co-occurr
251 REP ^alot$ a_lot
252
253 Note I: It's very useful to define replacements for the most typical
254 one-character mistakes, too: with REP you can add higher priority to a
255 subset of the TRY suggestions (suggestion list begins with the REP sug‐
256 gestions).
257
258 Note II: Suggesting separated words, specify spaces with underlines:
259
260
261 REP 1
262 REP onetwothree one_two_three
263
264 Note III: Replacement table can be used for a stricter compound word
265 checking with the option CHECKCOMPOUNDREP.
266
267
268 MAP number_of_map_definitions
269
270 MAP string_of_related_chars_or_parenthesized_character_sequences
271 We can define language-dependent information on characters and
272 character sequences that should be considered related (i.e.
273 nearer than other chars not in the set) in the affix file (.aff)
274 by a map table. With this table, Hunspell can suggest the right
275 forms for words, which incorrectly choose the wrong letter or
276 letter groups from a related set more than once in a word (see
277 REP).
278
279 For example a possible mapping could be for the German umlauted
280 ü versus the regular u; the word Frühstück really should be
281 written with umlauted u's and not regular ones
282
283 MAP 1
284 MAP uü
285
286 Use parenthesized groups for character sequences (eg. for composed Uni‐
287 code characters):
288
289 MAP 3
290 MAP ß(ss) (character sequence)
291 MAP fi(fi) ("fi" compatibility characters for Unicode fi ligature)
292 MAP (ọ́)o (composed Unicode character: ó with bottom dot)
293
294 PHONE number_of_phone_definitions
295
296 PHONE what replacement
297 PHONE uses a table-driven phonetic transcription algorithm bor‐
298 rowed from Aspell. It is useful for languages with not pronunci‐
299 ation based orthography. You can add a full alphabet conversion
300 and other rules for conversion of special letter sequences. For
301 detailed documentation see http://aspell.net/man-html/Phonetic-
302 Code.html. Note: Multibyte UTF-8 characters have not worked
303 with bracket expression yet. Dash expression has signed bytes
304 and not UTF-8 characters yet.
305
306 WARN flag
307 This flag is for rare words, which are also often spelling mis‐
308 takes, see option -r of command line Hunspell and FORBIDWARN.
309
310 FORBIDWARN
311 Words with flag WARN aren't accepted by the spell checker using
312 this parameter.
313
315 BREAK number_of_break_definitions
316
317 BREAK character_or_character_sequence
318 Define new break points for breaking words and checking word
319 parts separately. Use ^ and $ to delete characters at end and
320 start of the word. Rationale: useful for compounding with join‐
321 ing character or strings (for example, hyphen in English and
322 German or hyphen and n-dash in Hungarian). Dashes are often bad
323 break points for tokenization, because compounds with dashes may
324 contain not valid parts, too.) With BREAK, Hunspell can check
325 both side of these compounds, breaking the words at dashes and
326 n-dashes:
327
328 BREAK 2
329 BREAK -
330 BREAK -- # n-dash
331
332 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
333 be valid compounds. Note: The default word break of Hunspell is equiv‐
334 alent of the following BREAK definition:
335
336 BREAK 3
337 BREAK -
338 BREAK ^-
339 BREAK -$
340
341 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
342 definition:
343
344 BREAK 1
345 BREAK -
346
347 Switching off the default values:
348
349 BREAK 0
350
351 Note II: COMPOUNDRULE is better for handling dashes and other compound
352 joining characters or character strings. Use BREAK, if you want to
353 check words with dashes or other joining characters and there is no
354 time or possibility to describe precise compound rules with COM‐
355 POUNDRULE (COMPOUNDRULE handles only the suffixation of the last word
356 part of a compound word).
357
358 Note III: For command line spell checking of words with extra charac‐
359 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
360 ple
361
362 COMPOUNDRULE number_of_compound_definitions
363
364 COMPOUNDRULE compound_pattern
365 Define custom compound patterns with a regex-like syntax. The
366 first COMPOUNDRULE is a header with the number of the following
367 COMPOUNDRULE definitions. Compound patterns consist compound
368 flags, parentheses, star and question mark meta characters. A
369 flag followed by a `*' matches a word sequence of 0 or more
370 matches of words signed with this compound flag. A flag fol‐
371 lowed by a `?' matches a word sequence of 0 or 1 matches of a
372 word signed with this compound flag. See tests/compound*.*
373 examples.
374
375 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
376 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
377 1000122nd etc.).
378
379 Note II: In the case of long and numerical flag types use only
380 parenthesized flags: (1500)*(2000)?
381
382 Note III: COMPOUNDRULE flags work completely separately from the
383 compounding mechanisms using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
384 compound flags. (Use these flags on different entries for
385 words).
386
387
388 COMPOUNDMIN num
389 Minimum length of words used for compounding. Default value is
390 3 letters.
391
392 COMPOUNDFLAG flag
393 Words signed with COMPOUNDFLAG may be in compound words (except
394 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
395 also permits compounding of affixed words.
396
397 COMPOUNDBEGIN flag
398 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
399 first elements in compound words.
400
401 COMPOUNDLAST flag
402 Words signed with COMPOUNDLAST (or with a signed affix) may be
403 last elements in compound words.
404
405 COMPOUNDMIDDLE flag
406 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
407 middle elements in compound words.
408
409 ONLYINCOMPOUND flag
410 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
411 compounds (Fuge-elements in German, fogemorphemes in Swedish).
412 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
413 pound.*). Note: also valuable to flag compounding parts which
414 are not correct as a word by itself.
415
416 COMPOUNDPERMITFLAG flag
417 Prefixes are allowed at the beginning of compounds, suffixes are
418 allowed at the end of compounds by default. Affixes with COM‐
419 POUNDPERMITFLAG may be inside of compounds.
420
421 COMPOUNDFORBIDFLAG flag
422 Suffixes with this flag forbid compounding of the affixed word.
423 Dictionary words with this flag are removed from the beginning
424 and middle of compound words, overriding the effect of COMPOUND‐
425 PERMITFLAG.
426
427 COMPOUNDMORESUFFIXES
428 Allow twofold suffixes within compounds.
429
430 COMPOUNDROOT flag
431 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
432 is used only in the Hungarian language specific code).
433
434 COMPOUNDWORDMAX number
435 Set maximum word count in a compound word. (Default is unlim‐
436 ited.)
437
438 CHECKCOMPOUNDDUP
439 Forbid word duplication in compounds (e.g. foofoo).
440
441 CHECKCOMPOUNDREP
442 Forbid compounding, if the (usually bad) compound word may be a
443 non-compound word with a REP fault. Useful for languages with
444 `compound friendly' orthography.
445
446 CHECKCOMPOUNDCASE
447 Forbid upper case characters at word boundaries in compounds.
448
449 CHECKCOMPOUNDTRIPLE
450 Forbid compounding, if compound word contains triple repeating
451 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
452 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
453 acters).
454
455 SIMPLIFIEDTRIPLE
456 Allow simplified 2-letter forms of the compounds forbidden by
457 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
458 for the old German orthography: Schiff|fahrt -> Schiffahrt).
459
460 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
461
462 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
463 Forbid compounding, if the first word in the compound ends with
464 endchars, and next word begins with beginchars and (optionally)
465 they have the requested flags. The optional replacement parame‐
466 ter allows simplified compound form.
467
468 The special "endchars" pattern 0 (zero) limits the rule to the
469 unmodified stems (stems and stems with zero affixes):
470
471 CHECKCOMPOUNDPATTERN 0/x /y
472
473 Note: COMPOUNDMIN doesn't work correctly with the compound word alter‐
474 nation, so it may need to set COMPOUNDMIN to lower value.
475
476 FORCEUCASE flag
477 Last word part of a compound with flag FORCEUCASE forces capi‐
478 talization of the whole compound word. Eg. Dutch word "straat"
479 (street) with FORCEUCASE flags will allowed only in capitalized
480 compound forms, according to the Dutch spelling rules for proper
481 names.
482
483 COMPOUNDSYLLABLE max_syllable vowels
484 Need for special compounding rules in Hungarian. First parame‐
485 ter is the maximum syllable number, that may be in a compound,
486 if words in compounds are more than COMPOUNDWORDMAX. Second
487 parameter is the list of vowels (for calculating syllables).
488
489 SYLLABLENUM flags
490 Need for special compounding rules in Hungarian.
491
493 PFX flag cross_product number
494
495 PFX flag stripping prefix [condition [morphological_fields...]]
496
497 SFX flag cross_product number
498
499 SFX flag stripping suffix [condition [morphological_fields...]]
500 An affix is either a prefix or a suffix attached to root words
501 to make other words. We can define affix classes with arbitrary
502 number affix rules. Affix classes are signed with affix flags.
503 The first line of an affix class definition is the header. The
504 fields of an affix class header:
505
506 (0) Option name (PFX or SFX)
507
508 (1) Flag (name of the affix class)
509
510 (2) Cross product (permission to combine prefixes and suffixes).
511 Possible values: Y (yes) or N (no)
512
513 (3) Line count of the following rules.
514
515 Fields of an affix rules:
516
517 (0) Option name
518
519 (1) Flag
520
521 (2) stripping characters from beginning (at prefix rules) or end
522 (at suffix rules) of the word
523
524 (3) affix (optionally with flags of continuation classes, sepa‐
525 rated by a slash)
526
527 (4) condition.
528
529 Zero stripping or affix are indicated by zero. Zero condition is
530 indicated by dot. Condition is a simplified, regular expres‐
531 sion-like pattern, which must be met before the affix can be
532 applied. (Dot signs an arbitrary character. Characters in braces
533 sign an arbitrary character from the character subset. Dash
534 hasn't got special meaning, but circumflex (^) next the first
535 brace sets the complementer character set.)
536
537 (5) Optional morphological fields separated by spaces or tabula‐
538 tors.
539
540
542 CIRCUMFIX flag
543 Affixes signed with CIRCUMFIX flag may be on a word when this
544 word also has a prefix with CIRCUMFIX flag and vice versa (see
545 circumfix.* test files in the source distribution).
546
547 FORBIDDENWORD flag
548 This flag signs forbidden word form. Because affixed forms are
549 also forbidden, we can subtract a subset from set of the
550 accepted affixed and compound words. Note: usefull to forbid
551 erroneous words, generated by the compounding mechanism.
552
553 FULLSTRIP
554 With FULLSTRIP, affix rules can strip full words, not only one
555 less characters, before adding the affixes, see fullstrip.* test
556 files in the source distribution). Note: conditions may be word
557 length without FULLSTRIP, too.
558
559 KEEPCASE flag
560 Forbid uppercased and capitalized forms of words signed with
561 KEEPCASE flags. Useful for special orthographies (measurements
562 and currency often keep their case in uppercased texts) and
563 writing systems (e.g. keeping lower case of IPA characters).
564 Also valuable for words erroneously written in the wrong case.
565
566 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
567 CASE flag may be capitalized and uppercased, but uppercased
568 forms of these words may not contain sharp s, only SS. See ger‐
569 mancompounding example in the tests directory of the Hunspell
570 distribution.
571
572
573 ICONV number_of_ICONV_definitions
574
575 ICONV pattern pattern2
576 Define input conversion table. Note: useful to convert one type
577 of quote to another one, or change ligature.
578
579 OCONV number_of_OCONV_definitions
580
581 OCONV pattern pattern2
582 Define output conversion table.
583
584 LEMMA_PRESENT flag
585 Deprecated. Use "st:" field instead of LEMMA_PRESENT.
586
587 NEEDAFFIX flag
588 This flag signs virtual stems in the dictionary, words only
589 valid when affixed. Except, if the dictionary word has a
590 homonym or a zero affix. NEEDAFFIX works also with prefixes and
591 prefix + suffix combinations (see tests/needaffix5.*).
592
593 PSEUDOROOT flag
594 Deprecated. (Former name of the NEEDAFFIX option.)
595
596 SUBSTANDARD flag
597 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
598 morphs) not used in morphological generation and root words
599 removed from suggestion. See also NOSUGGEST.
600
601 WORDCHARS characters
602 WORDCHARS extends tokenizer of Hunspell command line interface
603 with additional word character. For example, dot, dash, n-dash,
604 numbers, percent sign are word character in Hungarian.
605
606 CHECKSHARPS
607 SS letter pair in uppercased (German) words may be upper case
608 sharp s (ß). Hunspell can handle this special casing with the
609 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
610 mancompounding example) in both spelling and suggestion.
611
612
614 Hunspell's dictionary items and affix rules may have optional space or
615 tabulator separated morphological description fields, started with
616 3-character (two letters and a colon) field IDs:
617
618
619 word/flags po:noun is:nom
620
621 Example: We define a simple resource with morphological informations, a
622 derivative suffix (ds:) and a part of speech category (po:):
623
624 Affix file:
625
626
627 SFX X Y 1
628 SFX X 0 able . ds:able
629
630 Dictionary file:
631
632
633 drink/X po:verb
634
635 Test file:
636
637
638 drink
639 drinkable
640
641 Test:
642
643
644 $ analyze test.aff test.dic test.txt
645 > drink
646 analyze(drink) = po:verb
647 stem(drink) = po:verb
648 > drinkable
649 analyze(drinkable) = po:verb ds:able
650 stem(drinkable) = drinkable
651
652 You can see in the example, that the analyzer concatenates the morpho‐
653 logical fields in item and arrangement style.
654
655
657 Default morphological and other IDs (used in suggestion, stemming and
658 morphological generation):
659
660 ph: Alternative transliteration for better suggestions, ie. mis‐
661 spellings related to the special orthography and pronunciation
662 of the word. The best way to handle common misspellings, so it's
663 worth to add ph: field to the most affected few thousand dictio‐
664 nary words (or word pairs etc.) to get correct suggestions for
665 their misspellings.
666
667
668 For example:
669
670
671 Wednesday ph:wendsay ph:wensday
672 Marseille ph:maarsayl
673
674 Hunspell adds all ph: transliterations to the inner REP table, so it
675 will always suggest the correct word for the specified misspellings
676 with the highest priority.
677
678 The previous example is equivalent of the following REP definition:
679
680
681 REP 6
682 REP wendsay Wednesday
683 REP Wendsay Wednesday
684 REP wensday Wednesday
685 REP Wensday Wednesday
686 REP maarsayl Marseille
687 REP Maarsayl Marseille
688
689 The asterisk at the end of the ph: pattern means stripping the termi‐
690 nating character both from the pattern and the word in the associated
691 REP rule:
692
693
694 pretty ph:prity*
695
696 will result
697
698
699 REP 1
700 REP prit prett
701
702 REP rule, resulting the following correct suggestions
703
704
705 *prity -> pretty
706 *pritier -> prettier
707 *pritiest -> prettiest
708
709 Moreover, ph: fields can handle suggestions with more than two words,
710 also different suggestions for the same misspelling:
711
712 do not know ph:dunno
713 don't know ph:dunno
714
715 results
716
717
718 *dunno -> do not know, don't know
719
720 Note: if available, ph: is used in n-gram similarity, too.
721
722 The ASCII arrow "->" in a ph: pattern means a REP rule (see REP), cre‐
723 ating arbitrary replacement rule associated to the dictionary item:
724
725 happy/B ph:hepy ph:hepi->happi
726
727 results
728
729
730 *hepy -> happy
731 *hepiest -> happiest
732
733 st: Stem. Optional: default stem is the dictionary item in morpho‐
734 logical analysis. Stem field is useful for virtual stems (dic‐
735 tionary words with NEEDAFFIX flag) and morphological exceptions
736 instead of new, single used morphological rules.
737
738 feet st:foot is:plural
739 mice st:mouse is:plural
740 teeth st:tooth is:plural
741
742 Word forms with multiple stems need multiple dictionary items:
743
744
745 lay po:verb st:lie is:past_2
746 lay po:verb is:present
747 lay po:noun
748
749 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
750 Morphological generation needs stem, allomorph and affix fields.
751
752 sing al:sang al:sung
753 sang st:sing
754 sung st:sing
755
756 po: Part of speech category.
757
758 ds: Derivational suffix(es). Stemming doesn't remove derivational
759 suffixes. Morphological generation depends on the order of the
760 suffix fields.
761
762 In affix rules:
763
764
765 SFX Y Y 1
766 SFX Y 0 ly . ds:ly_adj
767
768 In the dictionary:
769
770
771 ably st:able ds:ly_adj
772 able al:ably
773
774 is: Inflectional suffix(es). All inflectional suffixes are removed
775 by stemming. Morphological generation depends on the order of
776 the suffix fields.
777
778
779 feet st:foot is:plural
780
781 ts: Terminal suffix(es). Terminal suffix fields are inflectional
782 suffix fields "removed" by additional (not terminal) suffixes.
783
784 Useful for zero morphemes and affixes removed by splitting
785 rules.
786
787
788 work/D ts:present
789
790 SFX D Y 2
791 SFX D 0 ed . is:past_1
792 SFX D 0 ed . is:past_2
793
794 Typical example of the terminal suffix is the zero morpheme of the nom‐
795 inative case.
796
797
798 sp: Surface prefix. Temporary solution for adding prefixes to the
799 stems and generated word forms. See tests/morph.* example.
800
801
802 pa: Parts of the compound words. Output fields of morphological
803 analysis for stemming.
804
805 dp: Planned: derivational prefix.
806
807 ip: Planned: inflectional prefix.
808
809 tp: Planned: terminal prefix.
810
811
813 Ispell's original algorithm strips only one suffix. Hunspell can strip
814 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
815
816 The twofold suffix stripping is a significant improvement in handling
817 of immense number of suffixes, that characterize agglutinative lan‐
818 guages.
819
820 A second `s' suffix (affix class Y) will be the continuation class of
821 the suffix `able' in the following example:
822
823
824 SFX Y Y 1
825 SFX Y 0 s .
826
827 SFX X Y 1
828 SFX X 0 able/Y .
829
830 Dictionary file:
831
832
833 drink/X
834
835 Test file:
836
837
838 drink
839 drinkable
840 drinkables
841
842 Test:
843
844
845 $ hunspell -m -d test <test.txt
846 drink st:drink
847 drinkable st:drink fl:X
848 drinkables st:drink fl:X fl:Y
849
850 Theoretically with the twofold suffix stripping needs only the square
851 root of the number of suffix rules, compared with a Hunspell implemen‐
852 tation. In our practice, we could have elaborated the Hungarian inflec‐
853 tional morphology with twofold suffix stripping.
854
855
857 Hunspell can handle more than 65000 affix classes. There are three new
858 syntax for giving flags in affix and dictionary files.
859
860 FLAG long command sets 2-character flags:
861
862
863 FLAG long
864 SFX Y1 Y 1
865 SFX Y1 0 s 1
866
867 Dictionary record with the Y1, Z3, F? flags:
868
869
870 foo/Y1Z3F?
871
872 FLAG num command sets numerical flags separated by comma:
873
874
875 FLAG num
876 SFX 65000 Y 1
877 SFX 65000 0 s 1
878
879 Dictionary example:
880
881
882 foo/65000,12,2756
883
884 The third one is the Unicode character flags.
885
886
888 Hunspell's dictionary can contain repeating elements that are homonyms:
889
890
891 work/A po:verb
892 work/B po:noun
893
894 An affix file:
895
896
897 SFX A Y 1
898 SFX A 0 s . sf:sg3
899
900 SFX B Y 1
901 SFX B 0 s . is:plur
902
903 Test file:
904
905
906 works
907
908 Test:
909
910
911 $ hunspell -d test -m <testwords
912 work st:work po:verb is:sg3
913 work st:work po:noun is:plur
914
915 This feature also gives a way to forbid illegal prefix/suffix combina‐
916 tions.
917
918
920 An interesting side-effect of multi-step stripping is, that the appro‐
921 priate treatment of circumfixes now comes for free. For instance, in
922 Hungarian, superlatives are formed by simultaneous prefixation of leg-
923 and suffixation of -bb to the adjective base. A problem with the one-
924 level architecture is that there is no way to render lexical licensing
925 of particular prefixes and suffixes interdependent, and therefore
926 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
927 `old'. Until the introduction of clusters, a special treatment of the
928 superlative had to be hardwired in the earlier HunSpell code. This may
929 have been legitimate for a single case, but in fact prefix--suffix
930 dependences are ubiquitous in category-changing derivational patterns
931 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
932 able but *undrink). In simple words, here, the prefix un- is legitimate
933 only if the base drink is suffixed with -able. If both these patters
934 are handled by on-line affix rules and affix rules are checked against
935 the base only, there is no way to express this dependency and the sys‐
936 tem will necessarily over- or undergenerate.
937
938 In next example, suffix class R have got a prefix `continuation' class
939 (class P).
940
941
942 PFX P Y 1
943 PFX P 0 un . [prefix_un]+
944
945 SFX S Y 1
946 SFX S 0 s . +PL
947
948 SFX Q Y 1
949 SFX Q 0 s . +3SGV
950
951 SFX R Y 1
952 SFX R 0 able/PS . +DER_V_ADJ_ABLE
953
954 Dictionary:
955
956
957 2
958 drink/RQ [verb]
959 drink/S [noun]
960
961 Morphological analysis:
962
963
964 > drink
965 drink[verb]
966 drink[noun]
967 > drinks
968 drink[verb]+3SGV
969 drink[noun]+PL
970 > drinkable
971 drink[verb]+DER_V_ADJ_ABLE
972 > drinkables
973 drink[verb]+DER_V_ADJ_ABLE+PL
974 > undrinkable
975 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
976 > undrinkables
977 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
978 > undrink
979 Unknown word.
980 > undrinks
981 Unknown word.
982
984 Conditional affixes implemented by a continuation class are not enough
985 for circumfixes, because a circumfix is one affix in morphology. We
986 also need CIRCUMFIX option for correct morphological analysis.
987
988
989 # circumfixes: ~ obligate prefix/suffix combinations
990 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
991 # nagy, nagyobb, legnagyobb, legeslegnagyobb
992 # (great, greater, greatest, most greatest)
993
994 CIRCUMFIX X
995
996 PFX A Y 1
997 PFX A 0 leg/X .
998
999 PFX B Y 1
1000 PFX B 0 legesleg/X .
1001
1002 SFX C Y 3
1003 SFX C 0 obb . +COMPARATIVE
1004 SFX C 0 obb/AX . +SUPERLATIVE
1005 SFX C 0 obb/BX . +SUPERSUPERLATIVE
1006
1007 Dictionary:
1008
1009
1010 1
1011 nagy/C [MN]
1012
1013 Analysis:
1014
1015
1016 > nagy
1017 nagy[MN]
1018 > nagyobb
1019 nagy[MN]+COMPARATIVE
1020 > legnagyobb
1021 nagy[MN]+SUPERLATIVE
1022 > legeslegnagyobb
1023 nagy[MN]+SUPERSUPERLATIVE
1024
1026 Allowing free compounding yields decrease in precision of recognition,
1027 not to mention stemming and morphological analysis. Although lexical
1028 switches are introduced to license compounding of bases by Ispell, this
1029 proves not to be restrictive enough. For example:
1030
1031
1032 # affix file
1033 COMPOUNDFLAG X
1034
1035 2
1036 foo/X
1037 bar/X
1038
1039 With this resource, foobar and barfoo also are accepted words.
1040
1041 This has been improved upon with the introduction of direction-sensi‐
1042 tive compounding, i.e., lexical features can specify separately whether
1043 a base can occur as leftmost or rightmost constituent in compounds.
1044 This, however, is still insufficient to handle the intricate patterns
1045 of compounding, not to mention idiosyncratic (and language specific)
1046 norms of hyphenation.
1047
1048 The Hunspell algorithm currently allows any affixed form of words,
1049 which are lexically marked as potential members of compounds. Hunspell
1050 improved this, and its recursive compound checking rules makes it pos‐
1051 sible to implement the intricate spelling conventions of Hungarian com‐
1052 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
1053 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
1054 rule. Further example in Hungarian, derivate suffixes often modify
1055 compounding properties. Hunspell allows the compounding flags on the
1056 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
1057 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
1058
1059 Suffixes with this flag forbid compounding of the affixed word.
1060
1061 We also need several Hunspell features for handling German compounding:
1062
1063
1064 # German compounding
1065
1066 # set language to handle special casing of German sharp s
1067
1068 LANG de_DE
1069
1070 # compound flags
1071
1072 COMPOUNDBEGIN U
1073 COMPOUNDMIDDLE V
1074 COMPOUNDEND W
1075
1076 # Prefixes are allowed at the beginning of compounds,
1077 # suffixes are allowed at the end of compounds by default:
1078 # (prefix)?(root)+(affix)?
1079 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
1080 COMPOUNDPERMITFLAG P
1081
1082 # for German fogemorphemes (Fuge-element)
1083 # Hint: ONLYINCOMPOUND is not required everywhere, but the
1084 # checking will be a little faster with it.
1085
1086 ONLYINCOMPOUND X
1087
1088 # forbid uppercase characters at compound word bounds
1089 CHECKCOMPOUNDCASE
1090
1091 # for handling Fuge-elements with dashes (Arbeits-)
1092 # dash will be a special word
1093
1094 COMPOUNDMIN 1
1095 WORDCHARS -
1096
1097 # compound settings and fogemorpheme for `Arbeit'
1098
1099 SFX A Y 3
1100 SFX A 0 s/UPX .
1101 SFX A 0 s/VPDX .
1102 SFX A 0 0/WXD .
1103
1104 SFX B Y 2
1105 SFX B 0 0/UPX .
1106 SFX B 0 0/VWXDP .
1107
1108 # a suffix for `Computer'
1109
1110 SFX C Y 1
1111 SFX C 0 n/WD .
1112
1113 # for forbid exceptions (*Arbeitsnehmer)
1114
1115 FORBIDDENWORD Z
1116
1117 # dash prefix for compounds with dash (Arbeits-Computer)
1118
1119 PFX - Y 1
1120 PFX - 0 -/P .
1121
1122 # decapitalizing prefix
1123 # circumfix for positioning in compounds
1124
1125 PFX D Y 29
1126 PFX D A a/PX A
1127 PFX D Ä ä/PX Ä
1128 .
1129 .
1130 PFX D Y y/PX Y
1131 PFX D Z z/PX Z
1132
1133 Example dictionary:
1134
1135
1136 4
1137 Arbeit/A-
1138 Computer/BC-
1139 -/W
1140 Arbeitsnehmer/Z
1141
1142 Accepted compound compound words with the previous resource:
1143
1144
1145 Computer
1146 Computern
1147 Arbeit
1148 Arbeits-
1149 Computerarbeit
1150 Computerarbeits-
1151 Arbeitscomputer
1152 Arbeitscomputern
1153 Computerarbeitscomputer
1154 Computerarbeitscomputern
1155 Arbeitscomputerarbeit
1156 Computerarbeits-Computer
1157 Computerarbeits-Computern
1158
1159 Not accepted compoundings:
1160
1161
1162 computer
1163 arbeit
1164 Arbeits
1165 arbeits
1166 ComputerArbeit
1167 ComputerArbeits
1168 Arbeitcomputer
1169 ArbeitsComputer
1170 Computerarbeitcomputer
1171 ComputerArbeitcomputer
1172 ComputerArbeitscomputer
1173 Arbeitscomputerarbeits
1174 Computerarbeits-computer
1175 Arbeitsnehmer
1176
1177 This solution is still not ideal, however, and will be replaced by a
1178 pattern-based compound-checking algorithm which is closely integrated
1179 with input buffer tokenization. Patterns describing compounds come as a
1180 separate input resource that can refer to high-level properties of con‐
1181 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1182 ment of hyphens). The patterns are matched against potential segmenta‐
1183 tions of compounds to assess wellformedness.
1184
1185
1187 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1188 major deficiency when it comes to scalability. Although a language
1189 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1190 fails to allow a full implementation of Hungarian orthographic conven‐
1191 tions. For instance, the '--' symbol (n-dash) is missing from this
1192 character set contrary to the fact that it is not only the official
1193 symbol to delimit parenthetic clauses in the language, but it can be in
1194 compound words as a special 'big' hyphen.
1195
1196 MySpell has got some 8-bit encoding tables, but there are languages
1197 without standard 8-bit encoding, too. For example, a lot of African
1198 languages have non-latin or extended latin characters.
1199
1200 Similarly, using the original spelling of certain foreign names like
1201 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1202 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1203 bine with inflections containing characters only in ISO 8859-2 (like
1204 elative -ből, allative -től or delative -ről with double acute), these
1205 result in words (like Ångströmről or Molière-től.) that can not be
1206 encoded using any single ASCII encoding scheme.
1207
1208 The problems raised in relation to 8-bit ASCII encoding have long been
1209 recognized by proponents of Unicode. It is clear that trading effi‐
1210 ciency for encoding-independence has its advantages when it comes a
1211 truly multi-lingual application. There is implemented a memory and time
1212 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1213 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1214 affixes and words are stored in UTF-8, during the analysis are handled
1215 in mostly UTF-8, under condition checking and suggestion are converted
1216 to UTF-16. Unicode text analysis and spell checking have a minimal
1217 (0-20%) time overhead and minimal or reasonable memory overhead depends
1218 from the language (its UTF-8 encoding and affixation).
1219
1220
1222 Aspell dictionaries can be easily converted into hunspell. Conversion
1223 steps:
1224
1225 dictionary (xx.cwl -> xx.wl):
1226
1227 preunzip xx.cwl
1228 wc -l < xx.wl > xx.dic
1229 cat xx.wl >> xx.dic
1230
1231 affix file
1232
1233 If the affix file exists, copy it:
1234 cp xx_affix.dat xx.aff
1235 If not, create it with the suitable character encoding (see xx.dat)
1236 echo "SET ISO8859-x" > xx.aff
1237 or
1238 echo "SET UTF-8" > xx.aff
1239
1240 It's useful to add a TRY option with the characters of the dictionary
1241 with frequency order to set edit distance suggestions:
1242 echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1243
1244
1246 hunspell (1), ispell (1), ispell (4)
1247
1248
1249
1250
1251 2017-09-20 hunspell(5)