1hunspell(4) Kernel Interfaces Manual hunspell(4)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) Hunspell requires two files to define the way a language is
10 being spell checked: a dictionary file containing words and applicable
11 flags, and an affix file that specifies how these flags wil controll
12 spell checking. An optional file is the personal dictionary file.
13
14
16 A dictionary file (*.dic) contains a list of words, one per line. The
17 first line of the dictionaries (except personal dictionaries) contains
18 the approximate word count (for optimal hash memory size). Each word
19 may optionally be followed by a slash ("/") and one or more flags,
20 which represents the word attributes, for example affixes.
21
22 Note: Dictionary words can contain also slashes when escaped like ""
23 syntax.
24
25
27 Personal dictionaries are simple word lists. Asterisk at the first
28 character position signs prohibition. A second word separated by a
29 slash sets the affixation.
30
31
32 foo
33 Foo/Simpson
34 *bar
35
36 In this example, "foo" and "Foo" are personal words, plus Foo will be
37 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
38 word.
39
40
42 Dictionary file:
43
44 3
45 hello
46 try/B
47 work/AB
48
49 The flags B and A specify attributes of these words.
50
51 Affix file:
52
53
54 SET UTF-8
55 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
56
57 REP 2
58 REP f ph
59 REP ph f
60
61 PFX A Y 1
62 PFX A 0 re .
63
64 SFX B Y 2
65 SFX B 0 ed [^y]
66 SFX B y ied y
67
68 In the affix file, prefix A and suffix B have been defined. Flag A
69 defines a `re-' prefix. Class B defines two `-ed' suffixes. First B
70 suffix can be added to a word if the last character of the word isn't
71 `y'. Second suffix can be added to the words terminated with an `y'.
72
73
74 All accepted words with this dictionary and affix combination are:
75 "hello", "try", "tried", "work", "worked", "rework", "reworked".
76
77
79 Hunspell source distribution contains more than 80 examples for option
80 usage.
81
82
83 SET encoding
84 Set character encoding of words and morphemes in affix and dic‐
85 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
86 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, microsoft-cp1251,
87 ISCII-DEVANAGARI.
88
89 SET UTF-8
90
91 FLAG value
92 Set flag type. Default type is the extended ASCII (8-bit) char‐
93 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
94 flags. The `long' value sets the double extended ASCII charac‐
95 ter flag type, the `num' sets the decimal number flag type. Dec‐
96 imal flags numbered from 1 to 65000, and in flag fields are sep‐
97 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
98 form.
99
100 FLAG long
101
102 COMPLEXPREFIXES
103 Set twofold prefix stripping (but single suffix stripping) eg.
104 for morphologically complex languages with right-to-left writing
105 system.
106
107
108 LANG langcode
109 Set language code for language specific functions of Hunspell.
110 Use it to activate special casing of Azeri (LANG az) and Turkish
111 (LANG tr).
112
113 IGNORE characters
114 Sets characters to ignore dictionary words, affixes and input
115 words. Useful for optional characters, as Arabic (harakat) or
116 Hebrew (niqqud) diacritical marks (see tests/ignore.* test dic‐
117 tionary in Hunspell distribution).
118
119
120 AF number_of_flag_vector_aliases
121
122 AF flag_vector
123 Hunspell can substitute affix flag sets with ordinal numbers in
124 affix rules (alias compression, see makealias tool). First exam‐
125 ple with alias compression:
126
127 3
128 hello
129 try/1
130 work/2
131
132 AF definitions in the affix file:
133
134 AF 2
135 AF A
136 AF AB
137
138 It is equivalent of the following dic file:
139
140 3
141 hello
142 try/A
143 work/AB
144
145 See also tests/alias* examples of the source distribution.
146
147 Note I: If affix file contains the FLAG parameter, define it before the
148 AF definitions.
149
150 Note II: Use makealias utility in Hunspell distribution to compress aff
151 and dic files.
152
153 AM number_of_morphological_aliases
154
155 AM morphological_fields
156 Hunspell can substitute also morphological data with ordinal
157 numbers in affix rules (alias compression). See tests/alias*
158 examples.
159
161 Suggestion parameters can optimize the default n-gram (similarity
162 search in the dictionary words based on the common 1, 2, 3, 4-character
163 length common character-sequences), character swap and deletion sugges‐
164 tions of Hunspell. REP is suggested to fix the typical and especially
165 bad language specific bugs, because the REP suggestions have the high‐
166 est priority in the suggestion list. PHONE is for languages with not
167 pronunciation based orthography.
168
169 KEY characters_separated_by_vertical_line_optionally
170 Hunspell searches and suggests words with one different charac‐
171 ter replaced by a neighbor KEY character. Not neighbor charac‐
172 ters in KEY string separated by vertical line characters. Sug‐
173 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
174
175 KEY qwertyuiop|asdfghjkl|zxcvbnm
176 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
177
178 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
179 "*nide". A character may have more neighbors, too:
180
181 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
182
183 TRY characters
184 Hunspell can suggest right word forms, when they differ from the
185 bad input word by one TRY character. The parameter of TRY is
186 case sensitive.
187
188 NOSUGGEST flag
189 Words signed with NOSUGGEST flag are not suggested (but still
190 accepted when typed correctly). Proposed flag for vulgar and
191 obscene words (see also SUBSTANDARD).
192
193 MAXCPDSUGS num
194 Set max. number of suggested compound words generated by com‐
195 pound rules. The number of the suggested compound words may be
196 greater from the same 1-character distance type.
197
198 MAXNGRAMSUGS num
199 Set max. number of n-gram suggestions. Value 0 switches off the
200 n-gram suggestions (see also MAXDIFF).
201
202 MAXDIFF [0-10]
203 Set the similarity factor for the n-gram based suggestions (5 =
204 default value; 0 = fewer n-gram suggestions, but min. 1; 10 =
205 MAXNGRAMSUGS n-gram suggestions).
206
207 ONLYMAXDIFF
208 Remove all bad n-gram suggestions (default mode keeps one, see
209 MAXDIFF).
210
211 NOSPLITSUGS
212 Disable word suggestions with spaces.
213
214 SUGSWITHDOTS
215 Add dot(s) to suggestions, if input word terminates in dot(s).
216 (Not for OpenOffice.org dictionaries, because OpenOffice.org has
217 an automatic dot expansion mechanism.)
218
219 REP number_of_replacement_definitions
220
221 REP what replacement
222 This table specifies modifications to try first. First REP is
223 the header of this table and one or more REP data line are fol‐
224 lowing it. With this table, Hunspell can suggest the right
225 forms for the typical spelling mistakes when the incorrect form
226 differs by more than 1 letter from the right form. The search
227 string supports the regex boundary signs (^ and $). For example
228 a possible English replacement table definition to handle mis‐
229 spelled consonants:
230
231 REP 5
232 REP f ph
233 REP ph f
234 REP tion$ shun
235 REP ^cooccurr co-occurr
236 REP ^alot$ a_lot
237
238 Note I: It's very useful to define replacements for the most typical
239 one-character mistakes, too: with REP you can add higher priority to a
240 subset of the TRY suggestions (suggestion list begins with the REP sug‐
241 gestions).
242
243 Note II: Suggesting separated words, specify spaces with underlines:
244
245
246 REP 1
247 REP onetwothree one_two_three
248
249 Note III: Replacement table can be used for a stricter compound word
250 checking with the option CHECKCOMPOUNDREP.
251
252
253 MAP number_of_map_definitions
254
255 MAP string_of_related_chars_or_parenthesized_character_sequences
256 We can define language-dependent information on characters and
257 character sequences that should be considered related (i.e.
258 nearer than other chars not in the set) in the affix file (.aff)
259 by a map table. With this table, Hunspell can suggest the right
260 forms for words, which incorrectly choose the wrong letter or
261 letter groups from a related set more than once in a word (see
262 REP).
263
264 For example a possible mapping could be for the German umlauted
265 ü versus the regular u; the word Frühstück really should be
266 written with umlauted u's and not regular ones
267
268 MAP 1
269 MAP uü
270
271 Use parenthesized groups for character sequences (eg. for composed Uni‐
272 code characters):
273
274 MAP 3
275 MAP ß(ss) (character sequence)
276 MAP fi(fi) ("fi" compatibility characters for Unicode fi ligature)
277 MAP (ọ́)o (composed Unicode character: ó with bottom dot)
278
279 PHONE number_of_phone_definitions
280
281 PHONE what replacement
282 PHONE uses a table-driven phonetic transcription algorithm bor‐
283 rowed from Aspell. It is useful for languages with not pronunci‐
284 ation based orthography. You can add a full alphabet conversion
285 and other rules for conversion of special letter sequences. For
286 detailed documentation see http://aspell.net/man-html/Phonetic-
287 Code.html. Note: Multibyte UTF-8 characters have not worked
288 with bracket expression yet. Dash expression has signed bytes
289 and not UTF-8 characters yet.
290
291 WARN flag
292 This flag is for rare words, wich are also often spelling mis‐
293 takes, see option -r of command line Hunspell and FORBIDWARN.
294
295 FORBIDWARN
296 Words with flag WARN aren't accepted by the spell checker using
297 this parameter.
298
300 BREAK number_of_break_definitions
301
302 BREAK character_or_character_sequence
303 Define new break points for breaking words and checking word
304 parts separately. Use ^ and $ to delete characters at end and
305 start of the word. Rationale: useful for compounding with join‐
306 ing character or strings (for example, hyphen in English and
307 German or hyphen and n-dash in Hungarian). Dashes are often bad
308 break points for tokenization, because compounds with dashes may
309 contain not valid parts, too.) With BREAK, Hunspell can check
310 both side of these compounds, breaking the words at dashes and
311 n-dashes:
312
313 BREAK 2
314 BREAK -
315 BREAK -- # n-dash
316
317 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
318 be valid compounds. Note: The default word break of Hunspell is equiv‐
319 alent of the following BREAK definition:
320
321 BREAK 3
322 BREAK -
323 BREAK ^-
324 BREAK -$
325
326 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
327 definition:
328
329 BREAK 1
330 BREAK -
331
332 Switching off the default values:
333
334 BREAK 0
335
336 Note II: COMPOUNDRULE is better for handling dashes and other compound
337 joining characters or character strings. Use BREAK, if you want to
338 check words with dashes or other joining characters and there is no
339 time or possibility to describe precise compound rules with COM‐
340 POUNDRULE (COMPOUNDRULE handles only the suffixation of the last word
341 part of a compound word).
342
343 Note III: For command line spell checking of words with extra charac‐
344 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
345 ple
346
347 COMPOUNDRULE number_of_compound_definitions
348
349 COMPOUNDRULE compound_pattern
350 Define custom compound patterns with a regex-like syntax. The
351 first COMPOUNDRULE is a header with the number of the following
352 COMPOUNDRULE definitions. Compound patterns consist compound
353 flags, parentheses, star and question mark meta characters. A
354 flag followed by a `*' matches a word sequence of 0 or more
355 matches of words signed with this compound flag. A flag fol‐
356 lowed by a `?' matches a word sequence of 0 or 1 matches of a
357 word signed with this compound flag. See tests/compound*.*
358 examples.
359
360 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
361 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
362 1000122nd etc.).
363
364 Note II: In the case of long and numerical flag types use only
365 parenthesized flags: (1500)*(2000)?
366
367 Note III: COMPOUNDRULE flags work completely separately from the
368 compounding mechanisme using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
369 compound flags. (Use these flags on different enhtries for
370 words).
371
372
373 COMPOUNDMIN num
374 Minimum length of words used for compounding. Default value is
375 3 letters.
376
377 COMPOUNDFLAG flag
378 Words signed with COMPOUNDFLAG may be in compound words (except
379 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
380 also permits compounding of affixed words.
381
382 COMPOUNDBEGIN flag
383 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
384 first elements in compound words.
385
386 COMPOUNDLAST flag
387 Words signed with COMPOUNDLAST (or with a signed affix) may be
388 last elements in compound words.
389
390 COMPOUNDMIDDLE flag
391 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
392 middle elements in compound words.
393
394 ONLYINCOMPOUND flag
395 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
396 compounds (Fuge-elements in German, fogemorphemes in Swedish).
397 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
398 pound.*). Note: also valuable to flag compounding parts which
399 are not correct as a word by itself.
400
401 COMPOUNDPERMITFLAG flag
402 Prefixes are allowed at the beginning of compounds, suffixes are
403 allowed at the end of compounds by default. Affixes with COM‐
404 POUNDPERMITFLAG may be inside of compounds.
405
406 COMPOUNDFORBIDFLAG flag
407 Suffixes with this flag forbid compounding of the affixed word.
408
409 COMPOUNDROOT flag
410 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
411 is used only in the Hungarian language specific code).
412
413 COMPOUNDWORDMAX number
414 Set maximum word count in a compound word. (Default is unlim‐
415 ited.)
416
417 CHECKCOMPOUNDDUP
418 Forbid word duplication in compounds (e.g. foofoo).
419
420 CHECKCOMPOUNDREP
421 Forbid compounding, if the (usually bad) compound word may be a
422 non compound word with a REP fault. Useful for languages with
423 `compound friendly' orthography.
424
425 CHECKCOMPOUNDCASE
426 Forbid upper case characters at word boundaries in compounds.
427
428 CHECKCOMPOUNDTRIPLE
429 Forbid compounding, if compound word contains triple repeating
430 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
431 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
432 acters).
433
434 SIMPLIFIEDTRIPLE
435 Allow simplified 2-letter forms of the compounds forbidden by
436 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
437 for the old German orthography: Schiff|fahrt -> Schiffahrt).
438
439 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
440
441 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
442 Forbid compounding, if the first word in the compound ends with
443 endchars, and next word begins with beginchars and (optionally)
444 they have the requested flags. The optional replacement parame‐
445 ter allows simplified compound form.
446
447 The special "endchars" pattern 0 (zero) limits the rule to the
448 unmodified stems (stems and stems with zero affixes):
449
450 CHECKCOMPOUNDPATTERN 0/x /y
451
452 Note: COMPOUNDMIN doesn't work correctly with the compound word alter‐
453 nation, so it may need to set COMPOUNDMIN to lower value.
454
455 FORCEUCASE flag
456 Last word part of a compound with flag FORCEUCASE forces capi‐
457 talization of the whole compound word. Eg. Dutch word "straat"
458 (street) with FORCEUCASE flags will allowed only in capitalized
459 compound forms, according to the Dutch spelling rules for proper
460 names.
461
462 COMPOUNDSYLLABLE max_syllable vowels
463 Need for special compounding rules in Hungarian. First parame‐
464 ter is the maximum syllable number, that may be in a compound,
465 if words in compounds are more than COMPOUNDWORDMAX. Second
466 parameter is the list of vowels (for calculating syllables).
467
468 SYLLABLENUM flags
469 Need for special compounding rules in Hungarian.
470
472 PFX flag cross_product number
473
474 PFX flag stripping prefix [condition [morphological_fields...]]
475
476 SFX flag cross_product number
477
478 SFX flag stripping suffix [condition [morphological_fields...]]
479 An affix is either a prefix or a suffix attached to root words
480 to make other words. We can define affix classes with arbitrary
481 number affix rules. Affix classes are signed with affix flags.
482 The first line of an affix class definition is the header. The
483 fields of an affix class header:
484
485 (0) Option name (PFX or SFX)
486
487 (1) Flag (name of the affix class)
488
489 (2) Cross product (permission to combine prefixes and suffixes).
490 Possible values: Y (yes) or N (no)
491
492 (3) Line count of the following rules.
493
494 Fields of an affix rules:
495
496 (0) Option name
497
498 (1) Flag
499
500 (2) stripping characters from beginning (at prefix rules) or end
501 (at suffix rules) of the word
502
503 (3) affix (optionally with flags of continuation classes, sepa‐
504 rated by a slash)
505
506 (4) condition.
507
508 Zero stripping or affix are indicated by zero. Zero condition is
509 indicated by dot. Condition is a simplified, regular expres‐
510 sion-like pattern, which must be met before the affix can be
511 applied. (Dot signs an arbitrary character. Characters in braces
512 sign an arbitrary character from the character subset. Dash
513 hasn't got special meaning, but circumflex (^) next the first
514 brace sets the complementer character set.)
515
516 (5) Optional morphological fields separated by spaces or tabula‐
517 tors.
518
519
521 CIRCUMFIX flag
522 Affixes signed with CIRCUMFIX flag may be on a word when this
523 word also has a prefix with CIRCUMFIX flag and vice versa (see
524 circumfix.* test files in the source distribution).
525
526 FORBIDDENWORD flag
527 This flag signs forbidden word form. Because affixed forms are
528 also forbidden, we can subtract a subset from set of the
529 accepted affixed and compound words. Note: usefull to forbid
530 erroneous words, generated by the compounding mechanism.
531
532 FULLSTRIP
533 With FULLSTRIP, affix rules can strip full words, not only one
534 less characters, before adding the affixes, see fullstrip.* test
535 files in the source distribution). Note: conditions may be word
536 length without FULLSTRIP, too.
537
538 KEEPCASE flag
539 Forbid uppercased and capitalized forms of words signed with
540 KEEPCASE flags. Useful for special orthographies (measurements
541 and currency often keep their case in uppercased texts) and
542 writing systems (e.g. keeping lower case of IPA characters).
543 Also valuable for words erroneously written in the wrong case.
544
545 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
546 CASE flag may be capitalized and uppercased, but uppercased
547 forms of these words may not contain sharp s, only SS. See ger‐
548 mancompounding example in the tests directory of the Hunspell
549 distribution.
550
551
552 ICONV number_of_ICONV_definitions
553
554 ICONV pattern pattern2
555 Define input conversion table. Note: useful to convert one type
556 of quote to another one, or change ligature.
557
558 OCONV number_of_OCONV_definitions
559
560 OCONV pattern pattern2
561 Define output conversion table.
562
563 LEMMA_PRESENT flag
564 Deprecated. Use "st:" field instead of LEMMA_PRESENT.
565
566 NEEDAFFIX flag
567 This flag signs virtual stems in the dictionary, words only
568 valid when affixed. Except, if the dictionary word has a
569 homonym or a zero affix. NEEDAFFIX works also with prefixes and
570 prefix + suffix combinations (see tests/pseudoroot5.*).
571
572 PSEUDOROOT flag
573 Deprecated. (Former name of the NEEDAFFIX option.)
574
575 SUBSTANDARD flag
576 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
577 morphs) not used in morphological generation (and in suggestion
578 in the future versions). See also NOSUGGEST.
579
580 WORDCHARS characters
581 WORDCHARS extends tokenizer of Hunspell command line interface
582 with additional word character. For example, dot, dash, n-dash,
583 numbers, percent sign are word character in Hungarian.
584
585 CHECKSHARPS
586 SS letter pair in uppercased (German) words may be upper case
587 sharp s (ß). Hunspell can handle this special casing with the
588 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
589 mancompounding example) in both spelling and suggestion.
590
591
593 Hunspell's dictionary items and affix rules may have optional space or
594 tabulator separated morphological description fields, started with
595 3-character (two letters and a colon) field IDs:
596
597
598 word/flags po:noun is:nom
599
600 Example: We define a simple resource with morphological informations, a
601 derivative suffix (ds:) and a part of speech category (po:):
602
603 Affix file:
604
605
606 SFX X Y 1
607 SFX X 0 able . ds:able
608
609 Dictionary file:
610
611
612 drink/X po:verb
613
614 Test file:
615
616
617 drink
618 drinkable
619
620 Test:
621
622
623 $ analyze test.aff test.dic test.txt
624 > drink
625 analyze(drink) = po:verb
626 stem(drink) = po:verb
627 > drinkable
628 analyze(drinkable) = po:verb ds:able
629 stem(drinkable) = drinkable
630
631 You can see in the example, that the analyzer concatenates the morpho‐
632 logical fields in item and arrangement style.
633
634
636 Default morphological and other IDs (used in suggestion, stemming and
637 morphological generation):
638
639 ph: Alternative transliteration for better suggestion. It's useful
640 for words with foreign pronunciation. (Dictionary based phonetic
641 suggestion.) For example:
642
643
644 Marseille ph:maarsayl
645
646 st: Stem. Optional: default stem is the dictionary item in morpho‐
647 logical analysis. Stem field is useful for virtual stems (dic‐
648 tionary words with NEEDAFFIX flag) and morphological exceptions
649 instead of new, single used morphological rules.
650
651 feet st:foot is:plural
652 mice st:mouse is:plural
653 teeth st:tooth is:plural
654
655 Word forms with multiple stems need multiple dictionary items:
656
657
658 lay po:verb st:lie is:past_2
659 lay po:verb is:present
660 lay po:noun
661
662 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
663 Morphological generation needs stem, allomorph and affix fields.
664
665 sing al:sang al:sung
666 sang st:sing
667 sung st:sing
668
669 po: Part of speech category.
670
671 ds: Derivational suffix(es). Stemming doesn't remove derivational
672 suffixes. Morphological generation depends on the order of the
673 suffix fields.
674
675 In affix rules:
676
677
678 SFX Y Y 1
679 SFX Y 0 ly . ds:ly_adj
680
681 In the dictionary:
682
683
684 ably st:able ds:ly_adj
685 able al:ably
686
687 is: Inflectional suffix(es). All inflectional suffixes are removed
688 by stemming. Morphological generation depends on the order of
689 the suffix fields.
690
691
692 feet st:foot is:plural
693
694 ts: Terminal suffix(es). Terminal suffix fields are inflectional
695 suffix fields "removed" by additional (not terminal) suffixes.
696
697 Useful for zero morphemes and affixes removed by splitting
698 rules.
699
700
701 work/D ts:present
702
703 SFX D Y 2
704 SFX D 0 ed . is:past_1
705 SFX D 0 ed . is:past_2
706
707 Typical example of the terminal suffix is the zero morpheme of the nom‐
708 inative case.
709
710
711 sp: Surface prefix. Temporary solution for adding prefixes to the
712 stems and generated word forms. See tests/morph.* example.
713
714
715 pa: Parts of the compound words. Output fields of morphological
716 analysis for stemming.
717
718 dp: Planned: derivational prefix.
719
720 ip: Planned: inflectional prefix.
721
722 tp: Planned: terminal prefix.
723
724
726 Ispell's original algorithm strips only one suffix. Hunspell can strip
727 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
728
729 The twofold suffix stripping is a significant improvement in handling
730 of immense number of suffixes, that characterize agglutinative lan‐
731 guages.
732
733 A second `s' suffix (affix class Y) will be the continuation class of
734 the suffix `able' in the following example:
735
736
737 SFX Y Y 1
738 SFX Y 0 s .
739
740 SFX X Y 1
741 SFX X 0 able/Y .
742
743 Dictionary file:
744
745
746 drink/X
747
748 Test file:
749
750
751 drink
752 drinkable
753 drinkables
754
755 Test:
756
757
758 $ hunspell -m -d test <test.txt
759 drink st:drink
760 drinkable st:drink fl:X
761 drinkables st:drink fl:X fl:Y
762
763 Theoretically with the twofold suffix stripping needs only the square
764 root of the number of suffix rules, compared with a Hunspell implemen‐
765 tation. In our practice, we could have elaborated the Hungarian inflec‐
766 tional morphology with twofold suffix stripping.
767
768
770 Hunspell can handle more than 65000 affix classes. There are three new
771 syntax for giving flags in affix and dictionary files.
772
773 FLAG long command sets 2-character flags:
774
775
776 FLAG long
777 SFX Y1 Y 1
778 SFX Y1 0 s 1
779
780 Dictionary record with the Y1, Z3, F? flags:
781
782
783 foo/Y1Z3F?
784
785 FLAG num command sets numerical flags separated by comma:
786
787
788 FLAG num
789 SFX 65000 Y 1
790 SFX 65000 0 s 1
791
792 Dictionary example:
793
794
795 foo/65000,12,2756
796
797 The third one is the Unicode character flags.
798
799
801 Hunspell's dictionary can contain repeating elements that are homonyms:
802
803
804 work/A po:verb
805 work/B po:noun
806
807 An affix file:
808
809
810 SFX A Y 1
811 SFX A 0 s . sf:sg3
812
813 SFX B Y 1
814 SFX B 0 s . is:plur
815
816 Test file:
817
818
819 works
820
821 Test:
822
823
824 $ hunspell -d test -m <testwords
825 work st:work po:verb is:sg3
826 work st:work po:noun is:plur
827
828 This feature also gives a way to forbid illegal prefix/suffix combina‐
829 tions.
830
831
833 An interesting side-effect of multi-step stripping is, that the appro‐
834 priate treatment of circumfixes now comes for free. For instance, in
835 Hungarian, superlatives are formed by simultaneous prefixation of leg-
836 and suffixation of -bb to the adjective base. A problem with the one-
837 level architecture is that there is no way to render lexical licensing
838 of particular prefixes and suffixes interdependent, and therefore
839 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
840 `old'. Until the introduction of clusters, a special treatment of the
841 superlative had to be hardwired in the earlier HunSpell code. This may
842 have been legitimate for a single case, but in fact prefix--suffix
843 dependences are ubiquitous in category-changing derivational patterns
844 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
845 able but *undrink). In simple words, here, the prefix un- is legitimate
846 only if the base drink is suffixed with -able. If both these patters
847 are handled by on-line affix rules and affix rules are checked against
848 the base only, there is no way to express this dependency and the sys‐
849 tem will necessarily over- or undergenerate.
850
851 In next example, suffix class R have got a prefix `continuation' class
852 (class P).
853
854
855 PFX P Y 1
856 PFX P 0 un . [prefix_un]+
857
858 SFX S Y 1
859 SFX S 0 s . +PL
860
861 SFX Q Y 1
862 SFX Q 0 s . +3SGV
863
864 SFX R Y 1
865 SFX R 0 able/PS . +DER_V_ADJ_ABLE
866
867 Dictionary:
868
869
870 2
871 drink/RQ [verb]
872 drink/S [noun]
873
874 Morphological analysis:
875
876
877 > drink
878 drink[verb]
879 drink[noun]
880 > drinks
881 drink[verb]+3SGV
882 drink[noun]+PL
883 > drinkable
884 drink[verb]+DER_V_ADJ_ABLE
885 > drinkables
886 drink[verb]+DER_V_ADJ_ABLE+PL
887 > undrinkable
888 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
889 > undrinkables
890 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
891 > undrink
892 Unknown word.
893 > undrinks
894 Unknown word.
895
897 Conditional affixes implemented by a continuation class are not enough
898 for circumfixes, because a circumfix is one affix in morphology. We
899 also need CIRCUMFIX option for correct morphological analysis.
900
901
902 # circumfixes: ~ obligate prefix/suffix combinations
903 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
904 # nagy, nagyobb, legnagyobb, legeslegnagyobb
905 # (great, greater, greatest, most greatest)
906
907 CIRCUMFIX X
908
909 PFX A Y 1
910 PFX A 0 leg/X .
911
912 PFX B Y 1
913 PFX B 0 legesleg/X .
914
915 SFX C Y 3
916 SFX C 0 obb . +COMPARATIVE
917 SFX C 0 obb/AX . +SUPERLATIVE
918 SFX C 0 obb/BX . +SUPERSUPERLATIVE
919
920 Dictionary:
921
922
923 1
924 nagy/C [MN]
925
926 Analysis:
927
928
929 > nagy
930 nagy[MN]
931 > nagyobb
932 nagy[MN]+COMPARATIVE
933 > legnagyobb
934 nagy[MN]+SUPERLATIVE
935 > legeslegnagyobb
936 nagy[MN]+SUPERSUPERLATIVE
937
939 Allowing free compounding yields decrease in precision of recognition,
940 not to mention stemming and morphological analysis. Although lexical
941 switches are introduced to license compounding of bases by Ispell, this
942 proves not to be restrictive enough. For example:
943
944
945 # affix file
946 COMPOUNDFLAG X
947
948 2
949 foo/X
950 bar/X
951
952 With this resource, foobar and barfoo also are accepted words.
953
954 This has been improved upon with the introduction of direction-sensi‐
955 tive compounding, i.e., lexical features can specify separately whether
956 a base can occur as leftmost or rightmost constituent in compounds.
957 This, however, is still insufficient to handle the intricate patterns
958 of compounding, not to mention idiosyncratic (and language specific)
959 norms of hyphenation.
960
961 The Hunspell algorithm currently allows any affixed form of words,
962 which are lexically marked as potential members of compounds. Hunspell
963 improved this, and its recursive compound checking rules makes it pos‐
964 sible to implement the intricate spelling conventions of Hungarian com‐
965 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
966 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
967 rule. Further example in Hungarian, derivate suffixes often modify
968 compounding properties. Hunspell allows the compounding flags on the
969 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
970 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
971
972 Suffixes with this flag forbid compounding of the affixed word.
973
974 We also need several Hunspell features for handling German compounding:
975
976
977 # German compounding
978
979 # set language to handle special casing of German sharp s
980
981 LANG de_DE
982
983 # compound flags
984
985 COMPOUNDBEGIN U
986 COMPOUNDMIDDLE V
987 COMPOUNDEND W
988
989 # Prefixes are allowed at the beginning of compounds,
990 # suffixes are allowed at the end of compounds by default:
991 # (prefix)?(root)+(affix)?
992 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
993 COMPOUNDPERMITFLAG P
994
995 # for German fogemorphemes (Fuge-element)
996 # Hint: ONLYINCOMPOUND is not required everywhere, but the
997 # checking will be a little faster with it.
998
999 ONLYINCOMPOUND X
1000
1001 # forbid uppercase characters at compound word bounds
1002 CHECKCOMPOUNDCASE
1003
1004 # for handling Fuge-elements with dashes (Arbeits-)
1005 # dash will be a special word
1006
1007 COMPOUNDMIN 1
1008 WORDCHARS -
1009
1010 # compound settings and fogemorpheme for `Arbeit'
1011
1012 SFX A Y 3
1013 SFX A 0 s/UPX .
1014 SFX A 0 s/VPDX .
1015 SFX A 0 0/WXD .
1016
1017 SFX B Y 2
1018 SFX B 0 0/UPX .
1019 SFX B 0 0/VWXDP .
1020
1021 # a suffix for `Computer'
1022
1023 SFX C Y 1
1024 SFX C 0 n/WD .
1025
1026 # for forbid exceptions (*Arbeitsnehmer)
1027
1028 FORBIDDENWORD Z
1029
1030 # dash prefix for compounds with dash (Arbeits-Computer)
1031
1032 PFX - Y 1
1033 PFX - 0 -/P .
1034
1035 # decapitalizing prefix
1036 # circumfix for positioning in compounds
1037
1038 PFX D Y 29
1039 PFX D A a/PX A
1040 PFX D Ä ä/PX Ä
1041 .
1042 .
1043 PFX D Y y/PX Y
1044 PFX D Z z/PX Z
1045
1046 Example dictionary:
1047
1048
1049 4
1050 Arbeit/A-
1051 Computer/BC-
1052 -/W
1053 Arbeitsnehmer/Z
1054
1055 Accepted compound compound words with the previous resource:
1056
1057
1058 Computer
1059 Computern
1060 Arbeit
1061 Arbeits-
1062 Computerarbeit
1063 Computerarbeits-
1064 Arbeitscomputer
1065 Arbeitscomputern
1066 Computerarbeitscomputer
1067 Computerarbeitscomputern
1068 Arbeitscomputerarbeit
1069 Computerarbeits-Computer
1070 Computerarbeits-Computern
1071
1072 Not accepted compoundings:
1073
1074
1075 computer
1076 arbeit
1077 Arbeits
1078 arbeits
1079 ComputerArbeit
1080 ComputerArbeits
1081 Arbeitcomputer
1082 ArbeitsComputer
1083 Computerarbeitcomputer
1084 ComputerArbeitcomputer
1085 ComputerArbeitscomputer
1086 Arbeitscomputerarbeits
1087 Computerarbeits-computer
1088 Arbeitsnehmer
1089
1090 This solution is still not ideal, however, and will be replaced by a
1091 pattern-based compound-checking algorithm which is closely integrated
1092 with input buffer tokenization. Patterns describing compounds come as a
1093 separate input resource that can refer to high-level properties of con‐
1094 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1095 ment of hyphens). The patterns are matched against potential segmenta‐
1096 tions of compounds to assess wellformedness.
1097
1098
1100 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1101 major deficiency when it comes to scalability. Although a language
1102 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1103 fails to allow a full implementation of Hungarian orthographic conven‐
1104 tions. For instance, the '--' symbol (n-dash) is missing from this
1105 character set contrary to the fact that it is not only the official
1106 symbol to delimit parenthetic clauses in the language, but it can be in
1107 compound words as a special 'big' hyphen.
1108
1109 MySpell has got some 8-bit encoding tables, but there are languages
1110 without standard 8-bit encoding, too. For example, a lot of African
1111 languages have non-latin or extended latin characters.
1112
1113 Similarly, using the original spelling of certain foreign names like
1114 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1115 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1116 bine with inflections containing characters only in ISO 8859-2 (like
1117 elative -ből, allative -től or delative -ről with double acute), these
1118 result in words (like Ångströmről or Molière-től.) that can not be
1119 encoded using any single ASCII encoding scheme.
1120
1121 The problems raised in relation to 8-bit ASCII encoding have long been
1122 recognized by proponents of Unicode. It is clear that trading effi‐
1123 ciency for encoding-independence has its advantages when it comes a
1124 truly multi-lingual application. There is implemented a memory and time
1125 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1126 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1127 affixes and words are stored in UTF-8, during the analysis are handled
1128 in mostly UTF-8, under condition checking and suggestion are converted
1129 to UTF-16. Unicode text analysis and spell checking have a minimal
1130 (0-20%) time overhead and minimal or reasonable memory overhead depends
1131 from the language (its UTF-8 encoding and affixation).
1132
1133
1135 Aspell dictionaries can be easily converted into hunspell. Conversion
1136 steps:
1137
1138 dictionary (xx.cwl -> xx.wl):
1139
1140 preunzip xx.cwl
1141 wc -l < xx.wl > xx.dic
1142 cat xx.wl >> xx.dic
1143
1144 affix file
1145
1146 If the affix file exists, copy it:
1147 cp xx_affix.dat xx.aff
1148 If not, create it with the suitable character encoding (see xx.dat)
1149 echo "SET ISO8859-x" > xx.aff
1150 or
1151 echo "SET UTF-8" > xx.aff
1152
1153 It's useful to add a TRY option with the characters of the dictionary
1154 with frequency order to set edit distance suggestions:
1155 echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1156
1157
1159 hunspell (1), ispell (1), ispell (4)
1160
1161
1162
1163
1164 2011-02-16 hunspell(4)