1hunspell(4) Kernel Interfaces Manual hunspell(4)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) requires two files to define the language that it is spell
10 checking. The first file is a dictionary containing words for the lan‐
11 guage, and the second is an "affix" file that defines the meaning of
12 special flags in the dictionary.
13
14 A dictionary file (*.dic) contains a list of words, one per line. The
15 first line of the dictionaries (except personal dictionaries) contains
16 the approximate word count (for optimal hash memory size). Each word
17 may optionally be followed by a slash ("/") and one or more flags,
18 which represents affixes or special attributes. Dictionary words can
19 contain also slashes with the "" syntax. Default flag format is a sin‐
20 gle (usually alphabetic) character. After the dictionary words there
21 are also optional fields separated by tabulators or spaces (spaces only
22 work as morphological field separators, if they are followed by morpho‐
23 logical field ids, see also Optional data fields).
24
25 Personal dictionaries are simple word lists. Asterisk at the first
26 character position signs prohibition. A second word separated by a
27 slash sets the affixation.
28
29
30 foo
31 Foo/Simpson
32 *bar
33
34 In this example, "foo" and "Foo" are personal words, plus Foo will be
35 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
36 word.
37
38 An affix file (*.aff) may contain a lot of optional attributes. For
39 example, SET is used for setting the character encodings of affixes and
40 dictionary files. TRY sets the change characters for suggestions. REP
41 sets a replacement table for multiple character corrections in sugges‐
42 tion mode. PFX and SFX defines prefix and suffix classes named with
43 affix flags.
44
45 The following affix file example defines UTF-8 character encoding.
46 `TRY' suggestions differ from the bad word with an English letter or an
47 apostrophe. With these REP definitions, Hunspell can suggest the right
48 word form, when the misspelled word contains f instead of ph and vice
49 versa.
50
51
52 SET UTF-8
53 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
54
55 REP 2
56 REP f ph
57 REP ph f
58
59 PFX A Y 1
60 PFX A 0 re .
61
62 SFX B Y 2
63 SFX B 0 ed [^y]
64 SFX B y ied y
65
66 There are two affix classes in the dictionary. Class A defines a `re-'
67 prefix. Class B defines two `-ed' suffixes. First suffix can be added
68 to a word if the last character of the word isn't `y'. Second suffix
69 can be added to the words terminated with an `y'. (See later.) The
70 following dictionary file uses these affix classes.
71
72
73 3
74 hello
75 try/B
76 work/AB
77
78 All accepted words with this dictionary: "hello", "try", "tried",
79 "work", "worked", "rework", "reworked".
80
81
83 Hunspell source distribution contains more than 80 examples for option
84 usage.
85
86
87 SET encoding
88 Set character encoding of words and morphemes in affix and dic‐
89 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
90 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, microsoft-cp1251,
91 ISCII-DEVANAGARI.
92
93 FLAG value
94 Set flag type. Default type is the extended ASCII (8-bit) char‐
95 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
96 flags. The `long' value sets the double extended ASCII charac‐
97 ter flag type, the `num' sets the decimal number flag type. Dec‐
98 imal flags numbered from 1 to 65000, and in flag fields are sep‐
99 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
100 form.
101
102 COMPLEXPREFIXES
103 Set twofold prefix stripping (but single suffix stripping) for
104 agglutinative languages with right-to-left writing system.
105
106 LANG langcode
107 Set language code. In Hunspell may be language specific codes
108 enabled by LANG code. At present there are az_AZ, hu_HU, tr_TR
109 specific codes in Hunspell (see the source code).
110
111 IGNORE characters
112 Ignore characters from dictionary words, affixes and input
113 words. Useful for optional characters, as Arabic diacritical
114 marks (Harakat).
115
116 AF number_of_flag_vector_aliases
117
118 AF flag_vector
119 Hunspell can substitute affix flag sets with ordinal numbers in
120 affix rules (alias compression, see makealias tool). First exam‐
121 ple with alias compression:
122
123 3
124 hello
125 try/1
126 work/2
127
128 AF definitions in the affix file:
129
130 SET UTF-8
131 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
132 AF 2
133 AF A
134 AF AB
135
136 It is equivalent of the following dic file:
137
138 3
139 hello
140 try/A
141 work/AB
142
143 See also tests/alias* examples of the source distribution.
144
145 Note: If affix file contains the FLAG parameter, define it before the
146 AF definitions.
147
148 Note II: Use makealias utility in Hunspell distribution to compress aff
149 and dic files.
150
151 AM number_of_morphological_aliases
152
153 AM morphological_fields
154 Hunspell can substitute also morphological data with ordinal
155 numbers in affix rules (alias compression). See tests/alias*
156 examples.
157
159 Suggestion parameters can optimize the default n-gram, character swap
160 and deletion suggestions of Hunspell. REP is suggested to fix the typi‐
161 cal and especially bad language specific bugs, because the REP sugges‐
162 tions have the highest priority in the suggestion list. PHONE is for
163 languages with not pronunciation based orthography.
164
165 KEY characters_separated_by_vertical_line_optionally
166 Hunspell searches and suggests words with one different charac‐
167 ter replaced by a neighbor KEY character. Not neighbor charac‐
168 ters in KEY string separated by vertical line characters. Sug‐
169 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
170
171 KEY qwertyuiop|asdfghjkl|zxcvbnm
172 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
173
174 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
175 "*nide". A character may have more neighbors, too:
176
177 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
178
179 TRY characters
180 Hunspell can suggest right word forms, when they differ from the
181 bad input word by one TRY character. The parameter of TRY is
182 case sensitive.
183
184 NOSUGGEST flag
185 Words signed with NOSUGGEST flag are not suggested. Proposed
186 flag for vulgar and obscene words (see also SUBSTANDARD).
187
188 MAXNGRAMSUGS num
189 Set number of n-gram suggestions. Value 0 switches off the n-
190 gram suggestions.
191
192 NOSPLITSUGS
193 Disable split-word suggestions.
194
195 SUGSWITHDOTS
196 Add dot(s) to suggestions, if input word terminates in dot(s).
197 (Not for OpenOffice.org dictionaries, because OpenOffice.org has
198 an automatic dot expansion mechanism.)
199
200 REP number_of_replacement_definitions
201
202 REP what replacement
203 We can define language-dependent phonetic information in the
204 affix file (.aff) by a replacement table. First REP is the
205 header of this table and one or more REP data line are following
206 it. With this table, Hunspell can suggest the right forms for
207 the typical faults of spelling when the incorrect form differs
208 by more, than 1 letter from the right form. For example a pos‐
209 sible English replacement table definition to handle misspelled
210 consonants:
211
212 REP 8
213 REP f ph
214 REP ph f
215 REP f gh
216 REP gh f
217 REP j dg
218 REP dg j
219 REP k ch
220 REP ch k
221
222 Note I: It's very useful to define replacements for the most typical
223 one-character mistakes, too: with REP you can add higher priority to a
224 subset of the TRY suggestions (suggestion list begins with the REP sug‐
225 gestions).
226
227 Note II: Suggesting separated words by REP, you can specify a space
228 with an underline:
229
230
231 REP 1
232 REP alot a_lot
233
234 Note III: Replacement table can be used for a stricter compound word
235 checking (forbidding generated compound words, if they are also simple
236 words with typical fault, see CHECKCOMPOUNDREP).
237
238
239 MAP number_of_map_definitions
240
241 MAP string_of_related_chars_or_parenthesized_character_sequences
242 We can define language-dependent information on characters and
243 character sequences that should be considered related (i.e.
244 nearer than other chars not in the set) in the affix file (.aff)
245 by a map table. With this table, Hunspell can suggest the right
246 forms for words, which incorrectly choose the wrong letter or
247 letter groups from a related set more than once in a word (see
248 REP).
249
250 For example a possible mapping could be for the German umlauted
251 ü versus the regular u; the word Frühstück really should be
252 written with umlauted u's and not regular ones
253
254 MAP 1
255 MAP uü
256
257 Use parenthesized groups for character sequences (eg. for composed Uni‐
258 code characters):
259
260 MAP 3
261 MAP ß(ss) (character sequence)
262 MAP fi(fi) ("fi" compatibility characters for Unicode fi ligature)
263 MAP (ọ́)o (composed Unicode character: ó with bottom dot)
264
265 PHONE number_of_phone_definitions
266
267 PHONE what replacement
268 PHONE uses a table-driven phonetic transcription algorithm bor‐
269 rowed from Aspell. It is useful for languages with not pronunci‐
270 ation based orthography. You can add a full alphabet conversion
271 and other rules for conversion of special letter sequences. For
272 detailed documentation see http://aspell.net/man-html/Phonetic-
273 Code.html. Note: Multibyte UTF-8 characters have not worked
274 with bracket expression yet. Dash expression has signed bytes
275 and not UTF-8 characters yet.
276
278 BREAK number_of_break_definitions
279
280 BREAK character_or_character_sequence
281 Define new break points for breaking words and checking word
282 parts separately. Use ^ and $ to delete characters at end and
283 start of the word. Rationale: useful for compounding with join‐
284 ing character or strings (for example, hyphen in English and
285 German or hyphen and n-dash in Hungarian). Dashes are often bad
286 break points for tokenization, because compounds with dashes may
287 contain not valid parts, too.) With BREAK, Hunspell can check
288 both side of these compounds, breaking the words at dashes and
289 n-dashes:
290
291 BREAK 2
292 BREAK -
293 BREAK -- # n-dash
294
295 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
296 be valid compounds. Note: The default word break of Hunspell is equiv‐
297 alent of the following BREAK definition:
298
299 BREAK 3
300 BREAK -
301 BREAK ^-
302 BREAK -$
303
304 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
305 definition:
306
307 BREAK 1
308 BREAK -
309
310 W Note II: COMPOUNDRULE is better (or will be better) for handling
311 dashes and other compound joining characters or character strings. Use
312 BREAK, if you want check words with dashes or other joining characters
313 and there is no time or possibility to describe precise compound rules
314 with COMPOUNDRULE (COMPOUNDRULE has handled only the last suffixation
315 of the compound word yet).
316
317 Note III: For command line spell checking of words with extra charac‐
318 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
319 ple
320
321 COMPOUNDRULE number_of_compound_definitions
322
323 COMPOUNDRULE compound_pattern
324 Define custom compound patterns with a regex-like syntax. The
325 first COMPOUNDRULE is a header with the number of the following
326 COMPOUNDRULE definitions. Compound patterns consist compound
327 flags, parentheses, star and question mark meta characters. A
328 flag followed by a `*' matches a word sequence of 0 or more
329 matches of words signed with this compound flag. A flag fol‐
330 lowed by a `?' matches a word sequence of 0 or 1 matches of a
331 word signed with this compound flag. See tests/compound*.*
332 examples.
333
334 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
335 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
336 1000122nd etc.).
337
338 Note II: In the case of long and numerical flag types use only
339 parenthesized flags: (1500)*(2000)?
340
341 Note III: COMPOUNDRULE flags haven't been compatible with the
342 COMPOUNDFLAG, COMPOUNDBEGIN, etc. compound flags yet (use these
343 flags on different words).
344
345
346 COMPOUNDMIN num
347 Minimum length of words in compound words. Default value is 3
348 letters.
349
350 COMPOUNDFLAG flag
351 Words signed with COMPOUNDFLAG may be in compound words (except
352 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
353 also permits compounding of affixed words.
354
355 COMPOUNDBEGIN flag
356 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
357 first elements in compound words.
358
359 COMPOUNDLAST flag
360 Words signed with COMPOUNDLAST (or with a signed affix) may be
361 last elements in compound words.
362
363 COMPOUNDMIDDLE flag
364 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
365 middle elements in compound words.
366
367 ONLYINCOMPOUND flag
368 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
369 compounds (Fuge-elements in German, fogemorphemes in Swedish).
370 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
371 pound.*).
372
373 COMPOUNDPERMITFLAG flag
374 Prefixes are allowed at the beginning of compounds, suffixes are
375 allowed at the end of compounds by default. Affixes with COM‐
376 POUNDPERMITFLAG may be inside of compounds.
377
378 COMPOUNDFORBIDFLAG flag
379 Suffixes with this flag forbid compounding of the affixed word.
380
381 COMPOUNDROOT flag
382 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
383 is used only in the Hungarian language specific code).
384
385 COMPOUNDWORDMAX number
386 Set maximum word count in a compound word. (Default is unlim‐
387 ited.)
388
389 CHECKCOMPOUNDDUP
390 Forbid word duplication in compounds (e.g. foofoo).
391
392 CHECKCOMPOUNDREP
393 Forbid compounding, if the (usually bad) compound word may be a
394 non compound word with a REP fault. Useful for languages with
395 `compound friendly' orthography.
396
397 CHECKCOMPOUNDCASE
398 Forbid upper case characters at word bound in compounds.
399
400 CHECKCOMPOUNDTRIPLE
401 Forbid compounding, if compound word contains triple repeating
402 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
403 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
404 acters).
405
406 SIMPLIFIEDTRIPLE
407 Allow simplified 2-letter forms of the compounds forbidden by
408 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
409 for the old German orthography: Schiff|fahrt -> Schiffahrt).
410
411 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
412
413 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
414 Forbid compounding, if the first word in the compound ends with
415 endchars, and next word begins with beginchars and (optionally)
416 they have the requested flags. The optional replacement parame‐
417 ter allows simplified compound form. Note: COMPOUNDMIN doesn't
418 work correctly with the compound word alternation, so it may
419 need to set COMPOUNDMIN to lower value.
420
421 COMPOUNDSYLLABLE max_syllable vowels
422 Need for special compounding rules in Hungarian. First parame‐
423 ter is the maximum syllable number, that may be in a compound,
424 if words in compounds are more than COMPOUNDWORDMAX. Second
425 parameter is the list of vowels (for calculating syllables).
426
427 SYLLABLENUM flags
428 Need for special compounding rules in Hungarian.
429
431 PFX flag cross_product number
432
433 PFX flag stripping prefix [condition [morphological_fields...]]
434
435 SFX flag cross_product number
436
437 SFX flag stripping suffix [condition [morphological_fields...]]
438 An affix is either a prefix or a suffix attached to root words
439 to make other words. We can define affix classes with arbitrary
440 number affix rules. Affix classes are signed with affix flags.
441 The first line of an affix class definition is the header. The
442 fields of an affix class header:
443
444 (0) Option name (PFX or SFX)
445
446 (1) Flag (name of the affix class)
447
448 (2) Cross product (permission to combine prefixes and suffixes).
449 Possible values: Y (yes) or N (no)
450
451 (3) Line count of the following rules.
452
453 Fields of an affix rules:
454
455 (0) Option name
456
457 (1) Flag
458
459 (2) stripping characters from beginning (at prefix rules) or end
460 (at suffix rules) of the word
461
462 (3) affix (optionally with flags of continuation classes, sepa‐
463 rated by a slash)
464
465 (4) condition.
466
467 Zero stripping or affix are indicated by zero. Zero condition is
468 indicated by dot. Condition is a simplified, regular expres‐
469 sion-like pattern, which must be met before the affix can be
470 applied. (Dot signs an arbitrary character. Characters in braces
471 sign an arbitrary character from the character subset. Dash
472 hasn't got special meaning, but circumflex (^) next the first
473 brace sets the complementer character set.)
474
475 (5) Optional morphological fields separated by spaces or tabula‐
476 tors.
477
478
480 CIRCUMFIX flag
481 Affixes signed with CIRCUMFIX flag may be on a word when this
482 word also has a prefix with CIRCUMFIX flag and vice versa.
483
484 FORBIDDENWORD flag
485 This flag signs forbidden word form. Because affixed forms are
486 also forbidden, we can subtract a subset from set of the
487 accepted affixed and compound words.
488
489 FULLSTRIP
490 With FULLSTRIP, affix rules can strip full words, not only one
491 less characters.
492
493 Note: conditions may be word length without FULLSTRIP, too.
494
495 KEEPCASE flag
496 Forbid uppercased and capitalized forms of words signed with
497 KEEPCASE flags. Useful for special orthographies (measurements
498 and currency often keep their case in uppercased texts) and
499 writing systems (e.g. keeping lower case of IPA characters).
500
501 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
502 CASE flag may be capitalized and uppercased, but uppercased
503 forms of these words may not contain sharp s, only SS. See ger‐
504 mancompounding example in the tests directory of the Hunspell
505 distribution.
506
507 Note: Using lot of zero affixes may have a big cost, because
508 every zero affix is checked under affix analysis before the
509 other affixes.
510
511 ICONV number_of_ICONV_definitions
512
513 ICONV pattern pattern2
514 Define input conversion table.
515
516 OCONV number_of_OCONV_definitions
517
518 OCONV pattern pattern2
519 Define output conversion table.
520
521 LEMMA_PRESENT flag
522 Not used in Hunspell 1.2. Use "st:" field instead of
523 LEMMA_PRESENT.
524
525 NEEDAFFIX flag
526 This flag signs virtual stems in the dictionary. Only affixed
527 forms of these words will be accepted by Hunspell. Except, if
528 the dictionary word has a homonym or a zero affix. NEEDAFFIX
529 works also with prefixes and prefix + suffix combinations (see
530 tests/pseudoroot5.*).
531
532 PSEUDOROOT flag
533 Deprecated. (Former name of the NEEDAFFIX option.)
534
535 SUBSTANDARD flag
536 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
537 morphs) not used in morphological generation (and in suggestion
538 in the future versions). See also NOSUGGEST.
539
540 WORDCHARS characters
541 WORDCHARS extends tokenizer of Hunspell command line interface
542 with additional word character. For example, dot, dash, n-dash,
543 numbers, percent sign are word character in Hungarian.
544
545 CHECKSHARPS
546 SS letter pair in uppercased (German) words may be upper case
547 sharp s (ß). Hunspell can handle this special casing with the
548 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
549 mancompounding example) in both spelling and suggestion.
550
551
553 Hunspell's dictionary items and affix rules may have optional space or
554 tabulator separated morphological description fields, started with
555 3-character (two letters and a colon) field IDs:
556
557
558 word/flags po:noun is:nom
559
560 Example: We define a simple resource with morphological informations, a
561 derivative suffix (ds:) and a part of speech category (po:):
562
563 Affix file:
564
565
566 SFX X Y 1
567 SFX X 0 able . ds:able
568
569 Dictionary file:
570
571
572 drink/X po:verb
573
574 Test file:
575
576
577 drink
578 drinkable
579
580 Test:
581
582
583 $ analyze test.aff test.dic test.txt
584 > drink
585 analyze(drink) = po:verb
586 stem(drink) = po:verb
587 > drinkable
588 analyze(drinkable) = po:verb ds:able
589 stem(drinkable) = drinkable
590
591 You can see in the example, that the analyzer concatenates the morpho‐
592 logical fields in item and arrangement style.
593
594
596 Default morphological and other IDs (used in suggestion, stemming and
597 morphological generation):
598
599 ph: Alternative transliteration for better suggestion. It's useful
600 for words with foreign pronunciation. (Dictionary based phonetic
601 suggestion.) For example:
602
603
604 Marseille ph:maarsayl
605
606 st: Stem. Optional: default stem is the dictionary item in morpho‐
607 logical analysis. Stem field is useful for virtual stems (dic‐
608 tionary words with NEEDAFFIX flag) and morphological exceptions
609 instead of new, single used morphological rules.
610
611 feet st:foot is:plural
612 mice st:mouse is:plural
613 teeth st:tooth is:plural
614
615 Word forms with multiple stems need multiple dictionary items:
616
617
618 lay po:verb st:lie is:past_2
619 lay po:verb is:present
620 lay po:noun
621
622 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
623 Morphological generation needs stem, allomorph and affix fields.
624
625 sing al:sang al:sung
626 sang st:sing
627 sung st:sing
628
629 po: Part of speech category.
630
631 ds: Derivational suffix(es). Stemming doesn't remove derivational
632 suffixes. Morphological generation depends on the order of the
633 suffix fields.
634
635 In affix rules:
636
637
638 SFX Y Y 1
639 SFX Y 0 ly . ds:ly_adj
640
641 In the dictionary:
642
643
644 ably st:able ds:ly_adj
645 able al:ably
646
647 is: Inflectional suffix(es). All inflectional suffixes are removed
648 by stemming. Morphological generation depends on the order of
649 the suffix fields.
650
651
652 feet st:foot is:plural
653
654 ts: Terminal suffix(es). Terminal suffix fields are inflectional
655 suffix fields "removed" by additional (not terminal) suffixes.
656
657 Useful for zero morphemes and affixes removed by splitting
658 rules.
659
660
661 work/D ts:present
662
663 SFX D Y 2
664 SFX D 0 ed . is:past_1
665 SFX D 0 ed . is:past_2
666
667 Typical example of the terminal suffix is the zero morpheme of the nom‐
668 inative case.
669
670
671 sp: Surface prefix. Temporary solution for adding prefixes to the
672 stems and generated word forms. See tests/morph.* example.
673
674
675 pa: Parts of the compound words. Output fields of morphological
676 analysis for stemming.
677
678 dp: Planned: derivational prefix.
679
680 ip: Planned: inflectional prefix.
681
682 tp: Planned: terminal prefix.
683
684
686 Ispell's original algorithm strips only one suffix. Hunspell can strip
687 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
688
689 The twofold suffix stripping is a significant improvement in handling
690 of immense number of suffixes, that characterize agglutinative lan‐
691 guages.
692
693 A second `s' suffix (affix class Y) will be the continuation class of
694 the suffix `able' in the following example:
695
696
697 SFX Y Y 1
698 SFX Y 0 s .
699
700 SFX X Y 1
701 SFX X 0 able/Y .
702
703 Dictionary file:
704
705
706 drink/X
707
708 Test file:
709
710
711 drink
712 drinkable
713 drinkables
714
715 Test:
716
717
718 $ hunspell -m -d test <test.txt
719 drink st:drink
720 drinkable st:drink fl:X
721 drinkables st:drink fl:X fl:Y
722
723 Theoretically with the twofold suffix stripping needs only the square
724 root of the number of suffix rules, compared with a Hunspell implemen‐
725 tation. In our practice, we could have elaborated the Hungarian inflec‐
726 tional morphology with twofold suffix stripping.
727
728
730 Hunspell can handle more than 65000 affix classes. There are three new
731 syntax for giving flags in affix and dictionary files.
732
733 FLAG long command sets 2-character flags:
734
735
736 FLAG long
737 SFX Y1 Y 1
738 SFX Y1 0 s 1
739
740 Dictionary record with the Y1, Z3, F? flags:
741
742
743 foo/Y1Z3F?
744
745 FLAG num command sets numerical flags separated by comma:
746
747
748 FLAG num
749 SFX 65000 Y 1
750 SFX 65000 0 s 1
751
752 Dictionary example:
753
754
755 foo/65000,12,2756
756
757 The third one is the Unicode character flags.
758
759
761 Hunspell's dictionary can contain repeating elements that are homonyms:
762
763
764 work/A po:verb
765 work/B po:noun
766
767 An affix file:
768
769
770 SFX A Y 1
771 SFX A 0 s . sf:sg3
772
773 SFX B Y 1
774 SFX B 0 s . is:plur
775
776 Test file:
777
778
779 works
780
781 Test:
782
783
784 $ hunspell -d test -m <testwords
785 work st:work po:verb is:sg3
786 work st:work po:noun is:plur
787
788 This feature also gives a way to forbid illegal prefix/suffix combina‐
789 tions.
790
791
793 An interesting side-effect of multi-step stripping is, that the appro‐
794 priate treatment of circumfixes now comes for free. For instance, in
795 Hungarian, superlatives are formed by simultaneous prefixation of leg-
796 and suffixation of -bb to the adjective base. A problem with the one-
797 level architecture is that there is no way to render lexical licensing
798 of particular prefixes and suffixes interdependent, and therefore
799 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
800 `old'. Until the introduction of clusters, a special treatment of the
801 superlative had to be hardwired in the earlier HunSpell code. This may
802 have been legitimate for a single case, but in fact prefix--suffix
803 dependences are ubiquitous in category-changing derivational patterns
804 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
805 able but *undrink). In simple words, here, the prefix un- is legitimate
806 only if the base drink is suffixed with -able. If both these patters
807 are handled by on-line affix rules and affix rules are checked against
808 the base only, there is no way to express this dependency and the sys‐
809 tem will necessarily over- or undergenerate.
810
811 In next example, suffix class R have got a prefix `continuation' class
812 (class P).
813
814
815 PFX P Y 1
816 PFX P 0 un . [prefix_un]+
817
818 SFX S Y 1
819 SFX S 0 s . +PL
820
821 SFX Q Y 1
822 SFX Q 0 s . +3SGV
823
824 SFX R Y 1
825 SFX R 0 able/PS . +DER_V_ADJ_ABLE
826
827 Dictionary:
828
829
830 2
831 drink/RQ [verb]
832 drink/S [noun]
833
834 Morphological analysis:
835
836
837 > drink
838 drink[verb]
839 drink[noun]
840 > drinks
841 drink[verb]+3SGV
842 drink[noun]+PL
843 > drinkable
844 drink[verb]+DER_V_ADJ_ABLE
845 > drinkables
846 drink[verb]+DER_V_ADJ_ABLE+PL
847 > undrinkable
848 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
849 > undrinkables
850 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
851 > undrink
852 Unknown word.
853 > undrinks
854 Unknown word.
855
857 Conditional affixes implemented by a continuation class are not enough
858 for circumfixes, because a circumfix is one affix in morphology. We
859 also need CIRCUMFIX option for correct morphological analysis.
860
861
862 # circumfixes: ~ obligate prefix/suffix combinations
863 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
864 # nagy, nagyobb, legnagyobb, legeslegnagyobb
865 # (great, greater, greatest, most greatest)
866
867 CIRCUMFIX X
868
869 PFX A Y 1
870 PFX A 0 leg/X .
871
872 PFX B Y 1
873 PFX B 0 legesleg/X .
874
875 SFX C Y 3
876 SFX C 0 obb . +COMPARATIVE
877 SFX C 0 obb/AX . +SUPERLATIVE
878 SFX C 0 obb/BX . +SUPERSUPERLATIVE
879
880 Dictionary:
881
882
883 1
884 nagy/C [MN]
885
886 Analysis:
887
888
889 > nagy
890 nagy[MN]
891 > nagyobb
892 nagy[MN]+COMPARATIVE
893 > legnagyobb
894 nagy[MN]+SUPERLATIVE
895 > legeslegnagyobb
896 nagy[MN]+SUPERSUPERLATIVE
897
899 Allowing free compounding yields decrease in precision of recognition,
900 not to mention stemming and morphological analysis. Although lexical
901 switches are introduced to license compounding of bases by Ispell, this
902 proves not to be restrictive enough. For example:
903
904
905 # affix file
906 COMPOUNDFLAG X
907
908 2
909 foo/X
910 bar/X
911
912 With this resource, foobar and barfoo also are accepted words.
913
914 This has been improved upon with the introduction of direction-sensi‐
915 tive compounding, i.e., lexical features can specify separately whether
916 a base can occur as leftmost or rightmost constituent in compounds.
917 This, however, is still insufficient to handle the intricate patterns
918 of compounding, not to mention idiosyncratic (and language specific)
919 norms of hyphenation.
920
921 The Hunspell algorithm currently allows any affixed form of words,
922 which are lexically marked as potential members of compounds. Hunspell
923 improved this, and its recursive compound checking rules makes it pos‐
924 sible to implement the intricate spelling conventions of Hungarian com‐
925 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
926 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
927 rule. Further example in Hungarian, derivate suffixes often modify
928 compounding properties. Hunspell allows the compounding flags on the
929 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
930 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
931
932 Suffixes with this flag forbid compounding of the affixed word.
933
934 We also need several Hunspell features for handling German compounding:
935
936
937 # German compounding
938
939 # set language to handle special casing of German sharp s
940
941 LANG de_DE
942
943 # compound flags
944
945 COMPOUNDBEGIN U
946 COMPOUNDMIDDLE V
947 COMPOUNDEND W
948
949 # Prefixes are allowed at the beginning of compounds,
950 # suffixes are allowed at the end of compounds by default:
951 # (prefix)?(root)+(affix)?
952 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
953 COMPOUNDPERMITFLAG P
954
955 # for German fogemorphemes (Fuge-element)
956 # Hint: ONLYINCOMPOUND is not required everywhere, but the
957 # checking will be a little faster with it.
958
959 ONLYINCOMPOUND X
960
961 # forbid uppercase characters at compound word bounds
962 CHECKCOMPOUNDCASE
963
964 # for handling Fuge-elements with dashes (Arbeits-)
965 # dash will be a special word
966
967 COMPOUNDMIN 1
968 WORDCHARS -
969
970 # compound settings and fogemorpheme for `Arbeit'
971
972 SFX A Y 3
973 SFX A 0 s/UPX .
974 SFX A 0 s/VPDX .
975 SFX A 0 0/WXD .
976
977 SFX B Y 2
978 SFX B 0 0/UPX .
979 SFX B 0 0/VWXDP .
980
981 # a suffix for `Computer'
982
983 SFX C Y 1
984 SFX C 0 n/WD .
985
986 # for forbid exceptions (*Arbeitsnehmer)
987
988 FORBIDDENWORD Z
989
990 # dash prefix for compounds with dash (Arbeits-Computer)
991
992 PFX - Y 1
993 PFX - 0 -/P .
994
995 # decapitalizing prefix
996 # circumfix for positioning in compounds
997
998 PFX D Y 29
999 PFX D A a/PX A
1000 PFX D Ä ä/PX Ä
1001 .
1002 .
1003 PFX D Y y/PX Y
1004 PFX D Z z/PX Z
1005
1006 Example dictionary:
1007
1008
1009 4
1010 Arbeit/A-
1011 Computer/BC-
1012 -/W
1013 Arbeitsnehmer/Z
1014
1015 Accepted compound compound words with the previous resource:
1016
1017
1018 Computer
1019 Computern
1020 Arbeit
1021 Arbeits-
1022 Computerarbeit
1023 Computerarbeits-
1024 Arbeitscomputer
1025 Arbeitscomputern
1026 Computerarbeitscomputer
1027 Computerarbeitscomputern
1028 Arbeitscomputerarbeit
1029 Computerarbeits-Computer
1030 Computerarbeits-Computern
1031
1032 Not accepted compoundings:
1033
1034
1035 computer
1036 arbeit
1037 Arbeits
1038 arbeits
1039 ComputerArbeit
1040 ComputerArbeits
1041 Arbeitcomputer
1042 ArbeitsComputer
1043 Computerarbeitcomputer
1044 ComputerArbeitcomputer
1045 ComputerArbeitscomputer
1046 Arbeitscomputerarbeits
1047 Computerarbeits-computer
1048 Arbeitsnehmer
1049
1050 This solution is still not ideal, however, and will be replaced by a
1051 pattern-based compound-checking algorithm which is closely integrated
1052 with input buffer tokenization. Patterns describing compounds come as a
1053 separate input resource that can refer to high-level properties of con‐
1054 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1055 ment of hyphens). The patterns are matched against potential segmenta‐
1056 tions of compounds to assess wellformedness.
1057
1058
1060 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1061 major deficiency when it comes to scalability. Although a language
1062 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1063 fails to allow a full implementation of Hungarian orthographic conven‐
1064 tions. For instance, the '--' symbol (n-dash) is missing from this
1065 character set contrary to the fact that it is not only the official
1066 symbol to delimit parenthetic clauses in the language, but it can be in
1067 compound words as a special 'big' hyphen.
1068
1069 MySpell has got some 8-bit encoding tables, but there are languages
1070 without standard 8-bit encoding, too. For example, a lot of African
1071 languages have non-latin or extended latin characters.
1072
1073 Similarly, using the original spelling of certain foreign names like
1074 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1075 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1076 bine with inflections containing characters only in ISO 8859-2 (like
1077 elative -ből, allative -től or delative -ről with double acute), these
1078 result in words (like Ångströmről or Molière-től.) that can not be
1079 encoded using any single ASCII encoding scheme.
1080
1081 The problems raised in relation to 8-bit ASCII encoding have long been
1082 recognized by proponents of Unicode. It is clear that trading effi‐
1083 ciency for encoding-independence has its advantages when it comes a
1084 truly multi-lingual application. There is implemented a memory and time
1085 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1086 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1087 affixes and words are stored in UTF-8, during the analysis are handled
1088 in mostly UTF-8, under condition checking and suggestion are converted
1089 to UTF-16. Unicode text analysis and spell checking have a minimal
1090 (0-20%) time overhead and minimal or reasonable memory overhead depends
1091 from the language (its UTF-8 encoding and affixation).
1092
1093
1095 Aspell dictionaries can be easily converted into hunspell. Conversion
1096 steps:
1097
1098 dictionary (xx.cwl -> xx.wl):
1099
1100 preunzip xx.cwl
1101 wc -l < xx.wl > xx.dic
1102 cat xx.wl >> xx.dic
1103
1104 affix file
1105
1106 If the affix file exists, copy it:
1107 cp xx_affix.dat xx.aff
1108 If not, create it with the suitable character encoding (see xx.dat)
1109 echo "SET ISO8859-x" > xx.aff
1110 or
1111 echo "SET UTF-8" > xx.aff
1112
1113 It's useful to add a TRY option with the characters of the dictionary
1114 with frequency order to set edit distance suggestions:
1115 echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1116
1117
1119 hunspell (1), ispell (1), ispell (4)
1120
1121
1122
1123
1124 2010-03-03 hunspell(4)