1hunspell(4) Kernel Interfaces Manual hunspell(4)
2
3
4
6 hunspell - format of Hunspell dictionaries and affix files
7
9 Hunspell(1) requires two files to define the language that it is spell
10 checking. The first file is a dictionary containing words for the lan‐
11 guage, and the second is an "affix" file that defines the meaning of
12 special flags in the dictionary.
13
14 A dictionary file (*.dic) contains a list of words, one per line. The
15 first line of the dictionaries (except personal dictionaries) contains
16 the approximate word count (for optimal hash memory size). Each word
17 may optionally be followed by a slash ("/") and one or more flags,
18 which represents affixes or special attributes. Dictionary words can
19 contain also slashes with the "" syntax. Default flag format is a sin‐
20 gle (usually alphabetic) character. After the dictionary words there
21 are also optional fields separated by tabulators or spaces (spaces only
22 work as morphological field separators, if they are followed by morpho‐
23 logical field ids, see also Optional data fields).
24
25 Personal dictionaries are simple word lists. Asterisk at the first
26 character position signs prohibition. A second word separated by a
27 slash sets the affixation.
28
29
30 foo
31 Foo/Simpson
32 *bar
33
34 In this example, "foo" and "Foo" are personal words, plus Foo will be
35 recognized with affixes of Simpson (Foo's etc.) and bar is a forbidden
36 word.
37
38 An affix file (*.aff) may contain a lot of optional attributes. For
39 example, SET is used for setting the character encodings of affixes and
40 dictionary files. TRY sets the change characters for suggestions. REP
41 sets a replacement table for multiple character corrections in sugges‐
42 tion mode. PFX and SFX defines prefix and suffix classes named with
43 affix flags.
44
45 The following affix file example defines UTF-8 character encoding.
46 `TRY' suggestions differ from the bad word with an English letter or an
47 apostrophe. With these REP definitions, Hunspell can suggest the right
48 word form, when the misspelled word contains f instead of ph and vice
49 versa.
50
51
52 SET UTF-8
53 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
54
55 REP 2
56 REP f ph
57 REP ph f
58
59 PFX A Y 1
60 PFX A 0 re .
61
62 SFX B Y 2
63 SFX B 0 ed [^y]
64 SFX B y ied y
65
66 There are two affix classes in the dictionary. Class A defines a `re-'
67 prefix. Class B defines two `-ed' suffixes. First suffix can be added
68 to a word if the last character of the word isn't `y'. Second suffix
69 can be added to the words terminated with an `y'. (See later.) The
70 following dictionary file uses these affix classes.
71
72
73 3
74 hello
75 try/B
76 work/AB
77
78 All accepted words with this dictionary: "hello", "try", "tried",
79 "work", "worked", "rework", "reworked".
80
81
83 Hunspell source distribution contains more than 80 examples for option
84 usage.
85
86
87 SET encoding
88 Set character encoding of words and morphemes in affix and dic‐
89 tionary files. Possible values: UTF-8, ISO8859-1 - ISO8859-10,
90 ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U, microsoft-cp1251,
91 ISCII-DEVANAGARI.
92
93 FLAG value
94 Set flag type. Default type is the extended ASCII (8-bit) char‐
95 acter. `UTF-8' parameter sets UTF-8 encoded Unicode character
96 flags. The `long' value sets the double extended ASCII charac‐
97 ter flag type, the `num' sets the decimal number flag type. Dec‐
98 imal flags numbered from 1 to 65000, and in flag fields are sep‐
99 arated by comma. BUG: UTF-8 flag type doesn't work on ARM plat‐
100 form.
101
102 COMPLEXPREFIXES
103 Set twofold prefix stripping (but single suffix stripping) for
104 agglutinative languages with right-to-left writing system.
105
106 LANG langcode
107 Set language code. In Hunspell may be language specific codes
108 enabled by LANG code. At present there are az_AZ, hu_HU, TR_tr
109 specific codes in Hunspell (see the source code).
110
111 IGNORE characters
112 Ignore characters from dictionary words, affixes and input
113 words. Useful for optional characters, as Arabic diacritical
114 marks (Harakat).
115
116 AF number_of_flag_vector_aliases
117
118 AF flag_vector
119 Hunspell can substitute affix flag sets with ordinal numbers in
120 affix rules (alias compression, see makealias tool). First exam‐
121 ple with alias compression:
122
123 3
124 hello
125 try/1
126 work/2
127
128 AF definitions in the affix file:
129
130 SET UTF-8
131 TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
132 AF 2
133 AF A
134 AF AB
135
136 It is equivalent of the following dic file:
137
138 3
139 hello
140 try/A
141 work/AB
142
143 See also tests/alias* examples of the source distribution.
144
145 Note: If affix file contains the FLAG parameter, define it before the
146 AF definitions.
147
148 Note II: Use makealias utility in Hunspell distribution to compress aff
149 and dic files.
150
151 AM number_of_morphological_aliases
152
153 AM morphological_fields
154 Hunspell can substitute also morphological data with ordinal
155 numbers in affix rules (alias compression). See tests/alias*
156 examples.
157
159 Suggestion parameters can optimize the default n-gram, character swap
160 and deletion suggestions of Hunspell. REP is suggested to fix the typi‐
161 cal and especially bad language specific bugs, because the REP sugges‐
162 tions have the highest priority in the suggestion list. PHONE is for
163 languages with not pronunciation based orthography.
164
165 KEY characters_separated_by_vertical_line_optionally
166 Hunspell searches and suggests words with one different charac‐
167 ter replaced by a neighbor KEY character. Not neighbor charac‐
168 ters in KEY string separated by vertical line characters. Sug‐
169 gested KEY parameters for QWERTY and Dvorak keyboard layouts:
170
171 KEY qwertyuiop|asdfghjkl|zxcvbnm
172 KEY pyfgcrl|aeouidhtns|qjkxbmwvz
173
174 Using the first QWERTY layout, Hunspell suggests "nude" and "node" for
175 "*nide". A character may have more neighbors, too:
176
177 KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
178
179 TRY characters
180 Hunspell can suggest right word forms, when they differ from the
181 bad input word by one TRY character. The parameter of TRY is
182 case sensitive.
183
184 NOSUGGEST flag
185 Words signed with NOSUGGEST flag are not suggested. Proposed
186 flag for vulgar and obscene words (see also SUBSTANDARD).
187
188 MAXNGRAMSUGS num
189 Set number of n-gram suggestions. Value 0 switches off the n-
190 gram suggestions.
191
192 NOSPLITSUGS
193 Disable split-word suggestions.
194
195 SUGSWITHDOTS
196 Add dot(s) to suggestions, if input word terminates in dot(s).
197 (Not for OpenOffice.org dictionaries, because OpenOffice.org has
198 an automatic dot expansion mechanism.)
199
200 REP number_of_replacement_definitions
201
202 REP what replacement
203 We can define language-dependent phonetic information in the
204 affix file (.aff) by a replacement table. First REP is the
205 header of this table and one or more REP data line are following
206 it. With this table, Hunspell can suggest the right forms for
207 the typical faults of spelling when the incorrect form differs
208 by more, than 1 letter from the right form. For example a pos‐
209 sible English replacement table definition to handle misspelled
210 consonants:
211
212 REP 8
213 REP f ph
214 REP ph f
215 REP f gh
216 REP gh f
217 REP j dg
218 REP dg j
219 REP k ch
220 REP ch k
221
222 Note I: It's very useful to define replacements for the most typical
223 one-character mistakes, too: with REP you can add higher priority to a
224 subset of the TRY suggestions (suggestion list begins with the REP sug‐
225 gestions).
226
227 Note II: Suggesting separated words by REP, you can specify a space
228 with an underline:
229
230
231 REP 1
232 REP alot a_lot
233
234 Note III: Replacement table can be used for a stricter compound word
235 checking (forbidding generated compound words, if they are also simple
236 words with typical fault, see CHECKCOMPOUNDREP).
237
238
239 MAP number_of_map_definitions
240
241 MAP string_of_related_chars
242 We can define language-dependent information on characters that
243 should be considered related (i.e. nearer than other chars not
244 in the set) in the affix file (.aff) by a character map table.
245 With this table, Hunspell can suggest the right forms for words,
246 which incorrectly choose the wrong letter from a related set
247 more than once in a word.
248
249 For example a possible mapping could be for the German umlauted
250 ü versus the regular u; the word Frühstück really should be
251 written with umlauted u's and not regular ones
252
253 MAP 1
254 MAP uü
255
256 PHONE number_of_phone_definitions
257
258 PHONE what replacement
259 PHONE uses a table-driven phonetic transcription algorithm bor‐
260 rowed from Aspell. It is useful for languages with not pronunci‐
261 ation based orthography. You can add a full alphabet conversion
262 and other rules for conversion of special letter sequences. For
263 detailed documentation see http://aspell.net/man-html/Phonetic-
264 Code.html. Note: Multibyte UTF-8 characters have not worked
265 with bracket expression yet. Dash expression has signed bytes
266 and not UTF-8 characters yet.
267
269 BREAK number_of_break_definitions
270
271 BREAK character_or_character_sequence
272 Define new break points for breaking words and checking word
273 parts separately. Use ^ and $ to delete characters at end and
274 start of the word. Rationale: useful for compounding with join‐
275 ing character or strings (for example, hyphen in English and
276 German or hyphen and n-dash in Hungarian). Dashes are often bad
277 break points for tokenization, because compounds with dashes may
278 contain not valid parts, too.) With BREAK, Hunspell can check
279 both side of these compounds, breaking the words at dashes and
280 n-dashes:
281
282 BREAK 2
283 BREAK -
284 BREAK -- # n-dash
285
286 Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
287 be valid compounds. Note: The default word break of Hunspell is equiv‐
288 alent of the following BREAK definition:
289
290 BREAK 3
291 BREAK -
292 BREAK ^-
293 BREAK -$
294
295 Hunspell doesn't accept the "-word" and "word-" forms by this BREAK
296 definition:
297
298 BREAK 1
299 BREAK -
300
301 W Note II: COMPOUNDRULE is better (or will be better) for handling
302 dashes and other compound joining characters or character strings. Use
303 BREAK, if you want check words with dashes or other joining characters
304 and there is no time or possibility to describe precise compound rules
305 with COMPOUNDRULE (COMPOUNDRULE has handled only the last suffixation
306 of the compound word yet).
307
308 Note III: For command line spell checking of words with extra charac‐
309 ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
310 ple
311
312 COMPOUNDRULE number_of_compound_definitions
313
314 COMPOUNDRULE compound_pattern
315 Define custom compound patterns with a regex-like syntax. The
316 first COMPOUNDRULE is a header with the number of the following
317 COMPOUNDRULE definitions. Compound patterns consist compound
318 flags, parentheses, star and question mark meta characters. A
319 flag followed by a `*' matches a word sequence of 0 or more
320 matches of words signed with this compound flag. A flag fol‐
321 lowed by a `?' matches a word sequence of 0 or 1 matches of a
322 word signed with this compound flag. See tests/compound*.*
323 examples.
324
325 Note: en_US dictionary of OpenOffice.org uses COMPOUNDRULE for
326 ordinal number recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
327 1000122nd etc.).
328
329 Note II: In the case of long and numerical flag types use only
330 parenthesized flags: (1500)*(2000)?
331
332 Note III: COMPOUNDRULE flags haven't been compatible with the
333 COMPOUNDFLAG, COMPOUNDBEGIN, etc. compound flags yet (use these
334 flags on different words).
335
336
337 COMPOUNDMIN num
338 Minimum length of words in compound words. Default value is 3
339 letters.
340
341 COMPOUNDFLAG flag
342 Words signed with COMPOUNDFLAG may be in compound words (except
343 when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
344 also permits compounding of affixed words.
345
346 COMPOUNDBEGIN flag
347 Words signed with COMPOUNDBEGIN (or with a signed affix) may be
348 first elements in compound words.
349
350 COMPOUNDLAST flag
351 Words signed with COMPOUNDLAST (or with a signed affix) may be
352 last elements in compound words.
353
354 COMPOUNDMIDDLE flag
355 Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
356 middle elements in compound words.
357
358 ONLYINCOMPOUND flag
359 Suffixes signed with ONLYINCOMPOUND flag may be only inside of
360 compounds (Fuge-elements in German, fogemorphemes in Swedish).
361 ONLYINCOMPOUND flag works also with words (see tests/onlyincom‐
362 pound.*).
363
364 COMPOUNDPERMITFLAG flag
365 Prefixes are allowed at the beginning of compounds, suffixes are
366 allowed at the end of compounds by default. Affixes with COM‐
367 POUNDPERMITFLAG may be inside of compounds.
368
369 COMPOUNDFORBIDFLAG flag
370 Suffixes with this flag forbid compounding of the affixed word.
371
372 COMPOUNDROOT flag
373 COMPOUNDROOT flag signs the compounds in the dictionary (Now it
374 is used only in the Hungarian language specific code).
375
376 COMPOUNDWORDMAX number
377 Set maximum word count in a compound word. (Default is unlim‐
378 ited.)
379
380 CHECKCOMPOUNDDUP
381 Forbid word duplication in compounds (e.g. foofoo).
382
383 CHECKCOMPOUNDREP
384 Forbid compounding, if the (usually bad) compound word may be a
385 non compound word with a REP fault. Useful for languages with
386 `compound friendly' orthography.
387
388 CHECKCOMPOUNDCASE
389 Forbid upper case characters at word bound in compounds.
390
391 CHECKCOMPOUNDTRIPLE
392 Forbid compounding, if compound word contains triple repeating
393 letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
394 ter support in UTF-8 encoding (works only for 7-bit ASCII char‐
395 acters).
396
397 SIMPLIFIEDTRIPLE
398 Allow simplified 2-letter forms of the compounds forbidden by
399 CHECKCOMPOUNDTRIPLE. It's useful for Swedish and Norwegian (and
400 for the old German orthography: Schiff|fahrt -> Schiffahrt).
401
402 CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
403
404 CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
405 Forbid compounding, if the first word in the compound ends with
406 endchars, and next word begins with beginchars and (optionally)
407 they have the requested flags. The optional replacement parame‐
408 ter allows simplified compound form. Note: COMPOUNDMIN doesn't
409 work correctly with the compound word alternation, so it may
410 need to set COMPOUNDMIN to lower value.
411
412 COMPOUNDSYLLABLE max_syllable vowels
413 Need for special compounding rules in Hungarian. First parame‐
414 ter is the maximum syllable number, that may be in a compound,
415 if words in compounds are more than COMPOUNDWORDMAX. Second
416 parameter is the list of vowels (for calculating syllables).
417
418 SYLLABLENUM flags
419 Need for special compounding rules in Hungarian.
420
422 PFX flag cross_product number
423
424 PFX flag stripping prefix [condition [morphological_fields...]]
425
426 SFX flag cross_product number
427
428 SFX flag stripping suffix [condition [morphological_fields...]]
429 An affix is either a prefix or a suffix attached to root words
430 to make other words. We can define affix classes with arbitrary
431 number affix rules. Affix classes are signed with affix flags.
432 The first line of an affix class definition is the header. The
433 fields of an affix class header:
434
435 (0) Option name (PFX or SFX)
436
437 (1) Flag (name of the affix class)
438
439 (2) Cross product (permission to combine prefixes and suffixes).
440 Possible values: Y (yes) or N (no)
441
442 (3) Line count of the following rules.
443
444 Fields of an affix rules:
445
446 (0) Option name
447
448 (1) Flag
449
450 (2) stripping characters from beginning (at prefix rules) or end
451 (at suffix rules) of the word
452
453 (3) affix (optionally with flags of continuation classes, sepa‐
454 rated by a slash)
455
456 (4) condition.
457
458 Zero stripping or affix are indicated by zero. Zero condition is
459 indicated by dot. Condition is a simplified, regular expres‐
460 sion-like pattern, which must be met before the affix can be
461 applied. (Dot signs an arbitrary character. Characters in braces
462 sign an arbitrary character from the character subset. Dash
463 hasn't got special meaning, but circumflex (^) next the first
464 brace sets the complementer character set.)
465
466 (5) Optional morphological fields separated by spaces or tabula‐
467 tors.
468
469
471 CIRCUMFIX flag
472 Affixes signed with CIRCUMFIX flag may be on a word when this
473 word also has a prefix with CIRCUMFIX flag and vice versa.
474
475 FORBIDDENWORD flag
476 This flag signs forbidden word form. Because affixed forms are
477 also forbidden, we can subtract a subset from set of the
478 accepted affixed and compound words.
479
480 FULLSTRIP
481 With FULLSTRIP, affix rules can strip full words, not only one
482 less characters.
483
484 Note: conditions may be word length without FULLSTRIP, too.
485
486 KEEPCASE flag
487 Forbid uppercased and capitalized forms of words signed with
488 KEEPCASE flags. Useful for special orthographies (measurements
489 and currency often keep their case in uppercased texts) and
490 writing systems (e.g. keeping lower case of IPA characters).
491
492 Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
493 CASE flag may be capitalized and uppercased, but uppercased
494 forms of these words may not contain sharp s, only SS. See ger‐
495 mancompounding example in the tests directory of the Hunspell
496 distribution.
497
498 Note: Using lot of zero affixes may have a big cost, because
499 every zero affix is checked under affix analysis before the
500 other affixes.
501
502 ICONV number_of_ICONV_definitions
503
504 ICONV pattern pattern2
505 Define input conversion table.
506
507 OCONV number_of_OCONV_definitions
508
509 OCONV pattern pattern2
510 Define output conversion table.
511
512 LEMMA_PRESENT flag
513 Not used in Hunspell 1.2. Use "st:" field instead of
514 LEMMA_PRESENT.
515
516 NEEDAFFIX flag
517 This flag signs virtual stems in the dictionary. Only affixed
518 forms of these words will be accepted by Hunspell. Except, if
519 the dictionary word has a homonym or a zero affix. NEEDAFFIX
520 works also with prefixes and prefix + suffix combinations (see
521 tests/pseudoroot5.*).
522
523 PSEUDOROOT flag
524 Deprecated. (Former name of the NEEDAFFIX option.)
525
526 SUBSTANDARD flag
527 SUBSTANDARD flag signs affix rules and dictionary words (allo‐
528 morphs) not used in morphological generation (and in suggestion
529 in the future versions). See also NOSUGGEST.
530
531 WORDCHARS characters
532 WORDCHARS extends tokenizer of Hunspell command line interface
533 with additional word character. For example, dot, dash, n-dash,
534 numbers, percent sign are word character in Hungarian.
535
536 CHECKSHARPS
537 SS letter pair in uppercased (German) words may be upper case
538 sharp s (ß). Hunspell can handle this special casing with the
539 CHECKSHARPS declaration (see also KEEPCASE flag and tests/ger‐
540 mancompounding example) in both spelling and suggestion.
541
542
544 Hunspell's dictionary items and affix rules may have optional space or
545 tabulator separated morphological description fields, started with
546 3-character (two letters and a colon) field IDs:
547
548
549 word/flags po:noun is:nom
550
551 Example: We define a simple resource with morphological informations, a
552 derivative suffix (ds:) and a part of speech category (po:):
553
554 Affix file:
555
556
557 SFX X Y 1
558 SFX X 0 able . ds:able
559
560 Dictionary file:
561
562
563 drink/X po:verb
564
565 Test file:
566
567
568 drink
569 drinkable
570
571 Test:
572
573
574 $ analyze test.aff test.dic test.txt
575 > drink
576 analyze(drink) = po:verb
577 stem(drink) = po:verb
578 > drinkable
579 analyze(drinkable) = po:verb ds:able
580 stem(drinkable) = drinkable
581
582 You can see in the example, that the analyzer concatenates the morpho‐
583 logical fields in item and arrangement style.
584
585
587 Default morphological and other IDs (used in suggestion, stemming and
588 morphological generation):
589
590 ph: Alternative transliteration for better suggestion. It's useful
591 for words with foreign pronunciation. (Dictionary based phonetic
592 suggestion.) For example:
593
594
595 Marseille ph:maarsayl
596
597 st: Stem. Optional: default stem is the dictionary item in morpho‐
598 logical analysis. Stem field is useful for virtual stems (dic‐
599 tionary words with NEEDAFFIX flag) and morphological exceptions
600 instead of new, single used morphological rules.
601
602 feet st:foot is:plural
603 mice st:mouse is:plural
604 teeth st:tooth is:plural
605
606 Word forms with multiple stems need multiple dictionary items:
607
608
609 lay po:verb st:lie is:past_2
610 lay po:verb is:present
611 lay po:noun
612
613 al: Allomorph(s). A dictionary item is the stem of its allomorphs.
614 Morphological generation needs stem, allomorph and affix fields.
615
616 sing al:sang al:sung
617 sang st:sing
618 sung st:sing
619
620 po: Part of speech category.
621
622 ds: Derivational suffix(es). Stemming doesn't remove derivational
623 suffixes. Morphological generation depends on the order of the
624 suffix fields.
625
626 In affix rules:
627
628
629 SFX Y Y 1
630 SFX Y 0 ly . ds:ly_adj
631
632 In the dictionary:
633
634
635 ably st:able ds:ly_adj
636 able al:ably
637
638 is: Inflectional suffix(es). All inflectional suffixes are removed
639 by stemming. Morphological generation depends on the order of
640 the suffix fields.
641
642
643 feet st:foot is:plural
644
645 ts: Terminal suffix(es). Terminal suffix fields are inflectional
646 suffix fields "removed" by additional (not terminal) suffixes.
647
648 Useful for zero morphemes and affixes removed by splitting
649 rules.
650
651
652 work/D ts:present
653
654 SFX D Y 2
655 SFX D 0 ed . is:past_1
656 SFX D 0 ed . is:past_2
657
658 Typical example of the terminal suffix is the zero morpheme of the nom‐
659 inative case.
660
661
662 sp: Surface prefix. Temporary solution for adding prefixes to the
663 stems and generated word forms. See tests/morph.* example.
664
665
666 pa: Parts of the compound words. Output fields of morphological
667 analysis for stemming.
668
669 dp: Planned: derivational prefix.
670
671 ip: Planned: inflectional prefix.
672
673 tp: Planned: terminal prefix.
674
675
677 Ispell's original algorithm strips only one suffix. Hunspell can strip
678 another one yet (or a plus prefix in COMPLEXPREFIXES mode).
679
680 The twofold suffix stripping is a significant improvement in handling
681 of immense number of suffixes, that characterize agglutinative lan‐
682 guages.
683
684 A second `s' suffix (affix class Y) will be the continuation class of
685 the suffix `able' in the following example:
686
687
688 SFX Y Y 1
689 SFX Y 0 s .
690
691 SFX X Y 1
692 SFX X 0 able/Y .
693
694 Dictionary file:
695
696
697 drink/X
698
699 Test file:
700
701
702 drink
703 drinkable
704 drinkables
705
706 Test:
707
708
709 $ hunspell -m -d test <test.txt
710 drink st:drink
711 drinkable st:drink fl:X
712 drinkables st:drink fl:X fl:Y
713
714 Theoretically with the twofold suffix stripping needs only the square
715 root of the number of suffix rules, compared with a Hunspell implemen‐
716 tation. In our practice, we could have elaborated the Hungarian inflec‐
717 tional morphology with twofold suffix stripping.
718
719
721 Hunspell can handle more than 65000 affix classes. There are three new
722 syntax for giving flags in affix and dictionary files.
723
724 FLAG long command sets 2-character flags:
725
726
727 FLAG long
728 SFX Y1 Y 1
729 SFX Y1 0 s 1
730
731 Dictionary record with the Y1, Z3, F? flags:
732
733
734 foo/Y1Z3F?
735
736 FLAG num command sets numerical flags separated by comma:
737
738
739 FLAG num
740 SFX 65000 Y 1
741 SFX 65000 0 s 1
742
743 Dictionary example:
744
745
746 foo/65000,12,2756
747
748 The third one is the Unicode character flags.
749
750
752 Hunspell's dictionary can contain repeating elements that are homonyms:
753
754
755 work/A po:verb
756 work/B po:noun
757
758 An affix file:
759
760
761 SFX A Y 1
762 SFX A 0 s . sf:sg3
763
764 SFX B Y 1
765 SFX B 0 s . is:plur
766
767 Test file:
768
769
770 works
771
772 Test:
773
774
775 $ hunspell -d test -m <testwords
776 work st:work po:verb is:sg3
777 work st:work po:noun is:plur
778
779 This feature also gives a way to forbid illegal prefix/suffix combina‐
780 tions.
781
782
784 An interesting side-effect of multi-step stripping is, that the appro‐
785 priate treatment of circumfixes now comes for free. For instance, in
786 Hungarian, superlatives are formed by simultaneous prefixation of leg-
787 and suffixation of -bb to the adjective base. A problem with the one-
788 level architecture is that there is no way to render lexical licensing
789 of particular prefixes and suffixes interdependent, and therefore
790 incorrect forms are recognized as valid, i.e. *legvén = leg + vén
791 `old'. Until the introduction of clusters, a special treatment of the
792 superlative had to be hardwired in the earlier HunSpell code. This may
793 have been legitimate for a single case, but in fact prefix--suffix
794 dependences are ubiquitous in category-changing derivational patterns
795 (cf. English payable, non-payable but *non-pay or drinkable, undrink‐
796 able but *undrink). In simple words, here, the prefix un- is legitimate
797 only if the base drink is suffixed with -able. If both these patters
798 are handled by on-line affix rules and affix rules are checked against
799 the base only, there is no way to express this dependency and the sys‐
800 tem will necessarily over- or undergenerate.
801
802 In next example, suffix class R have got a prefix `continuation' class
803 (class P).
804
805
806 PFX P Y 1
807 PFX P 0 un . [prefix_un]+
808
809 SFX S Y 1
810 SFX S 0 s . +PL
811
812 SFX Q Y 1
813 SFX Q 0 s . +3SGV
814
815 SFX R Y 1
816 SFX R 0 able/PS . +DER_V_ADJ_ABLE
817
818 Dictionary:
819
820
821 2
822 drink/RQ [verb]
823 drink/S [noun]
824
825 Morphological analysis:
826
827
828 > drink
829 drink[verb]
830 drink[noun]
831 > drinks
832 drink[verb]+3SGV
833 drink[noun]+PL
834 > drinkable
835 drink[verb]+DER_V_ADJ_ABLE
836 > drinkables
837 drink[verb]+DER_V_ADJ_ABLE+PL
838 > undrinkable
839 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
840 > undrinkables
841 [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
842 > undrink
843 Unknown word.
844 > undrinks
845 Unknown word.
846
848 Conditional affixes implemented by a continuation class are not enough
849 for circumfixes, because a circumfix is one affix in morphology. We
850 also need CIRCUMFIX option for correct morphological analysis.
851
852
853 # circumfixes: ~ obligate prefix/suffix combinations
854 # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
855 # nagy, nagyobb, legnagyobb, legeslegnagyobb
856 # (great, greater, greatest, most greatest)
857
858 CIRCUMFIX X
859
860 PFX A Y 1
861 PFX A 0 leg/X .
862
863 PFX B Y 1
864 PFX B 0 legesleg/X .
865
866 SFX C Y 3
867 SFX C 0 obb . +COMPARATIVE
868 SFX C 0 obb/AX . +SUPERLATIVE
869 SFX C 0 obb/BX . +SUPERSUPERLATIVE
870
871 Dictionary:
872
873
874 1
875 nagy/C [MN]
876
877 Analysis:
878
879
880 > nagy
881 nagy[MN]
882 > nagyobb
883 nagy[MN]+COMPARATIVE
884 > legnagyobb
885 nagy[MN]+SUPERLATIVE
886 > legeslegnagyobb
887 nagy[MN]+SUPERSUPERLATIVE
888
890 Allowing free compounding yields decrease in precision of recognition,
891 not to mention stemming and morphological analysis. Although lexical
892 switches are introduced to license compounding of bases by Ispell, this
893 proves not to be restrictive enough. For example:
894
895
896 # affix file
897 COMPOUNDFLAG X
898
899 2
900 foo/X
901 bar/X
902
903 With this resource, foobar and barfoo also are accepted words.
904
905 This has been improved upon with the introduction of direction-sensi‐
906 tive compounding, i.e., lexical features can specify separately whether
907 a base can occur as leftmost or rightmost constituent in compounds.
908 This, however, is still insufficient to handle the intricate patterns
909 of compounding, not to mention idiosyncratic (and language specific)
910 norms of hyphenation.
911
912 The Hunspell algorithm currently allows any affixed form of words,
913 which are lexically marked as potential members of compounds. Hunspell
914 improved this, and its recursive compound checking rules makes it pos‐
915 sible to implement the intricate spelling conventions of Hungarian com‐
916 pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
917 ROOT, SYLLABLENUM options can be set the noteworthy Hungarian `6-3'
918 rule. Further example in Hungarian, derivate suffixes often modify
919 compounding properties. Hunspell allows the compounding flags on the
920 affixes, and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
921 POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
922
923 Suffixes with this flag forbid compounding of the affixed word.
924
925 We also need several Hunspell features for handling German compounding:
926
927
928 # German compounding
929
930 # set language to handle special casing of German sharp s
931
932 LANG de_DE
933
934 # compound flags
935
936 COMPOUNDBEGIN U
937 COMPOUNDMIDDLE V
938 COMPOUNDEND W
939
940 # Prefixes are allowed at the beginning of compounds,
941 # suffixes are allowed at the end of compounds by default:
942 # (prefix)?(root)+(affix)?
943 # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
944 COMPOUNDPERMITFLAG P
945
946 # for German fogemorphemes (Fuge-element)
947 # Hint: ONLYINCOMPOUND is not required everywhere, but the
948 # checking will be a little faster with it.
949
950 ONLYINCOMPOUND X
951
952 # forbid uppercase characters at compound word bounds
953 CHECKCOMPOUNDCASE
954
955 # for handling Fuge-elements with dashes (Arbeits-)
956 # dash will be a special word
957
958 COMPOUNDMIN 1
959 WORDCHARS -
960
961 # compound settings and fogemorpheme for `Arbeit'
962
963 SFX A Y 3
964 SFX A 0 s/UPX .
965 SFX A 0 s/VPDX .
966 SFX A 0 0/WXD .
967
968 SFX B Y 2
969 SFX B 0 0/UPX .
970 SFX B 0 0/VWXDP .
971
972 # a suffix for `Computer'
973
974 SFX C Y 1
975 SFX C 0 n/WD .
976
977 # for forbid exceptions (*Arbeitsnehmer)
978
979 FORBIDDENWORD Z
980
981 # dash prefix for compounds with dash (Arbeits-Computer)
982
983 PFX - Y 1
984 PFX - 0 -/P .
985
986 # decapitalizing prefix
987 # circumfix for positioning in compounds
988
989 PFX D Y 29
990 PFX D A a/PX A
991 PFX D Ä ä/PX Ä
992 .
993 .
994 PFX D Y y/PX Y
995 PFX D Z z/PX Z
996
997 Example dictionary:
998
999
1000 4
1001 Arbeit/A-
1002 Computer/BC-
1003 -/W
1004 Arbeitsnehmer/Z
1005
1006 Accepted compound compound words with the previous resource:
1007
1008
1009 Computer
1010 Computern
1011 Arbeit
1012 Arbeits-
1013 Computerarbeit
1014 Computerarbeits-
1015 Arbeitscomputer
1016 Arbeitscomputern
1017 Computerarbeitscomputer
1018 Computerarbeitscomputern
1019 Arbeitscomputerarbeit
1020 Computerarbeits-Computer
1021 Computerarbeits-Computern
1022
1023 Not accepted compoundings:
1024
1025
1026 computer
1027 arbeit
1028 Arbeits
1029 arbeits
1030 ComputerArbeit
1031 ComputerArbeits
1032 Arbeitcomputer
1033 ArbeitsComputer
1034 Computerarbeitcomputer
1035 ComputerArbeitcomputer
1036 ComputerArbeitscomputer
1037 Arbeitscomputerarbeits
1038 Computerarbeits-computer
1039 Arbeitsnehmer
1040
1041 This solution is still not ideal, however, and will be replaced by a
1042 pattern-based compound-checking algorithm which is closely integrated
1043 with input buffer tokenization. Patterns describing compounds come as a
1044 separate input resource that can refer to high-level properties of con‐
1045 stituent parts (e.g. the number of syllables, affix flags, and contain‐
1046 ment of hyphens). The patterns are matched against potential segmenta‐
1047 tions of compounds to assess wellformedness.
1048
1049
1051 Both Ispell and Myspell use 8-bit ASCII character encoding, which is a
1052 major deficiency when it comes to scalability. Although a language
1053 like Hungarian has a standard ASCII character set (ISO 8859-2), it
1054 fails to allow a full implementation of Hungarian orthographic conven‐
1055 tions. For instance, the '--' symbol (n-dash) is missing from this
1056 character set contrary to the fact that it is not only the official
1057 symbol to delimit parenthetic clauses in the language, but it can be in
1058 compound words as a special 'big' hyphen.
1059
1060 MySpell has got some 8-bit encoding tables, but there are languages
1061 without standard 8-bit encoding, too. For example, a lot of African
1062 languages have non-latin or extended latin characters.
1063
1064 Similarly, using the original spelling of certain foreign names like
1065 Ångström or Molière is encouraged by the Hungarian spelling norm, and,
1066 since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1067 bine with inflections containing characters only in ISO 8859-2 (like
1068 elative -ből, allative -től or delative -ről with double acute), these
1069 result in words (like Ångströmről or Molière-től.) that can not be
1070 encoded using any single ASCII encoding scheme.
1071
1072 The problems raised in relation to 8-bit ASCII encoding have long been
1073 recognized by proponents of Unicode. It is clear that trading effi‐
1074 ciency for encoding-independence has its advantages when it comes a
1075 truly multi-lingual application. There is implemented a memory and time
1076 efficient Unicode handling in Hunspell. In non-UTF-8 character encod‐
1077 ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1078 affixes and words are stored in UTF-8, during the analysis are handled
1079 in mostly UTF-8, under condition checking and suggestion are converted
1080 to UTF-16. Unicode text analysis and spell checking have a minimal
1081 (0-20%) time overhead and minimal or reasonable memory overhead depends
1082 from the language (its UTF-8 encoding and affixation).
1083
1084
1086 hunspell (1), ispell (1), ispell (4)
1087
1088
1089
1090
1091 2008-08-15 hunspell(4)