hunspell(4)

1hunspell(4)                Kernel Interfaces Manual                hunspell(4)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1) Hunspell requires two files to define the way a language is
10       being spell checked: a dictionary file containing words and  applicable
11       flags,  and  an  affix file that specifies how these flags wil controll
12       spell checking.  An optional file is the personal dictionary file.
13
14

Dictionary file

16       A dictionary file (*.dic) contains a list of words, one per line.   The
17       first  line of the dictionaries (except personal dictionaries) contains
18       the approximate word count (for optimal hash memory  size).  Each  word
19       may  optionally  be  followed  by  a slash ("/") and one or more flags,
20       which represents the word attributes, for example affixes.
21
22       Note: Dictionary words can contain also slashes when escaped  like   ""
23       syntax.
24
25

Personal dictionary file

27       Personal  dictionaries  are  simple  word  lists. Asterisk at the first
28       character position signs prohibition.  A second  word  separated  by  a
29       slash sets the affixation.
30
31
32              foo
33              Foo/Simpson
34              *bar
35
36       In  this  example, "foo" and "Foo" are personal words, plus Foo will be
37       recognized with affixes of Simpson (Foo's etc.) and bar is a  forbidden
38       word.
39
40

Short example

42       Dictionary file:
43
44              3
45              hello
46              try/B
47              work/AB
48
49       The flags B and A specify attributes of these words.
50
51       Affix file:
52
53
54              SET UTF-8
55              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
56
57              REP 2
58              REP f ph
59              REP ph f
60
61              PFX A Y 1
62              PFX A 0 re .
63
64              SFX B Y 2
65              SFX B 0 ed [^y]
66              SFX B y ied y
67
68       In  the  affix  file,  prefix A and suffix B have been defined.  Flag A
69       defines a `re-' prefix. Class B defines two  `-ed'  suffixes.  First  B
70       suffix  can  be added to a word if the last character of the word isn't
71       `y'.  Second suffix can be added to the words terminated with an `y'.
72
73
74       All accepted words with this  dictionary  and  affix  combination  are:
75       "hello", "try", "tried", "work", "worked", "rework", "reworked".
76
77

AFFIX FILE GENERAL OPTIONS

79       Hunspell  source distribution contains more than 80 examples for option
80       usage.
81
82
83       SET encoding
84              Set character encoding of words and morphemes in affix and  dic‐
85              tionary  files.  Possible values: UTF-8, ISO8859-1 - ISO8859-10,
86              ISO8859-13  -  ISO8859-15,  KOI8-R,  KOI8-U,   microsoft-cp1251,
87              ISCII-DEVANAGARI.
88
89              SET UTF-8
90
91       FLAG value
92              Set  flag type. Default type is the extended ASCII (8-bit) char‐
93              acter.  `UTF-8' parameter sets UTF-8 encoded  Unicode  character
94              flags.   The `long' value sets the double extended ASCII charac‐
95              ter flag type, the `num' sets the decimal number flag type. Dec‐
96              imal flags numbered from 1 to 65000, and in flag fields are sep‐
97              arated by comma.  BUG: UTF-8 flag type doesn't work on ARM plat‐
98              form.
99
100              FLAG long
101
102       COMPLEXPREFIXES
103              Set  twofold  prefix stripping (but single suffix stripping) eg.
104              for morphologically complex languages with right-to-left writing
105              system.
106
107
108       LANG langcode
109              Set  language  code for language specific functions of Hunspell.
110              Use it to activate special casing of Azeri (LANG az) and Turkish
111              (LANG tr).
112
113       IGNORE characters
114              Sets  characters  to  ignore dictionary words, affixes and input
115              words.  Useful for optional characters, as Arabic  (harakat)  or
116              Hebrew  (niqqud) diacritical marks (see tests/ignore.* test dic‐
117              tionary in Hunspell distribution).
118
119
120       AF number_of_flag_vector_aliases
121
122       AF flag_vector
123              Hunspell can substitute affix flag sets with ordinal numbers  in
124              affix rules (alias compression, see makealias tool). First exam‐
125              ple with alias compression:
126
127              3
128              hello
129              try/1
130              work/2
131
132       AF definitions in the affix file:
133
134              AF 2
135              AF A
136              AF AB
137
138       It is equivalent of the following dic file:
139
140              3
141              hello
142              try/A
143              work/AB
144
145       See also tests/alias* examples of the source distribution.
146
147       Note I: If affix file contains the FLAG parameter, define it before the
148       AF definitions.
149
150       Note II: Use makealias utility in Hunspell distribution to compress aff
151       and dic files.
152
153       AM number_of_morphological_aliases
154
155       AM morphological_fields
156              Hunspell can substitute also  morphological  data  with  ordinal
157              numbers  in  affix  rules (alias compression).  See tests/alias*
158              examples.
159

AFFIX FILE OPTIONS FOR SUGGESTION

161       Suggestion parameters  can  optimize  the  default  n-gram  (similarity
162       search in the dictionary words based on the common 1, 2, 3, 4-character
163       length common character-sequences), character swap and deletion sugges‐
164       tions  of Hunspell.  REP is suggested to fix the typical and especially
165       bad language specific bugs, because the REP suggestions have the  high‐
166       est  priority  in the suggestion list.  PHONE is for languages with not
167       pronunciation based orthography.
168
169       KEY characters_separated_by_vertical_line_optionally
170              Hunspell searches and suggests words with one different  charac‐
171              ter  replaced  by a neighbor KEY character. Not neighbor charac‐
172              ters in KEY string separated by vertical line characters.   Sug‐
173              gested KEY parameters for QWERTY and Dvorak keyboard layouts:
174
175              KEY qwertyuiop|asdfghjkl|zxcvbnm
176              KEY pyfgcrl|aeouidhtns|qjkxbmwvz
177
178       Using  the first QWERTY layout, Hunspell suggests "nude" and "node" for
179       "*nide". A character may have more neighbors, too:
180
181              KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
182
183       TRY characters
184              Hunspell can suggest right word forms, when they differ from the
185              bad  input  word  by  one TRY character. The parameter of TRY is
186              case sensitive.
187
188       NOSUGGEST flag
189              Words signed with NOSUGGEST flag are not  suggested  (but  still
190              accepted  when  typed  correctly).  Proposed flag for vulgar and
191              obscene words (see also SUBSTANDARD).
192
193       MAXCPDSUGS num
194              Set max. number of suggested compound words  generated  by  com‐
195              pound  rules.  The number of the suggested compound words may be
196              greater from the same 1-character distance type.
197
198       MAXNGRAMSUGS num
199              Set max. number of n-gram suggestions. Value 0 switches off  the
200              n-gram suggestions (see also MAXDIFF).
201
202       MAXDIFF [0-10]
203              Set  the similarity factor for the n-gram based suggestions (5 =
204              default value; 0 = fewer n-gram suggestions, but min.  1;  10  =
205              MAXNGRAMSUGS n-gram suggestions).
206
207       ONLYMAXDIFF
208              Remove  all  bad n-gram suggestions (default mode keeps one, see
209              MAXDIFF).
210
211       NOSPLITSUGS
212              Disable word suggestions with spaces.
213
214       SUGSWITHDOTS
215              Add dot(s) to suggestions, if input word terminates  in  dot(s).
216              (Not for OpenOffice.org dictionaries, because OpenOffice.org has
217              an automatic dot expansion mechanism.)
218
219       REP number_of_replacement_definitions
220
221       REP what replacement
222              This table specifies modifications to try first.  First  REP  is
223              the  header of this table and one or more REP data line are fol‐
224              lowing it.  With this table,  Hunspell  can  suggest  the  right
225              forms  for the typical spelling mistakes when the incorrect form
226              differs by more than 1 letter from the right form.   The  search
227              string supports the regex boundary signs (^ and $).  For example
228              a possible English replacement table definition to  handle  mis‐
229              spelled consonants:
230
231              REP 5
232              REP f ph
233              REP ph f
234              REP tion$ shun
235              REP ^cooccurr co-occurr
236              REP ^alot$ a_lot
237
238       Note  I:  It's  very useful to define replacements for the most typical
239       one-character mistakes, too: with REP you can add higher priority to  a
240       subset of the TRY suggestions (suggestion list begins with the REP sug‐
241       gestions).
242
243       Note II: Suggesting separated words, specify spaces with underlines:
244
245
246              REP 1
247              REP onetwothree one_two_three
248
249       Note III: Replacement table can be used for a  stricter  compound  word
250       checking with the option CHECKCOMPOUNDREP.
251
252
253       MAP number_of_map_definitions
254
255       MAP string_of_related_chars_or_parenthesized_character_sequences
256              We  can  define language-dependent information on characters and
257              character sequences that  should  be  considered  related  (i.e.
258              nearer than other chars not in the set) in the affix file (.aff)
259              by a map table.  With this table, Hunspell can suggest the right
260              forms  for  words,  which incorrectly choose the wrong letter or
261              letter groups from a related set more than once in a  word  (see
262              REP).
263
264              For  example a possible mapping could be for the German umlauted
265              ü versus the regular u; the  word  Frühstück  really  should  be
266              written with umlauted u's and not regular ones
267
268              MAP 1
269              MAP uü
270
271       Use parenthesized groups for character sequences (eg. for composed Uni‐
272       code characters):
273
274              MAP 3
275              MAP ß(ss)  (character sequence)
276              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
277              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
278
279       PHONE number_of_phone_definitions
280
281       PHONE what replacement
282              PHONE uses a table-driven phonetic transcription algorithm  bor‐
283              rowed from Aspell. It is useful for languages with not pronunci‐
284              ation based orthography. You can add a full alphabet  conversion
285              and  other rules for conversion of special letter sequences. For
286              detailed documentation see  http://aspell.net/man-html/Phonetic-
287              Code.html.   Note:  Multibyte  UTF-8  characters have not worked
288              with bracket expression yet. Dash expression  has  signed  bytes
289              and not UTF-8 characters yet.
290
291       WARN flag
292              This  flag  is for rare words, wich are also often spelling mis‐
293              takes, see option -r of command line Hunspell and FORBIDWARN.
294
295       FORBIDWARN
296              Words with flag WARN aren't accepted by the spell checker  using
297              this parameter.
298

OPTIONS FOR COMPOUNDING

300       BREAK number_of_break_definitions
301
302       BREAK character_or_character_sequence
303              Define  new  break  points  for breaking words and checking word
304              parts separately. Use ^ and $ to delete characters  at  end  and
305              start  of the word. Rationale: useful for compounding with join‐
306              ing character or strings (for example,  hyphen  in  English  and
307              German  or hyphen and n-dash in Hungarian). Dashes are often bad
308              break points for tokenization, because compounds with dashes may
309              contain  not  valid parts, too.)  With BREAK, Hunspell can check
310              both side of these compounds, breaking the words at  dashes  and
311              n-dashes:
312
313              BREAK 2
314              BREAK -
315              BREAK --    # n-dash
316
317       Breaking  are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
318       be valid compounds.  Note: The default word break of Hunspell is equiv‐
319       alent of the following BREAK definition:
320
321              BREAK 3
322              BREAK -
323              BREAK ^-
324              BREAK -$
325
326       Hunspell  doesn't  accept  the  "-word" and "word-" forms by this BREAK
327       definition:
328
329              BREAK 1
330              BREAK -
331
332       Switching off the default values:
333
334              BREAK 0
335
336       Note II: COMPOUNDRULE is better for handling dashes and other  compound
337       joining  characters  or  character  strings.  Use BREAK, if you want to
338       check words with dashes or other joining characters  and  there  is  no
339       time  or  possibility  to  describe  precise  compound  rules with COM‐
340       POUNDRULE (COMPOUNDRULE handles only the suffixation of the  last  word
341       part of a compound word).
342
343       Note  III:  For command line spell checking of words with extra charac‐
344       ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
345       ple
346
347       COMPOUNDRULE number_of_compound_definitions
348
349       COMPOUNDRULE compound_pattern
350              Define  custom  compound patterns with a regex-like syntax.  The
351              first COMPOUNDRULE is a header with the number of the  following
352              COMPOUNDRULE  definitions.  Compound  patterns  consist compound
353              flags, parentheses, star and question mark  meta  characters.  A
354              flag  followed  by  a  `*'  matches a word sequence of 0 or more
355              matches of words signed with this compound flag.   A  flag  fol‐
356              lowed  by  a  `?' matches a word sequence of 0 or 1 matches of a
357              word signed with  this  compound  flag.   See  tests/compound*.*
358              examples.
359
360              Note:  en_US  dictionary of OpenOffice.org uses COMPOUNDRULE for
361              ordinal number recognition (1st, 2nd, 11th, 12th,  22nd,  112th,
362              1000122nd etc.).
363
364              Note  II:  In the case of long and numerical flag types use only
365              parenthesized flags: (1500)*(2000)?
366
367              Note III: COMPOUNDRULE flags work completely separately from the
368              compounding  mechanisme  using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
369              compound flags. (Use  these  flags  on  different  enhtries  for
370              words).
371
372
373       COMPOUNDMIN num
374              Minimum  length of words used for compounding.  Default value is
375              3 letters.
376
377       COMPOUNDFLAG flag
378              Words signed with COMPOUNDFLAG may be in compound words  (except
379              when  word  shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
380              also permits compounding of affixed words.
381
382       COMPOUNDBEGIN flag
383              Words signed with COMPOUNDBEGIN (or with a signed affix) may  be
384              first elements in compound words.
385
386       COMPOUNDLAST flag
387              Words  signed  with COMPOUNDLAST (or with a signed affix) may be
388              last elements in compound words.
389
390       COMPOUNDMIDDLE flag
391              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
392              middle elements in compound words.
393
394       ONLYINCOMPOUND flag
395              Suffixes  signed  with ONLYINCOMPOUND flag may be only inside of
396              compounds (Fuge-elements in German, fogemorphemes  in  Swedish).
397              ONLYINCOMPOUND  flag works also with words (see tests/onlyincom‐
398              pound.*).  Note: also valuable to flag compounding  parts  which
399              are not correct as a word by itself.
400
401       COMPOUNDPERMITFLAG flag
402              Prefixes are allowed at the beginning of compounds, suffixes are
403              allowed at the end of compounds by default.  Affixes  with  COM‐
404              POUNDPERMITFLAG may be inside of compounds.
405
406       COMPOUNDFORBIDFLAG flag
407              Suffixes with this flag forbid compounding of the affixed word.
408
409       COMPOUNDROOT flag
410              COMPOUNDROOT  flag signs the compounds in the dictionary (Now it
411              is used only in the Hungarian language specific code).
412
413       COMPOUNDWORDMAX number
414              Set maximum word count in a compound word.  (Default  is  unlim‐
415              ited.)
416
417       CHECKCOMPOUNDDUP
418              Forbid word duplication in compounds (e.g. foofoo).
419
420       CHECKCOMPOUNDREP
421              Forbid  compounding, if the (usually bad) compound word may be a
422              non compound word with a REP fault. Useful  for  languages  with
423              `compound friendly' orthography.
424
425       CHECKCOMPOUNDCASE
426              Forbid upper case characters at word boundaries in compounds.
427
428       CHECKCOMPOUNDTRIPLE
429              Forbid  compounding,  if compound word contains triple repeating
430              letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
431              ter  support in UTF-8 encoding (works only for 7-bit ASCII char‐
432              acters).
433
434       SIMPLIFIEDTRIPLE
435              Allow simplified 2-letter forms of the  compounds  forbidden  by
436              CHECKCOMPOUNDTRIPLE.  It's useful for Swedish and Norwegian (and
437              for the old German orthography: Schiff|fahrt -> Schiffahrt).
438
439       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
440
441       CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
442              Forbid compounding, if the first word in the compound ends  with
443              endchars,  and next word begins with beginchars and (optionally)
444              they have the requested flags.  The optional replacement parame‐
445              ter allows simplified compound form.
446
447              The  special  "endchars" pattern 0 (zero) limits the rule to the
448              unmodified stems (stems and stems with zero affixes):
449
450              CHECKCOMPOUNDPATTERN 0/x /y
451
452       Note: COMPOUNDMIN doesn't work correctly with the compound word  alter‐
453       nation, so it may need to set COMPOUNDMIN to lower value.
454
455       FORCEUCASE flag
456              Last  word  part of a compound with flag FORCEUCASE forces capi‐
457              talization of the whole compound word. Eg. Dutch  word  "straat"
458              (street)  with FORCEUCASE flags will allowed only in capitalized
459              compound forms, according to the Dutch spelling rules for proper
460              names.
461
462       COMPOUNDSYLLABLE max_syllable vowels
463              Need  for special compounding rules in Hungarian.  First parame‐
464              ter is the maximum syllable number, that may be in  a  compound,
465              if  words  in  compounds  are more than COMPOUNDWORDMAX.  Second
466              parameter is the list of vowels (for calculating syllables).
467
468       SYLLABLENUM flags
469              Need for special compounding rules in Hungarian.
470

AFFIX FILE OPTIONS FOR AFFIX CREATION

472       PFX flag cross_product number
473
474       PFX flag stripping prefix [condition [morphological_fields...]]
475
476       SFX flag cross_product number
477
478       SFX flag stripping suffix [condition [morphological_fields...]]
479              An affix is either a prefix or a suffix attached to  root  words
480              to  make other words. We can define affix classes with arbitrary
481              number affix rules.  Affix classes are signed with affix  flags.
482              The  first  line of an affix class definition is the header. The
483              fields of an affix class header:
484
485              (0) Option name (PFX or SFX)
486
487              (1) Flag (name of the affix class)
488
489              (2) Cross product (permission to combine prefixes and suffixes).
490              Possible values: Y (yes) or N (no)
491
492              (3) Line count of the following rules.
493
494              Fields of an affix rules:
495
496              (0) Option name
497
498              (1) Flag
499
500              (2) stripping characters from beginning (at prefix rules) or end
501              (at suffix rules) of the word
502
503              (3) affix (optionally with flags of continuation classes,  sepa‐
504              rated by a slash)
505
506              (4) condition.
507
508              Zero stripping or affix are indicated by zero. Zero condition is
509              indicated by dot.  Condition is a  simplified,  regular  expres‐
510              sion-like  pattern,  which  must  be met before the affix can be
511              applied. (Dot signs an arbitrary character. Characters in braces
512              sign  an  arbitrary  character  from  the character subset. Dash
513              hasn't got special meaning, but circumflex (^)  next  the  first
514              brace sets the complementer character set.)
515
516              (5) Optional morphological fields separated by spaces or tabula‐
517              tors.
518
519

AFFIX FILE OTHER OPTIONS

521       CIRCUMFIX flag
522              Affixes signed with CIRCUMFIX flag may be on a  word  when  this
523              word  also  has a prefix with CIRCUMFIX flag and vice versa (see
524              circumfix.* test files in the source distribution).
525
526       FORBIDDENWORD flag
527              This flag signs forbidden word form. Because affixed  forms  are
528              also  forbidden,  we  can  subtract  a  subset  from  set of the
529              accepted affixed and compound words.  Note:  usefull  to  forbid
530              erroneous words, generated by the compounding mechanism.
531
532       FULLSTRIP
533              With  FULLSTRIP,  affix rules can strip full words, not only one
534              less characters, before adding the affixes, see fullstrip.* test
535              files in the source distribution).  Note: conditions may be word
536              length without FULLSTRIP, too.
537
538       KEEPCASE flag
539              Forbid uppercased and capitalized forms  of  words  signed  with
540              KEEPCASE  flags.  Useful for special orthographies (measurements
541              and currency often keep their  case  in  uppercased  texts)  and
542              writing  systems  (e.g.  keeping  lower case of IPA characters).
543              Also valuable for words erroneously written in the wrong case.
544
545              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
546              CASE  flag  may  be  capitalized  and uppercased, but uppercased
547              forms of these words may not contain sharp s, only SS. See  ger‐
548              mancompounding  example  in  the tests directory of the Hunspell
549              distribution.
550
551
552       ICONV number_of_ICONV_definitions
553
554       ICONV pattern pattern2
555              Define input conversion table.  Note: useful to convert one type
556              of quote to another one, or change ligature.
557
558       OCONV number_of_OCONV_definitions
559
560       OCONV pattern pattern2
561              Define output conversion table.
562
563       LEMMA_PRESENT flag
564              Deprecated. Use "st:" field instead of LEMMA_PRESENT.
565
566       NEEDAFFIX flag
567              This  flag  signs  virtual  stems  in the dictionary, words only
568              valid when affixed.   Except,  if  the  dictionary  word  has  a
569              homonym or a zero affix.  NEEDAFFIX works also with prefixes and
570              prefix + suffix combinations (see tests/pseudoroot5.*).
571
572       PSEUDOROOT flag
573              Deprecated. (Former name of the NEEDAFFIX option.)
574
575       SUBSTANDARD flag
576              SUBSTANDARD flag signs affix rules and dictionary  words  (allo‐
577              morphs)  not used in morphological generation (and in suggestion
578              in the future versions). See also NOSUGGEST.
579
580       WORDCHARS characters
581              WORDCHARS extends tokenizer of Hunspell command  line  interface
582              with  additional word character. For example, dot, dash, n-dash,
583              numbers, percent sign are word character in Hungarian.
584
585       CHECKSHARPS
586              SS letter pair in uppercased (German) words may  be  upper  case
587              sharp  s  (ß).  Hunspell can handle this special casing with the
588              CHECKSHARPS declaration (see also KEEPCASE flag  and  tests/ger‐
589              mancompounding example) in both spelling and suggestion.
590
591

Morphological analysis

593       Hunspell's  dictionary items and affix rules may have optional space or
594       tabulator separated  morphological  description  fields,  started  with
595       3-character (two letters and a colon) field IDs:
596
597
598               word/flags po:noun is:nom
599
600       Example: We define a simple resource with morphological informations, a
601       derivative suffix (ds:) and a part of speech category (po:):
602
603       Affix file:
604
605
606               SFX X Y 1
607               SFX X 0 able . ds:able
608
609       Dictionary file:
610
611
612               drink/X po:verb
613
614       Test file:
615
616
617               drink
618               drinkable
619
620       Test:
621
622
623               $ analyze test.aff test.dic test.txt
624               > drink
625               analyze(drink) = po:verb
626               stem(drink) = po:verb
627               > drinkable
628               analyze(drinkable) = po:verb ds:able
629               stem(drinkable) = drinkable
630
631       You can see in the example, that the analyzer concatenates the  morpho‐
632       logical fields in item and arrangement style.
633
634

Optional data fields

636       Default  morphological  and other IDs (used in suggestion, stemming and
637       morphological generation):
638
639       ph:    Alternative transliteration for better suggestion.  It's  useful
640              for words with foreign pronunciation. (Dictionary based phonetic
641              suggestion.)  For example:
642
643
644              Marseille ph:maarsayl
645
646       st:    Stem. Optional: default stem is the dictionary item  in  morpho‐
647              logical  analysis.  Stem field is useful for virtual stems (dic‐
648              tionary words with NEEDAFFIX flag) and morphological  exceptions
649              instead of new, single used morphological rules.
650
651              feet  st:foot  is:plural
652              mice  st:mouse is:plural
653              teeth st:tooth is:plural
654
655       Word forms with multiple stems need multiple dictionary items:
656
657
658              lay po:verb st:lie is:past_2
659              lay po:verb is:present
660              lay po:noun
661
662       al:    Allomorph(s).  A  dictionary item is the stem of its allomorphs.
663              Morphological generation needs stem, allomorph and affix fields.
664
665              sing al:sang al:sung
666              sang st:sing
667              sung st:sing
668
669       po:    Part of speech category.
670
671       ds:    Derivational suffix(es).  Stemming doesn't  remove  derivational
672              suffixes.   Morphological generation depends on the order of the
673              suffix fields.
674
675              In affix rules:
676
677
678              SFX Y Y 1
679              SFX Y 0 ly . ds:ly_adj
680
681       In the dictionary:
682
683
684              ably st:able ds:ly_adj
685              able al:ably
686
687       is:    Inflectional suffix(es).  All inflectional suffixes are  removed
688              by  stemming.   Morphological generation depends on the order of
689              the suffix fields.
690
691
692              feet st:foot is:plural
693
694       ts:    Terminal suffix(es).  Terminal suffix  fields  are  inflectional
695              suffix fields "removed" by additional (not terminal) suffixes.
696
697              Useful  for  zero  morphemes  and  affixes  removed by splitting
698              rules.
699
700
701              work/D ts:present
702
703              SFX D Y 2
704              SFX D   0 ed . is:past_1
705              SFX D   0 ed . is:past_2
706
707       Typical example of the terminal suffix is the zero morpheme of the nom‐
708       inative case.
709
710
711       sp:    Surface  prefix.  Temporary  solution for adding prefixes to the
712              stems and generated word forms. See tests/morph.* example.
713
714
715       pa:    Parts of the compound  words.  Output  fields  of  morphological
716              analysis for stemming.
717
718       dp:    Planned: derivational prefix.
719
720       ip:    Planned: inflectional prefix.
721
722       tp:    Planned: terminal prefix.
723
724

Twofold suffix stripping

726       Ispell's  original algorithm strips only one suffix. Hunspell can strip
727       another one yet (or a plus prefix in COMPLEXPREFIXES mode).
728
729       The twofold suffix stripping is a significant improvement  in  handling
730       of  immense  number  of  suffixes, that characterize agglutinative lan‐
731       guages.
732
733       A second `s' suffix (affix class Y) will be the continuation  class  of
734       the suffix `able' in the following example:
735
736
737               SFX Y Y 1
738               SFX Y 0 s .
739
740               SFX X Y 1
741               SFX X 0 able/Y .
742
743       Dictionary file:
744
745
746               drink/X
747
748       Test file:
749
750
751               drink
752               drinkable
753               drinkables
754
755       Test:
756
757
758               $ hunspell -m -d test <test.txt
759               drink st:drink
760               drinkable st:drink fl:X
761               drinkables st:drink fl:X fl:Y
762
763       Theoretically  with  the twofold suffix stripping needs only the square
764       root of the number of suffix rules, compared with a Hunspell  implemen‐
765       tation. In our practice, we could have elaborated the Hungarian inflec‐
766       tional morphology with twofold suffix stripping.
767
768

Extended affix classes

770       Hunspell can handle more than 65000 affix classes.  There are three new
771       syntax for giving flags in affix and dictionary files.
772
773       FLAG long command sets 2-character flags:
774
775
776                FLAG long
777                SFX Y1 Y 1
778                SFX Y1 0 s 1
779
780       Dictionary record with the Y1, Z3, F? flags:
781
782
783                foo/Y1Z3F?
784
785       FLAG num command sets numerical flags separated by comma:
786
787
788                FLAG num
789                SFX 65000 Y 1
790                SFX 65000 0 s 1
791
792       Dictionary example:
793
794
795                foo/65000,12,2756
796
797       The third one is the Unicode character flags.
798
799

Homonyms

801       Hunspell's dictionary can contain repeating elements that are homonyms:
802
803
804               work/A    po:verb
805               work/B    po:noun
806
807       An affix file:
808
809
810               SFX A Y 1
811               SFX A 0 s . sf:sg3
812
813               SFX B Y 1
814               SFX B 0 s . is:plur
815
816       Test file:
817
818
819               works
820
821       Test:
822
823
824               $ hunspell -d test -m <testwords
825               work st:work po:verb is:sg3
826               work st:work po:noun is:plur
827
828       This  feature also gives a way to forbid illegal prefix/suffix combina‐
829       tions.
830
831

Prefix--suffix dependencies

833       An interesting side-effect of multi-step stripping is, that the  appro‐
834       priate  treatment  of circumfixes now comes for free.  For instance, in
835       Hungarian, superlatives are formed by simultaneous prefixation of  leg-
836       and  suffixation of -bb to the adjective base.  A problem with the one-
837       level architecture is that there is no way to render lexical  licensing
838       of  particular  prefixes  and  suffixes  interdependent,  and therefore
839       incorrect forms are recognized as valid,  i.e.  *legvén  =  leg  +  vén
840       `old'.  Until  the introduction of clusters, a special treatment of the
841       superlative had to be hardwired in the earlier HunSpell code. This  may
842       have  been  legitimate  for  a  single case, but in fact prefix--suffix
843       dependences are ubiquitous in category-changing  derivational  patterns
844       (cf.  English  payable, non-payable but *non-pay or drinkable, undrink‐
845       able but *undrink). In simple words, here, the prefix un- is legitimate
846       only  if  the  base drink is suffixed with -able. If both these patters
847       are handled by on-line affix rules and affix rules are checked  against
848       the  base only, there is no way to express this dependency and the sys‐
849       tem will necessarily over- or undergenerate.
850
851       In next example, suffix class R have got a prefix `continuation'  class
852       (class P).
853
854
855              PFX P Y 1
856              PFX P   0 un . [prefix_un]+
857
858              SFX S Y 1
859              SFX S   0 s . +PL
860
861              SFX Q Y 1
862              SFX Q   0 s . +3SGV
863
864              SFX R Y 1
865              SFX R   0 able/PS . +DER_V_ADJ_ABLE
866
867       Dictionary:
868
869
870              2
871              drink/RQ  [verb]
872              drink/S   [noun]
873
874       Morphological analysis:
875
876
877              > drink
878              drink[verb]
879              drink[noun]
880              > drinks
881              drink[verb]+3SGV
882              drink[noun]+PL
883              > drinkable
884              drink[verb]+DER_V_ADJ_ABLE
885              > drinkables
886              drink[verb]+DER_V_ADJ_ABLE+PL
887              > undrinkable
888              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
889              > undrinkables
890              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
891              > undrink
892              Unknown word.
893              > undrinks
894              Unknown word.
895

Circumfix

897       Conditional  affixes implemented by a continuation class are not enough
898       for circumfixes, because a circumfix is one  affix  in  morphology.  We
899       also need CIRCUMFIX option for correct morphological analysis.
900
901
902              # circumfixes: ~ obligate prefix/suffix combinations
903              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
904              # nagy, nagyobb, legnagyobb, legeslegnagyobb
905              # (great, greater, greatest, most greatest)
906
907              CIRCUMFIX X
908
909              PFX A Y 1
910              PFX A 0 leg/X .
911
912              PFX B Y 1
913              PFX B 0 legesleg/X .
914
915              SFX C Y 3
916              SFX C 0 obb . +COMPARATIVE
917              SFX C 0 obb/AX . +SUPERLATIVE
918              SFX C 0 obb/BX . +SUPERSUPERLATIVE
919
920       Dictionary:
921
922
923              1
924              nagy/C    [MN]
925
926       Analysis:
927
928
929              > nagy
930              nagy[MN]
931              > nagyobb
932              nagy[MN]+COMPARATIVE
933              > legnagyobb
934              nagy[MN]+SUPERLATIVE
935              > legeslegnagyobb
936              nagy[MN]+SUPERSUPERLATIVE
937

Compounds

939       Allowing  free compounding yields decrease in precision of recognition,
940       not to mention stemming and morphological analysis.   Although  lexical
941       switches are introduced to license compounding of bases by Ispell, this
942       proves not to be restrictive enough. For example:
943
944
945              # affix file
946              COMPOUNDFLAG X
947
948              2
949              foo/X
950              bar/X
951
952       With this resource, foobar and barfoo also are accepted words.
953
954       This has been improved upon with the introduction  of  direction-sensi‐
955       tive compounding, i.e., lexical features can specify separately whether
956       a base can occur as leftmost or  rightmost  constituent  in  compounds.
957       This,  however,  is still insufficient to handle the intricate patterns
958       of compounding, not to mention idiosyncratic  (and  language  specific)
959       norms of hyphenation.
960
961       The  Hunspell  algorithm  currently  allows  any affixed form of words,
962       which are lexically marked as potential members of compounds.  Hunspell
963       improved  this, and its recursive compound checking rules makes it pos‐
964       sible to implement the intricate spelling conventions of Hungarian com‐
965       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
966       ROOT, SYLLABLENUM options can be set  the  noteworthy  Hungarian  `6-3'
967       rule.   Further  example  in  Hungarian, derivate suffixes often modify
968       compounding properties. Hunspell allows the compounding  flags  on  the
969       affixes,  and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
970       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
971
972       Suffixes with this flag forbid compounding of the affixed word.
973
974       We also need several Hunspell features for handling German compounding:
975
976
977              # German compounding
978
979              # set language to handle special casing of German sharp s
980
981              LANG de_DE
982
983              # compound flags
984
985              COMPOUNDBEGIN U
986              COMPOUNDMIDDLE V
987              COMPOUNDEND W
988
989              # Prefixes are allowed at the beginning of compounds,
990              # suffixes are allowed at the end of compounds by default:
991              # (prefix)?(root)+(affix)?
992              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
993              COMPOUNDPERMITFLAG P
994
995              # for German fogemorphemes (Fuge-element)
996              # Hint: ONLYINCOMPOUND is not required everywhere, but the
997              # checking will be a little faster with it.
998
999              ONLYINCOMPOUND X
1000
1001              # forbid uppercase characters at compound word bounds
1002              CHECKCOMPOUNDCASE
1003
1004              # for handling Fuge-elements with dashes (Arbeits-)
1005              # dash will be a special word
1006
1007              COMPOUNDMIN 1
1008              WORDCHARS -
1009
1010              # compound settings and fogemorpheme for `Arbeit'
1011
1012              SFX A Y 3
1013              SFX A 0 s/UPX .
1014              SFX A 0 s/VPDX .
1015              SFX A 0 0/WXD .
1016
1017              SFX B Y 2
1018              SFX B 0 0/UPX .
1019              SFX B 0 0/VWXDP .
1020
1021              # a suffix for `Computer'
1022
1023              SFX C Y 1
1024              SFX C 0 n/WD .
1025
1026              # for forbid exceptions (*Arbeitsnehmer)
1027
1028              FORBIDDENWORD Z
1029
1030              # dash prefix for compounds with dash (Arbeits-Computer)
1031
1032              PFX - Y 1
1033              PFX - 0 -/P .
1034
1035              # decapitalizing prefix
1036              # circumfix for positioning in compounds
1037
1038              PFX D Y 29
1039              PFX D A a/PX A
1040              PFX D Ä ä/PX Ä
1041               .
1042               .
1043              PFX D Y y/PX Y
1044              PFX D Z z/PX Z
1045
1046       Example dictionary:
1047
1048
1049              4
1050              Arbeit/A-
1051              Computer/BC-
1052              -/W
1053              Arbeitsnehmer/Z
1054
1055       Accepted compound compound words with the previous resource:
1056
1057
1058              Computer
1059              Computern
1060              Arbeit
1061              Arbeits-
1062              Computerarbeit
1063              Computerarbeits-
1064              Arbeitscomputer
1065              Arbeitscomputern
1066              Computerarbeitscomputer
1067              Computerarbeitscomputern
1068              Arbeitscomputerarbeit
1069              Computerarbeits-Computer
1070              Computerarbeits-Computern
1071
1072       Not accepted compoundings:
1073
1074
1075              computer
1076              arbeit
1077              Arbeits
1078              arbeits
1079              ComputerArbeit
1080              ComputerArbeits
1081              Arbeitcomputer
1082              ArbeitsComputer
1083              Computerarbeitcomputer
1084              ComputerArbeitcomputer
1085              ComputerArbeitscomputer
1086              Arbeitscomputerarbeits
1087              Computerarbeits-computer
1088              Arbeitsnehmer
1089
1090       This solution is still not ideal, however, and will be  replaced  by  a
1091       pattern-based  compound-checking  algorithm which is closely integrated
1092       with input buffer tokenization. Patterns describing compounds come as a
1093       separate input resource that can refer to high-level properties of con‐
1094       stituent parts (e.g. the number of syllables, affix flags, and contain‐
1095       ment  of hyphens). The patterns are matched against potential segmenta‐
1096       tions of compounds to assess wellformedness.
1097
1098

Unicode character encoding

1100       Both Ispell and Myspell use 8-bit ASCII character encoding, which is  a
1101       major  deficiency  when  it  comes to scalability.  Although a language
1102       like Hungarian has a standard ASCII  character  set  (ISO  8859-2),  it
1103       fails  to allow a full implementation of Hungarian orthographic conven‐
1104       tions.  For instance, the '--' symbol (n-dash)  is  missing  from  this
1105       character  set  contrary  to  the fact that it is not only the official
1106       symbol to delimit parenthetic clauses in the language, but it can be in
1107       compound words as a special 'big' hyphen.
1108
1109       MySpell  has  got  some  8-bit encoding tables, but there are languages
1110       without standard 8-bit encoding, too. For example,  a  lot  of  African
1111       languages have non-latin or extended latin characters.
1112
1113       Similarly,  using  the  original spelling of certain foreign names like
1114       Ångström or Molière is encouraged by the Hungarian spelling norm,  and,
1115       since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1116       bine with inflections containing characters only  in  ISO 8859-2  (like
1117       elative  -ből, allative -től or delative -ről with double acute), these
1118       result in words (like Ångströmről or  Molière-től.)  that  can  not  be
1119       encoded using any single ASCII encoding scheme.
1120
1121       The  problems raised in relation to 8-bit ASCII encoding have long been
1122       recognized by proponents of Unicode. It is  clear  that  trading  effi‐
1123       ciency  for  encoding-independence  has  its advantages when it comes a
1124       truly multi-lingual application. There is implemented a memory and time
1125       efficient  Unicode  handling in Hunspell. In non-UTF-8 character encod‐
1126       ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1127       affixes  and words are stored in UTF-8, during the analysis are handled
1128       in mostly UTF-8, under condition checking and suggestion are  converted
1129       to  UTF-16.  Unicode  text  analysis  and spell checking have a minimal
1130       (0-20%) time overhead and minimal or reasonable memory overhead depends
1131       from the language (its UTF-8 encoding and affixation).
1132
1133

Conversion of aspell dictionaries

1135       Aspell  dictionaries  can be easily converted into hunspell. Conversion
1136       steps:
1137
1138       dictionary (xx.cwl -> xx.wl):
1139
1140       preunzip xx.cwl
1141       wc -l < xx.wl > xx.dic
1142       cat xx.wl >> xx.dic
1143
1144       affix file
1145
1146       If the affix file exists, copy it:
1147       cp xx_affix.dat xx.aff
1148       If not, create it with the suitable character encoding (see xx.dat)
1149       echo "SET ISO8859-x" > xx.aff
1150       or
1151       echo "SET UTF-8" > xx.aff
1152
1153       It's useful to add a TRY option with the characters of  the  dictionary
1154       with frequency order to set edit distance suggestions:
1155       echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1156
1157