hunspell(5)

1hunspell(5)                   File Formats Manual                  hunspell(5)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1) Hunspell requires two files to define the way a language is
10       being spell checked: a dictionary file containing words and  applicable
11       flags,  and  an  affix file that specifies how these flags will control
12       spell checking.  An optional file is the personal dictionary file.
13
14

Dictionary file

16       A dictionary file (*.dic) contains a list of words, one per line.   The
17       first  line of the dictionaries (except personal dictionaries) contains
18       the approximate word count (for optimal hash memory  size).  Each  word
19       may  optionally  be  followed  by  a slash ("/") and one or more flags,
20       which represents the word attributes, for example affixes.
21
22       Note: Dictionary words can contain also slashes when escaped  like   ""
23       syntax.
24
25

Personal dictionary file

27       Personal  dictionaries  are  simple  word  lists. Asterisk at the first
28       character position signs prohibition.  A second  word  separated  by  a
29       slash sets the affixation.
30
31
32              foo
33              Foo/Simpson
34              *bar
35
36       In  this  example, "foo" and "Foo" are personal words, plus Foo will be
37       recognized with affixes of Simpson (Foo's etc.) and bar is a  forbidden
38       word.
39
40

Short example

42       Dictionary file:
43
44              3
45              hello
46              try/B
47              work/AB
48
49       The flags B and A specify attributes of these words.
50
51       Affix file:
52
53
54              SET UTF-8
55              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
56
57              REP 2
58              REP f ph
59              REP ph f
60
61              PFX A Y 1
62              PFX A 0 re .
63
64              SFX B Y 2
65              SFX B 0 ed [^y]
66              SFX B y ied y
67
68       In  the  affix  file,  prefix A and suffix B have been defined.  Flag A
69       defines a `re-' prefix. Class B defines two  `-ed'  suffixes.  First  B
70       suffix  can  be added to a word if the last character of the word isn't
71       `y'.  Second suffix can be added to the words terminated with an `y'.
72
73
74       All accepted words with this  dictionary  and  affix  combination  are:
75       "hello", "try", "tried", "work", "worked", "rework", "reworked".
76
77

AFFIX FILE GENERAL OPTIONS

79       Hunspell  source distribution contains more than 80 examples for option
80       usage.
81
82
83       SET encoding
84              Set character encoding of words and morphemes in affix and  dic‐
85              tionary  files.  Possible values: UTF-8, ISO8859-1 - ISO8859-10,
86              ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U,  cp1251,  ISCII-DEVANA‐
87              GARI.
88
89              SET UTF-8
90
91       FLAG value
92              Set  flag type. Default type is the extended ASCII (8-bit) char‐
93              acter.  `UTF-8' parameter sets UTF-8 encoded  Unicode  character
94              flags.   The `long' value sets the double extended ASCII charac‐
95              ter flag type, the `num' sets the decimal number flag type. Dec‐
96              imal flags numbered from 1 to 65000, and in flag fields are sep‐
97              arated by comma.  BUG: UTF-8 flag type doesn't work on ARM plat‐
98              form.
99
100              FLAG long
101
102       COMPLEXPREFIXES
103              Set  twofold  prefix stripping (but single suffix stripping) eg.
104              for morphologically complex languages with right-to-left writing
105              system.
106
107
108       LANG langcode
109              Set  language  code for language specific functions of Hunspell.
110              Use it to activate special casing of Azeri (LANG az) and Turkish
111              (LANG tr).
112
113       IGNORE characters
114              Sets  characters  to  ignore dictionary words, affixes and input
115              words.  Useful for optional characters, as Arabic  (harakat)  or
116              Hebrew  (niqqud) diacritical marks (see tests/ignore.* test dic‐
117              tionary in Hunspell distribution).
118
119
120       AF number_of_flag_vector_aliases
121
122       AF flag_vector
123              Hunspell can substitute affix flag sets with ordinal numbers  in
124              affix rules (alias compression, see makealias tool). First exam‐
125              ple with alias compression:
126
127              3
128              hello
129              try/1
130              work/2
131
132       AF definitions in the affix file:
133
134              AF 2
135              AF A
136              AF AB
137
138       It is equivalent of the following dic file:
139
140              3
141              hello
142              try/A
143              work/AB
144
145       See also tests/alias* examples of the source distribution.
146
147       Note I: If affix file contains the FLAG parameter, define it before the
148       AF definitions.
149
150       Note II: Use makealias utility in Hunspell distribution to compress aff
151       and dic files.
152
153       AM number_of_morphological_aliases
154
155       AM morphological_fields
156              Hunspell can substitute also  morphological  data  with  ordinal
157              numbers  in  affix  rules (alias compression).  See tests/alias*
158              examples.
159

AFFIX FILE OPTIONS FOR SUGGESTION

161       Suggestion parameters  can  optimize  the  default  n-gram  (similarity
162       search in the dictionary words based on the common 1, 2, 3, 4-character
163       length common character-sequences), character swap and deletion sugges‐
164       tions  of Hunspell.  REP is suggested to fix the typical and especially
165       bad language specific bugs, because the REP suggestions have the  high‐
166       est  priority  in the suggestion list.  PHONE is for languages with not
167       pronunciation based orthography.
168
169       KEY characters_separated_by_vertical_line_optionally
170              Hunspell searches and suggests words with one different  charac‐
171              ter  replaced  by a neighbor KEY character. Not neighbor charac‐
172              ters in KEY string separated by vertical line characters.   Sug‐
173              gested KEY parameters for QWERTY and Dvorak keyboard layouts:
174
175              KEY qwertyuiop|asdfghjkl|zxcvbnm
176              KEY pyfgcrl|aeouidhtns|qjkxbmwvz
177
178       Using  the first QWERTY layout, Hunspell suggests "nude" and "node" for
179       "*nide". A character may have more neighbors, too:
180
181              KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
182
183       TRY characters
184              Hunspell can suggest right word forms, when they differ from the
185              bad  input  word  by  one TRY character. The parameter of TRY is
186              case sensitive.
187
188       NOSUGGEST flag
189              Words signed with NOSUGGEST flag are not  suggested  (but  still
190              accepted  when  typed  correctly).  Proposed flag for vulgar and
191              obscene words (see also SUBSTANDARD).
192
193       MAXCPDSUGS num
194              Set max. number of suggested compound words  generated  by  com‐
195              pound  rules.  The number of the suggested compound words may be
196              greater from the same 1-character distance type.
197
198       MAXNGRAMSUGS num
199              Set max. number of n-gram suggestions. Value 0 switches off  the
200              n-gram suggestions (see also MAXDIFF).
201
202       MAXDIFF [0-10]
203              Set  the similarity factor for the n-gram based suggestions (5 =
204              default value; 0 = fewer n-gram suggestions, but min.  1;  10  =
205              MAXNGRAMSUGS n-gram suggestions).
206
207       ONLYMAXDIFF
208              Remove  all  bad n-gram suggestions (default mode keeps one, see
209              MAXDIFF).
210
211       NOSPLITSUGS
212              Disable word suggestions with spaces.
213
214       SUGSWITHDOTS
215              Add dot(s) to suggestions, if input word terminates  in  dot(s).
216              (Not  for  LibreOffice  dictionaries, because LibreOffice has an
217              automatic dot expansion mechanism.)
218
219       REP number_of_replacement_definitions
220
221       REP what replacement
222              This table specifies modifications to try first.  First  REP  is
223              the  header of this table and one or more REP data line are fol‐
224              lowing it.  With this table,  Hunspell  can  suggest  the  right
225              forms  for the typical spelling mistakes when the incorrect form
226              differs by more than 1 letter from the right form.   The  search
227              string supports the regex boundary signs (^ and $).  For example
228              a possible English replacement table definition to  handle  mis‐
229              spelled consonants:
230
231              REP 5
232              REP f ph
233              REP ph f
234              REP tion$ shun
235              REP ^cooccurr co-occurr
236              REP ^alot$ a_lot
237
238       Note  I:  It's  very useful to define replacements for the most typical
239       one-character mistakes, too: with REP you can add higher priority to  a
240       subset of the TRY suggestions (suggestion list begins with the REP sug‐
241       gestions).
242
243       Note II: Suggesting separated words, specify spaces with underlines:
244
245
246              REP 1
247              REP onetwothree one_two_three
248
249       Note III: Replacement table can be used for a  stricter  compound  word
250       checking with the option CHECKCOMPOUNDREP.
251
252
253       MAP number_of_map_definitions
254
255       MAP string_of_related_chars_or_parenthesized_character_sequences
256              We  can  define language-dependent information on characters and
257              character sequences that  should  be  considered  related  (i.e.
258              nearer than other chars not in the set) in the affix file (.aff)
259              by a map table.  With this table, Hunspell can suggest the right
260              forms  for  words,  which incorrectly choose the wrong letter or
261              letter groups from a related set more than once in a  word  (see
262              REP).
263
264              For  example a possible mapping could be for the German umlauted
265              ü versus the regular u; the  word  Frühstück  really  should  be
266              written with umlauted u's and not regular ones
267
268              MAP 1
269              MAP uü
270
271       Use parenthesized groups for character sequences (eg. for composed Uni‐
272       code characters):
273
274              MAP 3
275              MAP ß(ss)  (character sequence)
276              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
277              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
278
279       PHONE number_of_phone_definitions
280
281       PHONE what replacement
282              PHONE uses a table-driven phonetic transcription algorithm  bor‐
283              rowed from Aspell. It is useful for languages with not pronunci‐
284              ation based orthography. You can add a full alphabet  conversion
285              and  other rules for conversion of special letter sequences. For
286              detailed documentation see  http://aspell.net/man-html/Phonetic-
287              Code.html.   Note:  Multibyte  UTF-8  characters have not worked
288              with bracket expression yet. Dash expression  has  signed  bytes
289              and not UTF-8 characters yet.
290
291       WARN flag
292              This  flag is for rare words, which are also often spelling mis‐
293              takes, see option -r of command line Hunspell and FORBIDWARN.
294
295       FORBIDWARN
296              Words with flag WARN aren't accepted by the spell checker  using
297              this parameter.
298

OPTIONS FOR COMPOUNDING

300       BREAK number_of_break_definitions
301
302       BREAK character_or_character_sequence
303              Define  new  break  points  for breaking words and checking word
304              parts separately. Use ^ and $ to delete characters  at  end  and
305              start  of the word. Rationale: useful for compounding with join‐
306              ing character or strings (for example,  hyphen  in  English  and
307              German  or hyphen and n-dash in Hungarian). Dashes are often bad
308              break points for tokenization, because compounds with dashes may
309              contain  not  valid parts, too.)  With BREAK, Hunspell can check
310              both side of these compounds, breaking the words at  dashes  and
311              n-dashes:
312
313              BREAK 2
314              BREAK -
315              BREAK --    # n-dash
316
317       Breaking  are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
318       be valid compounds.  Note: The default word break of Hunspell is equiv‐
319       alent of the following BREAK definition:
320
321              BREAK 3
322              BREAK -
323              BREAK ^-
324              BREAK -$
325
326       Hunspell  doesn't  accept  the  "-word" and "word-" forms by this BREAK
327       definition:
328
329              BREAK 1
330              BREAK -
331
332       Switching off the default values:
333
334              BREAK 0
335
336       Note II: COMPOUNDRULE is better for handling dashes and other  compound
337       joining  characters  or  character  strings.  Use BREAK, if you want to
338       check words with dashes or other joining characters  and  there  is  no
339       time  or  possibility  to  describe  precise  compound  rules with COM‐
340       POUNDRULE (COMPOUNDRULE handles only the suffixation of the  last  word
341       part of a compound word).
342
343       Note  III:  For command line spell checking of words with extra charac‐
344       ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
345       ple
346
347       COMPOUNDRULE number_of_compound_definitions
348
349       COMPOUNDRULE compound_pattern
350              Define  custom  compound patterns with a regex-like syntax.  The
351              first COMPOUNDRULE is a header with the number of the  following
352              COMPOUNDRULE  definitions.  Compound  patterns  consist compound
353              flags, parentheses, star and question mark  meta  characters.  A
354              flag  followed  by  a  `*'  matches a word sequence of 0 or more
355              matches of words signed with this compound flag.   A  flag  fol‐
356              lowed  by  a  `?' matches a word sequence of 0 or 1 matches of a
357              word signed with  this  compound  flag.   See  tests/compound*.*
358              examples.
359
360              Note:  en_US  dictionary of OpenOffice.org uses COMPOUNDRULE for
361              ordinal number recognition (1st, 2nd, 11th, 12th,  22nd,  112th,
362              1000122nd etc.).
363
364              Note  II:  In the case of long and numerical flag types use only
365              parenthesized flags: (1500)*(2000)?
366
367              Note III: COMPOUNDRULE flags work completely separately from the
368              compounding  mechanisms  using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
369              compound flags.  (Use  these  flags  on  different  entries  for
370              words).
371
372
373       COMPOUNDMIN num
374              Minimum  length of words used for compounding.  Default value is
375              3 letters.
376
377       COMPOUNDFLAG flag
378              Words signed with COMPOUNDFLAG may be in compound words  (except
379              when  word  shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
380              also permits compounding of affixed words.
381
382       COMPOUNDBEGIN flag
383              Words signed with COMPOUNDBEGIN (or with a signed affix) may  be
384              first elements in compound words.
385
386       COMPOUNDLAST flag
387              Words  signed  with COMPOUNDLAST (or with a signed affix) may be
388              last elements in compound words.
389
390       COMPOUNDMIDDLE flag
391              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
392              middle elements in compound words.
393
394       ONLYINCOMPOUND flag
395              Suffixes  signed  with ONLYINCOMPOUND flag may be only inside of
396              compounds (Fuge-elements in German, fogemorphemes  in  Swedish).
397              ONLYINCOMPOUND  flag works also with words (see tests/onlyincom‐
398              pound.*).  Note: also valuable to flag compounding  parts  which
399              are not correct as a word by itself.
400
401       COMPOUNDPERMITFLAG flag
402              Prefixes are allowed at the beginning of compounds, suffixes are
403              allowed at the end of compounds by default.  Affixes  with  COM‐
404              POUNDPERMITFLAG may be inside of compounds.
405
406       COMPOUNDFORBIDFLAG flag
407              Suffixes with this flag forbid compounding of the affixed word.
408
409       COMPOUNDMORESUFFIXES
410              Allow twofold suffixes within compounds.
411
412       COMPOUNDROOT flag
413              COMPOUNDROOT  flag signs the compounds in the dictionary (Now it
414              is used only in the Hungarian language specific code).
415
416       COMPOUNDWORDMAX number
417              Set maximum word count in a compound word.  (Default  is  unlim‐
418              ited.)
419
420       CHECKCOMPOUNDDUP
421              Forbid word duplication in compounds (e.g. foofoo).
422
423       CHECKCOMPOUNDREP
424              Forbid  compounding, if the (usually bad) compound word may be a
425              non compound word with a REP fault. Useful  for  languages  with
426              `compound friendly' orthography.
427
428       CHECKCOMPOUNDCASE
429              Forbid upper case characters at word boundaries in compounds.
430
431       CHECKCOMPOUNDTRIPLE
432              Forbid  compounding,  if compound word contains triple repeating
433              letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
434              ter  support in UTF-8 encoding (works only for 7-bit ASCII char‐
435              acters).
436
437       SIMPLIFIEDTRIPLE
438              Allow simplified 2-letter forms of the  compounds  forbidden  by
439              CHECKCOMPOUNDTRIPLE.  It's useful for Swedish and Norwegian (and
440              for the old German orthography: Schiff|fahrt -> Schiffahrt).
441
442       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
443
444       CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
445              Forbid compounding, if the first word in the compound ends  with
446              endchars,  and next word begins with beginchars and (optionally)
447              they have the requested flags.  The optional replacement parame‐
448              ter allows simplified compound form.
449
450              The  special  "endchars" pattern 0 (zero) limits the rule to the
451              unmodified stems (stems and stems with zero affixes):
452
453              CHECKCOMPOUNDPATTERN 0/x /y
454
455       Note: COMPOUNDMIN doesn't work correctly with the compound word  alter‐
456       nation, so it may need to set COMPOUNDMIN to lower value.
457
458       FORCEUCASE flag
459              Last  word  part of a compound with flag FORCEUCASE forces capi‐
460              talization of the whole compound word. Eg. Dutch  word  "straat"
461              (street)  with FORCEUCASE flags will allowed only in capitalized
462              compound forms, according to the Dutch spelling rules for proper
463              names.
464
465       COMPOUNDSYLLABLE max_syllable vowels
466              Need  for special compounding rules in Hungarian.  First parame‐
467              ter is the maximum syllable number, that may be in  a  compound,
468              if  words  in  compounds  are more than COMPOUNDWORDMAX.  Second
469              parameter is the list of vowels (for calculating syllables).
470
471       SYLLABLENUM flags
472              Need for special compounding rules in Hungarian.
473

AFFIX FILE OPTIONS FOR AFFIX CREATION

475       PFX flag cross_product number
476
477       PFX flag stripping prefix [condition [morphological_fields...]]
478
479       SFX flag cross_product number
480
481       SFX flag stripping suffix [condition [morphological_fields...]]
482              An affix is either a prefix or a suffix attached to  root  words
483              to  make other words. We can define affix classes with arbitrary
484              number affix rules.  Affix classes are signed with affix  flags.
485              The  first  line of an affix class definition is the header. The
486              fields of an affix class header:
487
488              (0) Option name (PFX or SFX)
489
490              (1) Flag (name of the affix class)
491
492              (2) Cross product (permission to combine prefixes and suffixes).
493              Possible values: Y (yes) or N (no)
494
495              (3) Line count of the following rules.
496
497              Fields of an affix rules:
498
499              (0) Option name
500
501              (1) Flag
502
503              (2) stripping characters from beginning (at prefix rules) or end
504              (at suffix rules) of the word
505
506              (3) affix (optionally with flags of continuation classes,  sepa‐
507              rated by a slash)
508
509              (4) condition.
510
511              Zero stripping or affix are indicated by zero. Zero condition is
512              indicated by dot.  Condition is a  simplified,  regular  expres‐
513              sion-like  pattern,  which  must  be met before the affix can be
514              applied. (Dot signs an arbitrary character. Characters in braces
515              sign  an  arbitrary  character  from  the character subset. Dash
516              hasn't got special meaning, but circumflex (^)  next  the  first
517              brace sets the complementer character set.)
518
519              (5) Optional morphological fields separated by spaces or tabula‐
520              tors.
521
522

AFFIX FILE OTHER OPTIONS

524       CIRCUMFIX flag
525              Affixes signed with CIRCUMFIX flag may be on a  word  when  this
526              word  also  has a prefix with CIRCUMFIX flag and vice versa (see
527              circumfix.* test files in the source distribution).
528
529       FORBIDDENWORD flag
530              This flag signs forbidden word form. Because affixed  forms  are
531              also  forbidden,  we  can  subtract  a  subset  from  set of the
532              accepted affixed and compound words.  Note:  usefull  to  forbid
533              erroneous words, generated by the compounding mechanism.
534
535       FULLSTRIP
536              With  FULLSTRIP,  affix rules can strip full words, not only one
537              less characters, before adding the affixes, see fullstrip.* test
538              files in the source distribution).  Note: conditions may be word
539              length without FULLSTRIP, too.
540
541       KEEPCASE flag
542              Forbid uppercased and capitalized forms  of  words  signed  with
543              KEEPCASE  flags.  Useful for special orthographies (measurements
544              and currency often keep their  case  in  uppercased  texts)  and
545              writing  systems  (e.g.  keeping  lower case of IPA characters).
546              Also valuable for words erroneously written in the wrong case.
547
548              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
549              CASE  flag  may  be  capitalized  and uppercased, but uppercased
550              forms of these words may not contain sharp s, only SS. See  ger‐
551              mancompounding  example  in  the tests directory of the Hunspell
552              distribution.
553
554
555       ICONV number_of_ICONV_definitions
556
557       ICONV pattern pattern2
558              Define input conversion table.  Note: useful to convert one type
559              of quote to another one, or change ligature.
560
561       OCONV number_of_OCONV_definitions
562
563       OCONV pattern pattern2
564              Define output conversion table.
565
566       LEMMA_PRESENT flag
567              Deprecated. Use "st:" field instead of LEMMA_PRESENT.
568
569       NEEDAFFIX flag
570              This  flag  signs  virtual  stems  in the dictionary, words only
571              valid when affixed.   Except,  if  the  dictionary  word  has  a
572              homonym or a zero affix.  NEEDAFFIX works also with prefixes and
573              prefix + suffix combinations (see tests/pseudoroot5.*).
574
575       PSEUDOROOT flag
576              Deprecated. (Former name of the NEEDAFFIX option.)
577
578       SUBSTANDARD flag
579              SUBSTANDARD flag signs affix rules and dictionary  words  (allo‐
580              morphs)  not used in morphological generation (and in suggestion
581              in the future versions). See also NOSUGGEST.
582
583       WORDCHARS characters
584              WORDCHARS extends tokenizer of Hunspell command  line  interface
585              with  additional word character. For example, dot, dash, n-dash,
586              numbers, percent sign are word character in Hungarian.
587
588       CHECKSHARPS
589              SS letter pair in uppercased (German) words may  be  upper  case
590              sharp  s  (ß).  Hunspell can handle this special casing with the
591              CHECKSHARPS declaration (see also KEEPCASE flag  and  tests/ger‐
592              mancompounding example) in both spelling and suggestion.
593
594

Morphological analysis

596       Hunspell's  dictionary items and affix rules may have optional space or
597       tabulator separated  morphological  description  fields,  started  with
598       3-character (two letters and a colon) field IDs:
599
600
601               word/flags po:noun is:nom
602
603       Example: We define a simple resource with morphological informations, a
604       derivative suffix (ds:) and a part of speech category (po:):
605
606       Affix file:
607
608
609               SFX X Y 1
610               SFX X 0 able . ds:able
611
612       Dictionary file:
613
614
615               drink/X po:verb
616
617       Test file:
618
619
620               drink
621               drinkable
622
623       Test:
624
625
626               $ analyze test.aff test.dic test.txt
627               > drink
628               analyze(drink) = po:verb
629               stem(drink) = po:verb
630               > drinkable
631               analyze(drinkable) = po:verb ds:able
632               stem(drinkable) = drinkable
633
634       You can see in the example, that the analyzer concatenates the  morpho‐
635       logical fields in item and arrangement style.
636
637

Optional data fields

639       Default  morphological  and other IDs (used in suggestion, stemming and
640       morphological generation):
641
642       ph:    Alternative transliteration for better suggestion.  It's  useful
643              for words with foreign pronunciation. (Dictionary based phonetic
644              suggestion.)  For example:
645
646
647              Marseille ph:maarsayl
648
649       st:    Stem. Optional: default stem is the dictionary item  in  morpho‐
650              logical  analysis.  Stem field is useful for virtual stems (dic‐
651              tionary words with NEEDAFFIX flag) and morphological  exceptions
652              instead of new, single used morphological rules.
653
654              feet  st:foot  is:plural
655              mice  st:mouse is:plural
656              teeth st:tooth is:plural
657
658       Word forms with multiple stems need multiple dictionary items:
659
660
661              lay po:verb st:lie is:past_2
662              lay po:verb is:present
663              lay po:noun
664
665       al:    Allomorph(s).  A  dictionary item is the stem of its allomorphs.
666              Morphological generation needs stem, allomorph and affix fields.
667
668              sing al:sang al:sung
669              sang st:sing
670              sung st:sing
671
672       po:    Part of speech category.
673
674       ds:    Derivational suffix(es).  Stemming doesn't  remove  derivational
675              suffixes.   Morphological generation depends on the order of the
676              suffix fields.
677
678              In affix rules:
679
680
681              SFX Y Y 1
682              SFX Y 0 ly . ds:ly_adj
683
684       In the dictionary:
685
686
687              ably st:able ds:ly_adj
688              able al:ably
689
690       is:    Inflectional suffix(es).  All inflectional suffixes are  removed
691              by  stemming.   Morphological generation depends on the order of
692              the suffix fields.
693
694
695              feet st:foot is:plural
696
697       ts:    Terminal suffix(es).  Terminal suffix  fields  are  inflectional
698              suffix fields "removed" by additional (not terminal) suffixes.
699
700              Useful  for  zero  morphemes  and  affixes  removed by splitting
701              rules.
702
703
704              work/D ts:present
705
706              SFX D Y 2
707              SFX D   0 ed . is:past_1
708              SFX D   0 ed . is:past_2
709
710       Typical example of the terminal suffix is the zero morpheme of the nom‐
711       inative case.
712
713
714       sp:    Surface  prefix.  Temporary  solution for adding prefixes to the
715              stems and generated word forms. See tests/morph.* example.
716
717
718       pa:    Parts of the compound  words.  Output  fields  of  morphological
719              analysis for stemming.
720
721       dp:    Planned: derivational prefix.
722
723       ip:    Planned: inflectional prefix.
724
725       tp:    Planned: terminal prefix.
726
727

Twofold suffix stripping

729       Ispell's  original algorithm strips only one suffix. Hunspell can strip
730       another one yet (or a plus prefix in COMPLEXPREFIXES mode).
731
732       The twofold suffix stripping is a significant improvement  in  handling
733       of  immense  number  of  suffixes, that characterize agglutinative lan‐
734       guages.
735
736       A second `s' suffix (affix class Y) will be the continuation  class  of
737       the suffix `able' in the following example:
738
739
740               SFX Y Y 1
741               SFX Y 0 s .
742
743               SFX X Y 1
744               SFX X 0 able/Y .
745
746       Dictionary file:
747
748
749               drink/X
750
751       Test file:
752
753
754               drink
755               drinkable
756               drinkables
757
758       Test:
759
760
761               $ hunspell -m -d test <test.txt
762               drink st:drink
763               drinkable st:drink fl:X
764               drinkables st:drink fl:X fl:Y
765
766       Theoretically  with  the twofold suffix stripping needs only the square
767       root of the number of suffix rules, compared with a Hunspell  implemen‐
768       tation. In our practice, we could have elaborated the Hungarian inflec‐
769       tional morphology with twofold suffix stripping.
770
771

Extended affix classes

773       Hunspell can handle more than 65000 affix classes.  There are three new
774       syntax for giving flags in affix and dictionary files.
775
776       FLAG long command sets 2-character flags:
777
778
779                FLAG long
780                SFX Y1 Y 1
781                SFX Y1 0 s 1
782
783       Dictionary record with the Y1, Z3, F? flags:
784
785
786                foo/Y1Z3F?
787
788       FLAG num command sets numerical flags separated by comma:
789
790
791                FLAG num
792                SFX 65000 Y 1
793                SFX 65000 0 s 1
794
795       Dictionary example:
796
797
798                foo/65000,12,2756
799
800       The third one is the Unicode character flags.
801
802

Homonyms

804       Hunspell's dictionary can contain repeating elements that are homonyms:
805
806
807               work/A    po:verb
808               work/B    po:noun
809
810       An affix file:
811
812
813               SFX A Y 1
814               SFX A 0 s . sf:sg3
815
816               SFX B Y 1
817               SFX B 0 s . is:plur
818
819       Test file:
820
821
822               works
823
824       Test:
825
826
827               $ hunspell -d test -m <testwords
828               work st:work po:verb is:sg3
829               work st:work po:noun is:plur
830
831       This  feature also gives a way to forbid illegal prefix/suffix combina‐
832       tions.
833
834

Prefix--suffix dependencies

836       An interesting side-effect of multi-step stripping is, that the  appro‐
837       priate  treatment  of circumfixes now comes for free.  For instance, in
838       Hungarian, superlatives are formed by simultaneous prefixation of  leg-
839       and  suffixation of -bb to the adjective base.  A problem with the one-
840       level architecture is that there is no way to render lexical  licensing
841       of  particular  prefixes  and  suffixes  interdependent,  and therefore
842       incorrect forms are recognized as valid,  i.e.  *legvén  =  leg  +  vén
843       `old'.  Until  the introduction of clusters, a special treatment of the
844       superlative had to be hardwired in the earlier HunSpell code. This  may
845       have  been  legitimate  for  a  single case, but in fact prefix--suffix
846       dependences are ubiquitous in category-changing  derivational  patterns
847       (cf.  English  payable, non-payable but *non-pay or drinkable, undrink‐
848       able but *undrink). In simple words, here, the prefix un- is legitimate
849       only  if  the  base drink is suffixed with -able. If both these patters
850       are handled by on-line affix rules and affix rules are checked  against
851       the  base only, there is no way to express this dependency and the sys‐
852       tem will necessarily over- or undergenerate.
853
854       In next example, suffix class R have got a prefix `continuation'  class
855       (class P).
856
857
858              PFX P Y 1
859              PFX P   0 un . [prefix_un]+
860
861              SFX S Y 1
862              SFX S   0 s . +PL
863
864              SFX Q Y 1
865              SFX Q   0 s . +3SGV
866
867              SFX R Y 1
868              SFX R   0 able/PS . +DER_V_ADJ_ABLE
869
870       Dictionary:
871
872
873              2
874              drink/RQ  [verb]
875              drink/S   [noun]
876
877       Morphological analysis:
878
879
880              > drink
881              drink[verb]
882              drink[noun]
883              > drinks
884              drink[verb]+3SGV
885              drink[noun]+PL
886              > drinkable
887              drink[verb]+DER_V_ADJ_ABLE
888              > drinkables
889              drink[verb]+DER_V_ADJ_ABLE+PL
890              > undrinkable
891              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
892              > undrinkables
893              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
894              > undrink
895              Unknown word.
896              > undrinks
897              Unknown word.
898

Circumfix

900       Conditional  affixes implemented by a continuation class are not enough
901       for circumfixes, because a circumfix is one  affix  in  morphology.  We
902       also need CIRCUMFIX option for correct morphological analysis.
903
904
905              # circumfixes: ~ obligate prefix/suffix combinations
906              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
907              # nagy, nagyobb, legnagyobb, legeslegnagyobb
908              # (great, greater, greatest, most greatest)
909
910              CIRCUMFIX X
911
912              PFX A Y 1
913              PFX A 0 leg/X .
914
915              PFX B Y 1
916              PFX B 0 legesleg/X .
917
918              SFX C Y 3
919              SFX C 0 obb . +COMPARATIVE
920              SFX C 0 obb/AX . +SUPERLATIVE
921              SFX C 0 obb/BX . +SUPERSUPERLATIVE
922
923       Dictionary:
924
925
926              1
927              nagy/C    [MN]
928
929       Analysis:
930
931
932              > nagy
933              nagy[MN]
934              > nagyobb
935              nagy[MN]+COMPARATIVE
936              > legnagyobb
937              nagy[MN]+SUPERLATIVE
938              > legeslegnagyobb
939              nagy[MN]+SUPERSUPERLATIVE
940

Compounds

942       Allowing  free compounding yields decrease in precision of recognition,
943       not to mention stemming and morphological analysis.   Although  lexical
944       switches are introduced to license compounding of bases by Ispell, this
945       proves not to be restrictive enough. For example:
946
947
948              # affix file
949              COMPOUNDFLAG X
950
951              2
952              foo/X
953              bar/X
954
955       With this resource, foobar and barfoo also are accepted words.
956
957       This has been improved upon with the introduction  of  direction-sensi‐
958       tive compounding, i.e., lexical features can specify separately whether
959       a base can occur as leftmost or  rightmost  constituent  in  compounds.
960       This,  however,  is still insufficient to handle the intricate patterns
961       of compounding, not to mention idiosyncratic  (and  language  specific)
962       norms of hyphenation.
963
964       The  Hunspell  algorithm  currently  allows  any affixed form of words,
965       which are lexically marked as potential members of compounds.  Hunspell
966       improved  this, and its recursive compound checking rules makes it pos‐
967       sible to implement the intricate spelling conventions of Hungarian com‐
968       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
969       ROOT, SYLLABLENUM options can be set  the  noteworthy  Hungarian  `6-3'
970       rule.   Further  example  in  Hungarian, derivate suffixes often modify
971       compounding properties. Hunspell allows the compounding  flags  on  the
972       affixes,  and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
973       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
974
975       Suffixes with this flag forbid compounding of the affixed word.
976
977       We also need several Hunspell features for handling German compounding:
978
979
980              # German compounding
981
982              # set language to handle special casing of German sharp s
983
984              LANG de_DE
985
986              # compound flags
987
988              COMPOUNDBEGIN U
989              COMPOUNDMIDDLE V
990              COMPOUNDEND W
991
992              # Prefixes are allowed at the beginning of compounds,
993              # suffixes are allowed at the end of compounds by default:
994              # (prefix)?(root)+(affix)?
995              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
996              COMPOUNDPERMITFLAG P
997
998              # for German fogemorphemes (Fuge-element)
999              # Hint: ONLYINCOMPOUND is not required everywhere, but the
1000              # checking will be a little faster with it.
1001
1002              ONLYINCOMPOUND X
1003
1004              # forbid uppercase characters at compound word bounds
1005              CHECKCOMPOUNDCASE
1006
1007              # for handling Fuge-elements with dashes (Arbeits-)
1008              # dash will be a special word
1009
1010              COMPOUNDMIN 1
1011              WORDCHARS -
1012
1013              # compound settings and fogemorpheme for `Arbeit'
1014
1015              SFX A Y 3
1016              SFX A 0 s/UPX .
1017              SFX A 0 s/VPDX .
1018              SFX A 0 0/WXD .
1019
1020              SFX B Y 2
1021              SFX B 0 0/UPX .
1022              SFX B 0 0/VWXDP .
1023
1024              # a suffix for `Computer'
1025
1026              SFX C Y 1
1027              SFX C 0 n/WD .
1028
1029              # for forbid exceptions (*Arbeitsnehmer)
1030
1031              FORBIDDENWORD Z
1032
1033              # dash prefix for compounds with dash (Arbeits-Computer)
1034
1035              PFX - Y 1
1036              PFX - 0 -/P .
1037
1038              # decapitalizing prefix
1039              # circumfix for positioning in compounds
1040
1041              PFX D Y 29
1042              PFX D A a/PX A
1043              PFX D Ä ä/PX Ä
1044               .
1045               .
1046              PFX D Y y/PX Y
1047              PFX D Z z/PX Z
1048
1049       Example dictionary:
1050
1051
1052              4
1053              Arbeit/A-
1054              Computer/BC-
1055              -/W
1056              Arbeitsnehmer/Z
1057
1058       Accepted compound compound words with the previous resource:
1059
1060
1061              Computer
1062              Computern
1063              Arbeit
1064              Arbeits-
1065              Computerarbeit
1066              Computerarbeits-
1067              Arbeitscomputer
1068              Arbeitscomputern
1069              Computerarbeitscomputer
1070              Computerarbeitscomputern
1071              Arbeitscomputerarbeit
1072              Computerarbeits-Computer
1073              Computerarbeits-Computern
1074
1075       Not accepted compoundings:
1076
1077
1078              computer
1079              arbeit
1080              Arbeits
1081              arbeits
1082              ComputerArbeit
1083              ComputerArbeits
1084              Arbeitcomputer
1085              ArbeitsComputer
1086              Computerarbeitcomputer
1087              ComputerArbeitcomputer
1088              ComputerArbeitscomputer
1089              Arbeitscomputerarbeits
1090              Computerarbeits-computer
1091              Arbeitsnehmer
1092
1093       This solution is still not ideal, however, and will be  replaced  by  a
1094       pattern-based  compound-checking  algorithm which is closely integrated
1095       with input buffer tokenization. Patterns describing compounds come as a
1096       separate input resource that can refer to high-level properties of con‐
1097       stituent parts (e.g. the number of syllables, affix flags, and contain‐
1098       ment  of hyphens). The patterns are matched against potential segmenta‐
1099       tions of compounds to assess wellformedness.
1100
1101

Unicode character encoding

1103       Both Ispell and Myspell use 8-bit ASCII character encoding, which is  a
1104       major  deficiency  when  it  comes to scalability.  Although a language
1105       like Hungarian has a standard ASCII  character  set  (ISO  8859-2),  it
1106       fails  to allow a full implementation of Hungarian orthographic conven‐
1107       tions.  For instance, the '--' symbol (n-dash)  is  missing  from  this
1108       character  set  contrary  to  the fact that it is not only the official
1109       symbol to delimit parenthetic clauses in the language, but it can be in
1110       compound words as a special 'big' hyphen.
1111
1112       MySpell  has  got  some  8-bit encoding tables, but there are languages
1113       without standard 8-bit encoding, too. For example,  a  lot  of  African
1114       languages have non-latin or extended latin characters.
1115
1116       Similarly,  using  the  original spelling of certain foreign names like
1117       Ångström or Molière is encouraged by the Hungarian spelling norm,  and,
1118       since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1119       bine with inflections containing characters only  in  ISO 8859-2  (like
1120       elative  -ből, allative -től or delative -ről with double acute), these
1121       result in words (like Ångströmről or  Molière-től.)  that  can  not  be
1122       encoded using any single ASCII encoding scheme.
1123
1124       The  problems raised in relation to 8-bit ASCII encoding have long been
1125       recognized by proponents of Unicode. It is  clear  that  trading  effi‐
1126       ciency  for  encoding-independence  has  its advantages when it comes a
1127       truly multi-lingual application. There is implemented a memory and time
1128       efficient  Unicode  handling in Hunspell. In non-UTF-8 character encod‐
1129       ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1130       affixes  and words are stored in UTF-8, during the analysis are handled
1131       in mostly UTF-8, under condition checking and suggestion are  converted
1132       to  UTF-16.  Unicode  text  analysis  and spell checking have a minimal
1133       (0-20%) time overhead and minimal or reasonable memory overhead depends
1134       from the language (its UTF-8 encoding and affixation).
1135
1136

Conversion of aspell dictionaries

1138       Aspell  dictionaries  can be easily converted into hunspell. Conversion
1139       steps:
1140
1141       dictionary (xx.cwl -> xx.wl):
1142
1143       preunzip xx.cwl
1144       wc -l < xx.wl > xx.dic
1145       cat xx.wl >> xx.dic
1146
1147       affix file
1148
1149       If the affix file exists, copy it:
1150       cp xx_affix.dat xx.aff
1151       If not, create it with the suitable character encoding (see xx.dat)
1152       echo "SET ISO8859-x" > xx.aff
1153       or
1154       echo "SET UTF-8" > xx.aff
1155
1156       It's useful to add a TRY option with the characters of  the  dictionary
1157       with frequency order to set edit distance suggestions:
1158       echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1159
1160