hunspell(5)

1hunspell(5)                   File Formats Manual                  hunspell(5)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1) Hunspell requires two files to define the way a language is
10       being spell checked: a dictionary file containing words and  applicable
11       flags,  and  an  affix file that specifies how these flags will control
12       spell checking.  An optional file is the personal dictionary file.
13
14

Dictionary file

16       A dictionary file (*.dic) contains a list of words, one per line.   The
17       first  line of the dictionaries (except personal dictionaries) contains
18       the approximate word count (for optimal hash memory  size).  Each  word
19       may  optionally  be  followed  by  a slash ("/") and one or more flags,
20       which represents the word attributes, for example affixes.
21
22       Note: Dictionary words can contain also slashes when escaped  like   ""
23       syntax.
24
25       It's  worth  to add not only words, but word pairs to the dictionary to
26       get correct suggestions for common misspellings with missing space,  as
27       in  the  following  example, for the bad "alot" and "inspite" (see also
28       "REP" and  field  "ph:"  about  correct  suggestions  for  common  mis‐
29       spellings):
30
31
32              3
33              word
34              a lot
35              in spite
36

Personal dictionary file

38       Personal  dictionaries  are  simple  word  lists. Asterisk at the first
39       character position signs prohibition.  A second  word  separated  by  a
40       slash sets the affixation.
41
42
43              foo
44              Foo/Simpson
45              *bar
46
47       In  this  example, "foo" and "Foo" are personal words, plus Foo will be
48       recognized with affixes of Simpson (Foo's etc.) and bar is a  forbidden
49       word.
50
51

Short example

53       Dictionary file:
54
55              3
56              hello
57              try/B
58              work/AB
59
60       The flags B and A specify attributes of these words.
61
62       Affix file:
63
64
65              SET UTF-8
66              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
67
68              REP 2
69              REP f ph
70              REP ph f
71
72              PFX A Y 1
73              PFX A 0 re .
74
75              SFX B Y 2
76              SFX B 0 ed [^y]
77              SFX B y ied y
78
79       In  the  affix  file,  prefix A and suffix B have been defined.  Flag A
80       defines a `re-' prefix. Class B defines two  `-ed'  suffixes.  First  B
81       suffix  can  be added to a word if the last character of the word isn't
82       `y'.  Second suffix can be added to the words terminated with an `y'.
83
84       All accepted words with this  dictionary  and  affix  combination  are:
85       "hello", "try", "tried", "work", "worked", "rework", "reworked".
86
87

AFFIX FILE GENERAL OPTIONS

89       Hunspell  source distribution contains more than 80 examples for option
90       usage.
91
92
93       SET encoding
94              Set character encoding of words and morphemes in affix and  dic‐
95              tionary  files.  Possible values: UTF-8, ISO8859-1 - ISO8859-10,
96              ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U,  cp1251,  ISCII-DEVANA‐
97              GARI.
98
99              SET UTF-8
100
101       FLAG value
102              Set  flag type. Default type is the extended ASCII (8-bit) char‐
103              acter.  `UTF-8' parameter sets UTF-8 encoded  Unicode  character
104              flags.   The `long' value sets the double extended ASCII charac‐
105              ter flag type, the `num' sets the decimal number flag type. Dec‐
106              imal flags numbered from 1 to 65000, and in flag fields are sep‐
107              arated by comma.  BUG: UTF-8 flag type doesn't work on ARM plat‐
108              form.
109
110              FLAG long
111
112       COMPLEXPREFIXES
113              Set  twofold  prefix stripping (but single suffix stripping) eg.
114              for morphologically complex languages with right-to-left writing
115              system.
116
117
118       LANG langcode
119              Set  language  code for language-specific functions of Hunspell.
120              Use it to activate special casing of Azeri  (LANG  az),  Turkish
121              (LANG  tr)  and  Crimean  Tatar (LANG crh), also not generalized
122              syllable-counting compounding rules of Hungarian (LANG hu).
123
124
125       IGNORE characters
126              Sets characters to ignore dictionary words,  affixes  and  input
127              words.   Useful  for optional characters, as Arabic (harakat) or
128              Hebrew (niqqud) diacritical marks (see tests/ignore.* test  dic‐
129              tionary in Hunspell distribution).
130
131
132       AF number_of_flag_vector_aliases
133
134       AF flag_vector
135              Hunspell  can substitute affix flag sets with ordinal numbers in
136              affix rules (alias compression, see makealias tool). First exam‐
137              ple with alias compression:
138
139              3
140              hello
141              try/1
142              work/2
143
144       AF definitions in the affix file:
145
146              AF 2
147              AF A
148              AF AB
149
150       It is equivalent of the following dic file:
151
152              3
153              hello
154              try/A
155              work/AB
156
157       See also tests/alias* examples of the source distribution.
158
159       Note I: If affix file contains the FLAG parameter, define it before the
160       AF definitions.
161
162       Note II: Use makealias utility in Hunspell distribution to compress aff
163       and dic files.
164
165       AM number_of_morphological_aliases
166
167       AM morphological_fields
168              Hunspell  can  substitute  also  morphological data with ordinal
169              numbers in affix rules (alias  compression).   See  tests/alias*
170              examples.
171

AFFIX FILE OPTIONS FOR SUGGESTION

173       Suggestion  parameters  can  optimize  the  default  n-gram (similarity
174       search in the dictionary words based on the common 1, 2, 3, 4-character
175       length common character-sequences), character swap and deletion sugges‐
176       tions of Hunspell.  REP is suggested to fix the typical and  especially
177       bad  language specific bugs, because the REP suggestions have the high‐
178       est priority in the suggestion list.  PHONE is for languages  with  not
179       pronunciation based orthography.
180
181       For short common misspellings, it's important to use the ph: field (see
182       later) to give the best suggestions.
183
184       KEY characters_separated_by_vertical_line_optionally
185              Hunspell searches and suggests words with one different  charac‐
186              ter  replaced  by a neighbor KEY character. Not neighbor charac‐
187              ters in KEY string separated by vertical line characters.   Sug‐
188              gested KEY parameters for QWERTY and Dvorak keyboard layouts:
189
190              KEY qwertyuiop|asdfghjkl|zxcvbnm
191              KEY pyfgcrl|aeouidhtns|qjkxbmwvz
192
193       Using  the first QWERTY layout, Hunspell suggests "nude" and "node" for
194       "*nide". A character may have more neighbors, too:
195
196              KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
197
198       TRY characters
199              Hunspell can suggest right word forms, when they differ from the
200              bad  input  word  by  one TRY character. The parameter of TRY is
201              case sensitive.
202
203       NOSUGGEST flag
204              Words signed with NOSUGGEST flag are not  suggested  (but  still
205              accepted  when  typed  correctly).  Proposed flag for vulgar and
206              obscene words (see also SUBSTANDARD).
207
208       MAXCPDSUGS num
209              Set max. number of suggested compound words  generated  by  com‐
210              pound  rules.  The number of the suggested compound words may be
211              greater from the same 1-character distance type.
212
213       MAXNGRAMSUGS num
214              Set max. number of n-gram suggestions. Value 0 switches off  the
215              n-gram suggestions (see also MAXDIFF).
216
217       MAXDIFF [0-10]
218              Set  the similarity factor for the n-gram based suggestions (5 =
219              default value; 0 = fewer n-gram suggestions, but min.  1;  10  =
220              MAXNGRAMSUGS n-gram suggestions).
221
222       ONLYMAXDIFF
223              Remove  all  bad n-gram suggestions (default mode keeps one, see
224              MAXDIFF).
225
226       NOSPLITSUGS
227              Disable word suggestions with spaces.
228
229       SUGSWITHDOTS
230              Add dot(s) to suggestions, if input word terminates  in  dot(s).
231              (Not  for  LibreOffice  dictionaries, because LibreOffice has an
232              automatic dot expansion mechanism.)
233
234       REP number_of_replacement_definitions
235
236       REP what replacement
237              This table specifies modifications to try first.  First  REP  is
238              the  header of this table and one or more REP data line are fol‐
239              lowing it.  With this table,  Hunspell  can  suggest  the  right
240              forms  for the typical spelling mistakes when the incorrect form
241              differs by more than 1 letter from  the  right  form  (see  also
242              "ph:").   The search string supports the regex boundary signs (^
243              and $).  For example a possible English replacement table  defi‐
244              nition to handle misspelled consonants:
245
246              REP 5
247              REP f ph
248              REP ph f
249              REP tion$ shun
250              REP ^cooccurr co-occurr
251              REP ^alot$ a_lot
252
253       Note  I:  It's  very useful to define replacements for the most typical
254       one-character mistakes, too: with REP you can add higher priority to  a
255       subset of the TRY suggestions (suggestion list begins with the REP sug‐
256       gestions).
257
258       Note II: Suggesting separated words, specify spaces with underlines:
259
260
261              REP 1
262              REP onetwothree one_two_three
263
264       Note III: Replacement table can be used for a  stricter  compound  word
265       checking with the option CHECKCOMPOUNDREP.
266
267
268       MAP number_of_map_definitions
269
270       MAP string_of_related_chars_or_parenthesized_character_sequences
271              We  can  define language-dependent information on characters and
272              character sequences that  should  be  considered  related  (i.e.
273              nearer than other chars not in the set) in the affix file (.aff)
274              by a map table.  With this table, Hunspell can suggest the right
275              forms  for  words,  which incorrectly choose the wrong letter or
276              letter groups from a related set more than once in a  word  (see
277              REP).
278
279              For  example a possible mapping could be for the German umlauted
280              ü versus the regular u; the  word  Frühstück  really  should  be
281              written with umlauted u's and not regular ones
282
283              MAP 1
284              MAP uü
285
286       Use parenthesized groups for character sequences (eg. for composed Uni‐
287       code characters):
288
289              MAP 3
290              MAP ß(ss)  (character sequence)
291              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
292              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
293
294       PHONE number_of_phone_definitions
295
296       PHONE what replacement
297              PHONE uses a table-driven phonetic transcription algorithm  bor‐
298              rowed from Aspell. It is useful for languages with not pronunci‐
299              ation based orthography. You can add a full alphabet  conversion
300              and  other rules for conversion of special letter sequences. For
301              detailed documentation see  http://aspell.net/man-html/Phonetic-
302              Code.html.   Note:  Multibyte  UTF-8  characters have not worked
303              with bracket expression yet. Dash expression  has  signed  bytes
304              and not UTF-8 characters yet.
305
306       WARN flag
307              This  flag is for rare words, which are also often spelling mis‐
308              takes, see option -r of command line Hunspell and FORBIDWARN.
309
310       FORBIDWARN
311              Words with flag WARN aren't accepted by the spell checker  using
312              this parameter.
313

OPTIONS FOR COMPOUNDING

315       BREAK number_of_break_definitions
316
317       BREAK character_or_character_sequence
318              Define  new  break  points  for breaking words and checking word
319              parts separately. Use ^ and $ to delete characters  at  end  and
320              start  of the word. Rationale: useful for compounding with join‐
321              ing character or strings (for example,  hyphen  in  English  and
322              German  or hyphen and n-dash in Hungarian). Dashes are often bad
323              break points for tokenization, because compounds with dashes may
324              contain  not  valid parts, too.)  With BREAK, Hunspell can check
325              both side of these compounds, breaking the words at  dashes  and
326              n-dashes:
327
328              BREAK 2
329              BREAK -
330              BREAK --    # n-dash
331
332       Breaking  are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
333       be valid compounds.  Note: The default word break of Hunspell is equiv‐
334       alent of the following BREAK definition:
335
336              BREAK 3
337              BREAK -
338              BREAK ^-
339              BREAK -$
340
341       Hunspell  doesn't  accept  the  "-word" and "word-" forms by this BREAK
342       definition:
343
344              BREAK 1
345              BREAK -
346
347       Switching off the default values:
348
349              BREAK 0
350
351       Note II: COMPOUNDRULE is better for handling dashes and other  compound
352       joining  characters  or  character  strings.  Use BREAK, if you want to
353       check words with dashes or other joining characters  and  there  is  no
354       time  or  possibility  to  describe  precise  compound  rules with COM‐
355       POUNDRULE (COMPOUNDRULE handles only the suffixation of the  last  word
356       part of a compound word).
357
358       Note  III:  For command line spell checking of words with extra charac‐
359       ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
360       ple
361
362       COMPOUNDRULE number_of_compound_definitions
363
364       COMPOUNDRULE compound_pattern
365              Define  custom  compound patterns with a regex-like syntax.  The
366              first COMPOUNDRULE is a header with the number of the  following
367              COMPOUNDRULE  definitions.  Compound  patterns  consist compound
368              flags, parentheses, star and question mark  meta  characters.  A
369              flag  followed  by  a  `*'  matches a word sequence of 0 or more
370              matches of words signed with this compound flag.   A  flag  fol‐
371              lowed  by  a  `?' matches a word sequence of 0 or 1 matches of a
372              word signed with  this  compound  flag.   See  tests/compound*.*
373              examples.
374
375              Note:  en_US  dictionary of OpenOffice.org uses COMPOUNDRULE for
376              ordinal number recognition (1st, 2nd, 11th, 12th,  22nd,  112th,
377              1000122nd etc.).
378
379              Note  II:  In the case of long and numerical flag types use only
380              parenthesized flags: (1500)*(2000)?
381
382              Note III: COMPOUNDRULE flags work completely separately from the
383              compounding  mechanisms  using COMPOUNDFLAG, COMPOUNDBEGIN, etc.
384              compound flags.  (Use  these  flags  on  different  entries  for
385              words).
386
387
388       COMPOUNDMIN num
389              Minimum  length of words used for compounding.  Default value is
390              3 letters.
391
392       COMPOUNDFLAG flag
393              Words signed with COMPOUNDFLAG may be in compound words  (except
394              when  word  shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG
395              also permits compounding of affixed words.
396
397       COMPOUNDBEGIN flag
398              Words signed with COMPOUNDBEGIN (or with a signed affix) may  be
399              first elements in compound words.
400
401       COMPOUNDLAST flag
402              Words  signed  with COMPOUNDLAST (or with a signed affix) may be
403              last elements in compound words.
404
405       COMPOUNDMIDDLE flag
406              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
407              middle elements in compound words.
408
409       ONLYINCOMPOUND flag
410              Suffixes  signed  with ONLYINCOMPOUND flag may be only inside of
411              compounds (Fuge-elements in German, fogemorphemes  in  Swedish).
412              ONLYINCOMPOUND  flag works also with words (see tests/onlyincom‐
413              pound.*).  Note: also valuable to flag compounding  parts  which
414              are not correct as a word by itself.
415
416       COMPOUNDPERMITFLAG flag
417              Prefixes are allowed at the beginning of compounds, suffixes are
418              allowed at the end of compounds by default.  Affixes  with  COM‐
419              POUNDPERMITFLAG may be inside of compounds.
420
421       COMPOUNDFORBIDFLAG flag
422              Suffixes  with this flag forbid compounding of the affixed word.
423              Dictionary words with this flag are removed from  the  beginning
424              and middle of compound words, overriding the effect of COMPOUND‐
425              PERMITFLAG.
426
427       COMPOUNDMORESUFFIXES
428              Allow twofold suffixes within compounds.
429
430       COMPOUNDROOT flag
431              COMPOUNDROOT flag signs the compounds in the dictionary (Now  it
432              is used only in the Hungarian language specific code).
433
434       COMPOUNDWORDMAX number
435              Set  maximum  word  count in a compound word. (Default is unlim‐
436              ited.)
437
438       CHECKCOMPOUNDDUP
439              Forbid word duplication in compounds (e.g. foofoo).
440
441       CHECKCOMPOUNDREP
442              Forbid compounding, if the (usually bad) compound word may be  a
443              non-compound  word  with  a REP fault. Useful for languages with
444              `compound friendly' orthography.
445
446       CHECKCOMPOUNDCASE
447              Forbid upper case characters at word boundaries in compounds.
448
449       CHECKCOMPOUNDTRIPLE
450              Forbid compounding, if compound word contains  triple  repeating
451              letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
452              ter support in UTF-8 encoding (works only for 7-bit ASCII  char‐
453              acters).
454
455       SIMPLIFIEDTRIPLE
456              Allow  simplified  2-letter  forms of the compounds forbidden by
457              CHECKCOMPOUNDTRIPLE.  It's useful for Swedish and Norwegian (and
458              for the old German orthography: Schiff|fahrt -> Schiffahrt).
459
460       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
461
462       CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
463              Forbid  compounding, if the first word in the compound ends with
464              endchars, and next word begins with beginchars and  (optionally)
465              they have the requested flags.  The optional replacement parame‐
466              ter allows simplified compound form.
467
468              The special "endchars" pattern 0 (zero) limits the rule  to  the
469              unmodified stems (stems and stems with zero affixes):
470
471              CHECKCOMPOUNDPATTERN 0/x /y
472
473       Note:  COMPOUNDMIN doesn't work correctly with the compound word alter‐
474       nation, so it may need to set COMPOUNDMIN to lower value.
475
476       FORCEUCASE flag
477              Last word part of a compound with flag FORCEUCASE  forces  capi‐
478              talization  of  the whole compound word. Eg. Dutch word "straat"
479              (street) with FORCEUCASE flags will allowed only in  capitalized
480              compound forms, according to the Dutch spelling rules for proper
481              names.
482
483       COMPOUNDSYLLABLE max_syllable vowels
484              Need for special compounding rules in Hungarian.  First  parame‐
485              ter  is  the maximum syllable number, that may be in a compound,
486              if words in compounds are  more  than  COMPOUNDWORDMAX.   Second
487              parameter is the list of vowels (for calculating syllables).
488
489       SYLLABLENUM flags
490              Need for special compounding rules in Hungarian.
491

AFFIX FILE OPTIONS FOR AFFIX CREATION

493       PFX flag cross_product number
494
495       PFX flag stripping prefix [condition [morphological_fields...]]
496
497       SFX flag cross_product number
498
499       SFX flag stripping suffix [condition [morphological_fields...]]
500              An  affix  is either a prefix or a suffix attached to root words
501              to make other words. We can define affix classes with  arbitrary
502              number  affix rules.  Affix classes are signed with affix flags.
503              The first line of an affix class definition is the  header.  The
504              fields of an affix class header:
505
506              (0) Option name (PFX or SFX)
507
508              (1) Flag (name of the affix class)
509
510              (2) Cross product (permission to combine prefixes and suffixes).
511              Possible values: Y (yes) or N (no)
512
513              (3) Line count of the following rules.
514
515              Fields of an affix rules:
516
517              (0) Option name
518
519              (1) Flag
520
521              (2) stripping characters from beginning (at prefix rules) or end
522              (at suffix rules) of the word
523
524              (3)  affix (optionally with flags of continuation classes, sepa‐
525              rated by a slash)
526
527              (4) condition.
528
529              Zero stripping or affix are indicated by zero. Zero condition is
530              indicated  by  dot.   Condition is a simplified, regular expres‐
531              sion-like pattern, which must be met before  the  affix  can  be
532              applied. (Dot signs an arbitrary character. Characters in braces
533              sign an arbitrary character  from  the  character  subset.  Dash
534              hasn't  got  special  meaning, but circumflex (^) next the first
535              brace sets the complementer character set.)
536
537              (5) Optional morphological fields separated by spaces or tabula‐
538              tors.
539
540

AFFIX FILE OTHER OPTIONS

542       CIRCUMFIX flag
543              Affixes  signed  with  CIRCUMFIX flag may be on a word when this
544              word also has a prefix with CIRCUMFIX flag and vice  versa  (see
545              circumfix.* test files in the source distribution).
546
547       FORBIDDENWORD flag
548              This  flag  signs forbidden word form. Because affixed forms are
549              also forbidden, we  can  subtract  a  subset  from  set  of  the
550              accepted  affixed  and  compound words.  Note: usefull to forbid
551              erroneous words, generated by the compounding mechanism.
552
553       FULLSTRIP
554              With FULLSTRIP, affix rules can strip full words, not  only  one
555              less characters, before adding the affixes, see fullstrip.* test
556              files in the source distribution).  Note: conditions may be word
557              length without FULLSTRIP, too.
558
559       KEEPCASE flag
560              Forbid  uppercased  and  capitalized  forms of words signed with
561              KEEPCASE flags. Useful for special  orthographies  (measurements
562              and  currency  often  keep  their  case in uppercased texts) and
563              writing systems (e.g. keeping lower  case  of  IPA  characters).
564              Also valuable for words erroneously written in the wrong case.
565
566              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
567              CASE flag may be  capitalized  and  uppercased,  but  uppercased
568              forms  of these words may not contain sharp s, only SS. See ger‐
569              mancompounding example in the tests directory  of  the  Hunspell
570              distribution.
571
572
573       ICONV number_of_ICONV_definitions
574
575       ICONV pattern pattern2
576              Define input conversion table.  Note: useful to convert one type
577              of quote to another one, or change ligature.
578
579       OCONV number_of_OCONV_definitions
580
581       OCONV pattern pattern2
582              Define output conversion table.
583
584       LEMMA_PRESENT flag
585              Deprecated. Use "st:" field instead of LEMMA_PRESENT.
586
587       NEEDAFFIX flag
588              This flag signs virtual stems  in  the  dictionary,  words  only
589              valid  when  affixed.   Except,  if  the  dictionary  word has a
590              homonym or a zero affix.  NEEDAFFIX works also with prefixes and
591              prefix + suffix combinations (see tests/needaffix5.*).
592
593       PSEUDOROOT flag
594              Deprecated. (Former name of the NEEDAFFIX option.)
595
596       SUBSTANDARD flag
597              SUBSTANDARD  flag  signs affix rules and dictionary words (allo‐
598              morphs) not used in  morphological  generation  and  root  words
599              removed from suggestion. See also NOSUGGEST.
600
601       WORDCHARS characters
602              WORDCHARS  extends  tokenizer of Hunspell command line interface
603              with additional word character. For example, dot, dash,  n-dash,
604              numbers, percent sign are word character in Hungarian.
605
606       CHECKSHARPS
607              SS  letter  pair  in uppercased (German) words may be upper case
608              sharp s (ß).  Hunspell can handle this special casing  with  the
609              CHECKSHARPS  declaration  (see also KEEPCASE flag and tests/ger‐
610              mancompounding example) in both spelling and suggestion.
611
612

Morphological analysis

614       Hunspell's dictionary items and affix rules may have optional space  or
615       tabulator  separated  morphological  description  fields,  started with
616       3-character (two letters and a colon) field IDs:
617
618
619               word/flags po:noun is:nom
620
621       Example: We define a simple resource with morphological informations, a
622       derivative suffix (ds:) and a part of speech category (po:):
623
624       Affix file:
625
626
627               SFX X Y 1
628               SFX X 0 able . ds:able
629
630       Dictionary file:
631
632
633               drink/X po:verb
634
635       Test file:
636
637
638               drink
639               drinkable
640
641       Test:
642
643
644               $ analyze test.aff test.dic test.txt
645               > drink
646               analyze(drink) = po:verb
647               stem(drink) = po:verb
648               > drinkable
649               analyze(drinkable) = po:verb ds:able
650               stem(drinkable) = drinkable
651
652       You  can see in the example, that the analyzer concatenates the morpho‐
653       logical fields in item and arrangement style.
654
655

Optional data fields

657       Default morphological and other IDs (used in suggestion,  stemming  and
658       morphological generation):
659
660       ph:    Alternative  transliteration  for  better suggestions, ie.  mis‐
661              spellings related to the special orthography  and  pronunciation
662              of the word. The best way to handle common misspellings, so it's
663              worth to add ph: field to the most affected few thousand dictio‐
664              nary  words  (or word pairs etc.) to get correct suggestions for
665              their misspellings.
666
667
668              For example:
669
670
671              Wednesday ph:wendsay ph:wensday
672              Marseille ph:maarsayl
673
674       Hunspell adds all ph: transliterations to the inner REP  table,  so  it
675       will  always  suggest  the  correct word for the specified misspellings
676       with the highest priority.
677
678       The previous example is equivalent of the following REP definition:
679
680
681              REP 6
682              REP wendsay Wednesday
683              REP Wendsay Wednesday
684              REP wensday Wednesday
685              REP Wensday Wednesday
686              REP maarsayl Marseille
687              REP Maarsayl Marseille
688
689       The asterisk at the end of the ph: pattern means stripping  the  termi‐
690       nating  character  both from the pattern and the word in the associated
691       REP rule:
692
693
694              pretty ph:prity*
695
696       will result
697
698
699              REP 1
700              REP prit prett
701
702       REP rule, resulting the following correct suggestions
703
704
705              *prity -> pretty
706              *pritier -> prettier
707              *pritiest -> prettiest
708
709       Moreover, ph: fields can handle suggestions with more than  two  words,
710       also different suggestions for the same misspelling:
711
712              do not know ph:dunno
713              don't know ph:dunno
714
715       results
716
717
718              *dunno -> do not know, don't know
719
720       Note: if available, ph: is used in n-gram similarity, too.
721
722       The  ASCII arrow "->" in a ph: pattern means a REP rule (see REP), cre‐
723       ating arbitrary replacement rule associated to the dictionary item:
724
725              happy/B ph:hepy ph:hepi->happi
726
727       results
728
729
730              *hepy -> happy
731              *hepiest -> happiest
732
733       st:    Stem. Optional: default stem is the dictionary item  in  morpho‐
734              logical  analysis.  Stem field is useful for virtual stems (dic‐
735              tionary words with NEEDAFFIX flag) and morphological  exceptions
736              instead of new, single used morphological rules.
737
738              feet  st:foot  is:plural
739              mice  st:mouse is:plural
740              teeth st:tooth is:plural
741
742       Word forms with multiple stems need multiple dictionary items:
743
744
745              lay po:verb st:lie is:past_2
746              lay po:verb is:present
747              lay po:noun
748
749       al:    Allomorph(s).  A  dictionary item is the stem of its allomorphs.
750              Morphological generation needs stem, allomorph and affix fields.
751
752              sing al:sang al:sung
753              sang st:sing
754              sung st:sing
755
756       po:    Part of speech category.
757
758       ds:    Derivational suffix(es).  Stemming doesn't  remove  derivational
759              suffixes.   Morphological generation depends on the order of the
760              suffix fields.
761
762              In affix rules:
763
764
765              SFX Y Y 1
766              SFX Y 0 ly . ds:ly_adj
767
768       In the dictionary:
769
770
771              ably st:able ds:ly_adj
772              able al:ably
773
774       is:    Inflectional suffix(es).  All inflectional suffixes are  removed
775              by  stemming.   Morphological generation depends on the order of
776              the suffix fields.
777
778
779              feet st:foot is:plural
780
781       ts:    Terminal suffix(es).  Terminal suffix  fields  are  inflectional
782              suffix fields "removed" by additional (not terminal) suffixes.
783
784              Useful  for  zero  morphemes  and  affixes  removed by splitting
785              rules.
786
787
788              work/D ts:present
789
790              SFX D Y 2
791              SFX D   0 ed . is:past_1
792              SFX D   0 ed . is:past_2
793
794       Typical example of the terminal suffix is the zero morpheme of the nom‐
795       inative case.
796
797
798       sp:    Surface  prefix.  Temporary  solution for adding prefixes to the
799              stems and generated word forms. See tests/morph.* example.
800
801
802       pa:    Parts of the compound  words.  Output  fields  of  morphological
803              analysis for stemming.
804
805       dp:    Planned: derivational prefix.
806
807       ip:    Planned: inflectional prefix.
808
809       tp:    Planned: terminal prefix.
810
811

Twofold suffix stripping

813       Ispell's  original algorithm strips only one suffix. Hunspell can strip
814       another one yet (or a plus prefix in COMPLEXPREFIXES mode).
815
816       The twofold suffix stripping is a significant improvement  in  handling
817       of  immense  number  of  suffixes, that characterize agglutinative lan‐
818       guages.
819
820       A second `s' suffix (affix class Y) will be the continuation  class  of
821       the suffix `able' in the following example:
822
823
824               SFX Y Y 1
825               SFX Y 0 s .
826
827               SFX X Y 1
828               SFX X 0 able/Y .
829
830       Dictionary file:
831
832
833               drink/X
834
835       Test file:
836
837
838               drink
839               drinkable
840               drinkables
841
842       Test:
843
844
845               $ hunspell -m -d test <test.txt
846               drink st:drink
847               drinkable st:drink fl:X
848               drinkables st:drink fl:X fl:Y
849
850       Theoretically  with  the twofold suffix stripping needs only the square
851       root of the number of suffix rules, compared with a Hunspell  implemen‐
852       tation. In our practice, we could have elaborated the Hungarian inflec‐
853       tional morphology with twofold suffix stripping.
854
855

Extended affix classes

857       Hunspell can handle more than 65000 affix classes.  There are three new
858       syntax for giving flags in affix and dictionary files.
859
860       FLAG long command sets 2-character flags:
861
862
863                FLAG long
864                SFX Y1 Y 1
865                SFX Y1 0 s 1
866
867       Dictionary record with the Y1, Z3, F? flags:
868
869
870                foo/Y1Z3F?
871
872       FLAG num command sets numerical flags separated by comma:
873
874
875                FLAG num
876                SFX 65000 Y 1
877                SFX 65000 0 s 1
878
879       Dictionary example:
880
881
882                foo/65000,12,2756
883
884       The third one is the Unicode character flags.
885
886

Homonyms

888       Hunspell's dictionary can contain repeating elements that are homonyms:
889
890
891               work/A    po:verb
892               work/B    po:noun
893
894       An affix file:
895
896
897               SFX A Y 1
898               SFX A 0 s . sf:sg3
899
900               SFX B Y 1
901               SFX B 0 s . is:plur
902
903       Test file:
904
905
906               works
907
908       Test:
909
910
911               $ hunspell -d test -m <testwords
912               work st:work po:verb is:sg3
913               work st:work po:noun is:plur
914
915       This  feature also gives a way to forbid illegal prefix/suffix combina‐
916       tions.
917
918

Prefix--suffix dependencies

920       An interesting side-effect of multi-step stripping is, that the  appro‐
921       priate  treatment  of circumfixes now comes for free.  For instance, in
922       Hungarian, superlatives are formed by simultaneous prefixation of  leg-
923       and  suffixation of -bb to the adjective base.  A problem with the one-
924       level architecture is that there is no way to render lexical  licensing
925       of  particular  prefixes  and  suffixes  interdependent,  and therefore
926       incorrect forms are recognized as valid,  i.e.  *legvén  =  leg  +  vén
927       `old'.  Until  the introduction of clusters, a special treatment of the
928       superlative had to be hardwired in the earlier HunSpell code. This  may
929       have  been  legitimate  for  a  single case, but in fact prefix--suffix
930       dependences are ubiquitous in category-changing  derivational  patterns
931       (cf.  English  payable, non-payable but *non-pay or drinkable, undrink‐
932       able but *undrink). In simple words, here, the prefix un- is legitimate
933       only  if  the  base drink is suffixed with -able. If both these patters
934       are handled by on-line affix rules and affix rules are checked  against
935       the  base only, there is no way to express this dependency and the sys‐
936       tem will necessarily over- or undergenerate.
937
938       In next example, suffix class R have got a prefix `continuation'  class
939       (class P).
940
941
942              PFX P Y 1
943              PFX P   0 un . [prefix_un]+
944
945              SFX S Y 1
946              SFX S   0 s . +PL
947
948              SFX Q Y 1
949              SFX Q   0 s . +3SGV
950
951              SFX R Y 1
952              SFX R   0 able/PS . +DER_V_ADJ_ABLE
953
954       Dictionary:
955
956
957              2
958              drink/RQ  [verb]
959              drink/S   [noun]
960
961       Morphological analysis:
962
963
964              > drink
965              drink[verb]
966              drink[noun]
967              > drinks
968              drink[verb]+3SGV
969              drink[noun]+PL
970              > drinkable
971              drink[verb]+DER_V_ADJ_ABLE
972              > drinkables
973              drink[verb]+DER_V_ADJ_ABLE+PL
974              > undrinkable
975              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
976              > undrinkables
977              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
978              > undrink
979              Unknown word.
980              > undrinks
981              Unknown word.
982

Circumfix

984       Conditional  affixes implemented by a continuation class are not enough
985       for circumfixes, because a circumfix is one  affix  in  morphology.  We
986       also need CIRCUMFIX option for correct morphological analysis.
987
988
989              # circumfixes: ~ obligate prefix/suffix combinations
990              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
991              # nagy, nagyobb, legnagyobb, legeslegnagyobb
992              # (great, greater, greatest, most greatest)
993
994              CIRCUMFIX X
995
996              PFX A Y 1
997              PFX A 0 leg/X .
998
999              PFX B Y 1
1000              PFX B 0 legesleg/X .
1001
1002              SFX C Y 3
1003              SFX C 0 obb . +COMPARATIVE
1004              SFX C 0 obb/AX . +SUPERLATIVE
1005              SFX C 0 obb/BX . +SUPERSUPERLATIVE
1006
1007       Dictionary:
1008
1009
1010              1
1011              nagy/C    [MN]
1012
1013       Analysis:
1014
1015
1016              > nagy
1017              nagy[MN]
1018              > nagyobb
1019              nagy[MN]+COMPARATIVE
1020              > legnagyobb
1021              nagy[MN]+SUPERLATIVE
1022              > legeslegnagyobb
1023              nagy[MN]+SUPERSUPERLATIVE
1024

Compounds

1026       Allowing  free compounding yields decrease in precision of recognition,
1027       not to mention stemming and morphological analysis.   Although  lexical
1028       switches are introduced to license compounding of bases by Ispell, this
1029       proves not to be restrictive enough. For example:
1030
1031
1032              # affix file
1033              COMPOUNDFLAG X
1034
1035              2
1036              foo/X
1037              bar/X
1038
1039       With this resource, foobar and barfoo also are accepted words.
1040
1041       This has been improved upon with the introduction  of  direction-sensi‐
1042       tive compounding, i.e., lexical features can specify separately whether
1043       a base can occur as leftmost or  rightmost  constituent  in  compounds.
1044       This,  however,  is still insufficient to handle the intricate patterns
1045       of compounding, not to mention idiosyncratic  (and  language  specific)
1046       norms of hyphenation.
1047
1048       The  Hunspell  algorithm  currently  allows  any affixed form of words,
1049       which are lexically marked as potential members of compounds.  Hunspell
1050       improved  this, and its recursive compound checking rules makes it pos‐
1051       sible to implement the intricate spelling conventions of Hungarian com‐
1052       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
1053       ROOT, SYLLABLENUM options can be set  the  noteworthy  Hungarian  `6-3'
1054       rule.   Further  example  in  Hungarian, derivate suffixes often modify
1055       compounding properties. Hunspell allows the compounding  flags  on  the
1056       affixes,  and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
1057       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
1058
1059       Suffixes with this flag forbid compounding of the affixed word.
1060
1061       We also need several Hunspell features for handling German compounding:
1062
1063
1064              # German compounding
1065
1066              # set language to handle special casing of German sharp s
1067
1068              LANG de_DE
1069
1070              # compound flags
1071
1072              COMPOUNDBEGIN U
1073              COMPOUNDMIDDLE V
1074              COMPOUNDEND W
1075
1076              # Prefixes are allowed at the beginning of compounds,
1077              # suffixes are allowed at the end of compounds by default:
1078              # (prefix)?(root)+(affix)?
1079              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
1080              COMPOUNDPERMITFLAG P
1081
1082              # for German fogemorphemes (Fuge-element)
1083              # Hint: ONLYINCOMPOUND is not required everywhere, but the
1084              # checking will be a little faster with it.
1085
1086              ONLYINCOMPOUND X
1087
1088              # forbid uppercase characters at compound word bounds
1089              CHECKCOMPOUNDCASE
1090
1091              # for handling Fuge-elements with dashes (Arbeits-)
1092              # dash will be a special word
1093
1094              COMPOUNDMIN 1
1095              WORDCHARS -
1096
1097              # compound settings and fogemorpheme for `Arbeit'
1098
1099              SFX A Y 3
1100              SFX A 0 s/UPX .
1101              SFX A 0 s/VPDX .
1102              SFX A 0 0/WXD .
1103
1104              SFX B Y 2
1105              SFX B 0 0/UPX .
1106              SFX B 0 0/VWXDP .
1107
1108              # a suffix for `Computer'
1109
1110              SFX C Y 1
1111              SFX C 0 n/WD .
1112
1113              # for forbid exceptions (*Arbeitsnehmer)
1114
1115              FORBIDDENWORD Z
1116
1117              # dash prefix for compounds with dash (Arbeits-Computer)
1118
1119              PFX - Y 1
1120              PFX - 0 -/P .
1121
1122              # decapitalizing prefix
1123              # circumfix for positioning in compounds
1124
1125              PFX D Y 29
1126              PFX D A a/PX A
1127              PFX D Ä ä/PX Ä
1128               .
1129               .
1130              PFX D Y y/PX Y
1131              PFX D Z z/PX Z
1132
1133       Example dictionary:
1134
1135
1136              4
1137              Arbeit/A-
1138              Computer/BC-
1139              -/W
1140              Arbeitsnehmer/Z
1141
1142       Accepted compound compound words with the previous resource:
1143
1144
1145              Computer
1146              Computern
1147              Arbeit
1148              Arbeits-
1149              Computerarbeit
1150              Computerarbeits-
1151              Arbeitscomputer
1152              Arbeitscomputern
1153              Computerarbeitscomputer
1154              Computerarbeitscomputern
1155              Arbeitscomputerarbeit
1156              Computerarbeits-Computer
1157              Computerarbeits-Computern
1158
1159       Not accepted compoundings:
1160
1161
1162              computer
1163              arbeit
1164              Arbeits
1165              arbeits
1166              ComputerArbeit
1167              ComputerArbeits
1168              Arbeitcomputer
1169              ArbeitsComputer
1170              Computerarbeitcomputer
1171              ComputerArbeitcomputer
1172              ComputerArbeitscomputer
1173              Arbeitscomputerarbeits
1174              Computerarbeits-computer
1175              Arbeitsnehmer
1176
1177       This solution is still not ideal, however, and will be  replaced  by  a
1178       pattern-based  compound-checking  algorithm which is closely integrated
1179       with input buffer tokenization. Patterns describing compounds come as a
1180       separate input resource that can refer to high-level properties of con‐
1181       stituent parts (e.g. the number of syllables, affix flags, and contain‐
1182       ment  of hyphens). The patterns are matched against potential segmenta‐
1183       tions of compounds to assess wellformedness.
1184
1185

Unicode character encoding

1187       Both Ispell and Myspell use 8-bit ASCII character encoding, which is  a
1188       major  deficiency  when  it  comes to scalability.  Although a language
1189       like Hungarian has a standard ASCII  character  set  (ISO  8859-2),  it
1190       fails  to allow a full implementation of Hungarian orthographic conven‐
1191       tions.  For instance, the '--' symbol (n-dash)  is  missing  from  this
1192       character  set  contrary  to  the fact that it is not only the official
1193       symbol to delimit parenthetic clauses in the language, but it can be in
1194       compound words as a special 'big' hyphen.
1195
1196       MySpell  has  got  some  8-bit encoding tables, but there are languages
1197       without standard 8-bit encoding, too. For example,  a  lot  of  African
1198       languages have non-latin or extended latin characters.
1199
1200       Similarly,  using  the  original spelling of certain foreign names like
1201       Ångström or Molière is encouraged by the Hungarian spelling norm,  and,
1202       since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1203       bine with inflections containing characters only  in  ISO 8859-2  (like
1204       elative  -ből, allative -től or delative -ről with double acute), these
1205       result in words (like Ångströmről or  Molière-től.)  that  can  not  be
1206       encoded using any single ASCII encoding scheme.
1207
1208       The  problems raised in relation to 8-bit ASCII encoding have long been
1209       recognized by proponents of Unicode. It is  clear  that  trading  effi‐
1210       ciency  for  encoding-independence  has  its advantages when it comes a
1211       truly multi-lingual application. There is implemented a memory and time
1212       efficient  Unicode  handling in Hunspell. In non-UTF-8 character encod‐
1213       ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1214       affixes  and words are stored in UTF-8, during the analysis are handled
1215       in mostly UTF-8, under condition checking and suggestion are  converted
1216       to  UTF-16.  Unicode  text  analysis  and spell checking have a minimal
1217       (0-20%) time overhead and minimal or reasonable memory overhead depends
1218       from the language (its UTF-8 encoding and affixation).
1219
1220

Conversion of aspell dictionaries

1222       Aspell  dictionaries  can be easily converted into hunspell. Conversion
1223       steps:
1224
1225       dictionary (xx.cwl -> xx.wl):
1226
1227       preunzip xx.cwl
1228       wc -l < xx.wl > xx.dic
1229       cat xx.wl >> xx.dic
1230
1231       affix file
1232
1233       If the affix file exists, copy it:
1234       cp xx_affix.dat xx.aff
1235       If not, create it with the suitable character encoding (see xx.dat)
1236       echo "SET ISO8859-x" > xx.aff
1237       or
1238       echo "SET UTF-8" > xx.aff
1239
1240       It's useful to add a TRY option with the characters of  the  dictionary
1241       with frequency order to set edit distance suggestions:
1242       echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1243
1244