hunspell(5)

1hunspell(5)                   File Formats Manual                  hunspell(5)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1) Hunspell requires two files to define the way a language is
10       being spell checked: a dictionary file containing words and  applicable
11       flags,  and  an  affix file that specifies how these flags will control
12       spell checking.  An optional file is the personal dictionary file.
13
14

Dictionary file

16       A dictionary file (*.dic) contains a list of words, one per line.   The
17       first  line of the dictionaries (except personal dictionaries) contains
18       the approximate word count (for optimal hash memory  size).  Each  word
19       may  optionally  be  followed  by  a slash ("/") and one or more flags,
20       which represents the word attributes, for example affixes.
21
22       Note: Dictionary words can contain also slashes when escaped like  "\/"
23       syntax.
24
25       It's  worth  to add not only words, but word pairs to the dictionary to
26       get correct suggestions for common misspellings with missing space,  as
27       in  the  following  example, for the bad "alot" and "inspite" (see also
28       "REP" and  field  "ph:"  about  correct  suggestions  for  common  mis‐
29       spellings):
30
31
32              3
33              word
34              a lot
35              in spite
36

Personal dictionary file

38       Personal  dictionaries  are  simple  word  lists. Asterisk at the first
39       character position signs prohibition.  A second  word  separated  by  a
40       slash sets the affixation.
41
42
43              foo
44              Foo/Simpson
45              *bar
46
47       In  this  example, "foo" and "Foo" are personal words, plus Foo will be
48       recognized with affixes of Simpson (Foo's etc.) and bar is a  forbidden
49       word.
50
51

Short example

53       Dictionary file:
54
55              3
56              hello
57              try/B
58              work/AB
59
60       The flags B and A specify attributes of these words.
61
62       Affix file:
63
64
65              SET UTF-8
66              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
67
68              REP 2
69              REP f ph
70              REP ph f
71
72              PFX A Y 1
73              PFX A 0 re .
74
75              SFX B Y 2
76              SFX B 0 ed [^y]
77              SFX B y ied y
78
79       In the affix file, prefix A and suffix B have been defined.  Flag A de‐
80       fines a `re-' prefix. Class B defines two `-ed' suffixes. First B  suf‐
81       fix can be added to a word if the last character of the word isn't `y'.
82       Second suffix can be added to the words terminated with an `y'.
83
84       All accepted words with this  dictionary  and  affix  combination  are:
85       "hello", "try", "tried", "work", "worked", "rework", "reworked".
86
87

AFFIX FILE GENERAL OPTIONS

89       Hunspell  source distribution contains more than 80 examples for option
90       usage.
91
92
93       SET encoding
94              Set character encoding of words and morphemes in affix and  dic‐
95              tionary  files.  Possible values: UTF-8, ISO8859-1 - ISO8859-10,
96              ISO8859-13 - ISO8859-15, KOI8-R, KOI8-U,  cp1251,  ISCII-DEVANA‐
97              GARI.
98
99              SET UTF-8
100
101       FLAG value
102              Set  flag type. Default type is the extended ASCII (8-bit) char‐
103              acter.  `UTF-8' parameter sets UTF-8 encoded  Unicode  character
104              flags.   The `long' value sets the double extended ASCII charac‐
105              ter flag type, the `num' sets the decimal number flag type. Dec‐
106              imal flags numbered from 1 to 65000, and in flag fields are sep‐
107              arated by comma.
108
109              FLAG long
110
111       COMPLEXPREFIXES
112              Set twofold prefix stripping (but single suffix  stripping)  eg.
113              for morphologically complex languages with right-to-left writing
114              system.
115
116
117       LANG langcode
118              Set language code for language-specific functions  of  Hunspell.
119              Use  it  to  activate special casing of Azeri (LANG az), Turkish
120              (LANG tr) and Crimean Tatar (LANG  crh),  also  not  generalized
121              syllable-counting compounding rules of Hungarian (LANG hu).
122
123
124       IGNORE characters
125              Sets  characters  to  ignore dictionary words, affixes and input
126              words.  Useful for optional characters, as Arabic  (harakat)  or
127              Hebrew  (niqqud) diacritical marks (see tests/ignore.* test dic‐
128              tionary in Hunspell distribution).
129
130
131       AF number_of_flag_vector_aliases
132
133       AF flag_vector
134              Hunspell can substitute affix flag sets with ordinal numbers  in
135              affix rules (alias compression, see makealias tool). First exam‐
136              ple with alias compression:
137
138              3
139              hello
140              try/1
141              work/2
142
143       AF definitions in the affix file:
144
145              AF 2
146              AF A
147              AF AB
148
149       It is equivalent of the following dic file:
150
151              3
152              hello
153              try/A
154              work/AB
155
156       See also tests/alias* examples of the source distribution.
157
158       Note I: If affix file contains the FLAG parameter, define it before the
159       AF definitions.
160
161       Note II: Use makealias utility in Hunspell distribution to compress aff
162       and dic files.
163
164       AM number_of_morphological_aliases
165
166       AM morphological_fields
167              Hunspell can substitute also  morphological  data  with  ordinal
168              numbers  in  affix  rules (alias compression).  See tests/alias*
169              examples.
170

AFFIX FILE OPTIONS FOR SUGGESTION

172       Suggestion parameters  can  optimize  the  default  n-gram  (similarity
173       search in the dictionary words based on the common 1, 2, 3, 4-character
174       length common character-sequences), character swap and deletion sugges‐
175       tions  of Hunspell.  REP is suggested to fix the typical and especially
176       bad language specific bugs, because the REP suggestions have the  high‐
177       est  priority  in the suggestion list.  PHONE is for languages with not
178       pronunciation based orthography.
179
180       For short common misspellings, it's important to use the ph: field (see
181       later) to give the best suggestions.
182
183       KEY characters_separated_by_vertical_line_optionally
184              Hunspell  searches and suggests words with one different charac‐
185              ter replaced by a neighbor KEY character. Not  neighbor  charac‐
186              ters  in KEY string separated by vertical line characters.  Sug‐
187              gested KEY parameters for QWERTY and Dvorak keyboard layouts:
188
189              KEY qwertyuiop|asdfghjkl|zxcvbnm
190              KEY pyfgcrl|aeouidhtns|qjkxbmwvz
191
192       Using the first QWERTY layout, Hunspell suggests "nude" and "node"  for
193       "*nide". A character may have more neighbors, too:
194
195              KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
196
197       TRY characters
198              Hunspell can suggest right word forms, when they differ from the
199              bad input word by one TRY character. The  parameter  of  TRY  is
200              case sensitive.
201
202       NOSUGGEST flag
203              Words  signed  with  NOSUGGEST flag are not suggested (but still
204              accepted when typed correctly). Proposed flag for vulgar and ob‐
205              scene words (see also SUBSTANDARD).
206
207       MAXCPDSUGS num
208              Set  max.  number  of suggested compound words generated by com‐
209              pound rules. The number of the suggested compound words  may  be
210              greater from the same 1-character distance type.
211
212       MAXNGRAMSUGS num
213              Set  max. number of n-gram suggestions. Value 0 switches off the
214              n-gram suggestions (see also MAXDIFF).
215
216       MAXDIFF [0-10]
217              Set the similarity factor for the n-gram based suggestions (5  =
218              default  value;  0  = fewer n-gram suggestions, but min. 1; 10 =
219              MAXNGRAMSUGS n-gram suggestions).
220
221       ONLYMAXDIFF
222              Remove all bad n-gram suggestions (default mode keeps  one,  see
223              MAXDIFF).
224
225       NOSPLITSUGS
226              Disable word suggestions with spaces.
227
228       SUGSWITHDOTS
229              Add  dot(s)  to suggestions, if input word terminates in dot(s).
230              (Not for LibreOffice dictionaries, because  LibreOffice  has  an
231              automatic dot expansion mechanism.)
232
233       REP number_of_replacement_definitions
234
235       REP what replacement
236              This  table  specifies modifications to try first.  First REP is
237              the header of this table and one or more REP data line are  fol‐
238              lowing  it.   With  this  table,  Hunspell can suggest the right
239              forms for the typical spelling mistakes when the incorrect  form
240              differs  by  more  than  1  letter from the right form (see also
241              "ph:").  The search string supports the regex boundary signs  (^
242              and  $).  For example a possible English replacement table defi‐
243              nition to handle misspelled consonants:
244
245              REP 5
246              REP f ph
247              REP ph f
248              REP tion$ shun
249              REP ^cooccurr co-occurr
250              REP ^alot$ a_lot
251
252       Note I: It's very useful to define replacements for  the  most  typical
253       one-character  mistakes, too: with REP you can add higher priority to a
254       subset of the TRY suggestions (suggestion list begins with the REP sug‐
255       gestions).
256
257       Note II: Suggesting separated words, specify spaces with underlines:
258
259
260              REP 1
261              REP onetwothree one_two_three
262
263       Note  III:  Replacement  table can be used for a stricter compound word
264       checking with the option CHECKCOMPOUNDREP.
265
266
267       MAP number_of_map_definitions
268
269       MAP string_of_related_chars_or_parenthesized_character_sequences
270              We can define language-dependent information on  characters  and
271              character  sequences  that  should  be  considered related (i.e.
272              nearer than other chars not in the set) in the affix file (.aff)
273              by a map table.  With this table, Hunspell can suggest the right
274              forms for words, which incorrectly choose the  wrong  letter  or
275              letter  groups  from a related set more than once in a word (see
276              REP).
277
278              For example a possible mapping could be for the German  umlauted
279              ü  versus  the  regular  u;  the word Frühstück really should be
280              written with umlauted u's and not regular ones
281
282              MAP 1
283              MAP uü
284
285       Use parenthesized groups for character sequences (eg. for composed Uni‐
286       code characters):
287
288              MAP 3
289              MAP ß(ss)  (character sequence)
290              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
291              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
292
293       PHONE number_of_phone_definitions
294
295       PHONE what replacement
296              PHONE  uses a table-driven phonetic transcription algorithm bor‐
297              rowed from Aspell. It is useful for languages with not pronunci‐
298              ation  based orthography. You can add a full alphabet conversion
299              and other rules for conversion of special letter sequences.  For
300              detailed  documentation see http://aspell.net/man-html/Phonetic-
301              Code.html.  Note: Multibyte UTF-8  characters  have  not  worked
302              with  bracket  expression  yet. Dash expression has signed bytes
303              and not UTF-8 characters yet.
304
305       WARN flag
306              This flag is for rare words, which are also often spelling  mis‐
307              takes, see option -r of command line Hunspell and FORBIDWARN.
308
309       FORBIDWARN
310              Words  with flag WARN aren't accepted by the spell checker using
311              this parameter.
312

OPTIONS FOR COMPOUNDING

314       BREAK number_of_break_definitions
315
316       BREAK character_or_character_sequence
317              Define new break points for breaking  words  and  checking  word
318              parts  separately.  Use  ^ and $ to delete characters at end and
319              start of the word. Rationale: useful for compounding with  join‐
320              ing  character  or  strings  (for example, hyphen in English and
321              German or hyphen and n-dash in Hungarian). Dashes are often  bad
322              break points for tokenization, because compounds with dashes may
323              contain not valid parts, too.)  With BREAK, Hunspell  can  check
324              both  side  of these compounds, breaking the words at dashes and
325              n-dashes:
326
327              BREAK 2
328              BREAK -
329              BREAK --    # n-dash
330
331       Breaking are recursive, so foo-bar, bar-foo and foo-foo--bar-bar  would
332       be valid compounds.  Note: The default word break of Hunspell is equiv‐
333       alent of the following BREAK definition:
334
335              BREAK 3
336              BREAK -
337              BREAK ^-
338              BREAK -$
339
340       Hunspell doesn't accept the "-word" and "word-"  forms  by  this  BREAK
341       definition:
342
343              BREAK 1
344              BREAK -
345
346       Switching off the default values:
347
348              BREAK 0
349
350       Note II: COMPOUNDRULE is better for handling dashes and other  compound
351       joining characters or character strings. Use  BREAK,  if  you  want  to
352       check  words  with  dashes  or other joining characters and there is no
353       time or possibility  to  describe  precise  compound  rules  with  COM‐
354       POUNDRULE  (COMPOUNDRULE  handles only the suffixation of the last word
355       part of a compound word).
356
357       Note III: For command line spell checking of words with  extra  charac‐
358       ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
359       ple
360
361       COMPOUNDRULE number_of_compound_definitions
362
363       COMPOUNDRULE compound_pattern
364              Define custom compound patterns with a regex-like  syntax.   The
365              first  COMPOUNDRULE is a header with the number of the following
366              COMPOUNDRULE definitions.  Compound  patterns  consist  compound
367              flags,  parentheses,  star  and question mark meta characters. A
368              flag followed by a `*' matches a word  sequence  of  0  or  more
369              matches  of  words  signed with this compound flag.  A flag fol‐
370              lowed by a `?' matches a word sequence of 0 or 1  matches  of  a
371              word  signed with this compound flag.  See tests/compound*.* ex‐
372              amples.
373
374              Note: en_US dictionary of OpenOffice.org uses  COMPOUNDRULE  for
375              ordinal  number  recognition (1st, 2nd, 11th, 12th, 22nd, 112th,
376              1000122nd etc.).
377
378              Note II: In the case of long and numerical flag types  use  only
379              parenthesized flags: (1500)*(2000)?
380
381              Note III: COMPOUNDRULE flags work completely separately from the
382              compounding mechanisms using COMPOUNDFLAG,  COMPOUNDBEGIN,  etc.
383              compound  flags.  (Use  these  flags  on  different  entries for
384              words).
385
386
387       COMPOUNDMIN num
388              Minimum length of words used for compounding.  Default value  is
389              3 letters.
390
391       COMPOUNDFLAG flag
392              Words  signed with COMPOUNDFLAG may be in compound words (except
393              when word shorter than COMPOUNDMIN). Affixes  with  COMPOUNDFLAG
394              also permits compounding of affixed words.
395
396       COMPOUNDBEGIN flag
397              Words  signed with COMPOUNDBEGIN (or with a signed affix) may be
398              first elements in compound words.
399
400       COMPOUNDLAST flag
401              Words signed with COMPOUNDLAST (or with a signed affix)  may  be
402              last elements in compound words.
403
404       COMPOUNDMIDDLE flag
405              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
406              middle elements in compound words.
407
408       ONLYINCOMPOUND flag
409              Suffixes signed with ONLYINCOMPOUND flag may be only  inside  of
410              compounds  (Fuge-elements  in German, fogemorphemes in Swedish).
411              ONLYINCOMPOUND flag works also with words (see  tests/onlyincom‐
412              pound.*).   Note:  also valuable to flag compounding parts which
413              are not correct as a word by itself.
414
415       COMPOUNDPERMITFLAG flag
416              Prefixes are allowed at the beginning of compounds, suffixes are
417              allowed  at  the end of compounds by default.  Affixes with COM‐
418              POUNDPERMITFLAG may be inside of compounds.
419
420       COMPOUNDFORBIDFLAG flag
421              Suffixes with this flag forbid compounding of the affixed  word.
422              Dictionary  words  with this flag are removed from the beginning
423              and middle of compound words, overriding the effect of COMPOUND‐
424              PERMITFLAG.
425
426       COMPOUNDMORESUFFIXES
427              Allow twofold suffixes within compounds.
428
429       COMPOUNDROOT flag
430              COMPOUNDROOT  flag signs the compounds in the dictionary (Now it
431              is used only in the Hungarian language specific code).
432
433       COMPOUNDWORDMAX number
434              Set maximum word count in a compound word.  (Default  is  unlim‐
435              ited.)
436
437       CHECKCOMPOUNDDUP
438              Forbid word duplication in compounds (e.g. foofoo).
439
440       CHECKCOMPOUNDREP
441              Forbid  compounding, if the (usually bad) compound word may be a
442              non-compound word with a REP fault. Useful  for  languages  with
443              `compound friendly' orthography.
444
445       CHECKCOMPOUNDCASE
446              Forbid upper case characters at word boundaries in compounds.
447
448       CHECKCOMPOUNDTRIPLE
449              Forbid  compounding,  if compound word contains triple repeating
450              letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
451              ter  support in UTF-8 encoding (works only for 7-bit ASCII char‐
452              acters).
453
454       SIMPLIFIEDTRIPLE
455              Allow simplified 2-letter forms of the  compounds  forbidden  by
456              CHECKCOMPOUNDTRIPLE.  It's useful for Swedish and Norwegian (and
457              for the old German orthography: Schiff|fahrt -> Schiffahrt).
458
459       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
460
461       CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
462              Forbid compounding, if the first word in the compound ends  with
463              endchars,  and next word begins with beginchars and (optionally)
464              they have the requested flags.  The optional replacement parame‐
465              ter allows simplified compound form.
466
467              The  special  "endchars" pattern 0 (zero) limits the rule to the
468              unmodified stems (stems and stems with zero affixes):
469
470              CHECKCOMPOUNDPATTERN 0/x /y
471
472       Note: COMPOUNDMIN doesn't work correctly with the compound word  alter‐
473       nation, so it may need to set COMPOUNDMIN to lower value.
474
475       FORCEUCASE flag
476              Last  word  part of a compound with flag FORCEUCASE forces capi‐
477              talization of the whole compound word. Eg. Dutch  word  "straat"
478              (street)  with FORCEUCASE flags will allowed only in capitalized
479              compound forms, according to the Dutch spelling rules for proper
480              names.
481
482       COMPOUNDSYLLABLE max_syllable vowels
483              Need  for special compounding rules in Hungarian.  First parame‐
484              ter is the maximum syllable number, that may be in  a  compound,
485              if words in compounds are more than COMPOUNDWORDMAX.  Second pa‐
486              rameter is the list of vowels (for calculating syllables).
487
488       SYLLABLENUM flags
489              Need for special compounding rules in Hungarian.
490

AFFIX FILE OPTIONS FOR AFFIX CREATION

492       PFX flag cross_product number
493
494       PFX flag stripping prefix [condition [morphological_fields...]]
495
496       SFX flag cross_product number
497
498       SFX flag stripping suffix [condition [morphological_fields...]]
499              An affix is either a prefix or a suffix attached to  root  words
500              to  make other words. We can define affix classes with arbitrary
501              number affix rules.  Affix classes are signed with affix  flags.
502              The  first  line of an affix class definition is the header. The
503              fields of an affix class header:
504
505              (0) Option name (PFX or SFX)
506
507              (1) Flag (name of the affix class)
508
509              (2) Cross product (permission to combine prefixes and suffixes).
510              Possible values: Y (yes) or N (no)
511
512              (3) Line count of the following rules.
513
514              Fields of an affix rules:
515
516              (0) Option name
517
518              (1) Flag
519
520              (2) stripping characters from beginning (at prefix rules) or end
521              (at suffix rules) of the word
522
523              (3) affix (optionally with flags of continuation classes,  sepa‐
524              rated by a slash)
525
526              (4) condition.
527
528              Zero stripping or affix are indicated by zero. Zero condition is
529              indicated by dot.  Condition is a  simplified,  regular  expres‐
530              sion-like pattern, which must be met before the affix can be ap‐
531              plied. (Dot signs an arbitrary character. Characters  in  braces
532              sign  an  arbitrary  character  from  the character subset. Dash
533              hasn't got special meaning, but circumflex (^)  next  the  first
534              brace sets the complementer character set.)
535
536              (5) Optional morphological fields separated by spaces or tabula‐
537              tors.
538
539

AFFIX FILE OTHER OPTIONS

541       CIRCUMFIX flag
542              Affixes signed with CIRCUMFIX flag may be on a  word  when  this
543              word  also  has a prefix with CIRCUMFIX flag and vice versa (see
544              circumfix.* test files in the source distribution).
545
546       FORBIDDENWORD flag
547              This flag signs forbidden word form. Because affixed  forms  are
548              also  forbidden,  we  can  subtract a subset from set of the ac‐
549              cepted affixed and compound words.  Note: usefull to forbid  er‐
550              roneous words, generated by the compounding mechanism.
551
552       FULLSTRIP
553              With  FULLSTRIP,  affix rules can strip full words, not only one
554              less characters, before adding the affixes, see fullstrip.* test
555              files in the source distribution).  Note: conditions may be word
556              length without FULLSTRIP, too.
557
558       KEEPCASE flag
559              Forbid uppercased and capitalized forms  of  words  signed  with
560              KEEPCASE  flags.  Useful for special orthographies (measurements
561              and currency often keep their  case  in  uppercased  texts)  and
562              writing  systems  (e.g.  keeping  lower case of IPA characters).
563              Also valuable for words erroneously written in the wrong case.
564
565              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
566              CASE  flag  may  be  capitalized  and uppercased, but uppercased
567              forms of these words may not contain sharp s, only SS. See  ger‐
568              mancompounding  example  in  the tests directory of the Hunspell
569              distribution.
570
571
572       ICONV number_of_ICONV_definitions
573
574       ICONV pattern pattern2
575              Define input conversion table.  Note: useful to convert one type
576              of quote to another one, or change ligature.
577
578       OCONV number_of_OCONV_definitions
579
580       OCONV pattern pattern2
581              Define output conversion table.
582
583       LEMMA_PRESENT flag
584              Deprecated. Use "st:" field instead of LEMMA_PRESENT.
585
586       NEEDAFFIX flag
587              This  flag  signs  virtual  stems  in the dictionary, words only
588              valid when affixed.   Except,  if  the  dictionary  word  has  a
589              homonym or a zero affix.  NEEDAFFIX works also with prefixes and
590              prefix + suffix combinations (see tests/needaffix5.*).
591
592       PSEUDOROOT flag
593              Deprecated. (Former name of the NEEDAFFIX option.)
594
595       SUBSTANDARD flag
596              SUBSTANDARD flag signs affix rules and dictionary  words  (allo‐
597              morphs)  not used in morphological generation and root words re‐
598              moved from suggestion. See also NOSUGGEST.
599
600       WORDCHARS characters
601              WORDCHARS extends tokenizer of Hunspell command  line  interface
602              with  additional word character. For example, dot, dash, n-dash,
603              numbers, percent sign are word character in Hungarian.
604
605       CHECKSHARPS
606              SS letter pair in uppercased (German) words may  be  upper  case
607              sharp  s  (ß).  Hunspell can handle this special casing with the
608              CHECKSHARPS declaration (see also KEEPCASE flag  and  tests/ger‐
609              mancompounding example) in both spelling and suggestion.
610
611

Morphological analysis

613       Hunspell's  dictionary items and affix rules may have optional space or
614       tabulator separated  morphological  description  fields,  started  with
615       3-character (two letters and a colon) field IDs:
616
617
618               word/flags po:noun is:nom
619
620       Example: We define a simple resource with morphological informations, a
621       derivative suffix (ds:) and a part of speech category (po:):
622
623       Affix file:
624
625
626               SFX X Y 1
627               SFX X 0 able . ds:able
628
629       Dictionary file:
630
631
632               drink/X po:verb
633
634       Test file:
635
636
637               drink
638               drinkable
639
640       Test:
641
642
643               $ analyze test.aff test.dic test.txt
644               > drink
645               analyze(drink) = po:verb
646               stem(drink) = po:verb
647               > drinkable
648               analyze(drinkable) = po:verb ds:able
649               stem(drinkable) = drinkable
650
651       You can see in the example, that the analyzer concatenates the  morpho‐
652       logical fields in item and arrangement style.
653
654

Optional data fields

656       Default  morphological  and other IDs (used in suggestion, stemming and
657       morphological generation):
658
659       ph:    Alternative transliteration for better  suggestions,  ie.   mis‐
660              spellings  related  to the special orthography and pronunciation
661              of the word. The best way to handle common misspellings, so it's
662              worth to add ph: field to the most affected few thousand dictio‐
663              nary words (or word pairs etc.) to get correct  suggestions  for
664              their misspellings.
665
666
667              For example:
668
669
670              Wednesday ph:wendsay ph:wensday
671              Marseille ph:maarsayl
672
673       Hunspell  adds  all  ph: transliterations to the inner REP table, so it
674       will always suggest the correct word  for  the  specified  misspellings
675       with the highest priority.
676
677       The previous example is equivalent of the following REP definition:
678
679
680              REP 6
681              REP wendsay Wednesday
682              REP Wendsay Wednesday
683              REP wensday Wednesday
684              REP Wensday Wednesday
685              REP maarsayl Marseille
686              REP Maarsayl Marseille
687
688       The  asterisk  at the end of the ph: pattern means stripping the termi‐
689       nating character both from the pattern and the word in  the  associated
690       REP rule:
691
692
693              pretty ph:prity*
694
695       will result
696
697
698              REP 1
699              REP prit prett
700
701       REP rule, resulting the following correct suggestions
702
703
704              *prity -> pretty
705              *pritier -> prettier
706              *pritiest -> prettiest
707
708       Moreover,  ph:  fields can handle suggestions with more than two words,
709       also different suggestions for the same misspelling:
710
711              do not know ph:dunno
712              don't know ph:dunno
713
714       results
715
716
717              *dunno -> do not know, don't know
718
719       Note: if available, ph: is used in n-gram similarity, too.
720
721       The ASCII arrow "->" in a ph: pattern means a REP rule (see REP),  cre‐
722       ating arbitrary replacement rule associated to the dictionary item:
723
724              happy/B ph:hepy ph:hepi->happi
725
726       results
727
728
729              *hepy -> happy
730              *hepiest -> happiest
731
732       st:    Stem.  Optional:  default stem is the dictionary item in morpho‐
733              logical analysis. Stem field is useful for virtual  stems  (dic‐
734              tionary  words with NEEDAFFIX flag) and morphological exceptions
735              instead of new, single used morphological rules.
736
737              feet  st:foot  is:plural
738              mice  st:mouse is:plural
739              teeth st:tooth is:plural
740
741       Word forms with multiple stems need multiple dictionary items:
742
743
744              lay po:verb st:lie is:past_2
745              lay po:verb is:present
746              lay po:noun
747
748       al:    Allomorph(s). A dictionary item is the stem of  its  allomorphs.
749              Morphological generation needs stem, allomorph and affix fields.
750
751              sing al:sang al:sung
752              sang st:sing
753              sung st:sing
754
755       po:    Part of speech category.
756
757       ds:    Derivational  suffix(es).   Stemming doesn't remove derivational
758              suffixes.  Morphological generation depends on the order of  the
759              suffix fields.
760
761              In affix rules:
762
763
764              SFX Y Y 1
765              SFX Y 0 ly . ds:ly_adj
766
767       In the dictionary:
768
769
770              ably st:able ds:ly_adj
771              able al:ably
772
773       is:    Inflectional  suffix(es).  All inflectional suffixes are removed
774              by stemming.  Morphological generation depends on the  order  of
775              the suffix fields.
776
777
778              feet st:foot is:plural
779
780       ts:    Terminal  suffix(es).   Terminal  suffix fields are inflectional
781              suffix fields "removed" by additional (not terminal) suffixes.
782
783              Useful for zero  morphemes  and  affixes  removed  by  splitting
784              rules.
785
786
787              work/D ts:present
788
789              SFX D Y 2
790              SFX D   0 ed . is:past_1
791              SFX D   0 ed . is:past_2
792
793       Typical example of the terminal suffix is the zero morpheme of the nom‐
794       inative case.
795
796
797       sp:    Surface prefix. Temporary solution for adding  prefixes  to  the
798              stems and generated word forms. See tests/morph.* example.
799
800
801       pa:    Parts  of  the  compound  words.  Output fields of morphological
802              analysis for stemming.
803
804       dp:    Planned: derivational prefix.
805
806       ip:    Planned: inflectional prefix.
807
808       tp:    Planned: terminal prefix.
809
810

Twofold suffix stripping

812       Ispell's original algorithm strips only one suffix. Hunspell can  strip
813       another one yet (or a plus prefix in COMPLEXPREFIXES mode).
814
815       The  twofold  suffix stripping is a significant improvement in handling
816       of immense number of suffixes,  that  characterize  agglutinative  lan‐
817       guages.
818
819       A  second  `s' suffix (affix class Y) will be the continuation class of
820       the suffix `able' in the following example:
821
822
823               SFX Y Y 1
824               SFX Y 0 s .
825
826               SFX X Y 1
827               SFX X 0 able/Y .
828
829       Dictionary file:
830
831
832               drink/X
833
834       Test file:
835
836
837               drink
838               drinkable
839               drinkables
840
841       Test:
842
843
844               $ hunspell -m -d test <test.txt
845               drink st:drink
846               drinkable st:drink fl:X
847               drinkables st:drink fl:X fl:Y
848
849       Theoretically with the twofold suffix stripping needs only  the  square
850       root  of the number of suffix rules, compared with a Hunspell implemen‐
851       tation. In our practice, we could have elaborated the Hungarian inflec‐
852       tional morphology with twofold suffix stripping.
853
854

Extended affix classes

856       Hunspell can handle more than 65000 affix classes.  There are three new
857       syntax for giving flags in affix and dictionary files.
858
859       FLAG long command sets 2-character flags:
860
861
862                FLAG long
863                SFX Y1 Y 1
864                SFX Y1 0 s 1
865
866       Dictionary record with the Y1, Z3, F? flags:
867
868
869                foo/Y1Z3F?
870
871       FLAG num command sets numerical flags separated by comma:
872
873
874                FLAG num
875                SFX 65000 Y 1
876                SFX 65000 0 s 1
877
878       Dictionary example:
879
880
881                foo/65000,12,2756
882
883       The third one is the Unicode character flags.
884
885

Homonyms

887       Hunspell's dictionary can contain repeating elements that are homonyms:
888
889
890               work/A    po:verb
891               work/B    po:noun
892
893       An affix file:
894
895
896               SFX A Y 1
897               SFX A 0 s . sf:sg3
898
899               SFX B Y 1
900               SFX B 0 s . is:plur
901
902       Test file:
903
904
905               works
906
907       Test:
908
909
910               $ hunspell -d test -m <testwords
911               work st:work po:verb is:sg3
912               work st:work po:noun is:plur
913
914       This feature also gives a way to forbid illegal prefix/suffix  combina‐
915       tions.
916
917

Prefix--suffix dependencies

919       An  interesting side-effect of multi-step stripping is, that the appro‐
920       priate treatment of circumfixes now comes for free.  For  instance,  in
921       Hungarian,  superlatives are formed by simultaneous prefixation of leg-
922       and suffixation of -bb to the adjective base.  A problem with the  one-
923       level  architecture is that there is no way to render lexical licensing
924       of particular prefixes and suffixes interdependent, and  therefore  in‐
925       correct  forms are recognized as valid, i.e. *legvén = leg + vén `old'.
926       Until the introduction of clusters, a special treatment of the superla‐
927       tive  had  to  be hardwired in the earlier HunSpell code. This may have
928       been legitimate for a single case, but in  fact  prefix--suffix  depen‐
929       dences  are  ubiquitous in category-changing derivational patterns (cf.
930       English payable, non-payable but *non-pay or drinkable, undrinkable but
931       *undrink).  In simple words, here, the prefix un- is legitimate only if
932       the base drink is suffixed with -able. If both these patters  are  han‐
933       dled  by  on-line  affix  rules and affix rules are checked against the
934       base only, there is no way to express this dependency  and  the  system
935       will necessarily over- or undergenerate.
936
937       In  next example, suffix class R have got a prefix `continuation' class
938       (class P).
939
940
941              PFX P Y 1
942              PFX P   0 un . [prefix_un]+
943
944              SFX S Y 1
945              SFX S   0 s . +PL
946
947              SFX Q Y 1
948              SFX Q   0 s . +3SGV
949
950              SFX R Y 1
951              SFX R   0 able/PS . +DER_V_ADJ_ABLE
952
953       Dictionary:
954
955
956              2
957              drink/RQ  [verb]
958              drink/S   [noun]
959
960       Morphological analysis:
961
962
963              > drink
964              drink[verb]
965              drink[noun]
966              > drinks
967              drink[verb]+3SGV
968              drink[noun]+PL
969              > drinkable
970              drink[verb]+DER_V_ADJ_ABLE
971              > drinkables
972              drink[verb]+DER_V_ADJ_ABLE+PL
973              > undrinkable
974              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
975              > undrinkables
976              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
977              > undrink
978              Unknown word.
979              > undrinks
980              Unknown word.
981

Circumfix

983       Conditional affixes implemented by a continuation class are not  enough
984       for  circumfixes,  because  a  circumfix is one affix in morphology. We
985       also need CIRCUMFIX option for correct morphological analysis.
986
987
988              # circumfixes: ~ obligate prefix/suffix combinations
989              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
990              # nagy, nagyobb, legnagyobb, legeslegnagyobb
991              # (great, greater, greatest, most greatest)
992
993              CIRCUMFIX X
994
995              PFX A Y 1
996              PFX A 0 leg/X .
997
998              PFX B Y 1
999              PFX B 0 legesleg/X .
1000
1001              SFX C Y 3
1002              SFX C 0 obb . +COMPARATIVE
1003              SFX C 0 obb/AX . +SUPERLATIVE
1004              SFX C 0 obb/BX . +SUPERSUPERLATIVE
1005
1006       Dictionary:
1007
1008
1009              1
1010              nagy/C    [MN]
1011
1012       Analysis:
1013
1014
1015              > nagy
1016              nagy[MN]
1017              > nagyobb
1018              nagy[MN]+COMPARATIVE
1019              > legnagyobb
1020              nagy[MN]+SUPERLATIVE
1021              > legeslegnagyobb
1022              nagy[MN]+SUPERSUPERLATIVE
1023

Compounds

1025       Allowing free compounding yields decrease in precision of  recognition,
1026       not  to  mention stemming and morphological analysis.  Although lexical
1027       switches are introduced to license compounding of bases by Ispell, this
1028       proves not to be restrictive enough. For example:
1029
1030
1031              # affix file
1032              COMPOUNDFLAG X
1033
1034              2
1035              foo/X
1036              bar/X
1037
1038       With this resource, foobar and barfoo also are accepted words.
1039
1040       This  has  been improved upon with the introduction of direction-sensi‐
1041       tive compounding, i.e., lexical features can specify separately whether
1042       a  base  can  occur  as leftmost or rightmost constituent in compounds.
1043       This, however, is still insufficient to handle the  intricate  patterns
1044       of  compounding,  not  to mention idiosyncratic (and language specific)
1045       norms of hyphenation.
1046
1047       The Hunspell algorithm currently allows  any  affixed  form  of  words,
1048       which  are lexically marked as potential members of compounds. Hunspell
1049       improved this, and its recursive compound checking rules makes it  pos‐
1050       sible to implement the intricate spelling conventions of Hungarian com‐
1051       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
1052       ROOT,  SYLLABLENUM  options  can  be set the noteworthy Hungarian `6-3'
1053       rule.  Further example in Hungarian,  derivate  suffixes  often  modify
1054       compounding  properties.  Hunspell  allows the compounding flags on the
1055       affixes, and there are two special flags (COMPOUNDPERMITFLAG and  (COM‐
1056       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
1057
1058       Suffixes with this flag forbid compounding of the affixed word.
1059
1060       We also need several Hunspell features for handling German compounding:
1061
1062
1063              # German compounding
1064
1065              # set language to handle special casing of German sharp s
1066
1067              LANG de_DE
1068
1069              # compound flags
1070
1071              COMPOUNDBEGIN U
1072              COMPOUNDMIDDLE V
1073              COMPOUNDEND W
1074
1075              # Prefixes are allowed at the beginning of compounds,
1076              # suffixes are allowed at the end of compounds by default:
1077              # (prefix)?(root)+(affix)?
1078              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
1079              COMPOUNDPERMITFLAG P
1080
1081              # for German fogemorphemes (Fuge-element)
1082              # Hint: ONLYINCOMPOUND is not required everywhere, but the
1083              # checking will be a little faster with it.
1084
1085              ONLYINCOMPOUND X
1086
1087              # forbid uppercase characters at compound word bounds
1088              CHECKCOMPOUNDCASE
1089
1090              # for handling Fuge-elements with dashes (Arbeits-)
1091              # dash will be a special word
1092
1093              COMPOUNDMIN 1
1094              WORDCHARS -
1095
1096              # compound settings and fogemorpheme for `Arbeit'
1097
1098              SFX A Y 3
1099              SFX A 0 s/UPX .
1100              SFX A 0 s/VPDX .
1101              SFX A 0 0/WXD .
1102
1103              SFX B Y 2
1104              SFX B 0 0/UPX .
1105              SFX B 0 0/VWXDP .
1106
1107              # a suffix for `Computer'
1108
1109              SFX C Y 1
1110              SFX C 0 n/WD .
1111
1112              # for forbid exceptions (*Arbeitsnehmer)
1113
1114              FORBIDDENWORD Z
1115
1116              # dash prefix for compounds with dash (Arbeits-Computer)
1117
1118              PFX - Y 1
1119              PFX - 0 -/P .
1120
1121              # decapitalizing prefix
1122              # circumfix for positioning in compounds
1123
1124              PFX D Y 29
1125              PFX D A a/PX A
1126              PFX D Ä ä/PX Ä
1127               .
1128               .
1129              PFX D Y y/PX Y
1130              PFX D Z z/PX Z
1131
1132       Example dictionary:
1133
1134
1135              4
1136              Arbeit/A-
1137              Computer/BC-
1138              -/W
1139              Arbeitsnehmer/Z
1140
1141       Accepted compound compound words with the previous resource:
1142
1143
1144              Computer
1145              Computern
1146              Arbeit
1147              Arbeits-
1148              Computerarbeit
1149              Computerarbeits-
1150              Arbeitscomputer
1151              Arbeitscomputern
1152              Computerarbeitscomputer
1153              Computerarbeitscomputern
1154              Arbeitscomputerarbeit
1155              Computerarbeits-Computer
1156              Computerarbeits-Computern
1157
1158       Not accepted compoundings:
1159
1160
1161              computer
1162              arbeit
1163              Arbeits
1164              arbeits
1165              ComputerArbeit
1166              ComputerArbeits
1167              Arbeitcomputer
1168              ArbeitsComputer
1169              Computerarbeitcomputer
1170              ComputerArbeitcomputer
1171              ComputerArbeitscomputer
1172              Arbeitscomputerarbeits
1173              Computerarbeits-computer
1174              Arbeitsnehmer
1175
1176       This  solution  is  still not ideal, however, and will be replaced by a
1177       pattern-based compound-checking algorithm which is  closely  integrated
1178       with input buffer tokenization. Patterns describing compounds come as a
1179       separate input resource that can refer to high-level properties of con‐
1180       stituent parts (e.g. the number of syllables, affix flags, and contain‐
1181       ment of hyphens). The patterns are matched against potential  segmenta‐
1182       tions of compounds to assess wellformedness.
1183
1184

Unicode character encoding

1186       Both  Ispell and Myspell use 8-bit ASCII character encoding, which is a
1187       major deficiency when it comes to  scalability.   Although  a  language
1188       like  Hungarian  has  a  standard  ASCII character set (ISO 8859-2), it
1189       fails to allow a full implementation of Hungarian orthographic  conven‐
1190       tions.   For  instance,  the  '--' symbol (n-dash) is missing from this
1191       character set contrary to the fact that it is  not  only  the  official
1192       symbol to delimit parenthetic clauses in the language, but it can be in
1193       compound words as a special 'big' hyphen.
1194
1195       MySpell has got some 8-bit encoding tables,  but  there  are  languages
1196       without  standard  8-bit  encoding,  too. For example, a lot of African
1197       languages have non-latin or extended latin characters.
1198
1199       Similarly, using the original spelling of certain  foreign  names  like
1200       Ångström  or Molière is encouraged by the Hungarian spelling norm, and,
1201       since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1202       bine  with  inflections  containing characters only in ISO 8859-2 (like
1203       elative -ből, allative -től or delative -ről with double acute),  these
1204       result  in words (like Ångströmről or Molière-től.) that can not be en‐
1205       coded using any single ASCII encoding scheme.
1206
1207       The problems raised in relation to 8-bit ASCII encoding have long  been
1208       recognized  by  proponents  of  Unicode. It is clear that trading effi‐
1209       ciency for encoding-independence has its advantages  when  it  comes  a
1210       truly multi-lingual application. There is implemented a memory and time
1211       efficient Unicode handling in Hunspell. In non-UTF-8  character  encod‐
1212       ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1213       affixes and words are stored in UTF-8, during the analysis are  handled
1214       in  mostly UTF-8, under condition checking and suggestion are converted
1215       to UTF-16. Unicode text analysis and  spell  checking  have  a  minimal
1216       (0-20%) time overhead and minimal or reasonable memory overhead depends
1217       from the language (its UTF-8 encoding and affixation).
1218
1219

Conversion of aspell dictionaries

1221       Aspell dictionaries can be easily converted into  hunspell.  Conversion
1222       steps:
1223
1224       dictionary (xx.cwl -> xx.wl):
1225
1226       preunzip xx.cwl
1227       wc -l < xx.wl > xx.dic
1228       cat xx.wl >> xx.dic
1229
1230       affix file
1231
1232       If the affix file exists, copy it:
1233       cp xx_affix.dat xx.aff
1234       If not, create it with the suitable character encoding (see xx.dat)
1235       echo "SET ISO8859-x" > xx.aff
1236       or
1237       echo "SET UTF-8" > xx.aff
1238
1239       It's  useful  to add a TRY option with the characters of the dictionary
1240       with frequency order to set edit distance suggestions:
1241       echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1242
1243