hunspell(4)

1hunspell(4)                Kernel Interfaces Manual                hunspell(4)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1)  requires two files to define the language that it is spell
10       checking.  The first file is a dictionary containing words for the lan‐
11       guage,  and  the  second is an "affix" file that defines the meaning of
12       special flags in the dictionary.
13
14       A dictionary file (*.dic) contains a list of words, one per line.   The
15       first  line of the dictionaries (except personal dictionaries) contains
16       the approximate word count (for optimal hash memory  size).  Each  word
17       may  optionally  be  followed  by  a slash ("/") and one or more flags,
18       which represents affixes or special attributes.  Dictionary  words  can
19       contain  also slashes with the "" syntax. Default flag format is a sin‐
20       gle (usually alphabetic) character. After the  dictionary  words  there
21       are also optional fields separated by tabulators or spaces (spaces only
22       work as morphological field separators, if they are followed by morpho‐
23       logical field ids, see also Optional data fields).
24
25       Personal  dictionaries  are  simple  word  lists. Asterisk at the first
26       character position signs prohibition.  A second  word  separated  by  a
27       slash sets the affixation.
28
29
30              foo
31              Foo/Simpson
32              *bar
33
34       In  this  example, "foo" and "Foo" are personal words, plus Foo will be
35       recognized with affixes of Simpson (Foo's etc.) and bar is a  forbidden
36       word.
37
38       An  affix  file  (*.aff) may contain a lot of optional attributes.  For
39       example, SET is used for setting the character encodings of affixes and
40       dictionary files.  TRY sets the change characters for suggestions.  REP
41       sets a replacement table for multiple character corrections in  sugges‐
42       tion  mode.   PFX  and SFX defines prefix and suffix classes named with
43       affix flags.
44
45       The following affix file  example  defines  UTF-8  character  encoding.
46       `TRY' suggestions differ from the bad word with an English letter or an
47       apostrophe. With these REP definitions, Hunspell can suggest the  right
48       word  form,  when the misspelled word contains f instead of ph and vice
49       versa.
50
51
52              SET UTF-8
53              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
54
55              REP 2
56              REP f ph
57              REP ph f
58
59              PFX A Y 1
60              PFX A 0 re .
61
62              SFX B Y 2
63              SFX B 0 ed [^y]
64              SFX B y ied y
65
66       There are two affix classes in the dictionary. Class A defines a  `re-'
67       prefix.  Class  B defines two `-ed' suffixes. First suffix can be added
68       to a word if the last character of the word isn't `y'.   Second  suffix
69       can  be  added  to  the words terminated with an `y'.  (See later.) The
70       following dictionary file uses these affix classes.
71
72
73              3
74              hello
75              try/B
76              work/AB
77
78       All accepted words  with  this  dictionary:  "hello",  "try",  "tried",
79       "work", "worked", "rework", "reworked".
80
81

GENERAL OPTIONS

83       Hunspell  source distribution contains more than 80 examples for option
84       usage.
85
86
87       SET encoding
88              Set character encoding of words and morphemes in affix and  dic‐
89              tionary  files.  Possible values: UTF-8, ISO8859-1 - ISO8859-10,
90              ISO8859-13  -  ISO8859-15,  KOI8-R,  KOI8-U,   microsoft-cp1251,
91              ISCII-DEVANAGARI.
92
93       FLAG value
94              Set  flag type. Default type is the extended ASCII (8-bit) char‐
95              acter.  `UTF-8' parameter sets UTF-8 encoded  Unicode  character
96              flags.   The `long' value sets the double extended ASCII charac‐
97              ter flag type, the `num' sets the decimal number flag type. Dec‐
98              imal flags numbered from 1 to 65000, and in flag fields are sep‐
99              arated by comma.  BUG: UTF-8 flag type doesn't work on ARM plat‐
100              form.
101
102       COMPLEXPREFIXES
103              Set  twofold  prefix stripping (but single suffix stripping) for
104              agglutinative languages with right-to-left writing system.
105
106       LANG langcode
107              Set language code. In Hunspell may be  language  specific  codes
108              enabled  by  LANG code. At present there are az_AZ, hu_HU, tr_TR
109              specific codes in Hunspell (see the source code).
110
111       IGNORE characters
112              Ignore characters  from  dictionary  words,  affixes  and  input
113              words.   Useful  for  optional characters, as Arabic diacritical
114              marks (Harakat).
115
116       AF number_of_flag_vector_aliases
117
118       AF flag_vector
119              Hunspell can substitute affix flag sets with ordinal numbers  in
120              affix rules (alias compression, see makealias tool). First exam‐
121              ple with alias compression:
122
123              3
124              hello
125              try/1
126              work/2
127
128       AF definitions in the affix file:
129
130              SET UTF-8
131              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
132              AF 2
133              AF A
134              AF AB
135
136       It is equivalent of the following dic file:
137
138              3
139              hello
140              try/A
141              work/AB
142
143       See also tests/alias* examples of the source distribution.
144
145       Note: If affix file contains the FLAG parameter, define it  before  the
146       AF definitions.
147
148       Note II: Use makealias utility in Hunspell distribution to compress aff
149       and dic files.
150
151       AM number_of_morphological_aliases
152
153       AM morphological_fields
154              Hunspell can substitute also  morphological  data  with  ordinal
155              numbers  in  affix  rules (alias compression).  See tests/alias*
156              examples.
157

OPTIONS FOR SUGGESTION

159       Suggestion parameters can optimize the default n-gram,  character  swap
160       and deletion suggestions of Hunspell. REP is suggested to fix the typi‐
161       cal and especially bad language specific bugs, because the REP  sugges‐
162       tions  have  the highest priority in the suggestion list.  PHONE is for
163       languages with not pronunciation based orthography.
164
165       KEY characters_separated_by_vertical_line_optionally
166              Hunspell searches and suggests words with one different  charac‐
167              ter  replaced  by a neighbor KEY character. Not neighbor charac‐
168              ters in KEY string separated by vertical line characters.   Sug‐
169              gested KEY parameters for QWERTY and Dvorak keyboard layouts:
170
171              KEY qwertyuiop|asdfghjkl|zxcvbnm
172              KEY pyfgcrl|aeouidhtns|qjkxbmwvz
173
174       Using  the first QWERTY layout, Hunspell suggests "nude" and "node" for
175       "*nide". A character may have more neighbors, too:
176
177              KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
178
179       TRY characters
180              Hunspell can suggest right word forms, when they differ from the
181              bad  input  word  by  one TRY character. The parameter of TRY is
182              case sensitive.
183
184       NOSUGGEST flag
185              Words signed with NOSUGGEST flag  are  not  suggested.  Proposed
186              flag for vulgar and obscene words (see also SUBSTANDARD).
187
188       MAXNGRAMSUGS num
189              Set  number  of  n-gram suggestions. Value 0 switches off the n-
190              gram suggestions.
191
192       NOSPLITSUGS
193              Disable split-word suggestions.
194
195       SUGSWITHDOTS
196              Add dot(s) to suggestions, if input word terminates  in  dot(s).
197              (Not for OpenOffice.org dictionaries, because OpenOffice.org has
198              an automatic dot expansion mechanism.)
199
200       REP number_of_replacement_definitions
201
202       REP what replacement
203              We can define language-dependent  phonetic  information  in  the
204              affix  file  (.aff)   by  a replacement table.  First REP is the
205              header of this table and one or more REP data line are following
206              it.  With  this  table, Hunspell can suggest the right forms for
207              the typical faults of spelling when the incorrect  form  differs
208              by  more, than 1 letter from the right form.  For example a pos‐
209              sible English replacement table definition to handle  misspelled
210              consonants:
211
212              REP 8
213              REP f ph
214              REP ph f
215              REP f gh
216              REP gh f
217              REP j dg
218              REP dg j
219              REP k ch
220              REP ch k
221
222       Note  I:  It's  very useful to define replacements for the most typical
223       one-character mistakes, too: with REP you can add higher priority to  a
224       subset of the TRY suggestions (suggestion list begins with the REP sug‐
225       gestions).
226
227       Note II: Suggesting separated words by REP, you  can  specify  a  space
228       with an underline:
229
230
231              REP 1
232              REP alot a_lot
233
234       Note  III:  Replacement  table can be used for a stricter compound word
235       checking (forbidding generated compound words, if they are also  simple
236       words with typical fault, see CHECKCOMPOUNDREP).
237
238
239       MAP number_of_map_definitions
240
241       MAP string_of_related_chars_or_parenthesized_character_sequences
242              We  can  define language-dependent information on characters and
243              character sequences that  should  be  considered  related  (i.e.
244              nearer than other chars not in the set) in the affix file (.aff)
245              by a map table.  With this table, Hunspell can suggest the right
246              forms  for  words,  which incorrectly choose the wrong letter or
247              letter groups from a related set more than once in a  word  (see
248              REP).
249
250              For  example a possible mapping could be for the German umlauted
251              ü versus the regular u; the  word  Frühstück  really  should  be
252              written with umlauted u's and not regular ones
253
254              MAP 1
255              MAP uü
256
257       Use parenthesized groups for character sequences (eg. for composed Uni‐
258       code characters):
259
260              MAP 3
261              MAP ß(ss)  (character sequence)
262              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
263              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
264
265       PHONE number_of_phone_definitions
266
267       PHONE what replacement
268              PHONE uses a table-driven phonetic transcription algorithm  bor‐
269              rowed from Aspell. It is useful for languages with not pronunci‐
270              ation based orthography. You can add a full alphabet  conversion
271              and  other rules for conversion of special letter sequences. For
272              detailed documentation see  http://aspell.net/man-html/Phonetic-
273              Code.html.   Note:  Multibyte  UTF-8  characters have not worked
274              with bracket expression yet. Dash expression  has  signed  bytes
275              and not UTF-8 characters yet.
276

OPTIONS FOR COMPOUNDING

278       BREAK number_of_break_definitions
279
280       BREAK character_or_character_sequence
281              Define  new  break  points  for breaking words and checking word
282              parts separately. Use ^ and $ to delete characters  at  end  and
283              start  of the word. Rationale: useful for compounding with join‐
284              ing character or strings (for example,  hyphen  in  English  and
285              German  or hyphen and n-dash in Hungarian). Dashes are often bad
286              break points for tokenization, because compounds with dashes may
287              contain  not  valid parts, too.)  With BREAK, Hunspell can check
288              both side of these compounds, breaking the words at  dashes  and
289              n-dashes:
290
291              BREAK 2
292              BREAK -
293              BREAK --    # n-dash
294
295       Breaking  are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
296       be valid compounds.  Note: The default word break of Hunspell is equiv‐
297       alent of the following BREAK definition:
298
299              BREAK 3
300              BREAK -
301              BREAK ^-
302              BREAK -$
303
304       Hunspell  doesn't  accept  the  "-word" and "word-" forms by this BREAK
305       definition:
306
307              BREAK 1
308              BREAK -
309
310       W Note II: COMPOUNDRULE is better (or  will  be  better)  for  handling
311       dashes and other  compound joining characters or character strings. Use
312       BREAK, if you want check words with dashes or other joining  characters
313       and  there is no time or possibility to describe precise compound rules
314       with COMPOUNDRULE (COMPOUNDRULE has handled only the  last  suffixation
315       of the compound word yet).
316
317       Note  III:  For command line spell checking of words with extra charac‐
318       ters, set WORDCHARS parameters: WORDCHARS --- (see tests/break.*) exam‐
319       ple
320
321       COMPOUNDRULE number_of_compound_definitions
322
323       COMPOUNDRULE compound_pattern
324              Define  custom  compound patterns with a regex-like syntax.  The
325              first COMPOUNDRULE is a header with the number of the  following
326              COMPOUNDRULE  definitions.  Compound  patterns  consist compound
327              flags, parentheses, star and question mark  meta  characters.  A
328              flag  followed  by  a  `*'  matches a word sequence of 0 or more
329              matches of words signed with this compound flag.   A  flag  fol‐
330              lowed  by  a  `?' matches a word sequence of 0 or 1 matches of a
331              word signed with  this  compound  flag.   See  tests/compound*.*
332              examples.
333
334              Note:  en_US  dictionary of OpenOffice.org uses COMPOUNDRULE for
335              ordinal number recognition (1st, 2nd, 11th, 12th,  22nd,  112th,
336              1000122nd etc.).
337
338              Note  II:  In the case of long and numerical flag types use only
339              parenthesized flags: (1500)*(2000)?
340
341              Note III: COMPOUNDRULE flags haven't been  compatible  with  the
342              COMPOUNDFLAG,  COMPOUNDBEGIN, etc. compound flags yet (use these
343              flags on different words).
344
345
346       COMPOUNDMIN num
347              Minimum length of words in compound words.  Default value  is  3
348              letters.
349
350       COMPOUNDFLAG flag
351              Words  signed with COMPOUNDFLAG may be in compound words (except
352              when word shorter than COMPOUNDMIN). Affixes  with  COMPOUNDFLAG
353              also permits compounding of affixed words.
354
355       COMPOUNDBEGIN flag
356              Words  signed with COMPOUNDBEGIN (or with a signed affix) may be
357              first elements in compound words.
358
359       COMPOUNDLAST flag
360              Words signed with COMPOUNDLAST (or with a signed affix)  may  be
361              last elements in compound words.
362
363       COMPOUNDMIDDLE flag
364              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
365              middle elements in compound words.
366
367       ONLYINCOMPOUND flag
368              Suffixes signed with ONLYINCOMPOUND flag may be only  inside  of
369              compounds  (Fuge-elements  in German, fogemorphemes in Swedish).
370              ONLYINCOMPOUND flag works also with words (see  tests/onlyincom‐
371              pound.*).
372
373       COMPOUNDPERMITFLAG flag
374              Prefixes are allowed at the beginning of compounds, suffixes are
375              allowed at the end of compounds by default.  Affixes  with  COM‐
376              POUNDPERMITFLAG may be inside of compounds.
377
378       COMPOUNDFORBIDFLAG flag
379              Suffixes with this flag forbid compounding of the affixed word.
380
381       COMPOUNDROOT flag
382              COMPOUNDROOT  flag signs the compounds in the dictionary (Now it
383              is used only in the Hungarian language specific code).
384
385       COMPOUNDWORDMAX number
386              Set maximum word count in a compound word.  (Default  is  unlim‐
387              ited.)
388
389       CHECKCOMPOUNDDUP
390              Forbid word duplication in compounds (e.g. foofoo).
391
392       CHECKCOMPOUNDREP
393              Forbid  compounding, if the (usually bad) compound word may be a
394              non compound word with a REP fault. Useful  for  languages  with
395              `compound friendly' orthography.
396
397       CHECKCOMPOUNDCASE
398              Forbid upper case characters at word bound in compounds.
399
400       CHECKCOMPOUNDTRIPLE
401              Forbid  compounding,  if compound word contains triple repeating
402              letters (e.g. foo|ox or xo|oof). Bug: missing multi-byte charac‐
403              ter  support in UTF-8 encoding (works only for 7-bit ASCII char‐
404              acters).
405
406       SIMPLIFIEDTRIPLE
407              Allow simplified 2-letter forms of the  compounds  forbidden  by
408              CHECKCOMPOUNDTRIPLE.  It's useful for Swedish and Norwegian (and
409              for the old German orthography: Schiff|fahrt -> Schiffahrt).
410
411       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
412
413       CHECKCOMPOUNDPATTERN endchars[/flag] beginchars[/flag] [replacement]
414              Forbid compounding, if the first word in the compound ends  with
415              endchars,  and next word begins with beginchars and (optionally)
416              they have the requested flags.  The optional replacement parame‐
417              ter  allows simplified compound form.  Note: COMPOUNDMIN doesn't
418              work correctly with the compound word  alternation,  so  it  may
419              need to set COMPOUNDMIN to lower value.
420
421       COMPOUNDSYLLABLE max_syllable vowels
422              Need  for special compounding rules in Hungarian.  First parame‐
423              ter is the maximum syllable number, that may be in  a  compound,
424              if  words  in  compounds  are more than COMPOUNDWORDMAX.  Second
425              parameter is the list of vowels (for calculating syllables).
426
427       SYLLABLENUM flags
428              Need for special compounding rules in Hungarian.
429

OPTIONS FOR AFFIX CREATION

431       PFX flag cross_product number
432
433       PFX flag stripping prefix [condition [morphological_fields...]]
434
435       SFX flag cross_product number
436
437       SFX flag stripping suffix [condition [morphological_fields...]]
438              An affix is either a prefix or a suffix attached to  root  words
439              to  make other words. We can define affix classes with arbitrary
440              number affix rules.  Affix classes are signed with affix  flags.
441              The  first  line of an affix class definition is the header. The
442              fields of an affix class header:
443
444              (0) Option name (PFX or SFX)
445
446              (1) Flag (name of the affix class)
447
448              (2) Cross product (permission to combine prefixes and suffixes).
449              Possible values: Y (yes) or N (no)
450
451              (3) Line count of the following rules.
452
453              Fields of an affix rules:
454
455              (0) Option name
456
457              (1) Flag
458
459              (2) stripping characters from beginning (at prefix rules) or end
460              (at suffix rules) of the word
461
462              (3) affix (optionally with flags of continuation classes,  sepa‐
463              rated by a slash)
464
465              (4) condition.
466
467              Zero stripping or affix are indicated by zero. Zero condition is
468              indicated by dot.  Condition is a  simplified,  regular  expres‐
469              sion-like  pattern,  which  must  be met before the affix can be
470              applied. (Dot signs an arbitrary character. Characters in braces
471              sign  an  arbitrary  character  from  the character subset. Dash
472              hasn't got special meaning, but circumflex (^)  next  the  first
473              brace sets the complementer character set.)
474
475              (5) Optional morphological fields separated by spaces or tabula‐
476              tors.
477
478

OTHER OPTIONS

480       CIRCUMFIX flag
481              Affixes signed with CIRCUMFIX flag may be on a  word  when  this
482              word also has a prefix with CIRCUMFIX flag and vice versa.
483
484       FORBIDDENWORD flag
485              This  flag  signs forbidden word form. Because affixed forms are
486              also forbidden, we  can  subtract  a  subset  from  set  of  the
487              accepted affixed and compound words.
488
489       FULLSTRIP
490              With  FULLSTRIP,  affix rules can strip full words, not only one
491              less characters.
492
493              Note: conditions may be word length without FULLSTRIP, too.
494
495       KEEPCASE flag
496              Forbid uppercased and capitalized forms  of  words  signed  with
497              KEEPCASE  flags.  Useful for special orthographies (measurements
498              and currency often keep their  case  in  uppercased  texts)  and
499              writing systems (e.g. keeping lower case of IPA characters).
500
501              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
502              CASE flag may be  capitalized  and  uppercased,  but  uppercased
503              forms  of these words may not contain sharp s, only SS. See ger‐
504              mancompounding example in the tests directory  of  the  Hunspell
505              distribution.
506
507              Note:  Using  lot  of  zero affixes may have a big cost, because
508              every zero affix is checked  under  affix  analysis  before  the
509              other affixes.
510
511       ICONV number_of_ICONV_definitions
512
513       ICONV pattern pattern2
514              Define input conversion table.
515
516       OCONV number_of_OCONV_definitions
517
518       OCONV pattern pattern2
519              Define output conversion table.
520
521       LEMMA_PRESENT flag
522              Not   used   in   Hunspell  1.2.  Use  "st:"  field  instead  of
523              LEMMA_PRESENT.
524
525       NEEDAFFIX flag
526              This flag signs virtual stems in the dictionary.   Only  affixed
527              forms  of  these words will be accepted by Hunspell.  Except, if
528              the dictionary word has a homonym or a  zero  affix.   NEEDAFFIX
529              works  also  with prefixes and prefix + suffix combinations (see
530              tests/pseudoroot5.*).
531
532       PSEUDOROOT flag
533              Deprecated. (Former name of the NEEDAFFIX option.)
534
535       SUBSTANDARD flag
536              SUBSTANDARD flag signs affix rules and dictionary  words  (allo‐
537              morphs)  not used in morphological generation (and in suggestion
538              in the future versions). See also NOSUGGEST.
539
540       WORDCHARS characters
541              WORDCHARS extends tokenizer of Hunspell command  line  interface
542              with  additional word character. For example, dot, dash, n-dash,
543              numbers, percent sign are word character in Hungarian.
544
545       CHECKSHARPS
546              SS letter pair in uppercased (German) words may  be  upper  case
547              sharp  s  (ß).  Hunspell can handle this special casing with the
548              CHECKSHARPS declaration (see also KEEPCASE flag  and  tests/ger‐
549              mancompounding example) in both spelling and suggestion.
550
551

Morphological analysis

553       Hunspell's  dictionary items and affix rules may have optional space or
554       tabulator separated  morphological  description  fields,  started  with
555       3-character (two letters and a colon) field IDs:
556
557
558               word/flags po:noun is:nom
559
560       Example: We define a simple resource with morphological informations, a
561       derivative suffix (ds:) and a part of speech category (po:):
562
563       Affix file:
564
565
566               SFX X Y 1
567               SFX X 0 able . ds:able
568
569       Dictionary file:
570
571
572               drink/X po:verb
573
574       Test file:
575
576
577               drink
578               drinkable
579
580       Test:
581
582
583               $ analyze test.aff test.dic test.txt
584               > drink
585               analyze(drink) = po:verb
586               stem(drink) = po:verb
587               > drinkable
588               analyze(drinkable) = po:verb ds:able
589               stem(drinkable) = drinkable
590
591       You can see in the example, that the analyzer concatenates the  morpho‐
592       logical fields in item and arrangement style.
593
594

Optional data fields

596       Default  morphological  and other IDs (used in suggestion, stemming and
597       morphological generation):
598
599       ph:    Alternative transliteration for better suggestion.  It's  useful
600              for words with foreign pronunciation. (Dictionary based phonetic
601              suggestion.)  For example:
602
603
604              Marseille ph:maarsayl
605
606       st:    Stem. Optional: default stem is the dictionary item  in  morpho‐
607              logical  analysis.  Stem field is useful for virtual stems (dic‐
608              tionary words with NEEDAFFIX flag) and morphological  exceptions
609              instead of new, single used morphological rules.
610
611              feet  st:foot  is:plural
612              mice  st:mouse is:plural
613              teeth st:tooth is:plural
614
615       Word forms with multiple stems need multiple dictionary items:
616
617
618              lay po:verb st:lie is:past_2
619              lay po:verb is:present
620              lay po:noun
621
622       al:    Allomorph(s).  A  dictionary item is the stem of its allomorphs.
623              Morphological generation needs stem, allomorph and affix fields.
624
625              sing al:sang al:sung
626              sang st:sing
627              sung st:sing
628
629       po:    Part of speech category.
630
631       ds:    Derivational suffix(es).  Stemming doesn't  remove  derivational
632              suffixes.   Morphological generation depends on the order of the
633              suffix fields.
634
635              In affix rules:
636
637
638              SFX Y Y 1
639              SFX Y 0 ly . ds:ly_adj
640
641       In the dictionary:
642
643
644              ably st:able ds:ly_adj
645              able al:ably
646
647       is:    Inflectional suffix(es).  All inflectional suffixes are  removed
648              by  stemming.   Morphological generation depends on the order of
649              the suffix fields.
650
651
652              feet st:foot is:plural
653
654       ts:    Terminal suffix(es).  Terminal suffix  fields  are  inflectional
655              suffix fields "removed" by additional (not terminal) suffixes.
656
657              Useful  for  zero  morphemes  and  affixes  removed by splitting
658              rules.
659
660
661              work/D ts:present
662
663              SFX D Y 2
664              SFX D   0 ed . is:past_1
665              SFX D   0 ed . is:past_2
666
667       Typical example of the terminal suffix is the zero morpheme of the nom‐
668       inative case.
669
670
671       sp:    Surface  prefix.  Temporary  solution for adding prefixes to the
672              stems and generated word forms. See tests/morph.* example.
673
674
675       pa:    Parts of the compound  words.  Output  fields  of  morphological
676              analysis for stemming.
677
678       dp:    Planned: derivational prefix.
679
680       ip:    Planned: inflectional prefix.
681
682       tp:    Planned: terminal prefix.
683
684

Twofold suffix stripping

686       Ispell's  original algorithm strips only one suffix. Hunspell can strip
687       another one yet (or a plus prefix in COMPLEXPREFIXES mode).
688
689       The twofold suffix stripping is a significant improvement  in  handling
690       of  immense  number  of  suffixes, that characterize agglutinative lan‐
691       guages.
692
693       A second `s' suffix (affix class Y) will be the continuation  class  of
694       the suffix `able' in the following example:
695
696
697               SFX Y Y 1
698               SFX Y 0 s .
699
700               SFX X Y 1
701               SFX X 0 able/Y .
702
703       Dictionary file:
704
705
706               drink/X
707
708       Test file:
709
710
711               drink
712               drinkable
713               drinkables
714
715       Test:
716
717
718               $ hunspell -m -d test <test.txt
719               drink st:drink
720               drinkable st:drink fl:X
721               drinkables st:drink fl:X fl:Y
722
723       Theoretically  with  the twofold suffix stripping needs only the square
724       root of the number of suffix rules, compared with a Hunspell  implemen‐
725       tation. In our practice, we could have elaborated the Hungarian inflec‐
726       tional morphology with twofold suffix stripping.
727
728

Extended affix classes

730       Hunspell can handle more than 65000 affix classes.  There are three new
731       syntax for giving flags in affix and dictionary files.
732
733       FLAG long command sets 2-character flags:
734
735
736                FLAG long
737                SFX Y1 Y 1
738                SFX Y1 0 s 1
739
740       Dictionary record with the Y1, Z3, F? flags:
741
742
743                foo/Y1Z3F?
744
745       FLAG num command sets numerical flags separated by comma:
746
747
748                FLAG num
749                SFX 65000 Y 1
750                SFX 65000 0 s 1
751
752       Dictionary example:
753
754
755                foo/65000,12,2756
756
757       The third one is the Unicode character flags.
758
759

Homonyms

761       Hunspell's dictionary can contain repeating elements that are homonyms:
762
763
764               work/A    po:verb
765               work/B    po:noun
766
767       An affix file:
768
769
770               SFX A Y 1
771               SFX A 0 s . sf:sg3
772
773               SFX B Y 1
774               SFX B 0 s . is:plur
775
776       Test file:
777
778
779               works
780
781       Test:
782
783
784               $ hunspell -d test -m <testwords
785               work st:work po:verb is:sg3
786               work st:work po:noun is:plur
787
788       This  feature also gives a way to forbid illegal prefix/suffix combina‐
789       tions.
790
791

Prefix--suffix dependencies

793       An interesting side-effect of multi-step stripping is, that the  appro‐
794       priate  treatment  of circumfixes now comes for free.  For instance, in
795       Hungarian, superlatives are formed by simultaneous prefixation of  leg-
796       and  suffixation of -bb to the adjective base.  A problem with the one-
797       level architecture is that there is no way to render lexical  licensing
798       of  particular  prefixes  and  suffixes  interdependent,  and therefore
799       incorrect forms are recognized as valid,  i.e.  *legvén  =  leg  +  vén
800       `old'.  Until  the introduction of clusters, a special treatment of the
801       superlative had to be hardwired in the earlier HunSpell code. This  may
802       have  been  legitimate  for  a  single case, but in fact prefix--suffix
803       dependences are ubiquitous in category-changing  derivational  patterns
804       (cf.  English  payable, non-payable but *non-pay or drinkable, undrink‐
805       able but *undrink). In simple words, here, the prefix un- is legitimate
806       only  if  the  base drink is suffixed with -able. If both these patters
807       are handled by on-line affix rules and affix rules are checked  against
808       the  base only, there is no way to express this dependency and the sys‐
809       tem will necessarily over- or undergenerate.
810
811       In next example, suffix class R have got a prefix `continuation'  class
812       (class P).
813
814
815              PFX P Y 1
816              PFX P   0 un . [prefix_un]+
817
818              SFX S Y 1
819              SFX S   0 s . +PL
820
821              SFX Q Y 1
822              SFX Q   0 s . +3SGV
823
824              SFX R Y 1
825              SFX R   0 able/PS . +DER_V_ADJ_ABLE
826
827       Dictionary:
828
829
830              2
831              drink/RQ  [verb]
832              drink/S   [noun]
833
834       Morphological analysis:
835
836
837              > drink
838              drink[verb]
839              drink[noun]
840              > drinks
841              drink[verb]+3SGV
842              drink[noun]+PL
843              > drinkable
844              drink[verb]+DER_V_ADJ_ABLE
845              > drinkables
846              drink[verb]+DER_V_ADJ_ABLE+PL
847              > undrinkable
848              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
849              > undrinkables
850              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
851              > undrink
852              Unknown word.
853              > undrinks
854              Unknown word.
855

Circumfix

857       Conditional  affixes implemented by a continuation class are not enough
858       for circumfixes, because a circumfix is one  affix  in  morphology.  We
859       also need CIRCUMFIX option for correct morphological analysis.
860
861
862              # circumfixes: ~ obligate prefix/suffix combinations
863              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
864              # nagy, nagyobb, legnagyobb, legeslegnagyobb
865              # (great, greater, greatest, most greatest)
866
867              CIRCUMFIX X
868
869              PFX A Y 1
870              PFX A 0 leg/X .
871
872              PFX B Y 1
873              PFX B 0 legesleg/X .
874
875              SFX C Y 3
876              SFX C 0 obb . +COMPARATIVE
877              SFX C 0 obb/AX . +SUPERLATIVE
878              SFX C 0 obb/BX . +SUPERSUPERLATIVE
879
880       Dictionary:
881
882
883              1
884              nagy/C    [MN]
885
886       Analysis:
887
888
889              > nagy
890              nagy[MN]
891              > nagyobb
892              nagy[MN]+COMPARATIVE
893              > legnagyobb
894              nagy[MN]+SUPERLATIVE
895              > legeslegnagyobb
896              nagy[MN]+SUPERSUPERLATIVE
897

Compounds

899       Allowing  free compounding yields decrease in precision of recognition,
900       not to mention stemming and morphological analysis.   Although  lexical
901       switches are introduced to license compounding of bases by Ispell, this
902       proves not to be restrictive enough. For example:
903
904
905              # affix file
906              COMPOUNDFLAG X
907
908              2
909              foo/X
910              bar/X
911
912       With this resource, foobar and barfoo also are accepted words.
913
914       This has been improved upon with the introduction  of  direction-sensi‐
915       tive compounding, i.e., lexical features can specify separately whether
916       a base can occur as leftmost or  rightmost  constituent  in  compounds.
917       This,  however,  is still insufficient to handle the intricate patterns
918       of compounding, not to mention idiosyncratic  (and  language  specific)
919       norms of hyphenation.
920
921       The  Hunspell  algorithm  currently  allows  any affixed form of words,
922       which are lexically marked as potential members of compounds.  Hunspell
923       improved  this, and its recursive compound checking rules makes it pos‐
924       sible to implement the intricate spelling conventions of Hungarian com‐
925       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
926       ROOT, SYLLABLENUM options can be set  the  noteworthy  Hungarian  `6-3'
927       rule.   Further  example  in  Hungarian, derivate suffixes often modify
928       compounding properties. Hunspell allows the compounding  flags  on  the
929       affixes,  and there are two special flags (COMPOUNDPERMITFLAG and (COM‐
930       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
931
932       Suffixes with this flag forbid compounding of the affixed word.
933
934       We also need several Hunspell features for handling German compounding:
935
936
937              # German compounding
938
939              # set language to handle special casing of German sharp s
940
941              LANG de_DE
942
943              # compound flags
944
945              COMPOUNDBEGIN U
946              COMPOUNDMIDDLE V
947              COMPOUNDEND W
948
949              # Prefixes are allowed at the beginning of compounds,
950              # suffixes are allowed at the end of compounds by default:
951              # (prefix)?(root)+(affix)?
952              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
953              COMPOUNDPERMITFLAG P
954
955              # for German fogemorphemes (Fuge-element)
956              # Hint: ONLYINCOMPOUND is not required everywhere, but the
957              # checking will be a little faster with it.
958
959              ONLYINCOMPOUND X
960
961              # forbid uppercase characters at compound word bounds
962              CHECKCOMPOUNDCASE
963
964              # for handling Fuge-elements with dashes (Arbeits-)
965              # dash will be a special word
966
967              COMPOUNDMIN 1
968              WORDCHARS -
969
970              # compound settings and fogemorpheme for `Arbeit'
971
972              SFX A Y 3
973              SFX A 0 s/UPX .
974              SFX A 0 s/VPDX .
975              SFX A 0 0/WXD .
976
977              SFX B Y 2
978              SFX B 0 0/UPX .
979              SFX B 0 0/VWXDP .
980
981              # a suffix for `Computer'
982
983              SFX C Y 1
984              SFX C 0 n/WD .
985
986              # for forbid exceptions (*Arbeitsnehmer)
987
988              FORBIDDENWORD Z
989
990              # dash prefix for compounds with dash (Arbeits-Computer)
991
992              PFX - Y 1
993              PFX - 0 -/P .
994
995              # decapitalizing prefix
996              # circumfix for positioning in compounds
997
998              PFX D Y 29
999              PFX D A a/PX A
1000              PFX D Ä ä/PX Ä
1001               .
1002               .
1003              PFX D Y y/PX Y
1004              PFX D Z z/PX Z
1005
1006       Example dictionary:
1007
1008
1009              4
1010              Arbeit/A-
1011              Computer/BC-
1012              -/W
1013              Arbeitsnehmer/Z
1014
1015       Accepted compound compound words with the previous resource:
1016
1017
1018              Computer
1019              Computern
1020              Arbeit
1021              Arbeits-
1022              Computerarbeit
1023              Computerarbeits-
1024              Arbeitscomputer
1025              Arbeitscomputern
1026              Computerarbeitscomputer
1027              Computerarbeitscomputern
1028              Arbeitscomputerarbeit
1029              Computerarbeits-Computer
1030              Computerarbeits-Computern
1031
1032       Not accepted compoundings:
1033
1034
1035              computer
1036              arbeit
1037              Arbeits
1038              arbeits
1039              ComputerArbeit
1040              ComputerArbeits
1041              Arbeitcomputer
1042              ArbeitsComputer
1043              Computerarbeitcomputer
1044              ComputerArbeitcomputer
1045              ComputerArbeitscomputer
1046              Arbeitscomputerarbeits
1047              Computerarbeits-computer
1048              Arbeitsnehmer
1049
1050       This solution is still not ideal, however, and will be  replaced  by  a
1051       pattern-based  compound-checking  algorithm which is closely integrated
1052       with input buffer tokenization. Patterns describing compounds come as a
1053       separate input resource that can refer to high-level properties of con‐
1054       stituent parts (e.g. the number of syllables, affix flags, and contain‐
1055       ment  of hyphens). The patterns are matched against potential segmenta‐
1056       tions of compounds to assess wellformedness.
1057
1058

Unicode character encoding

1060       Both Ispell and Myspell use 8-bit ASCII character encoding, which is  a
1061       major  deficiency  when  it  comes to scalability.  Although a language
1062       like Hungarian has a standard ASCII  character  set  (ISO  8859-2),  it
1063       fails  to allow a full implementation of Hungarian orthographic conven‐
1064       tions.  For instance, the '--' symbol (n-dash)  is  missing  from  this
1065       character  set  contrary  to  the fact that it is not only the official
1066       symbol to delimit parenthetic clauses in the language, but it can be in
1067       compound words as a special 'big' hyphen.
1068
1069       MySpell  has  got  some  8-bit encoding tables, but there are languages
1070       without standard 8-bit encoding, too. For example,  a  lot  of  African
1071       languages have non-latin or extended latin characters.
1072
1073       Similarly,  using  the  original spelling of certain foreign names like
1074       Ångström or Molière is encouraged by the Hungarian spelling norm,  and,
1075       since characters 'Å' and 'è' are not part of ISO 8859-2, when they com‐
1076       bine with inflections containing characters only  in  ISO 8859-2  (like
1077       elative  -ből, allative -től or delative -ről with double acute), these
1078       result in words (like Ångströmről or  Molière-től.)  that  can  not  be
1079       encoded using any single ASCII encoding scheme.
1080
1081       The  problems raised in relation to 8-bit ASCII encoding have long been
1082       recognized by proponents of Unicode. It is  clear  that  trading  effi‐
1083       ciency  for  encoding-independence  has  its advantages when it comes a
1084       truly multi-lingual application. There is implemented a memory and time
1085       efficient  Unicode  handling in Hunspell. In non-UTF-8 character encod‐
1086       ings Hunspell works with the original 8-bit strings. In UTF-8 encoding,
1087       affixes  and words are stored in UTF-8, during the analysis are handled
1088       in mostly UTF-8, under condition checking and suggestion are  converted
1089       to  UTF-16.  Unicode  text  analysis  and spell checking have a minimal
1090       (0-20%) time overhead and minimal or reasonable memory overhead depends
1091       from the language (its UTF-8 encoding and affixation).
1092
1093

Conversion of aspell dictionaries

1095       Aspell  dictionaries  can be easily converted into hunspell. Conversion
1096       steps:
1097
1098       dictionary (xx.cwl -> xx.wl):
1099
1100       preunzip xx.cwl
1101       wc -l < xx.wl > xx.dic
1102       cat xx.wl >> xx.dic
1103
1104       affix file
1105
1106       If the affix file exists, copy it:
1107       cp xx_affix.dat xx.aff
1108       If not, create it with the suitable character encoding (see xx.dat)
1109       echo "SET ISO8859-x" > xx.aff
1110       or
1111       echo "SET UTF-8" > xx.aff
1112
1113       It's useful to add a TRY option with the characters of  the  dictionary
1114       with frequency order to set edit distance suggestions:
1115       echo "TRY qwertzuiopasdfghjklyxcvbnmQWERTZUIOPASDFGHJKLYXCVBNM" >>xx.aff
1116
1117