1hunspell(4)                Kernel Interfaces Manual                hunspell(4)
2
3
4

NAME

6       hunspell - format of Hunspell dictionaries and affix files
7

DESCRIPTION

9       Hunspell(1)  requires  two  files  to  define  the  language that it is
10       spellchecking.  The first file is a dictionary containing words for the
11       language,  and  the  second is an "affix" file that defines the meaning
12       of special flags in the dictionary.
13
14       A dictionary file (*.dic) contains a list of words, one per line.   The
15       first  line of the dictionaries (except personal dictionaries) contains
16       the approximate word count (for optimal hash memory  size).  Each  word
17       may  optionally  be  followed  by  a slash ("/") and one or more flags,
18       which represents affixes or special attributes.  Dictionary  words  can
19       contain  also slashes with the "" syntax. Default flag format is a sin‐
20       gle (usually alphabetic) character.  In  a  Hunspell  dictionary  file,
21       there is also an optional morphological field separated by tabulator.
22
23       Morphological desciptions have custom format.
24
25       An  affix  file  (*.aff) may contain a lot of optional attributes.  For
26       example, SET is used for setting the character encodings of affixes and
27       dictionary files.  TRY sets the change characters for suggestions.  REP
28       sets a replacement table for multiple character corrections in  sugges‐
29       tion  mode.   PFX  and SFX defines prefix and suffix classes named with
30       affix flags.
31
32       The following affix file  example  defines  UTF-8  character  encoding.
33       `TRY' suggestions differ from the bad word with an English letter or an
34       apostrophe. With these REP definitions, Hunspell can suggest the  right
35       word  form,  when the misspelled word contains f instead of ph and vice
36       versa.
37
38
39              SET UTF-8
40              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
41
42              REP 2
43              REP f ph
44              REP ph f
45
46              PFX A Y 1
47              PFX A 0 re .
48
49              SFX B Y 2
50              SFX B 0 ed [^y]
51              SFX B y ied y
52
53       There are two affix classes in the dictionary. Class A defines an `re-'
54       prefix.  Class  B defines two `-ed' suffixes. First suffix can be added
55       to a word if the last character of the word isn't `y'.   Second  suffix
56       can be added to words terminated with an `y'.  (See details later.) The
57       following dictionary file uses these affix classes.
58
59
60              3
61              hello
62              try/B
63              work/AB
64
65       All accepted words with this example: hello, try, tried, work,  worked,
66       rework, reworked.
67
68

GENERAL OPTIONS

70       SET encoding
71              Set  character encoding of words and morphemes in affix and dic‐
72              tionary files.  Possible values: UTF-8, ISO8859-1 -  ISO8859-10,
73              ISO8859-13   -  ISO8859-15,  KOI8-R,  KOI8-U,  microsoft-cp1251,
74              ISCII-DEVANAGARI.
75
76       FLAG value
77              Set flag type. Default type is the extended ASCII (8-bit)  char‐
78              acter.   `UTF-8'  parameter sets UTF-8 encoded Unicode character
79              flags.  The `long' value sets the double extended ASCII  charac‐
80              ter flag type, the `num' sets the decimal number flag type. Dec‐
81              imal flags numbered from 1 to 65535, and in flag fields are sep‐
82              arated by comma.  BUG: UTF-8 flag type doesn't work on ARM plat‐
83              form.
84
85       COMPLEXPREFIXES
86              Set twofold prefix stripping (but single suffix  stripping)  for
87              agglutinative languages with right-to-left writing system.
88
89       LANG langcode
90              Set  language  code.  In Hunspell may be language specific codes
91              enabled by LANG code. At present there are az_AZ,  hu_HU,  TR_tr
92              specific codes in Hunspell (see the source code).
93
94       IGNORE characters
95              Ignore  characters  from  dictionary  words,  affixes  and input
96              words.  Useful for optional characters,  as  Arabic  diacritical
97              marks (Harakat).
98
99       AF number_of_flag_vector_aliases
100
101       AF flag_vector
102              Hunspell  can substitue affix flag sets with a natural number in
103              affix rules (alias compression). First example with  alias  com‐
104              pression:
105
106              3
107              hello
108              try/1
109              work/2
110
111       AF definitions in the affix file:
112
113              SET UTF-8
114              TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
115              AF 2
116              AF A
117              AF AB
118
119       See also tests/alias* examples.
120
121       Note:  If  affix file contains the FLAG parameter, define it before the
122       AF definitions.
123
124       Note II: Use makealias utility in Hunspell distribution to compress aff
125       and dic files.
126
127       AM number_of_morphological_description_aliases
128
129       AM morphological_description
130              Hunspell  can  substitue  also morphological descriptions with a
131              natural  number  in  affix  rules  (alias   compression).    See
132              tests/alias* examples.
133

OPTIONS FOR SUGGESTION

135       TRY characters
136              Hunspell  can  suggest right word forms, when those differs from
137              the bad form by one TRY character. The parameter of TRY is  case
138              sensitive.
139
140       NOSUGGEST flag
141              Words  signed  with  NOSUGGEST  flag are not suggested. Proposed
142              flag for vulgar and obscene words.
143
144       MAXNGRAMSUGS num
145              Set number of n-gram suggestions. Value 0 switches  off  the  n-
146              gram suggestions.
147
148       NOSPLITSUGS
149              Disable split-word suggestions.
150
151       SUGSWITHDOTS
152              Add  dot(s)  to suggestions, if input word terminates in dot(s).
153              (Not for OpenOffice.org dictionaries, because OpenOffice.org has
154              an automatic dot expansion mechanism.)
155
156       REP number_of_replacement_definitions
157
158       REP what replacement
159              We  can  define  language-dependent  phonetic information in the
160              affix file (.aff)  by a replacement table.   First  REP  is  the
161              header of this table and one or more REP data line are following
162              it. With this table, Hunspell can suggest the  right  forms  for
163              the  typical  faults of spelling when the incorrect form differs
164              by more, than 1 letter from the right form.  For example a  pos‐
165              sible  English replacement table definition to handle misspelled
166              consonants:
167
168              REP 8
169              REP f ph
170              REP ph f
171              REP f gh
172              REP gh f
173              REP j dg
174              REP dg j
175              REP k ch
176              REP ch k
177
178       Note: It's very useful to define replacements for the most typical one-
179       character mistakes, too: with REP you can add higher priority to a sub‐
180       set of the TRY suggestions (suggestion list begins with the REP sugges‐
181       tions).
182
183       Note  II:  Replacement  table  can be used for a stricter compound word
184       checking (forbidding generated compound words, if they are also  simple
185       words with typical fault, see CHECKCOMPOUNDREP).
186
187
188       MAP number_of_map_definitions
189
190       MAP string_of_related_chars
191              We  can define language-dependent information on characters that
192              should be considered related (ie. nearer than other chars not in
193              the  set)  in  the  affix file (.aff)  by a character map table.
194              With this table, Hunspell can suggest the right forms for words,
195              which  incorrectly  choose  the  wrong letter from a related set
196              more than once in a word.
197
198              For example a possible mapping could be for the German  umlauted
199              ü  versus  the  regular  u;  the word Frühstück really should be
200              written with umlauted u's and not regular ones
201
202              MAP 1
203              MAP uü
204

OPTIONS FOR COMPOUNDING

206       BREAK number_of_break_definitions
207
208       BREAK character_or_character_sequence
209              Define break points for breaking words and checking  word  parts
210              separately.   Rationale:  useful  for  compounding  with joining
211              character or strings (for example, hyphen in English and  German
212              or  hyphen and n-dash in Hungarian).  Dashes are often bad break
213              points for tokenization, because compounds with dashes may  con‐
214              tain not valid parts, too.)  With BREAK, Hunspell can check both
215              side of these compounds, breaking the words  at  dashes  and  n-
216              dashes:
217
218              BREAK 2
219              BREAK -
220              BREAK --    # n-dash
221
222       Breaking  are recursive, so foo-bar, bar-foo and foo-foo--bar-bar would
223       be valid compounds.
224
225       Note: COMPOUNDRULE is better (or will be better)  for  handling  dashes
226       and other  compound joining characters or character strings. Use BREAK,
227       if you want check words with dashes or  other  joining  characters  and
228       there is no time or possibility to describe precise compound rules with
229       COMPOUNDRULE (COMPOUNDRULE has handled only the last suffixation of the
230       compound word yet).
231
232       Note  II:  For  command  line spell checking, set WORDCHARS parameters:
233       WORDCHARS --- (see tests/break.*) example
234
235       COMPOUNDRULE number_of_compound_definitions
236
237       COMPOUNDRULE compound_pattern
238              Define custom compound patterns with a regex-like  syntax.   The
239              first  COMPOUNDRULE is a header with the number of the following
240              COMPOUNDRULE definitions.  Compound  patterns  consist  compound
241              flags and star or question mark meta characters. A flag followed
242              by a `*' matches a word sequence of 0 or more matches  of  words
243              signed  with  this  compound  flag.   A  flag  followed by a `?'
244              matches a word sequence of 0 or 1 matches of a word signed  with
245              this compound flag.  See tests/compound*.* examples.
246
247              Note:  `*'  and  `?'  metacharacters  work only with the default
248              8-bit character and the UTF-8 FLAG types.
249
250              Note II: COMPOUNDRULE flags haven't  been  compatible  with  the
251              COMPOUNDFLAG,  COMPOUNDBEGIN, etc. compound flags yet (use these
252              flags on different words).
253
254       COMPOUNDMIN num
255              Minimum length of words in compound words.  Default value  is  3
256              letters.
257
258       COMPOUNDFLAG flag
259              Words  signed with COMPOUNDFLAG may be in compound words (except
260              when word shorter than COMPOUNDMIN). Affixes  with  COMPOUNDFLAG
261              also permits compounding of affixed words.
262
263       COMPOUNDBEGIN flag
264              Words  signed with COMPOUNDBEGIN (or with a signed affix) may be
265              first elements in compound words.
266
267       COMPOUNDLAST flag
268              Words signed with COMPOUNDLAST (or with a signed affix)  may  be
269              last elements in compound words.
270
271       COMPOUNDMIDDLE flag
272              Words signed with COMPOUNDMIDDLE (or with a signed affix) may be
273              middle elements in compound words.
274
275       ONLYINCOMPOUND flag
276              Suffixes signed with ONLYINCOMPOUND flag may be only  inside  of
277              compounds  (Fuge-elements  in German, fogemorphemes in Swedish).
278              ONLYINCOMPOUND flag works also with words (see  tests/onlyincom‐
279              pound.*).
280
281       COMPOUNDPERMITFLAG flag
282              Prefixes are allowed at the beginning of compounds, suffixes are
283              allowed at the end of compounds by default.  Affixes  with  COM‐
284              POUNDPERMITFLAG may be inside of compounds.
285
286       COMPOUNDFORBIDFLAG flag
287              Suffixes with this flag forbid compounding of the affixed word.
288
289       COMPOUNDROOT flag
290              COMPOUNDROOT  flag signs the compounds in the dictionary (Now it
291              is used only in the Hungarian language specific code).
292
293       COMPOUNDWORDMAX number
294              Set maximum word count in a compound word.  (Default  is  unlim‐
295              ited.)
296
297       CHECKCOMPOUNDDUP
298              Forbid word duplication in compounds (eg. foofoo).
299
300       CHECKCOMPOUNDREP
301              Forbid  compounding, if the (usually bad) compound word may be a
302              non compound word with a REP fault. Useful  for  languages  with
303              `compound friendly' orthography.
304
305       CHECKCOMPOUNDCASE
306              Forbid upper case characters at word bound in compounds.
307
308       CHECKCOMPOUNDTRIPLE
309              Forbid  compounding,  if  compound  word contains triple letters
310              (eg. foo|ox or xo|oof).  Bug: missing multi-byte character  sup‐
311              port in UTF-8 encoding (works only for 7-bit ASCII characters).
312
313       CHECKCOMPOUNDPATTERN number_of_checkcompoundpattern_definitions
314
315       CHECKCOMPOUNDPATTERN endchars beginchars
316              Forbid  compounding,  if  first  word in compound ends with end‐
317              chars, and next word begins with beginchars.
318
319       COMPOUNDSYLLABLE max_syllable vowels
320              Need for special compounding rules in Hungarian.  First  parame‐
321              ter  is  the maximum syllable number, that may be in a compound,
322              if words in compounds are  more  than  COMPOUNDWORDMAX.   Second
323              parameter is the list of vowels (for calculating syllables).
324
325       SYLLABLENUM flags
326              Need for special compounding rules in Hungarian.
327

OPTIONS FOR AFFIX CREATION

329       PFX flag cross_product number
330
331       PFX flag stripping prefix condition morphological_description
332
333       SFX flag cross_product number
334
335       SFX flag stripping suffix condition morphological_description
336              An  affix  is either a prefix or a suffix attached to root words
337              to make other words. We can define affix classes with  arbitrary
338              number  affix rules.  Affix classes are signed with affix flags.
339              The first line of an affix class definition is the  header.  The
340              fields of an affix class header:
341
342              (0) Option name (PFX or SFX)
343
344              (1) Flag (name of the affix class)
345
346              (2) Cross product (permission to combine prefixes and suffixes).
347              Possible values: Y (yes) or N (no)
348
349              (3) Line count of the following rules.
350
351              Fields of an affix rules:
352
353              (0) Option name
354
355              (1) Flag
356
357              (2) stripping characters from beginning (at prefix rules) or end
358              (at suffix rules) of the word
359
360              (3)  affix (optionally with flags of continuation classes, sepa‐
361              rated by a slash)
362
363              (4) condition.
364
365              Zero stripping or affix are indicated by zero. Zero condition is
366              indicated  by  dot.   Condition is a simplified, regular expres‐
367              sion-like pattern, which must be met before  the  affix  can  be
368              applied. (Dot signs an arbitrary character. Characters in braces
369              sign an arbitrary character  from  the  character  subset.  Dash
370              hasn't  got  special  meaning, but circumflex (^) next the first
371              brace sets the complementer character set.)
372
373              (5) Custom morphological description.
374
375

OTHER OPTIONS

377       CIRCUMFIX flag
378              Affixes signed with CIRCUMFIX flag may be on a  word  when  this
379              word also has a prefix with CIRCUMFIX flag and vice versa.
380
381       FORBIDDENWORD flag
382              This  flag  signs forbidden word form. Because affixed forms are
383              also forbidden, we can  substract  a  subset  from  set  of  the
384              accepted affixed and compound words.
385
386       KEEPCASE flag
387              Forbid  uppercased  and  capitalized  forms of words signed with
388              KEEPCASE flags. Useful for  special  ortographies  (measurements
389              and  currency  often  keep  their  case in uppercased texts) and
390              writing systems (eg. keeping lower case of IPA characters).
391
392              Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
393              CASE  flag  may  be  capitalised  and uppercased, but uppercased
394              forms of these words may not contain sharp s, only SS. See  ger‐
395              mancompounding  example  in  the tests directory of the Hunspell
396              distribution.
397
398       LEMMA_PRESENT flag
399              Generally, there are dictionary words as  lemmas  in  output  of
400              morphological  analysis. Sometimes dictionary words are not lem‐
401              mas, but affixed (not real) stems and  virtual  stems.  In  this
402              case lemmas (real stems) need to put into morphological descrip‐
403              tion, and forbid  not  real  lemmas  in  morphological  analysis
404              adding LEMMA_PRESENT flag to dictionary words.
405
406       NEEDAFFIX flag
407              This  flag  signs virtual stems in the dictionary.  Only affixed
408              forms of these words will be accepted by Hunspell.   Except,  if
409              the  dictionary  word  has a homonym or a zero affix.  NEEDAFFIX
410              works also with prefixes and prefix + suffix  combinations  (see
411              tests/pseudoroot5.*).
412
413       PSEUDOROOT flag
414              Deprecated. (Former name of the NEEDAFFIX option.)
415
416       WORDCHARS characters
417              WORDCHARS  extends  tokenizer of Hunspell command line interface
418              with additional word character. For example, dot, dash,  n-dash,
419              numbers, percent sign are word character in Hungarian.
420
421       CHECKSHARPS
422              SS  letter  pair  in uppercased (German) words may be upper case
423              sharp s (ß).  Hunspell can handle this special casing  with  the
424              CHECKSHARPS  declaration  (see also KEEPCASE flag and tests/ger‐
425              mancompounding example) in both spelling and suggestion.
426
427

Morphological analysis

429       Hunspell's affix rules have got an optional  morphological  description
430       field.  There is a similar optional field in dictionary file, separated
431       by tabulator:
432
433
434               word/flags    morphology
435
436       We define a simple resource with morphological informations.
437
438       Affix file:
439
440
441               SFX X Y 1
442               SFX X 0 able . +ABLE
443
444       Dictionary file:
445
446
447               drink/X   [VERB]
448
449       Test file:
450
451
452               drink
453               drinkable
454
455       Test:
456
457
458               $ hunmorph test.aff test.dic test.txt
459               drink:     drink[VERB]
460               drinkable: drink[VERB]+ABLE
461
462       You can see in the example, that the analyzer concatenates the  morpho‐
463       logical fields in item and arrangement style.
464
465

Twofold suffix stripping

467       Ispell's  original algorithm strips only one suffix. Hunspell can strip
468       another one yet.
469
470       The twofold suffix stripping is a significant improvement  in  handling
471       of  immense  number  of  suffixes, that characterize agglutinative lan‐
472       guages.
473
474       Extending the previous example by adding a second suffix (affix class Y
475       will be the continuation class of the suffix `able'):
476
477
478               SFX Y Y 1
479               SFX Y 0 s . +PLUR
480
481               SFX X Y 1
482               SFX X 0 able/Y . +ABLE
483
484       Dictionary file:
485
486
487               drink/X   [VERB]
488
489       Test file:
490
491
492               drink
493               drinkable
494               drinkables
495
496       Test:
497
498
499               $ hunmorph test.aff test.dic test.txt
500               drink:      drink[VERB]
501               drinkable:  drink[VERB]+ABLE
502               drinkables: drink[VERB]+ABLE+PLUR
503
504       Theoretically  with  the twofold suffix stripping needs only the square
505       root of the number of suffix rules, compared with a Hunspell  implemen‐
506       tation. In our practice, we could have elaborated the Hungarian inflec‐
507       tional morphology with twofold suffix stripping.
508
509       Note: In Hunlex preprocessor's grammar can be use not only twofold, but
510       multiple suffix slitting.
511
512

Extended affix classes

514       Hunspell  can  handle more than 65000 affix classes.  There are two new
515       syntax for giving flags in affix and dictionary files.
516
517       FLAG long command sets 2-character flags:
518
519
520                FLAG long
521                SFX Y1 Y 1
522                SFX Y1 0 s 1
523
524       Dictionary record with the Y1, Z3, F? flags:
525
526
527                foo/Y1Z3F?
528
529       FLAG num command sets numerical flags separated by comma:
530
531
532                FLAG num
533                SFX 65000 Y 1
534                SFX 65000 0 s 1
535
536       Dictionary example:
537
538
539                foo/65000,12,2756
540

Homonyms

542       Hunspell's dictionary can contain repeating elements that are homonyms:
543
544
545               work/A    [VERB]
546               work/B    [NOUN]
547
548       An affix file:
549
550
551               SFX A Y 1
552               SFX A 0 s . +SG3
553
554               SFX B Y 1
555               SFX B 0 s . +PLUR
556
557       Test file:
558
559
560               works
561
562       Test:
563
564
565               > works
566               work[VERB]+SG3
567               work[NOUN]+PLUR
568
569       This feature also gives a way to forbid illegal prefix/suffix  combina‐
570       tions in difficult cases.
571
572

Prefix--suffix dependencies

574       An  interesting side-effect of multi-step stripping is, that the appro‐
575       priate treatment of circumfixes now comes for free.  For  instance,  in
576       Hungarian,  superlatives are formed by simultaneous prefixation of leg-
577       and suffixation of -bb to the adjective base.  A problem with the  one-
578       level  architecture is that there is no way to render lexical licensing
579       of particular  prefixes  and  suffixes  interdependent,  and  therefore
580       incorrect  forms  are  recognized  as  valid,  i.e. *legvén = leg + vén
581       `old'. Until the introduction of clusters, a special treatment  of  the
582       superlative  had to be hardwired in the earlier HunSpell code. This may
583       have been legitimate for a single  case,  but  in  fact  prefix--suffix
584       dependences  are  ubiquitous in category-changing derivational patterns
585       (cf. English payable, non-payable but *non-pay or  drinkable,  undrink‐
586       able but *undrink). In simple words, here, the prefix un- is legitimate
587       only if the base drink is suffixed with -able. If  both  these  patters
588       are  handled by on-line affix rules and affix rules are checked against
589       the base only, there is no way to express this dependency and the  sys‐
590       tem will necessarily over- or undergenerate.
591
592       In  next example, suffix class R have got a prefix `continuation' class
593       (class P).
594
595
596              PFX P Y 1
597              PFX P   0 un . [prefix_un]+
598
599              SFX S Y 1
600              SFX S   0 s . +PL
601
602              SFX Q Y 1
603              SFX Q   0 s . +3SGV
604
605              SFX R Y 1
606              SFX R   0 able/PS . +DER_V_ADJ_ABLE
607
608       Dictionary:
609
610
611              2
612              drink/RQ  [verb]
613              drink/S   [noun]
614
615       Morphological analysis:
616
617
618              > drink
619              drink[verb]
620              drink[noun]
621              > drinks
622              drink[verb]+3SGV
623              drink[noun]+PL
624              > drinkable
625              drink[verb]+DER_V_ADJ_ABLE
626              > drinkables
627              drink[verb]+DER_V_ADJ_ABLE+PL
628              > undrinkable
629              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE
630              > undrinkables
631              [prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
632              > undrink
633              Unknown word.
634              > undrinks
635              Unknown word.
636

Circumfix

638       Conditional affixes implemented by a continuation class are not  enough
639       for  circumfixes,  because  a  circumfix is one affix in morphology. We
640       also need CIRCUMFIX option for correct morphological analysis.
641
642
643              # circumfixes: ~ obligate prefix/suffix combinations
644              # superlative in Hungarian: leg- (prefix) AND -bb (suffix)
645              # nagy, nagyobb, legnagyobb, legeslegnagyobb
646              # (great, greater, greatest, most greatest)
647
648              CIRCUMFIX X
649
650              PFX A Y 1
651              PFX A 0 leg/X .
652
653              PFX B Y 1
654              PFX B 0 legesleg/X .
655
656              SFX C Y 3
657              SFX C 0 obb . +COMPARATIVE
658              SFX C 0 obb/AX . +SUPERLATIVE
659              SFX C 0 obb/BX . +SUPERSUPERLATIVE
660
661       Dictionary:
662
663
664              1
665              nagy/C    [MN]
666
667       Analysis:
668
669
670              > nagy
671              nagy[MN]
672              > nagyobb
673              nagy[MN]+COMPARATIVE
674              > legnagyobb
675              nagy[MN]+SUPERLATIVE
676              > legeslegnagyobb
677              nagy[MN]+SUPERSUPERLATIVE
678

Compounds

680       Allowing free compounding yields decrease in precision of  recognition,
681       not  to  mention stemming and morphological analysis.  Although lexical
682       switches are introduced to license compounding of bases by Ispell, this
683       proves not to be restrictive enough. For example:
684
685
686              # affix file
687              COMPOUNDFLAG X
688
689              2
690              foo/X
691              bar/X
692
693       With this resource, foobar and barfoo also are accepted words.
694
695       This  has  been improved upon with the introduction of direction-sensi‐
696       tive compounding, i.e., lexical features can specify separately whether
697       a  base  can  occur  as leftmost or rightmost constituent in compounds.
698       This, however, is still insufficient to handle the  intricate  patterns
699       of  compounding,  not  to mention idiosyncratic (and language specific)
700       norms of hyphenation.
701
702       The Hunspell algorithm currently allows  any  affixed  form  of  words,
703       which  are lexically marked as potential members of compounds. Hunspell
704       improved this, and its recursive compound checking rules makes it  pos‐
705       sible to implement the intricate spelling conventions of Hungarian com‐
706       pounds. For example, using COMPOUNDWORDMAX, COMPOUNDSYLLABLE, COMPOUND‐
707       ROOT,  SYLLABLENUM  options  can be set the noteworthy Hungarian `6--3'
708       rule.  Further example in Hungarian,  derivate  suffixes  often  modify
709       compounding  properties.  Hunspell  allows the compounding flags on the
710       affixes, and there are two special flags (COMPOUNDPERMITFLAG and  (COM‐
711       POUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.
712
713       Suffixes with this flag forbid compounding of the affixed word.
714
715       We also need several Hunspell features for handling German compounding:
716
717
718              # German compounding
719
720              # set language to handle special casing of German sharp s
721
722              LANG de_DE
723
724              # compound flags
725
726              COMPOUNDBEGIN U
727              COMPOUNDMIDDLE V
728              COMPOUNDEND W
729
730              # Prefixes are allowed at the beginning of compounds,
731              # suffixes are allowed at the end of compounds by default:
732              # (prefix)?(root)+(affix)?
733              # Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
734              COMPOUNDPERMITFLAG P
735
736              # for German fogemorphemes (Fuge-element)
737              # Hint: ONLYINCOMPOUND is not required everywhere, but the
738              # checking will be a little faster with it.
739
740              ONLYINCOMPOUND X
741
742              # forbid uppercase characters at compound word bounds
743              CHECKCOMPOUNDCASE
744
745              # for handling Fuge-elements with dashes (Arbeits-)
746              # dash will be a special word
747
748              COMPOUNDMIN 1
749              WORDCHARS -
750
751              # compound settings and fogemorpheme for `Arbeit'
752
753              SFX A Y 3
754              SFX A 0 s/UPX .
755              SFX A 0 s/VPDX .
756              SFX A 0 0/WXD .
757
758              SFX B Y 2
759              SFX B 0 0/UPX .
760              SFX B 0 0/VWXDP .
761
762              # a suffix for `Computer'
763
764              SFX C Y 1
765              SFX C 0 n/WD .
766
767              # for forbid exceptions (*Arbeitsnehmer)
768
769              FORBIDDENWORD Z
770
771              # dash prefix for compounds with dash (Arbeits-Computer)
772
773              PFX - Y 1
774              PFX - 0 -/P .
775
776              # decapitalizing prefix
777              # circumfix for positioning in compounds
778
779              PFX D Y 29
780              PFX D A a/PX A
781              PFX D Ä ä/PX Ä
782               .
783               .
784              PFX D Y y/PX Y
785              PFX D Z z/PX Z
786
787       Example dictionary:
788
789
790              4
791              Arbeit/A-
792              Computer/BC-
793              -/W
794              Arbeitsnehmer/Z
795
796       Accepted compound compound words with the previous resource:
797
798
799              Computer
800              Computern
801              Arbeit
802              Arbeits-
803              Computerarbeit
804              Computerarbeits-
805              Arbeitscomputer
806              Arbeitscomputern
807              Computerarbeitscomputer
808              Computerarbeitscomputern
809              Arbeitscomputerarbeit
810              Computerarbeits-Computer
811              Computerarbeits-Computern
812
813       Not accepted compoundings:
814
815
816              computer
817              arbeit
818              Arbeits
819              arbeits
820              ComputerArbeit
821              ComputerArbeits
822              Arbeitcomputer
823              ArbeitsComputer
824              Computerarbeitcomputer
825              ComputerArbeitcomputer
826              ComputerArbeitscomputer
827              Arbeitscomputerarbeits
828              Computerarbeits-computer
829              Arbeitsnehmer
830
831       This  solution  is  still not ideal, however, and will be replaced by a
832       pattern-based compound-checking algorithm which is  closely  integrated
833       with input buffer tokenization. Patterns describing compounds come as a
834       separate input resource that can refer to high-level properties of con‐
835       stituent parts (e.g. the number of syllables, affix flags, and contain‐
836       ment of hyphens). The patterns are matched against potential  segmenta‐
837       tions of compounds to assess wellformedness.
838
839

Character encoding

841       Problems with the 8-bit encoding
842
843       Both  Ispell and Myspell use 8-bit ASCII character encoding, which is a
844       major deficiency when it comes to  scalability.   Although  a  language
845       like  Hungarian  has  a  standard  ASCII character set (ISO 8859-2), it
846       fails to allow a full implementation of Hungarian orthographic  conven‐
847       tions.   For  instance,  the  '--' symbol (n-dash) is missing from this
848       character set contrary to the fact that it is  not  only  the  official
849       symbol to delimit parenthetic clauses in the language, but it can be in
850       compound words as a special 'big' hyphen.
851
852       MySpell has got some 8-bit encoding tables,  but  there  are  languages
853       without  standard  8-bit  encoding,  too. For example, a lot of African
854       languages have non-latin or extended latin characters.
855
856       Similarly, using the original spelling of certain  foreign  names  like
857       Ĺngström  or Moličre is encouraged by the Hungarian spelling norm, and,
858       since characters 'Ĺ' and 'č' are not part of ISO 8859-2, when they com‐
859       bine  with  inflections  containing characters only in ISO 8859-2 (like
860       elative -bo=l, allative -to=l or delative  -ro=l  with  double  acute),
861       these result in words (like Ĺngströmro=l or Moličre-to=l.) that can not
862       be encoded using any single ASCII encoding scheme.
863
864       The problems raised in relation to 8-bit ASCII encoding have long  been
865       recognized  by  proponents of Unicode. Unfortunately, switching to Uni‐
866       code (e.g., UTF-16 encoding) would require a great deal of  code  opti‐
867       mization  and  would have an impact on the efficiency of the algorithm.
868       The Dömölki algorithm used in  checking  affixing  conditions  utilizes
869       256-byte  character arrays, which would grow to 64k with Unicode encod‐
870       ing. Since online affixing for a richly agglutinative language can eas‐
871       ily  have several hundred such arrays (in the case of the standard Hun‐
872       garian resources we use, this number is ca. 300 or more since redundant
873       storage  of structurally identical affix patterns improves efficiency),
874       switching to Unicode would incur high resource costs.  Nonetheless,  it
875       is  clear  that  trading  efficiency  for encoding-independence has its
876       advantages when it comes a truly multi-lingual  application,  therefore
877       it  was  among our plans for a long while to extend the architecture in
878       this direction.
879
880       A hybrid solution
881
882       Recently we implemented successfully a memory and time  efficient  Uni‐
883       code handling. In non-UTF-8 character encodings Hunspell works with the
884       original 8-bit algorithms, but with UTF-8 encoded dictionary and  affix
885       file  Hunspell uses a hybrid string manipulation and condition checking
886       to support Unicode:
887
888       Affixes and words are stored in UTF-8, during the analysis are  handled
889       in  mostly UTF-8, in condition checking and suggestion are converted to
890       UTF-16.
891
892       Dömölki-algorithm is used for storing and  checking  7-bit  ASCII  (ISO
893       646)  condition  characters,  and sorted UTF-16 lists for other Unicode
894       characters of condition patterns.
895
896       Hunspell has supported only the first 65536 characters (Basic Multilin‐
897       gual Plane) of Unicode Standard, yet.
898
899

SEE ALSO

901       hunspell (1), ispell (1), ispell (4)
902
903
904
905
906                                  2005-12-31                       hunspell(4)
Impressum