Mail::SpamAssassin::Plugin::TextCat(3pm)

1Mail::SpamAssassin::PluUgsienr::CToenxttrCiabtu(t3e)d PeMralilD:o:cSupmaemnAtsastaisosnin::Plugin::TextCat(3)
2
3
4

NAME

6       Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
7

SYNOPSIS

9         loadplugin     Mail::SpamAssassin::Plugin::TextCat
10

DESCRIPTION

12       This plugin will try to guess the language used in the message body
13       text.
14
15       You can use the "ok_languages" directive to set which languages are
16       considered okay for incoming mail and if the guessed language is not
17       okay, "UNWANTED_LANGUAGE_BODY" is triggered.
18
19       It will always add the results to a "X-Language" name-value pair in the
20       message metadata data structure. This may be useful as Bayes tokens and
21       can also be used in rules for scoring. The results can also be added to
22       marked-up messages using "add_header", with the _LANGUAGES_ tag. See
23       Mail::SpamAssassin::Conf for details.
24
25       Note: the language cannot always be recognized with sufficient
26       confidence.  In that case, no action is taken.
27
28       You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it
29       might help fine-tuning settings.
30

USER OPTIONS

32       ok_languages xx [ yy zz ... ]      (default: all)
33           This option is used to specify which languages are considered okay
34           for incoming mail.  SpamAssassin will try to detect the language
35           used in the message body text.
36
37           Note that the language cannot always be recognized with sufficient
38           confidence. In that case, no action is taken.
39
40           The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the
41           languages detected are in the "ok" list. Note that this is the only
42           effect of the "ok" list. It does not act as a whitelist against any
43           other form of spam scanning.
44
45           In your configuration, you must use the two or three letter
46           language specifier in lowercase, not the English name for the
47           language.  You may also specify "all" if a desired language is not
48           listed, or if you want to allow any language.  The default setting
49           is "all".
50
51           Examples:
52
53             ok_languages all         (allow all languages)
54             ok_languages en          (only allow English)
55             ok_languages en ja zh    (allow English, Japanese, and Chinese)
56
57           Note: if there are multiple ok_languages lines, only the last one
58           is used.
59
60           Select the languages to allow from the list below:
61
62           af   - Afrikaans
63           am   - Amharic
64           ar   - Arabic
65           be   - Byelorussian
66           bg   - Bulgarian
67           bs   - Bosnian
68           ca   - Catalan
69           cs   - Czech
70           cy   - Welsh
71           da   - Danish
72           de   - German
73           el   - Greek
74           en   - English
75           eo   - Esperanto
76           es   - Spanish
77           et   - Estonian
78           eu   - Basque
79           fa   - Persian
80           fi   - Finnish
81           fr   - French
82           fy   - Frisian
83           ga   - Irish Gaelic
84           gd   - Scottish Gaelic
85           he   - Hebrew
86           hi   - Hindi
87           hr   - Croatian
88           hu   - Hungarian
89           hy   - Armenian
90           id   - Indonesian
91           is   - Icelandic
92           it   - Italian
93           ja   - Japanese
94           ka   - Georgian
95           ko   - Korean
96           la   - Latin
97           lt   - Lithuanian
98           lv   - Latvian
99           mr   - Marathi
100           ms   - Malay
101           ne   - Nepali
102           nl   - Dutch
103           no   - Norwegian
104           pl   - Polish
105           pt   - Portuguese
106           qu   - Quechua
107           rm   - Rhaeto-Romance
108           ro   - Romanian
109           ru   - Russian
110           sa   - Sanskrit
111           sco  - Scots
112           sk   - Slovak
113           sl   - Slovenian
114           sq   - Albanian
115           sr   - Serbian
116           sv   - Swedish
117           sw   - Swahili
118           ta   - Tamil
119           th   - Thai
120           tl   - Tagalog
121           tr   - Turkish
122           uk   - Ukrainian
123           vi   - Vietnamese
124           yi   - Yiddish
125           zh   - Chinese (both Traditional and Simplified)
126           zh.big5   - Chinese (Traditional only)
127           zh.gb2312 - Chinese (Simplified only)
128
129
130
131       inactive_languages xx [ yy zz ... ]          (default: see below)
132           This option is used to specify which languages will not be
133           considered when trying to guess the language.  For performance
134           reasons, supported languages that have fewer than about 5 million
135           speakers are disabled by default.  Note that listing a language in
136           "ok_languages" automatically enables it for that user.
137
138           The default setting is:
139
140           bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi
141
142           That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian,
143           Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian,
144           Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.
145
146       textcat_max_languages N (default: 3)
147           The maximum number of languages any one message can simultaneously
148           match before its classification is considered unknown.  You can try
149           reducing this to 2 or possibly even 1 for more confident results,
150           as it's unusual for a message to contain multiple languages.
151
152           Read description for textcat_acceptable_score also, as these
153           settings are closely related.  Scoring affects how many languages
154           might be matched and here we set the "false positive limit" where
155           we think the engine can't decide what languages message really
156           contain.
157
158       textcat_optimal_ngrams N (default: 0)
159           If the number of ngrams is lower than this number then they will be
160           removed.  This can be used to speed up the program for longer
161           inputs.  For shorter inputs, this should be set to 0.
162
163       textcat_max_ngrams N (default: 400)
164           The maximum number of ngrams that should be compared with each of
165           the languages models (note that each of those models is used
166           completely).
167
168       textcat_acceptable_score N (default: 1.02)
169           Include any language that scores at least
170           "textcat_acceptable_score" in the returned list of languages.
171
172           This setting is basically a percentile range. Any language having
173           internal ngram-score within N-percent of the best score is included
174           into results.  Larger values than 1.05 are not recommended as it
175           can generate many false matches.  A setting of 1.00 would mean a
176           single best scoring language is always forcibly selected, but this
177           is not recommended as then textcat_max_languages can't do its job
178           classifying language as uncertain.
179
180           Read the description for textcat_max_languages, as these are
181           settings are closely related.
182
183           You can use _TEXTCATRESULTS_ tag to view the internal ngram-
184           scoring, it might help fine-tuning settings.
185
186
187
188perl v5.28.0                      2018-09M-a1i4l::SpamAssassin::Plugin::TextCat(3)