Mail::SpamAssassin::Plugin::TextCat(3pm)

1Mail::SpamAssassin::PluUgsienr::CToenxttrCiabtu(t3e)d PeMralilD:o:cSupmaemnAtsastaisosnin::Plugin::TextCat(3)
2
3
4

NAME

6       Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
7

SYNOPSIS

9         loadplugin     Mail::SpamAssassin::Plugin::TextCat
10

DESCRIPTION

12       This plugin will try to guess the language used in the message body
13       text.
14
15       You can use the "ok_languages" directive to set which languages are
16       considered okay for incoming mail and if the guessed language is not
17       okay, "UNWANTED_LANGUAGE_BODY" is triggered. Alternatively you can use
18       the X-Languages metadata header directly in rules.
19
20       It will always add the results to a "X-Languages" name-value pair in
21       the message metadata data structure. This may be useful as Bayes tokens
22       and can also be used in rules for scoring. The results can also be
23       added to marked-up messages using "add_header", with the _LANGUAGES_
24       tag. See Mail::SpamAssassin::Conf for details.
25
26       Note: the language cannot always be recognized with sufficient
27       confidence.  In that case, no action is taken.
28
29       You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it
30       might help fine-tuning settings.
31
32       Examples of using X-Languages header directly in rules:
33
34        header OK_LANGS X-Languages =~ /\ben\b/
35        score OK_LANGS -1
36
37        header BAD_LANGS X-Languages =~ /\b(?:ja|zh)\b/
38        score BAD_LANGS 1
39

USER OPTIONS

41       ok_languages xx [ yy zz ... ]      (default: all)
42           This option is used to specify which languages are considered okay
43           for incoming mail.  SpamAssassin will try to detect the language
44           used in the message body text.
45
46           Note that the language cannot always be recognized with sufficient
47           confidence. In that case, no action is taken.
48
49           The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the
50           languages detected are in the "ok" list. Note that this is the only
51           effect of the "ok" list. It does not act as a whitelist against any
52           other form of spam scanning.
53
54           In your configuration, you must use the two or three letter
55           language specifier in lowercase, not the English name for the
56           language.  You may also specify "all" if a desired language is not
57           listed, or if you want to allow any language.  The default setting
58           is "all".
59
60           Examples:
61
62             ok_languages all         (allow all languages)
63             ok_languages en          (only allow English)
64             ok_languages en ja zh    (allow English, Japanese, and Chinese)
65
66           Note: if there are multiple ok_languages lines, only the last one
67           is used.
68
69           Select the languages to allow from the list below:
70
71           af   - Afrikaans
72           am   - Amharic
73           ar   - Arabic
74           be   - Byelorussian
75           bg   - Bulgarian
76           bs   - Bosnian
77           ca   - Catalan
78           cs   - Czech
79           cy   - Welsh
80           da   - Danish
81           de   - German
82           el   - Greek
83           en   - English
84           eo   - Esperanto
85           es   - Spanish
86           et   - Estonian
87           eu   - Basque
88           fa   - Persian
89           fi   - Finnish
90           fr   - French
91           fy   - Frisian
92           ga   - Irish Gaelic
93           gd   - Scottish Gaelic
94           he   - Hebrew
95           hi   - Hindi
96           hr   - Croatian
97           hu   - Hungarian
98           hy   - Armenian
99           id   - Indonesian
100           is   - Icelandic
101           it   - Italian
102           ja   - Japanese
103           ka   - Georgian
104           ko   - Korean
105           la   - Latin
106           lt   - Lithuanian
107           lv   - Latvian
108           mr   - Marathi
109           ms   - Malay
110           ne   - Nepali
111           nl   - Dutch
112           no   - Norwegian
113           pl   - Polish
114           pt   - Portuguese
115           qu   - Quechua
116           rm   - Rhaeto-Romance
117           ro   - Romanian
118           ru   - Russian
119           sa   - Sanskrit
120           sco  - Scots
121           sk   - Slovak
122           sl   - Slovenian
123           sq   - Albanian
124           sr   - Serbian
125           sv   - Swedish
126           sw   - Swahili
127           ta   - Tamil
128           th   - Thai
129           tl   - Tagalog
130           tr   - Turkish
131           uk   - Ukrainian
132           vi   - Vietnamese
133           yi   - Yiddish
134           zh   - Chinese (both Traditional and Simplified)
135           zh.big5   - Chinese (Traditional only)
136           zh.gb2312 - Chinese (Simplified only)
137
138
139
140       inactive_languages xx [ yy zz ... ]          (default: see below)
141           This option is used to specify which languages will not be
142           considered when trying to guess the language.  For performance
143           reasons, supported languages that have fewer than about 5 million
144           speakers are disabled by default.  Note that listing a language in
145           "ok_languages" automatically enables it for that user.
146
147           The default setting is:
148
149           bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi
150
151           That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian,
152           Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian,
153           Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.
154
155       textcat_max_languages N (default: 3)
156           The maximum number of languages any one message can simultaneously
157           match before its classification is considered unknown.  You can try
158           reducing this to 2 or possibly even 1 for more confident results,
159           as it's unusual for a message to contain multiple languages.
160
161           Read description for textcat_acceptable_score also, as these
162           settings are closely related.  Scoring affects how many languages
163           might be matched and here we set the "false positive limit" where
164           we think the engine can't decide what languages message really
165           contain.
166
167       textcat_optimal_ngrams N (default: 0)
168           If the number of ngrams is lower than this number then they will be
169           removed.  This can be used to speed up the program for longer
170           inputs.  For shorter inputs, this should be set to 0.
171
172       textcat_max_ngrams N (default: 400)
173           The maximum number of ngrams that should be compared with each of
174           the languages models (note that each of those models is used
175           completely).
176
177       textcat_acceptable_score N (default: 1.02)
178           Include any language that scores at least
179           "textcat_acceptable_score" in the returned list of languages.
180
181           This setting is basically a percentile range. Any language having
182           internal ngram-score within N-percent of the best score is included
183           into results.  Larger values than 1.05 are not recommended as it
184           can generate many false matches.  A setting of 1.00 would mean a
185           single best scoring language is always forcibly selected, but this
186           is not recommended as then textcat_max_languages can't do its job
187           classifying language as uncertain.
188
189           Read the description for textcat_max_languages, as these are
190           settings are closely related.
191
192           You can use _TEXTCATRESULTS_ tag to view the internal ngram-
193           scoring, it might help fine-tuning settings.
194
195
196
197perl v5.32.1                      2021-03M-a2i5l::SpamAssassin::Plugin::TextCat(3)