1Mail::SpamAssassin::PluUgsienr::CToenxttrCiabtu(t3e)d PeMralilD:o:cSupmaemnAtsastaisosnin::Plugin::TextCat(3)
2
3
4
6 Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
7
9 loadplugin Mail::SpamAssassin::Plugin::TextCat
10
12 This plugin will try to guess the language used in the message body
13 text.
14
15 You can use the "ok_languages" directive to set which languages are
16 considered okay for incoming mail and if the guessed language is not
17 okay, "UNWANTED_LANGUAGE_BODY" is triggered. Alternatively you can use
18 the X-Languages metadata header directly in rules.
19
20 It will always add the results to a "X-Languages" name-value pair in
21 the message metadata data structure. This may be useful as Bayes tokens
22 and can also be used in rules for scoring. The results can also be
23 added to marked-up messages using "add_header", with the _LANGUAGES_
24 tag. See Mail::SpamAssassin::Conf for details.
25
26 Note: the language cannot always be recognized with sufficient
27 confidence. In that case, no action is taken.
28
29 You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it
30 might help fine-tuning settings.
31
32 Examples of using X-Languages header directly in rules:
33
34 header OK_LANGS X-Languages =~ /\ben\b/
35 score OK_LANGS -1
36
37 header BAD_LANGS X-Languages =~ /\b(?:ja|zh)\b/
38 score BAD_LANGS 1
39
41 ok_languages xx [ yy zz ... ] (default: all)
42 This option is used to specify which languages are considered okay
43 for incoming mail. SpamAssassin will try to detect the language
44 used in the message body text.
45
46 Note that the language cannot always be recognized with sufficient
47 confidence. In that case, no action is taken.
48
49 The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the
50 languages detected are in the "ok" list. Note that this is the only
51 effect of the "ok" list. It does not act as a whitelist against any
52 other form of spam scanning.
53
54 In your configuration, you must use the two or three letter
55 language specifier in lowercase, not the English name for the
56 language. You may also specify "all" if a desired language is not
57 listed, or if you want to allow any language. The default setting
58 is "all".
59
60 Examples:
61
62 ok_languages all (allow all languages)
63 ok_languages en (only allow English)
64 ok_languages en ja zh (allow English, Japanese, and Chinese)
65
66 Note: if there are multiple ok_languages lines, only the last one
67 is used.
68
69 Select the languages to allow from the list below:
70
71 af - Afrikaans
72 am - Amharic
73 ar - Arabic
74 be - Byelorussian
75 bg - Bulgarian
76 bs - Bosnian
77 ca - Catalan
78 cs - Czech
79 cy - Welsh
80 da - Danish
81 de - German
82 el - Greek
83 en - English
84 eo - Esperanto
85 es - Spanish
86 et - Estonian
87 eu - Basque
88 fa - Persian
89 fi - Finnish
90 fr - French
91 fy - Frisian
92 ga - Irish Gaelic
93 gd - Scottish Gaelic
94 he - Hebrew
95 hi - Hindi
96 hr - Croatian
97 hu - Hungarian
98 hy - Armenian
99 id - Indonesian
100 is - Icelandic
101 it - Italian
102 ja - Japanese
103 ka - Georgian
104 ko - Korean
105 la - Latin
106 lt - Lithuanian
107 lv - Latvian
108 mr - Marathi
109 ms - Malay
110 ne - Nepali
111 nl - Dutch
112 no - Norwegian
113 pl - Polish
114 pt - Portuguese
115 qu - Quechua
116 rm - Rhaeto-Romance
117 ro - Romanian
118 ru - Russian
119 sa - Sanskrit
120 sco - Scots
121 sk - Slovak
122 sl - Slovenian
123 sq - Albanian
124 sr - Serbian
125 sv - Swedish
126 sw - Swahili
127 ta - Tamil
128 th - Thai
129 tl - Tagalog
130 tr - Turkish
131 uk - Ukrainian
132 vi - Vietnamese
133 yi - Yiddish
134 zh - Chinese (both Traditional and Simplified)
135 zh.big5 - Chinese (Traditional only)
136 zh.gb2312 - Chinese (Simplified only)
137
138
139
140 inactive_languages xx [ yy zz ... ] (default: see below)
141 This option is used to specify which languages will not be
142 considered when trying to guess the language. For performance
143 reasons, supported languages that have fewer than about 5 million
144 speakers are disabled by default. Note that listing a language in
145 "ok_languages" automatically enables it for that user.
146
147 The default setting is:
148
149 bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi
150
151 That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian,
152 Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian,
153 Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.
154
155 textcat_max_languages N (default: 3)
156 The maximum number of languages any one message can simultaneously
157 match before its classification is considered unknown. You can try
158 reducing this to 2 or possibly even 1 for more confident results,
159 as it's unusual for a message to contain multiple languages.
160
161 Read description for textcat_acceptable_score also, as these
162 settings are closely related. Scoring affects how many languages
163 might be matched and here we set the "false positive limit" where
164 we think the engine can't decide what languages message really
165 contain.
166
167 textcat_optimal_ngrams N (default: 0)
168 If the number of ngrams is lower than this number then they will be
169 removed. This can be used to speed up the program for longer
170 inputs. For shorter inputs, this should be set to 0.
171
172 textcat_max_ngrams N (default: 400)
173 The maximum number of ngrams that should be compared with each of
174 the languages models (note that each of those models is used
175 completely).
176
177 textcat_acceptable_score N (default: 1.02)
178 Include any language that scores at least
179 "textcat_acceptable_score" in the returned list of languages.
180
181 This setting is basically a percentile range. Any language having
182 internal ngram-score within N-percent of the best score is included
183 into results. Larger values than 1.05 are not recommended as it
184 can generate many false matches. A setting of 1.00 would mean a
185 single best scoring language is always forcibly selected, but this
186 is not recommended as then textcat_max_languages can't do its job
187 classifying language as uncertain.
188
189 Read the description for textcat_max_languages, as these are
190 settings are closely related.
191
192 You can use _TEXTCATRESULTS_ tag to view the internal ngram-
193 scoring, it might help fine-tuning settings.
194
195
196
197perl v5.34.0 2021-07M-a2i3l::SpamAssassin::Plugin::TextCat(3)