1Mail::SpamAssassin::PluUgsienr::CEoxnttrraicbtuTteeMxdati(Pl3e:)r:lSpDaomcAusmseanstsaitni:o:nPlugin::ExtractText(3)
2
3
4
6 ExtractText - extracts text from documenmts.
7
9 loadplugin Mail::SpamAssassin::Plugin::ExtractText
10
11 ifplugin Mail::SpamAssassin::Plugin::ExtractText
12
13 extracttext_external pdftotext /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
14 extracttext_use pdftotext .pdf application/pdf
15
16 # http://docx2txt.sourceforge.net
17 extracttext_external docx2txt /usr/bin/docx2txt {} -
18 extracttext_use docx2txt .docx application/docx
19
20 extracttext_external antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
21 extracttext_use antiword .doc application/(?:vnd\.?)?ms-?word.*
22
23 extracttext_external unrtf /usr/bin/unrtf --nopict {}
24 extracttext_use unrtf .doc .rtf application/rtf text/rtf
25
26 extracttext_external odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}
27 extracttext_use odt2txt .odt .ott application/.*?opendocument.*text
28 extracttext_use odt2txt .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
29
30 extracttext_external tesseract {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
31 extracttext_use tesseract .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
32
33 add_header all ExtractText-Flags _EXTRACTTEXTFLAGS_
34 header PDF_NO_TEXT X-ExtractText-Flags =~ /\bpdftotext_NoText\b/
35 describe PDF_NO_TEXT PDF without text
36 score PDF_NO_TEXT 0.001
37
38 header DOC_NO_TEXT X-ExtractText-Flags =~ /\b(?:antiword|openxml|unrtf|odt2txt)_NoText\b/
39 describe DOC_NO_TEXT Document without text
40 score DOC_NO_TEXT 0.001
41
42 header EXTRACTTEXT exists:X-ExtractText-Flags
43 describe EXTRACTTEXT Email processed by extracttext plugin
44 score EXTRACTTEXT 0.001
45
46 endif
47
49 This module uses external tools to extract text from message parts, and
50 then sets the text as the rendered part. External tool must output
51 plain text, not HTML or other non-textual result.
52
53 How to extract text is completely configurable, and based on MIME part
54 type and file name.
55
57 All configuration lines in user_prefs files will be ignored.
58
59 extracttext_maxparts (default: 10)
60 Configure the maximum mime parts number to analyze, a value of 0
61 means all mime parts will be analyzed
62
63 extracttext_timeout (default: 5 10)
64 Configure the timeout in seconds of external tool checks, per
65 attachment.
66
67 Second argument speficies maximum total time for all checks.
68
69 Tools
70 extracttext_use
71 Specifies what tool to use for what message parts.
72
73 The general syntax is
74
75 extracttext_use "name" "specifiers"
76
77 name
78 the internal name of a tool.
79
80 specifiers
81 File extension and regular expressions for file names and MIME
82 types. The regular expressions are anchored to beginning and end.
83
84 Examples
85
86 extracttext_use antiword .doc application/(?:vnd\.?)?ms-?word.*
87 extracttext_use openxml .docx .dotx .dotm application/(?:vnd\.?)openxml.*?word.*
88 extracttext_use openxml .doc .dot application/(?:vnd\.?)?ms-?word.*
89 extracttext_use unrtf .doc .rtf application/rtf text/rtf
90
91 extracttext_external
92 Defines an external tool. The tool must read a document on
93 standard input or from a file and write text to standard output.
94
95 The special keyword "{}" will be substituted at runtime with the
96 temporary filename to be scanned by the external tool.
97
98 Environment variables can be defined with "{KEY=VALUE}", these
99 strings will be removed from commandline.
100
101 It is required that commandline used outputs result directly to
102 STDOUT.
103
104 The general syntax is
105
106 extracttext_external "name" "command" "parameters"
107
108 name
109 The internal name of this tool.
110
111 command
112 The full path to the external command to run.
113
114 parameters
115 Parameters for the external command. The temporary file name
116 containing the document will be automatically added as last
117 parameter.
118
119 Examples
120
121 extracttext_external antiword /usr/bin/antiword -t -w 0 -m UTF-8.txt {} -
122 extracttext_external unrtf /usr/bin/unrtf --nopict {}
123 extracttext_external odt2txt /usr/bin/odt2txt --encoding=UTF-8 {}
124
125 Metadata
126 The plugin adds some pseudo headers to the message. These headers are
127 seen by the bayes system, and can be used in normal SpamAssassin rules.
128
129 The headers are also available as template tags as noted below.
130
131 Example
132
133 The fictional example headers below are based on a message containing
134 this:
135
136 1 A perfectly normal PDF.
137 2 An OpenXML document with a word document inside. Neither Office
138 document contains text.
139
140 Headers
141
142 X-ExtractText-Chars
143 Tag: _EXTRACTTEXTCHARS_
144
145 Contains a count of characters that were extracted.
146
147 X-ExtractText-Chars: 10970
148
149 X-ExtractText-Words
150 Tag: _EXTRACTTEXTWORDS_
151
152 Contains a count of "words" that were extracted.
153
154 X-ExtractText-Chars: 1599
155
156 X-ExtractText-Tools
157 Tag: _EXTRACTTEXTTOOLS_
158
159 Contains chains of tools used for extraction.
160
161 X-ExtractText-Tools: pdftotext openxml_antiword
162
163 X-ExtractText-Types
164 Tag: _EXTRACTTEXTTYPES_
165
166 Contains chains of MIME types for parts found during extraction.
167
168 X-ExtractText-Types: application/pdf;
169 application/vnd.openxmlformats-officedocument.wordprocessingml.document,
170 application/ms-word
171
172 X-ExtractText-Extensions
173 Tag: _EXTRACTTEXTEXTENSIONS_
174
175 Contains chains of canonicalized file extensions for parts found
176 during extraction.
177
178 X-ExtractText-Extensions: pdf docx
179
180 X-ExtractText-Flags
181 Tag: _EXTRACTTEXTFLAGS_
182
183 Contains notes from the plugin.
184
185 X-ExtractText-Flags: openxml_NoText
186
187 Rules
188
189 Example:
190
191 header PDF_NO_TEXT X-ExtractText-Flags =~ /\bpdftotext_Notext\b/
192 describe PDF_NO_TEXT PDF without text
193
194
195
196perl v5.38.0 202M3a-i0l7:-:2S2pamAssassin::Plugin::ExtractText(3)