1Mail::SpamAssassin::PluUgsienr::CEoxnttrraicbtuTteeMxdati(Pl3e:)r:lSpDaomcAusmseanstsaitni:o:nPlugin::ExtractText(3)
2
3
4

NAME

6       ExtractText - extracts text from documenmts.
7

SYNOPSIS

9       loadplugin Mail::SpamAssassin::Plugin::ExtractText
10
11       ifplugin Mail::SpamAssassin::Plugin::ExtractText
12
13         extracttext_external  pdftotext  /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
14         extracttext_use       pdftotext  .pdf application/pdf
15
16         # http://docx2txt.sourceforge.net
17         extracttext_external  docx2txt   /usr/bin/docx2txt {} -
18         extracttext_use       docx2txt   .docx application/docx
19
20         extracttext_external  antiword   /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
21         extracttext_use       antiword   .doc application/(?:vnd\.?)?ms-?word.*
22
23         extracttext_external  unrtf      /usr/bin/unrtf --nopict {}
24         extracttext_use       unrtf      .doc .rtf application/rtf text/rtf
25
26         extracttext_external  odt2txt    /usr/bin/odt2txt --encoding=UTF-8 {}
27         extracttext_use       odt2txt    .odt .ott application/.*?opendocument.*text
28         extracttext_use       odt2txt    .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
29
30         extracttext_external  tesseract  {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
31         extracttext_use       tesseract  .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
32
33         add_header   all          ExtractText-Flags _EXTRACTTEXTFLAGS_
34         header       PDF_NO_TEXT  X-ExtractText-Flags =~ /\bpdftotext_NoText\b/
35         describe     PDF_NO_TEXT  PDF without text
36         score        PDF_NO_TEXT  0.001
37
38         header       DOC_NO_TEXT  X-ExtractText-Flags =~ /\b(?:antiword|openxml|unrtf|odt2txt)_NoText\b/
39         describe     DOC_NO_TEXT  Document without text
40         score        DOC_NO_TEXT  0.001
41
42         header       EXTRACTTEXT  exists:X-ExtractText-Flags
43         describe     EXTRACTTEXT  Email processed by extracttext plugin
44         score        EXTRACTTEXT  0.001
45
46       endif
47

DESCRIPTION

49       This module uses external tools to extract text from message parts, and
50       then sets the text as the rendered part. External tool must output
51       plain text, not HTML or other non-textual result.
52
53       How to extract text is completely configurable, and based on MIME part
54       type and file name.
55

CONFIGURATION

57       All configuration lines in user_prefs files will be ignored.
58
59       extracttext_maxparts (default: 10)
60           Configure the maximum mime parts number to analyze, a value of 0
61           means all mime parts will be analyzed
62
63       extracttext_timeout (default: 5 10)
64           Configure the timeout in seconds of external tool checks, per
65           attachment.
66
67           Second argument speficies maximum total time for all checks.
68
69   Tools
70       extracttext_use
71           Specifies what tool to use for what message parts.
72
73           The general syntax is
74
75           extracttext_use  "name"  "specifiers"
76
77       name
78           the internal name of a tool.
79
80       specifiers
81           File extension and regular expressions for file names and MIME
82           types. The regular expressions are anchored to beginning and end.
83
84       Examples
85
86               extracttext_use  antiword  .doc application/(?:vnd\.?)?ms-?word.*
87               extracttext_use  openxml   .docx .dotx .dotm application/(?:vnd\.?)openxml.*?word.*
88               extracttext_use  openxml   .doc .dot application/(?:vnd\.?)?ms-?word.*
89               extracttext_use  unrtf     .doc .rtf application/rtf text/rtf
90
91       extracttext_external
92           Defines an external tool.  The tool must read a document on
93           standard input or from a file and write text to standard output.
94
95           The special keyword "{}" will be substituted at runtime with the
96           temporary filename to be scanned by the external tool.
97
98           Environment variables can be defined with "{KEY=VALUE}", these
99           strings will be removed from commandline.
100
101           It is required that commandline used outputs result directly to
102           STDOUT.
103
104           The general syntax is
105
106           extracttext_external "name" "command" "parameters"
107
108       name
109           The internal name of this tool.
110
111       command
112           The full path to the external command to run.
113
114       parameters
115           Parameters for the external command. The temporary file name
116           containing the document will be automatically added as last
117           parameter.
118
119       Examples
120
121               extracttext_external  antiword  /usr/bin/antiword -t -w 0 -m UTF-8.txt {} -
122               extracttext_external  unrtf     /usr/bin/unrtf --nopict {}
123               extracttext_external  odt2txt   /usr/bin/odt2txt --encoding=UTF-8 {}
124
125   Metadata
126       The plugin adds some pseudo headers to the message. These headers are
127       seen by the bayes system, and can be used in normal SpamAssassin rules.
128
129       The headers are also available as template tags as noted below.
130
131       Example
132
133       The fictional example headers below are based on a message containing
134       this:
135
136       1 A perfectly normal PDF.
137       2 An OpenXML document with a word document inside. Neither Office
138       document contains text.
139
140       Headers
141
142       X-ExtractText-Chars
143           Tag: _EXTRACTTEXTCHARS_
144
145           Contains a count of characters that were extracted.
146
147           X-ExtractText-Chars: 10970
148
149       X-ExtractText-Words
150           Tag: _EXTRACTTEXTWORDS_
151
152           Contains a count of "words" that were extracted.
153
154           X-ExtractText-Chars: 1599
155
156       X-ExtractText-Tools
157           Tag: _EXTRACTTEXTTOOLS_
158
159           Contains chains of tools used for extraction.
160
161           X-ExtractText-Tools: pdftotext openxml_antiword
162
163       X-ExtractText-Types
164           Tag: _EXTRACTTEXTTYPES_
165
166           Contains chains of MIME types for parts found during extraction.
167
168           X-ExtractText-Types: application/pdf;
169           application/vnd.openxmlformats-officedocument.wordprocessingml.document,
170           application/ms-word
171
172       X-ExtractText-Extensions
173           Tag: _EXTRACTTEXTEXTENSIONS_
174
175           Contains chains of canonicalized file extensions for parts found
176           during extraction.
177
178           X-ExtractText-Extensions: pdf docx
179
180       X-ExtractText-Flags
181           Tag: _EXTRACTTEXTFLAGS_
182
183           Contains notes from the plugin.
184
185           X-ExtractText-Flags: openxml_NoText
186
187       Rules
188
189       Example:
190
191               header    PDF_NO_TEXT  X-ExtractText-Flags =~ /\bpdftotext_Notext\b/
192               describe  PDF_NO_TEXT  PDF without text
193
194
195
196perl v5.38.0                      202M3a-i0l7:-:2S2pamAssassin::Plugin::ExtractText(3)
Impressum