1PDF2TXT(1)                       User Commands                      PDF2TXT(1)
2
3
4

NAME

6       pdf2txt – extract text and images from PDF
7

SYNOPSIS

9       pdf2txt  [-h]  [--version]  [--debug]  [--disable-caching] [--page-num‐
10       bers PAGE_NUMBERS [PAGE_NUMBERS ...]]    [--pagenos PAGENOS]    [--max‐
11       pages MAXPAGES]  [--password PASSWORD]  [--rotation ROTATION] [--no-la‐
12       params]  [--detect-vertical]  [--char-margin CHAR_MARGIN]  [--word-mar‐
13       gin WORD_MARGIN]  [--line-margin LINE_MARGIN] [--boxes-flow BOXES_FLOW]
14       [--all-texts]      [--outfile OUTFILE]      [--output_type OUTPUT_TYPE]
15       [--codec CODEC]   [--output_dir OUTPUT_DIR]   [--layoutmode LAYOUTMODE]
16       [--scale SCALE] [--strip-control] files [files ...]
17

DESCRIPTION

19       A command line tool for extracting text and images from PDF and  output
20       it to plain text, html, xml or tags.
21

OPTIONS

23   POSITIONAL ARGUMENTS
24       files  One or more paths to PDF files.
25
26   OPTIONAL ARGUMENTS
27       -h, --help
28              Show a help message and exit.
29
30       --version, -v
31              Show program’s version number and exit.
32
33       --debug, -d
34              Use debug logging level.
35
36       --disable-caching, -C
37              If caching of resources, such as fonts, should be disabled.
38
39   PARSER
40       Used during PDF parsing
41
42       --page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]
43              A space-seperated list of page numbers to parse.
44
45       --pagenos PAGENOS, -p PAGENOS
46              A  comma-separated  list of page numbers to parse.  Included for
47              legacy applications; use --page-numbers for more idiomatic argu‐
48              ment entry.
49
50       --maxpages MAXPAGES, -m MAXPAGES
51              The maximum number of pages to parse.
52
53       --password PASSWORD, -P PASSWORD
54              The password to use for decrypting PDF file.
55
56       --rotation ROTATION, -R ROTATION
57              The  number  of  degrees to rotate the PDF before other types of
58              processing.
59
60   LAYOUT ANALYSIS
61       Used during layout analysis.
62
63       --no-laparams, -n
64              If layout analysis parameters should be ignored.
65
66       --detect-vertical, -V
67              If vertical text should be considered during layout analysis
68
69       --char-margin CHAR_MARGIN, -M CHAR_MARGIN
70              If two characters are closer together than this margin they  are
71              considered to be part of the same line.  The margin is specified
72              relative to the width of the character.
73
74       --word-margin WORD_MARGIN, -W WORD_MARGIN
75              If two characters on the same line are further apart  than  this
76              margin then they are considered to be two separate words, and an
77              intermediate space will be added for readability.  The margin is
78              specified relative to the width of the character.
79
80       --line-margin LINE_MARGIN, -L LINE_MARGIN
81              If  two  lines  are are close together they are considered to be
82              part of the same paragraph.  The margin is specified relative to
83              the height of a line.
84
85       --boxes-flow BOXES_FLOW, -F BOXES_FLOW
86              Specifies  how much a horizontal and vertical position of a text
87              matters when determining the order of lines.  The  value  should
88              be  within  the range of -1.0 (only horizontal position matters)
89              to +1.0 (only vertical position matters).   You  can  also  pass
90              None  to  disable  advanced  layout analysis, and instead return
91              text based on the position of the bottom left corner of the text
92              box.
93
94       --all-texts, -A
95              If layout analysis should be performed on text in figures.
96
97   OUTPUT
98       Used during output generation.
99
100       --outfile OUTFILE, -o OUTFILE
101              Path to file where output is written.  Or “-” (default) to write
102              to stdout.
103
104       --output_type OUTPUT_TYPE, -t OUTPUT_TYPE
105              Type of output to generate {text,html,xml,tag}.
106
107       --codec CODEC, -c CODEC
108              Text encoding to use in output file.
109
110       --output-dir OUTPUT_DIR, -O OUTPUT_DIR
111              The output directory to put extracted images in.  If not  given,
112              images are not extracted.
113
114       --layoutmode LAYOUTMODE, -Y LAYOUTMODE
115              Type of layout to use when generating html {normal,exact,loose}.
116              If normal, each line is positioned separately in the  html.   If
117              exact,  each character is positioned separately in the html.  If
118              loose, same result as normal but with an additional newline  af‐
119              ter each text line.  Only used when output_type is html.
120
121       --scale SCALE, -s SCALE
122              The  amount of zoom to use when generating html file.  Only used
123              when output_type is html.
124
125       --strip-control, -S
126              Remove control statement from text.  Only used when  output_type
127              is xml.
128

SEE ALSO

130       dumppdf(1)
131
132
133
134                                 October 2021                       PDF2TXT(1)
Impressum