1PDF2TXT(1) User Commands PDF2TXT(1)
2
3
4
6 pdf2txt – extract text and images from PDF
7
9 pdf2txt [-h] [--version] [--debug] [--disable-caching] [--page-num‐
10 bers PAGE_NUMBERS [PAGE_NUMBERS ...]] [--pagenos PAGENOS] [--max‐
11 pages MAXPAGES] [--password PASSWORD] [--rotation ROTATION] [--no-la‐
12 params] [--detect-vertical] [--char-margin CHAR_MARGIN] [--word-mar‐
13 gin WORD_MARGIN] [--line-margin LINE_MARGIN] [--boxes-flow BOXES_FLOW]
14 [--all-texts] [--outfile OUTFILE] [--output_type OUTPUT_TYPE]
15 [--codec CODEC] [--output_dir OUTPUT_DIR] [--layoutmode LAYOUTMODE]
16 [--scale SCALE] [--strip-control] files [files ...]
17
19 A command line tool for extracting text and images from PDF and output
20 it to plain text, html, xml or tags.
21
23 POSITIONAL ARGUMENTS
24 files One or more paths to PDF files.
25
26 OPTIONAL ARGUMENTS
27 -h, --help
28 Show a help message and exit.
29
30 --version, -v
31 Show program’s version number and exit.
32
33 --debug, -d
34 Use debug logging level.
35
36 --disable-caching, -C
37 If caching of resources, such as fonts, should be disabled.
38
39 PARSER
40 Used during PDF parsing
41
42 --page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]
43 A space-seperated list of page numbers to parse.
44
45 --pagenos PAGENOS, -p PAGENOS
46 A comma-separated list of page numbers to parse. Included for
47 legacy applications; use --page-numbers for more idiomatic argu‐
48 ment entry.
49
50 --maxpages MAXPAGES, -m MAXPAGES
51 The maximum number of pages to parse.
52
53 --password PASSWORD, -P PASSWORD
54 The password to use for decrypting PDF file.
55
56 --rotation ROTATION, -R ROTATION
57 The number of degrees to rotate the PDF before other types of
58 processing.
59
60 LAYOUT ANALYSIS
61 Used during layout analysis.
62
63 --no-laparams, -n
64 If layout analysis parameters should be ignored.
65
66 --detect-vertical, -V
67 If vertical text should be considered during layout analysis
68
69 --char-margin CHAR_MARGIN, -M CHAR_MARGIN
70 If two characters are closer together than this margin they are
71 considered to be part of the same line. The margin is specified
72 relative to the width of the character.
73
74 --word-margin WORD_MARGIN, -W WORD_MARGIN
75 If two characters on the same line are further apart than this
76 margin then they are considered to be two separate words, and an
77 intermediate space will be added for readability. The margin is
78 specified relative to the width of the character.
79
80 --line-margin LINE_MARGIN, -L LINE_MARGIN
81 If two lines are are close together they are considered to be
82 part of the same paragraph. The margin is specified relative to
83 the height of a line.
84
85 --boxes-flow BOXES_FLOW, -F BOXES_FLOW
86 Specifies how much a horizontal and vertical position of a text
87 matters when determining the order of lines. The value should
88 be within the range of -1.0 (only horizontal position matters)
89 to +1.0 (only vertical position matters). You can also pass
90 None to disable advanced layout analysis, and instead return
91 text based on the position of the bottom left corner of the text
92 box.
93
94 --all-texts, -A
95 If layout analysis should be performed on text in figures.
96
97 OUTPUT
98 Used during output generation.
99
100 --outfile OUTFILE, -o OUTFILE
101 Path to file where output is written. Or “-” (default) to write
102 to stdout.
103
104 --output_type OUTPUT_TYPE, -t OUTPUT_TYPE
105 Type of output to generate {text,html,xml,tag}.
106
107 --codec CODEC, -c CODEC
108 Text encoding to use in output file.
109
110 --output-dir OUTPUT_DIR, -O OUTPUT_DIR
111 The output directory to put extracted images in. If not given,
112 images are not extracted.
113
114 --layoutmode LAYOUTMODE, -Y LAYOUTMODE
115 Type of layout to use when generating html {normal,exact,loose}.
116 If normal, each line is positioned separately in the html. If
117 exact, each character is positioned separately in the html. If
118 loose, same result as normal but with an additional newline af‐
119 ter each text line. Only used when output_type is html.
120
121 --scale SCALE, -s SCALE
122 The amount of zoom to use when generating html file. Only used
123 when output_type is html.
124
125 --strip-control, -S
126 Remove control statement from text. Only used when output_type
127 is xml.
128
130 dumppdf(1)
131
132
133
134 October 2021 PDF2TXT(1)