1hmmscan(1) HMMER Manual hmmscan(1)
2
3
4
6 hmmscan - search protein sequence(s) against a protein profile database
7
8
9
11 hmmscan [options] <hmmdb> <seqfile>
12
13
14
15
17 hmmscan is used to search protein sequences against collections of pro‐
18 tein profiles. For each sequence in <seqfile>, use that query sequence
19 to search the target database of profiles in <hmmdb>, and output ranked
20 lists of the profiles with the most significant matches to the
21 sequence.
22
23
24 The <seqfile> may contain more than one query sequence. It can be in
25 FASTA format, or several other common sequence file formats (genbank,
26 embl, and uniprot, among others), or in alignment file formats (stock‐
27 holm, aligned fasta, and others). See the --qformat option for a com‐
28 plete list.
29
30
31 The <hmmdb> needs to be press'ed using hmmpress before it can be
32 searched with hmmscan. This creates four binary files, suffixed
33 .h3{fimp}.
34
35
36 The query <seqfile> may be '-' (a dash character), in which case the
37 query sequences are read from a <stdin> pipe instead of from a file.
38 The <hmmdb> cannot be read from a <stdin> stream, because it needs to
39 have those four auxiliary binary files generated by hmmpress.
40
41
42 The output format is designed to be human-readable, but is often so
43 voluminous that reading it is impractical, and parsing it is a pain.
44 The --tblout and --domtblout options save output in simple tabular for‐
45 mats that are concise and easier to parse. The -o option allows redi‐
46 recting the main output, including throwing it away in /dev/null.
47
48
49
50
52 -h Help; print a brief reminder of command line usage and all
53 available options.
54
55
56
57
59 -o <f> Direct the main human-readable output to a file <f> instead of
60 the default stdout.
61
62
63 --tblout <f>
64 Save a simple tabular (space-delimited) file summarizing the
65 per-target output, with one data line per homologous target
66 model found.
67
68
69 --domtblout <f>
70 Save a simple tabular (space-delimited) file summarizing the
71 per-domain output, with one data line per homologous domain
72 detected in a query sequence for each homologous model.
73
74
75 --pfamtblout <f>
76 Save an especially succinct tabular (space-delimited) file sum‐
77 marizing the per-target output, with one data line per homolo‐
78 gous target model found.
79
80
81
82 --acc Use accessions instead of names in the main output, where avail‐
83 able for profiles and/or sequences.
84
85
86 --noali
87 Omit the alignment section from the main output. This can
88 greatly reduce the output volume.
89
90
91 --notextw
92 Unlimit the length of each line in the main output. The default
93 is a limit of 120 characters per line, which helps in displaying
94 the output cleanly on terminals and in editors, but can truncate
95 target profile description lines.
96
97
98 --textw <n>
99 Set the main output's line length limit to <n> characters per
100 line. The default is 120.
101
102
103
104
106 Reporting thresholds control which hits are reported in output files
107 (the main output, --tblout, and --domtblout).
108
109
110 -E <x> In the per-target output, report target profiles with an E-value
111 of <= <x>. The default is 10.0, meaning that on average, about
112 10 false positives will be reported per query, so you can see
113 the top of the noise and decide for yourself if it's really
114 noise.
115
116
117 -T <x> Instead of thresholding per-profile output on E-value, instead
118 report target profiles with a bit score of >= <x>.
119
120
121 --domE <x>
122 In the per-domain output, for target profiles that have already
123 satisfied the per-profile reporting threshold, report individual
124 domains with a conditional E-value of <= <x>. The default is
125 10.0. A conditional E-value means the expected number of addi‐
126 tional false positive domains in the smaller search space of
127 those comparisons that already satisfied the per-profile report‐
128 ing threshold (and thus must have at least one homologous domain
129 already).
130
131
132
133 --domT <x>
134 Instead of thresholding per-domain output on E-value, instead
135 report domains with a bit score of >= <x>.
136
137
138
139
140
142 Inclusion thresholds are stricter than reporting thresholds. Inclusion
143 thresholds control which hits are considered to be reliable enough to
144 be included in an output alignment or a subsequent search round. In
145 hmmscan, which does not have any alignment output (like hmmsearch or
146 phmmer) nor any iterative search steps (like jackhmmer), inclusion
147 thresholds have little effect. They only affect what domains get marked
148 as significant (!) or questionable (?) in domain output.
149
150
151 --incE <x>
152 Use an E-value of <= <x> as the per-target inclusion threshold.
153 The default is 0.01, meaning that on average, about 1 false pos‐
154 itive would be expected in every 100 searches with different
155 query sequences.
156
157
158 --incT <x>
159 Instead of using E-values for setting the inclusion threshold,
160 instead use a bit score of >= <x> as the per-target inclusion
161 threshold. It would be unusual to use bit score thresholds with
162 hmmscan, because you don't expect a single score threshold to
163 work for different profiles; different profiles have slightly
164 different expected score distributions.
165
166
167 --incdomE <x>
168 Use a conditional E-value of <= <x> as the per-domain inclusion
169 threshold, in targets that have already satisfied the overall
170 per-target inclusion threshold. The default is 0.01.
171
172
173 --incdomT <x>
174 Instead of using E-values, instead use a bit score of >= <x> as
175 the per-domain inclusion threshold. As with --incT above, it
176 would be unusual to use a single bit score threshold in hmmscan.
177
178
179
180
182 Curated profile databases may define specific bit score thresholds for
183 each profile, superseding any thresholding based on statistical signif‐
184 icance alone.
185
186 To use these options, the profile must contain the appropriate (GA, TC,
187 and/or NC) optional score threshold annotation; this is picked up by
188 hmmbuild from Stockholm format alignment files. Each thresholding
189 option has two scores: the per-sequence threshold <x1> and the per-
190 domain threshold <x2> These act as if -T<x1> --incT<x1> --domT<x2>
191 --incdomT<x2> has been applied specifically using each model's curated
192 thresholds.
193
194
195 --cut_ga
196 Use the GA (gathering) bit scores in the model to set per-
197 sequence (GA1) and per-domain (GA2) reporting and inclusion
198 thresholds. GA thresholds are generally considered to be the
199 reliable curated thresholds defining family membership; for
200 example, in Pfam, these thresholds define what gets included in
201 Pfam Full alignments based on searches with Pfam Seed models.
202
203
204 --cut_nc
205 Use the NC (noise cutoff) bit score thresholds in the model to
206 set per-sequence (NC1) and per-domain (NC2) reporting and inclu‐
207 sion thresholds. NC thresholds are generally considered to be
208 the score of the highest-scoring known false positive.
209
210
211 --cut_tc
212 Use the NC (trusted cutoff) bit score thresholds in the model to
213 set per-sequence (TC1) and per-domain (TC2) reporting and inclu‐
214 sion thresholds. TC thresholds are generally considered to be
215 the score of the lowest-scoring known true positive that is
216 above all known false positives.
217
218
219
220
221
223 HMMER3 searches are accelerated in a three-step filter pipeline: the
224 MSV filter, the Viterbi filter, and the Forward filter. The first fil‐
225 ter is the fastest and most approximate; the last is the full Forward
226 scoring algorithm. There is also a bias filter step between MSV and
227 Viterbi. Targets that pass all the steps in the acceleration pipeline
228 are then subjected to postprocessing -- domain identification and scor‐
229 ing using the Forward/Backward algorithm.
230
231 Changing filter thresholds only removes or includes targets from con‐
232 sideration; changing filter thresholds does not alter bit scores, E-
233 values, or alignments, all of which are determined solely in postpro‐
234 cessing.
235
236
237 --max Turn off all filters, including the bias filter, and run full
238 Forward/Backward postprocessing on every target. This increases
239 sensitivity somewhat, at a large cost in speed.
240
241
242 --F1 <x>
243 Set the P-value threshold for the MSV filter step. The default
244 is 0.02, meaning that roughly 2% of the highest scoring nonho‐
245 mologous targets are expected to pass the filter.
246
247
248 --F2 <x>
249 Set the P-value threshold for the Viterbi filter step. The
250 default is 0.001.
251
252
253 --F3 <x>
254 Set the P-value threshold for the Forward filter step. The
255 default is 1e-5.
256
257
258 --nobias
259 Turn off the bias filter. This increases sensitivity somewhat,
260 but can come at a high cost in speed, especially if the query
261 has biased residue composition (such as a repetitive sequence
262 region, or if it is a membrane protein with large regions of
263 hydrophobicity). Without the bias filter, too many sequences may
264 pass the filter with biased queries, leading to slower than
265 expected performance as the computationally intensive For‐
266 ward/Backward algorithms shoulder an abnormally heavy load.
267
268
269
270
272 --nonull2
273 Turn off the null2 score corrections for biased composition.
274
275
276 -Z <x> Assert that the total number of targets in your searches is <x>,
277 for the purposes of per-sequence E-value calculations, rather
278 than the actual number of targets seen.
279
280
281 --domZ <x>
282 Assert that the total number of targets in your searches is <x>,
283 for the purposes of per-domain conditional E-value calculations,
284 rather than the number of targets that passed the reporting
285 thresholds.
286
287
288 --seed <n>
289 Set the random number seed to <n>. Some steps in postprocessing
290 require Monte Carlo simulation. The default is to use a fixed
291 seed (42), so that results are exactly reproducible. Any other
292 positive integer will give different (but also reproducible)
293 results. A choice of 0 uses an arbitrarily chosen seed.
294
295
296 --qformat <s>
297 Assert that the query sequence file is in format <s>. Accepted
298 formats include fasta, embl, genbank, ddbj, uniprot, stockholm,
299 pfam, a2m, and afa.
300
301
302 --cpu <n>
303 Set the number of parallel worker threads to <n>. By default,
304 HMMER sets this to the number of CPU cores it detects in your
305 machine - that is, it tries to maximize the use of your avail‐
306 able processor cores. Setting <n> higher than the number of
307 available cores is of little if any value, but you may want to
308 set it to something less. You can also control this number by
309 setting an environment variable, HMMER_NCPU.
310
311 This option is only available if HMMER was compiled with POSIX
312 threads support. This is the default, but it may have been
313 turned off for your site or machine for some reason.
314
315
316
317 --stall
318 For debugging the MPI master/worker version: pause after start,
319 to enable the developer to attach debuggers to the running mas‐
320 ter and worker(s) processes. Send SIGCONT signal to release the
321 pause. (Under gdb: (gdb) signal SIGCONT)
322
323 (Only available if optional MPI support was enabled at compile-
324 time.)
325
326
327 --mpi Run in MPI master/worker mode, using mpirun.
328
329 (Only available if optional MPI support was enabled at compile-
330 time.)
331
332
333
334
335
336
337
338
340 See hmmer(1) for a master man page with a list of all the individual
341 man pages for programs in the HMMER package.
342
343
344 For complete documentation, see the user guide that came with your
345 HMMER distribution (Userguide.pdf); or see the HMMER web page ().
346
347
348
349
351 Copyright (C) 2015 Howard Hughes Medical Institute.
352 Freely distributed under the GNU General Public License (GPLv3).
353
354 For additional information on copyright and licensing, see the file
355 called COPYRIGHT in your HMMER source distribution, or see the HMMER
356 web page ().
357
358
359
361 Eddy/Rivas Laboratory
362 Janelia Farm Research Campus
363 19700 Helix Drive
364 Ashburn VA 20147 USA
365 http://eddylab.org
366
367
368
369
370HMMER 3.1b2 February 2015 hmmscan(1)