tabix(1) - f36

1tabix(1)                     Bioinformatics tools                     tabix(1)
2
3
4

NAME

6       tabix - Generic indexer for TAB-delimited genome position files
7

SYNOPSIS

9       tabix  [-0lf]  [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol]
10       [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]]
11
12

DESCRIPTION

14       Tabix indexes a TAB-delimited genome position file in.tab.bgz and  cre‐
15       ates  an  index  file (in.tab.bgz.tbi or in.tab.bgz.csi) when region is
16       absent from the command-line. The input  data  file  must  be  position
17       sorted and compressed by bgzip which has a gzip(1) like interface.
18
19       After  indexing,  tabix is able to quickly retrieve data lines overlap‐
20       ping regions specified in the format  "chr:beginPos-endPos".   (Coordi‐
21       nates specified in this region format are 1-based and inclusive.)
22
23       Fast  data  retrieval also works over network if URI is given as a file
24       name and in this case the index file will be downloaded if  it  is  not
25       present locally.
26
27       The  tabix  (.tbi)  and BAI index formats can handle individual chromo‐
28       somes up to 512 Mbp (2^29 bases) in length.  If your input  file  might
29       contain  data  lines with begin or end positions greater than that, you
30       will need to use a CSI index.
31
32

INDEXING OPTIONS

34       -0, --zero-based
35                 Specify that the position in the data file is  0-based  (e.g.
36                 UCSC files) rather than 1-based.
37
38       -b, --begin INT
39                 Column of start chromosomal position. [4]
40
41       -c, --comment CHAR
42                 Skip lines started with character CHAR. [#]
43
44       -C, --csi Produce  CSI  format  index instead of classical tabix or BAI
45                 style indices.
46
47       -e, --end INT
48                 Column of end chromosomal position. The end column can be the
49                 same as the start column. [5]
50
51       -f, --force
52                 Force to overwrite the index file if it is present.
53
54       -m, --min-shift INT
55                 set minimal interval size for CSI indices to 2^INT [14]
56
57       -p, --preset STR
58                 Input  format  for indexing. Valid values are: gff, bed, sam,
59                 vcf.  This option should not be applied together with any  of
60                 -s,  -b, -e, -c and -0; it is not used for data retrieval be‐
61                 cause this setting is stored in the index file. [gff]
62
63       -s, --sequence INT
64                 Column of sequence name. Option -s, -b, -e, -S, -c and -0 are
65                 all  stored  in  the index file and thus not used in data re‐
66                 trieval. [1]
67
68       -S, --skip-lines INT
69                 Skip first INT lines in the data file. [0]
70
71

QUERYING AND OTHER OPTIONS

73       -h, --print-header
74              Print also the header/meta lines.
75
76       -H, --only-header
77              Print only the header/meta lines.
78
79       -l, --list-chroms
80              List the sequence names stored in the index file.
81
82       -r, --reheader FILE
83              Replace the header with the content of FILE
84
85       -R, --regions FILE
86              Restrict to regions listed in the FILE. The FILE can be BED file
87              (requires .bed, .bed.gz, .bed.bgz file name extension) or a TAB-
88              delimited file with CHROM, POS, and,   optionally,  POS_TO  col‐
89              umns,  where positions are 1-based and inclusive.  When this op‐
90              tion is in use, the input file may not be sorted.
91
92       -T, --targets FILE
93              Similar to -R but the entire input will be read sequentially and
94              regions not listed in FILE will be skipped.
95
96       -D     Do  not download the index file before opening it. Valid for re‐
97              mote files only.
98
99       --cache INT
100              Set the BGZF block cache size to INT megabytes. [10]
101
102              This is of most benefit when the -R option is  used,  which  can
103              cause  blocks  to be read more than once.  Setting the size to 0
104              will disable the cache.
105
106       --separate-regions
107              This option can be used when multiple regions  are  supplied  in
108              the  command  line  and the user needs to quickly see which file
109              records belong to which region.  For this, a line with the  name
110              of  the region, preceded by the file specific comment symbol, is
111              inserted  in  the  output  before  its  corresponding  group  of
112              records.
113
114       --verbosity INT
115              Set  verbosity  of  logging messages printed to stderr.  The de‐
116              fault is 3, which turns on error and warning messages; 2 reduces
117              warning  messages;  1 prints only error messages and 0 is mostly
118              silent.  Values higher than 3 produce  additional  informational
119              and debugging messages.
120

EXAMPLE

122       (grep  ^"#"  in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip >
123       sorted.gff.gz;
124
125       tabix -p gff sorted.gff.gz;
126
127       tabix sorted.gff.gz chr1:10,000,000-20,000,000;
128
129

NOTES

131       It is straightforward to achieve overlap queries using the standard  B-
132       tree  index (with or without binning) implemented in all SQL databases,
133       or the R-tree index in PostgreSQL and Oracle. But there are still  many
134       reasons  to  use  tabix.  Firstly,  tabix  directly works with a lot of
135       widely used TAB-delimited formats such as GFF/GTF and BED.  We  do  not
136       need  to  design database schema or specialized binary formats. Data do
137       not need to be duplicated in different formats, either. Secondly, tabix
138       works  on  compressed  data  files while most SQL databases do not. The
139       GenCode annotation GTF can be compressed down to 4%.  Thirdly, tabix is
140       fast.  The  same indexing algorithm is known to work efficiently for an
141       alignment with a few billion short reads. SQL databases probably cannot
142       easily  handle  data  at this scale. Last but not the least, tabix sup‐
143       ports remote data retrieval. One can put the data file and the index at
144       an  FTP  or  HTTP  server, and other users or even web services will be
145       able to get a slice without downloading the entire file.
146
147

AUTHOR

149       Tabix was written by Heng Li. The BGZF library  was  originally  imple‐
150       mented  by Bob Handsaker and modified by Heng Li for remote file access
151       and in-memory caching.
152
153