tabix(1) - f34

1tabix(1)                     Bioinformatics tools                     tabix(1)
2
3
4

NAME

6       tabix - Generic indexer for TAB-delimited genome position files
7

SYNOPSIS

9       tabix  [-0lf]  [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol]
10       [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]]
11
12

DESCRIPTION

14       Tabix indexes a TAB-delimited genome position file in.tab.bgz and  cre‐
15       ates  an  index  file (in.tab.bgz.tbi or in.tab.bgz.csi) when region is
16       absent from the command-line. The input  data  file  must  be  position
17       sorted and compressed by bgzip which has a gzip(1) like interface.
18
19       After  indexing,  tabix is able to quickly retrieve data lines overlap‐
20       ping regions specified in the format  "chr:beginPos-endPos".   (Coordi‐
21       nates specified in this region format are 1-based and inclusive.)
22
23       Fast  data  retrieval also works over network if URI is given as a file
24       name and in this case the index file will be downloaded if  it  is  not
25       present locally.
26
27

INDEXING OPTIONS

29       -0, --zero-based
30                 Specify  that  the position in the data file is 0-based (e.g.
31                 UCSC files) rather than 1-based.
32
33       -b, --begin INT
34                 Column of start chromosomal position. [4]
35
36       -c, --comment CHAR
37                 Skip lines started with character CHAR. [#]
38
39       -C, --csi Produce CSI format index instead of classical  tabix  or  BAI
40                 style indices.
41
42       -e, --end INT
43                 Column of end chromosomal position. The end column can be the
44                 same as the start column. [5]
45
46       -f, --force
47                 Force to overwrite the index file if it is present.
48
49       -m, --min-shift INT
50                 set minimal interval size for CSI indices to 2^INT [14]
51
52       -p, --preset STR
53                 Input format for indexing. Valid values are: gff,  bed,  sam,
54                 vcf.   This option should not be applied together with any of
55                 -s, -b, -e, -c and -0; it is  not  used  for  data  retrieval
56                 because this setting is stored in the index file. [gff]
57
58       -s, --sequence INT
59                 Column of sequence name. Option -s, -b, -e, -S, -c and -0 are
60                 all stored in the index  file  and  thus  not  used  in  data
61                 retrieval. [1]
62
63       -S, --skip-lines INT
64                 Skip first INT lines in the data file. [0]
65
66

QUERYING AND OTHER OPTIONS

68       -h, --print-header
69              Print also the header/meta lines.
70
71       -H, --only-header
72              Print only the header/meta lines.
73
74       -l, --list-chroms
75              List the sequence names stored in the index file.
76
77       -r, --reheader FILE
78              Replace the header with the content of FILE
79
80       -R, --regions FILE
81              Restrict to regions listed in the FILE. The FILE can be BED file
82              (requires .bed, .bed.gz, .bed.bgz file name extension) or a TAB-
83              delimited  file  with  CHROM, POS, and,  optionally, POS_TO col‐
84              umns, where positions are  1-based  and  inclusive.   When  this
85              option is in use, the input file may not be sorted.
86
87       -T, --targets FILE
88              Similar to -R but the entire input will be read sequentially and
89              regions not listed in FILE will be skipped.
90

EXAMPLE

92       (grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) |  bgzip  >
93       sorted.gff.gz;
94
95       tabix -p gff sorted.gff.gz;
96
97       tabix sorted.gff.gz chr1:10,000,000-20,000,000;
98
99

NOTES

101       It  is straightforward to achieve overlap queries using the standard B-
102       tree index (with or without binning) implemented in all SQL  databases,
103       or  the R-tree index in PostgreSQL and Oracle. But there are still many
104       reasons to use tabix. Firstly, tabix  directly  works  with  a  lot  of
105       widely  used  TAB-delimited  formats such as GFF/GTF and BED. We do not
106       need to design database schema or specialized binary formats.  Data  do
107       not need to be duplicated in different formats, either. Secondly, tabix
108       works on compressed data files while most SQL  databases  do  not.  The
109       GenCode annotation GTF can be compressed down to 4%.  Thirdly, tabix is
110       fast. The same indexing algorithm is known to work efficiently  for  an
111       alignment with a few billion short reads. SQL databases probably cannot
112       easily handle data at this scale. Last but not the  least,  tabix  sup‐
113       ports remote data retrieval. One can put the data file and the index at
114       an FTP or HTTP server, and other users or even  web  services  will  be
115       able to get a slice without downloading the entire file.
116
117

AUTHOR

119       Tabix  was  written  by Heng Li. The BGZF library was originally imple‐
120       mented by Bob Handsaker and modified by Heng Li for remote file  access
121       and in-memory caching.
122
123