1faidx(5) Bioinformatics formats faidx(5)
2
3
4
6 faidx - an index enabling random access to FASTA and FASTQ files
7
9 file.fa.fai, file.fasta.fai, file.fq.fai, file.fastq.fai
10
12 Using an fai index file in conjunction with a FASTA/FASTQ file contain‐
13 ing reference sequences enables efficient access to arbitrary regions
14 within those reference sequences. The index file typically has the
15 same filename as the corresponding FASTA/FASTQ file, with .fai
16 appended.
17
18 An fai index file is a text file consisting of lines each with five
19 TAB-delimited columns for a FASTA file and six for FASTQ:
20
21 NAME Name of this reference sequence
22 LENGTH Total length of this reference sequence, in bases
23 OFFSET Offset in the FASTA/FASTQ file of this sequence's first base
24 LINEBASES The number of bases on each line
25 LINEWIDTH The number of bytes in each line, including the newline
26 QUALOFFSET Offset of sequence's first quality within the FASTQ file
27
28 The NAME and LENGTH columns contain the same data as would appear in
29 the SN and LN fields of a SAM @SQ header for the same reference
30 sequence.
31
32 The OFFSET column contains the offset within the FASTA/FASTQ file, in
33 bytes starting from zero, of the first base of this reference sequence,
34 i.e., of the character following the newline at the end of the header
35 line (the ">" line in FASTA, "@" in FASTQ). Typically the lines of a
36 fai index file appear in the order in which the reference sequences
37 appear in the FASTA/FASTQ file, so .fai files are typically sorted
38 according to this column.
39
40 The LINEBASES column contains the number of bases in each of the
41 sequence lines that form the body of this reference sequence, apart
42 from the final line which may be shorter. The LINEWIDTH column con‐
43 tains the number of bytes in each of the sequence lines (except perhaps
44 the final line), thus differing from LINEBASES in that it also counts
45 the bytes forming the line terminator.
46
47 The QUALOFFSET works the same way as OFFSET but for the first quality
48 score of this reference sequence. This would be the first character
49 following the newline at the end of the "+" line. For FASTQ files
50 only.
51
52 FASTA Files
53 In order to be indexed with samtools faidx, a FASTA file must be a text
54 file of the form
55
56 >name [description...]
57 ATGCATGCATGCATGCATGCATGCATGCAT
58 GCATGCATGCATGCATGCATGCATGCATGC
59 ATGCAT
60 >name [description...]
61 ATGCATGCATGCAT
62 GCATGCATGCATGC
63 [...]
64
65 In particular, each reference sequence must be "well-formatted", i.e.,
66 all of its sequence lines must be the same length, apart from the final
67 sequence line which may be shorter. (While this sequence line length
68 must be the same within each sequence, it may vary between different
69 reference sequences in the same FASTA file.)
70
71 This also means that although the FASTA file may have Unix- or Windows-
72 style or other line termination, the newline characters present must be
73 consistent, at least within each reference sequence.
74
75 The samtools implementation uses the first word of the ">" header line
76 text (i.e., up to the first whitespace character, having skipped any
77 initial whitespace after the ">") as the NAME column.
78
79 FASTQ Files
80 FASTQ files for indexing work in the same way as the FASTA files.
81
82 @name [description...]
83 ATGCATGCATGCATGCATGCATGCATGCAT
84 GCATGCATGCATGCATGCATGCATGCATGC
85 ATGCAT
86 +
87 FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
88 HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
89 8011<<
90 @name [description...]
91 ATGCATGCATGCAT
92 GCATGCATGCATGC
93 +
94 IIA94445EEII==
95 =>IIIIIIIIICCC
96 [...]
97
98 Quality lines must be wrapped at the same length as the corresponding
99 sequence lines.
100
102 For example, given this FASTA file
103
104 >one
105 ATGCATGCATGCATGCATGCATGCATGCAT
106 GCATGCATGCATGCATGCATGCATGCATGC
107 ATGCAT
108 >two another chromosome
109 ATGCATGCATGCAT
110 GCATGCATGCATGC
111
112 formatted with Unix-style (LF) line termination, the corresponding fai
113 index would be
114
115 one 66 5 30 31
116 two 28 98 14 15
117
118 If the FASTA file were formatted with Windows-style (CR-LF) line termi‐
119 nation, the fai index would be
120
121 one 66 6 30 32
122 two 28 103 14 16
123
124 An example FASTQ file
125
126 @fastq1
127 ATGCATGCATGCATGCATGCATGCATGCAT
128 GCATGCATGCATGCATGCATGCATGCATGC
129 ATGCAT
130 +
131 FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
132 HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
133 8011<<
134 @fastq2
135 ATGCATGCATGCAT
136 GCATGCATGCATGC
137 +
138 IIA94445EEII==
139 =>IIIIIIIIICCC
140
141 Formatted with Unix-style line termination would give this fai index
142
143 fastq1 66 8 30 31 79
144 fastq2 28 156 14 15 188
145
147 samtools(1)
148
149 https://en.wikipedia.org/wiki/FASTA_format
150
151 https://en.wikipedia.org/wiki/FASTQ_format
152
153 Further description of the FASTA and FASTQ formats
154
155
156
157htslib June 2018 faidx(5)