1FS(5) File Formats Manual FS(5)
2
3
4
6 fs, inode - format of file system volume (2BSD)
7
9 #include <sys/types.h>
10 #include <sys/fs.h>
11 #include <sys/inode.h>
12
14 Every file system storage volume (e.g. disk) has a common format for
15 certain vital information. Every such volume is divided into a certain
16 number of blocks. The block size is DEV_BSIZE bytes; specified in
17 <sys/param.h> - currently 1024.
18
19 Each disk drive contains some number of file systems each laid out on a
20 contiguous partition of the disk. A file system consists of a boot
21 block, followed by a super block, followed by an inode area, followed
22 by a data block area which takes up the remainder of the disk parti‐
23 tion. The layout of the super block as defined in <sys/fs.h> is:
24
25 #define MAXMNTLEN 12
26
27 /*
28 * Structure of the super-block
29 */
30 struct fs
31 {
32 u_short fs_isize; /* first block after i-list */
33 daddr_t fs_fsize; /* size in blocks of entire volume */
34 short fs_nfree; /* number of addresses in fs_free */
35 daddr_t fs_free[NICFREE]; /* free block list */
36 short fs_ninode; /* number of inodes in fs_inode */
37 ino_t fs_inode[NICINOD]; /* free inode list */
38 char fs_flock; /* lock during free list manipulation */
39 char fs_ilock; /* lock during i-list manipulation */
40 char fs_fmod; /* super block modified flag */
41 char fs_ronly; /* mounted read-only flag */
42 time_t fs_time; /* last super block update */
43 daddr_t fs_tfree; /* total free blocks */
44 ino_t fs_tinode; /* total free inodes */
45 short fs_step; /* optimal step in free list pattern */
46 short fs_cyl; /* number of blocks per pattern */
47 char fs_fsmnt[MAXMNTLEN]; /* ordinary file mounted on */
48 ino_t fs_lasti; /* start place for circular search */
49 ino_t fs_nbehind; /* est # free inodes before s_lasti */
50 u_short fs_flags; /* mount time flags */
51 };
52
53 File system: A file system is described by its super-block. Block 0 of
54 each file system partition is unused and is available to contain a
55 bootstrap program, pack label, or other information. Block 1 (SUPERB)
56 is the super block. The inode area starts immediately after the super-
57 block, in block 2. Fs_isize is the address of the first block after
58 the inode area. Thus the inode area is fs_isize-2 blocks long.
59 Fs_fsize is the address of the first block not potentially available
60 for allocation to a file. Thus the data block area is fs_fsize -
61 fs_isize blocks long.
62
63 Super block: The path name on which the file system is mounted is main‐
64 tained in fs_fsmnt. Fs_flock, fs_ilock, fs_fmod, fs_ronly and fs_flags
65 are flags maintained in the in core copy of the super block while its
66 file system is mounted and their values on disk are immaterial.
67 Fs_fmod is used as a flag to indicate that the super-block has changed
68 and should be copied to the disk during the next periodic update of
69 file system information. Fs_ronly is a write-protection indicator. It
70 is a copy of the mount flags fs_flags anded with
71 MNT_RDONLY(see/sys/h/mount.h).
72
73 Fs_time is the last time the super-block of the file system was
74 changed. During a reboot, the fs_time of the super-block for the root
75 file system is used to set the system's idea of the time.
76
77 Inode: The inode is the focus of all file activity in the UNIX file
78 system. There is a unique inode allocated for each active file, each
79 current directory, each mounted-on file, text file, and the root. An
80 inode is `named' by its device/i-number pair.
81
82 Inodes are 64 bytes long, so 16 of them fit into a block if DEV_BSIZE
83 is 1024. The root inode is the root of the file system. Inode 0 can't
84 be used for normal purposes and historically bad blocks were linked to
85 inode 1, thus the root inode is 2 (inode 1 is no longer used for this
86 purpose, however numerous dump tapes make this assumption, so we are
87 stuck with it). No other i-number has a built-in meaning.
88
89 The format of an inode as given in <sys/inode.h> is:
90
91 /*
92 * Inode structure as it appears on
93 * a disk block.
94 */
95 struct dinode {
96 u_short di_mode; /* mode and type of file */
97 short di_nlink; /* number of links to file */
98 uid_t di_uid; /* owner's user id */
99 gid_t di_gid; /* owner's group id */
100 off_t di_size; /* number of bytes in file */
101 daddr_t di_addr[7]; /* 7 block addresses 4 bytes each */
102 u_short di_reserved[5];/* pad of 10 to make total size 64 */
103 u_short di_flags;
104 time_t di_atime; /* time last accessed */
105 time_t di_mtime; /* time last modified */
106 time_t di_ctime; /* time created */
107 };
108
109 /*
110 * 28 of the di_addr address bytes are used; 7 addresses of 4
111 * bytes each: 4 direct (4Kb directly accessible) and 3 indirect.
112 */
113 #define NADDR 7
114
115 /* modes */
116
117 #define IFMT 0170000 /* type of file */
118 #define IFCHR 0020000 /* character special */
119 #define IFDIR 0040000 /* directory */
120 #define IFBLK 0060000 /* block special */
121 #define IFREG 0100000 /* regular */
122 #define IFLNK 0120000 /* symbolic link */
123 #define IFSOCK 0140000 /* socket */
124 #define ISUID 04000 /* set user id on execution */
125 #define ISGID 02000 /* set group id on execution */
126 #define ISVTX 01000 /* save swapped text even after use */
127 #define IREAD 0400 /* read, write, execute permissions */
128 #define IWRITE 0200
129 #define IEXEC 0100
130
131 Di_mode identifies the type of file the inode represents; it is encoded
132 identically to the st_mode field of stat(2). Di_nlink is the number of
133 directory entries (links) that refer to this inode. Di_uid and di_gid
134 are the owner's user and group IDs. Di_size is the number of bytes in
135 the file. Di_atime and di_mtime are the times of last access and modi‐
136 fication of the file contents (read, write or create); Di_ctime records
137 the time of last modification to the inode or to the file, and is used
138 to determine whether it should be dumped by dump(8).
139
140 Special files are recognized by their modes. A block-type special file
141 is one which can potentially be mounted as a file system; a character-
142 type special file cannot, though it is not necessarily character-ori‐
143 ented. For special files, the first two bytes of the di_addr field are
144 occupied by the device code (see types(5)). The device codes of block
145 and character special files overlap.
146
147 Disk addresses of plain files and directories are kept in the array
148 di_addr. For a DEV_BSIZE of 1K bytes, 7 addresses are kept in di_addr
149 using 28 of the 40 bytes. The first 4 addresses specify device blocks
150 directly. The last 3 addresses are singly, doubly and triply indirect
151 and point to blocks containing 256 further block pointers. There are 3
152 block addresses reserved as a pad to bring the total size of an inode
153 to 64 bytes. All block addresses are of type daddr_t (see types(5)).
154
155 For block b in a file to exist, it is not necessary that all blocks
156 less than b exist. A zero block number indicates that the correspond‐
157 ing block has never been allocated. Such a missing block reads as if
158 it contained all zero bytes.
159
160 Free block list: The free data block list for each volume is maintained
161 as follows. Fs_free[1], ... , fs_free[fs_nfree-1], contain up to
162 NICFREE free block numbers (NICFREE is a configuration constant defined
163 in <sys/param.h>). Fs_free[0] is the block address of the head of a
164 chain of blocks constituting the free list. The layout of each block
165 of the free chain as defined in <sys/fs.h> is:
166
167 struct fblk
168 {
169 short df_nfree; /* number of addresses in df_free */
170 daddr_t df_free[NICFREE]; /* free block list */
171 };
172
173 The fields df_nfree and df_free in a free block are used exactly like
174 fs_nfree and fs_free in the super block.
175
176 The algorithm used to allocate a block is: decrement fs_nfree, and the
177 new block number is fs_free[fs_nfree]. If the new block address is 0,
178 there are no blocks left, so give an error. If fs_nfree became 0, read
179 the new block into fs_nfree and fs_free.
180
181 To free a block: check if fs_nfree is NICFREE; if so, copy fs_nfree and
182 the fs_free array into the newly freed block, write it out, and set
183 fs_nfree to 0. In any event set fs_free[fs_nfree] to the freed block's
184 address and increment fs_nfree.
185
186 Fs_isize and fs_fsize are used by the system to check for bad block
187 addresses; if an `impossible' block address is allocated from or
188 returned to the free list, a diagnostic is written on the console.
189 Moreover, the free array is cleared, to prevent further allocation from
190 a presumably corrupted free list.
191
192 Fs_step and fs_cyl determine the block interleaving of files for
193 fastest access; traditionally these were referred to as s_m and s_n
194 respectively. Fs_step is the distance between successive blocks and
195 fs_cyl is the number of blocks before the pattern repeats. A file sys‐
196 tem's interleaving factors are determined when it is first created by
197 mkfs(8). Mkfs lays out the initial free list with these parameters and
198 fsck(8) can be used to rebuild the free list optimally (and assign new
199 interleaving factors if necessary).
200
201 Free inode list: Fs_ninode is the number of free inode numbers in the
202 fs_inode array.
203
204 To allocate an inode: if fs_ninode is greater than 0, decrement it and
205 return fs_inode[fs_ninode]. If it was 0, read through the inode area
206 and place the numbers of all free inodes (up to NICINOD) into the
207 fs_inode array, then try again. If a search for free inodes is neces‐
208 sary, the search will start at the beginning of the inode area if
209 fs_nbehind >= 4 × NICINOD, otherwise starting at fs_lasti and continu‐
210 ing at the beginning of the inode area if NICINOD free inodes aren't
211 found when the end of the inode area is reached. When a search com‐
212 pletes the i-number of the first inode of the last block scanned in the
213 search is left in fs_lasti.
214
215 To free an inode, provided fs_ninode is less than NICINODE, place its
216 number into fs_inode[fs_ninode] and increment fs_ninode. If fs_ninode
217 is already NICINODE, don't bother to enter the freed inode into any ta‐
218 ble (fs_inode is only to speed up the allocation process; the informa‐
219 tion as to whether the inode is really free or not is maintained in the
220 inode itself). If the i-number of the freed inode is less than
221 fs_lasti increment fs_nbehind.
222
224 stat(2), dir(5), types(5), dcheck(8), fsck(8), icheck(8), mkfs(8),
225 mount(8)
226
228 It isn't the 4BSD fast file system. The 2BSD file system is a direct
229 descendent of the V7 file system and exists little changed from that
230 ancestor. There are many performance holes in the file system.
231
232 Some changes from the original V7 file system have resulted in better
233 performance: The larger block size (1Kb as opposed to the 512 byte
234 block size of V7) cuts the average number of system calls necessary to
235 access a file by a factor of two; the smaller (in core) inodes allowed
236 by the smaller number of direct links kept in inodes saves valuable
237 kernel data space allowing the kernel buffer cache to be made larger
238 while sacrificing only 1Kb of direct file accessing; and starting free
239 inode searches at the position the last search ended cuts the time to
240 gather free inodes significantly.
241
242 However, the separation of inodes and data blocks into completely dif‐
243 ferent areas of the disk, the handling of the free list, the lack of
244 any file allocation layout policy encouraging locality such as that
245 found in the 4BSD file system and the still too small block size often
246 leads to extremely poor performance.
247
248 The separation of inodes and data blocks in the file system means that
249 to access a file a seek will have to be made to the beginning of the
250 disk partition containing the file system followed another to the the
251 actual data blocks of the file (often quite distant from the inode
252 area).
253
254 The free list which is laid out at file system creation for optimal
255 file block allocation, becomes scrambled over time on an active file
256 system. This process is slowed down by the kernel which always frees
257 blocks from unlink'ed or truncated files in reverse order thereby main‐
258 taining strings of optimally laid out free blocks in the free list.
259 Eventually, however, since both freed and allocated blocks use the head
260 of the free list, it's possible (and quite probable) to have most of
261 the free list laid out optimally with the first portion totally scram‐
262 bled. As a trade off, a file system's free list may be rebuilt fairly
263 frequently via icheck -s or fsck -s and most blocks allocated will be
264 localized as close to the the inode area as possible. Because of this
265 problem, files are sometimes scattered across a file system generating
266 an unpleasant amount of disk arm movement. A nasty oscillation also
267 occurs in the free block list when fs_nfree hovers around NICFREE and 0
268 causing the free array to be constantly written out and read back in as
269 blocks are freed and allocated.
270
271 For a more in depth analysis of the 2BSD file system, its shortcomings,
272 and a description of the changes made for the 4BSD file system see “A
273 Fast File System for UNIX” by M. McKusick; W. Joy; S. Leffler; and R.
274 Fabry.
275
276
277
2783rd Berkeley Distribution January 27, 1996 FS(5)