1CRASH(8) System Manager's Manual CRASH(8)
2
3
4
6 crash - what to do when the system crashes
7
9 This section gives at least a few clues about how to proceed if the
10 system crashes. It can't pretend to be complete.
11
12 Bringing it back up. If the reason for the crash is not evident (see
13 below for guidance on `evident') you may want to try to dump the system
14 if you feel up to debugging. At the moment a dump can be taken only on
15 magtape. With a tape mounted and ready, stop the machine, load address
16 44, and start. This should write a copy of all of core on the tape
17 with an EOF mark. Caution: Any error is taken to mean the end of core
18 has been reached. This means that you must be sure the ring is in, the
19 tape is ready, and the tape is clean and new. If the dump fails, you
20 can try again, but some of the registers will be lost. See below for
21 what to do with the tape.
22
23 In restarting after a crash, always bring up the system single-user.
24 This is accomplished by following the directions in boot(8) as modified
25 for your particular installation; a single-user system is indicated by
26 having a particular value in the switches (173030 unless you've changed
27 init) as the system starts executing. When it is running, perform a
28 dcheck and icheck(1) on all file systems which could have been in use
29 at the time of the crash. If any serious file system problems are
30 found, they should be repaired. When you are satisfied with the health
31 of your disks, check and set the date if necessary, then come up multi-
32 user. This is most easily accomplished by changing the single-user
33 value in the switches to something else, then logging out by typing an
34 EOT.
35
36 To even boot UNIX at all, three files (and the directories leading to
37 them) must be intact. First, the initialization program /etc/init must
38 be present and executable. If it is not, the CPU will loop in user
39 mode at location 6. For init to work correctly, /dev/tty8 and /bin/sh
40 must be present. If either does not exist, the symptom is best
41 described as thrashing. Init will go into a fork/exec loop trying to
42 create a Shell with proper standard input and output.
43
44 If you cannot get the system to boot, a runnable system must be
45 obtained from a backup medium. The root file system may then be doc‐
46 tored as a mounted file system as described below. If there are any
47 problems with the root file system, it is probably prudent to go to a
48 backup system to avoid working on a mounted file system.
49
50 Repairing disks. The first rule to keep in mind is that an addled disk
51 should be treated gently; it shouldn't be mounted unless necessary, and
52 if it is very valuable yet in quite bad shape, perhaps it should be
53 dumped before trying surgery on it. This is an area where experience
54 and informed courage count for much.
55
56 The problems reported by icheck typically fall into two kinds. There
57 can be problems with the free list: duplicates in the free list, or
58 free blocks also in files. These can be cured easily with an icheck
59 -s. If the same block appears in more than one file or if a file con‐
60 tains bad blocks, the files should be deleted, and the free list recon‐
61 structed. The best way to delete such a file is to use clri(1), then
62 remove its directory entries. If any of the affected files is really
63 precious, you can try to copy it to another device first.
64
65 Dcheck may report files which have more directory entries than links.
66 Such situations are potentially dangerous; clri discusses a special
67 case of the problem. All the directory entries for the file should be
68 removed. If on the other hand there are more links than directory
69 entries, there is no danger of spreading infection, but merely some
70 disk space that is lost for use. It is sufficient to copy the file (if
71 it has any entries and is useful) then use clri on its inode and remove
72 any directory entries that do exist.
73
74 Finally, there may be inodes reported by dcheck that have 0 links and 0
75 entries. These occur on the root device when the system is stopped
76 with pipes open, and on other file systems when the system stops with
77 files that have been deleted while still open. A clri will free the
78 inode, and an icheck -s will recover any missing blocks.
79
80 Why did it crash? UNIX types a message on the console typewriter when
81 it voluntarily crashes. Here is the current list of such messages,
82 with enough information to provide a hope at least of the remedy. The
83 message has the form `panic: ...', possibly accompanied by other infor‐
84 mation. Left unstated in all cases is the possibility that hardware or
85 software error produced the message in some unexpected way.
86
87 blkdev
88 The getblk routine was called with a nonexistent major device as
89 argument. Definitely hardware or software error.
90
91 devtab
92 Null device table entry for the major device used as argument to
93 getblk. Definitely hardware or software error.
94
95 iinit
96 An I/O error reading the super-block for the root file system dur‐
97 ing initialization.
98
99 out of inodes
100 A mounted file system has no more i-nodes when creating a file.
101 Sorry, the device isn't available; the icheck should tell you.
102
103 no fs
104 A device has disappeared from the mounted-device table. Defi‐
105 nitely hardware or software error.
106
107 no imt
108 Like `no fs', but produced elsewhere.
109
110 no inodes
111 The in-core inode table is full. Try increasing NINODE in
112 param.h. Shouldn't be a panic, just a user error.
113
114 no clock
115 During initialization, neither the line nor programmable clock was
116 found to exist.
117
118 swap error
119 An unrecoverable I/O error during a swap. Really shouldn't be a
120 panic, but it is hard to fix.
121
122 unlink - iget
123 The directory containing a file being deleted can't be found.
124 Hardware or software.
125
126 out of swap space
127 A program needs to be swapped out, and there is no more swap
128 space. It has to be increased. This really shouldn't be a panic,
129 but there is no easy fix.
130
131 out of text
132 A pure procedure program is being executed, and the table for such
133 things is full. This shouldn't be a panic.
134
135 trap
136 An unexpected trap has occurred within the system. This is accom‐
137 panied by three numbers: a `ka6', which is the contents of the
138 segmentation register for the area in which the system's stack is
139 kept; `aps', which is the location where the hardware stored the
140 program status word during the trap; and a `trap type' which
141 encodes which trap occurred. The trap types are:
142
143 0 bus error
144 1 illegal instruction
145 2 BPT/trace
146 3 IOT
147 4 power fail
148 5 EMT
149 6 recursive system call (TRAP instruction)
150 7 11/70 cache parity, or programmed interrupt
151 10 floating point trap
152 11 segmentation violation
153
154 In some of these cases it is possible for octal 20 to be added into the
155 trap type; this indicates that the processor was in user mode when the
156 trap occurred. If you wish to examine the stack after such a trap,
157 either dump the system, or use the console switches to examine core;
158 the required address mapping is described below.
159
160 Interpreting dumps. All file system problems should be taken care of
161 before attempting to look at dumps. The dump should be read into the
162 file /usr/sys/core; cp(1) will do. At this point, you should execute
163 ps -alxk and who to print the process table and the users who were on
164 at the time of the crash. You should dump ( od(1)) the first 30 bytes
165 of /usr/sys/core. Starting at location 4, the registers R0, R1, R2,
166 R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are stored. If the dump
167 had to be restarted, R0 will not be correct. Next, take the value of
168 KA6 (location 022(8) in the dump) multiplied by 0100(8) and dump
169 01000(8) bytes starting from there. This is the per-process data asso‐
170 ciated with the process running at the time of the crash. Relabel the
171 addresses 140000 to 141776. R5 is C's frame or display pointer.
172 Stored at (R5) is the old R5 pointing to the previous stack frame. At
173 (R5)+2 is the saved PC of the calling procedure. Trace this calling
174 chain until you obtain an R5 value of 141756, which is where the user's
175 R5 is stored. If the chain is broken, you have to look for a plausible
176 R5, PC pair and continue from there. Each PC should be looked up in
177 the system's name list using adb(1) and its `:' command, to get a
178 reverse calling order. In most cases this procedure will give an idea
179 of what is wrong. A more complete discussion of system debugging is
180 impossible here.
181
183 clri(1), icheck(1), dcheck(1), boot(8)
184
185
186
187 CRASH(8)