1CRASH(8V) CRASH(8V)
2
3
4
6 crash - what happens when the system crashes
7
9 This section explains what happens when the system crashes and (very
10 briefly) how to analyze crash dumps.
11
12 When the system crashes voluntarily it prints a message of the form
13
14 panic: why i gave up the ghost
15
16 on the console, takes a dump on a mass storage peripheral, and then
17 invokes an automatic reboot procedure as described in reboot(8).
18 Unless some unexpected inconsistency is encountered in the state of the
19 file systems due to hardware or software failure, the system will then
20 resume multi-user operations. If the automatic file system check
21 fails, the file systems should be checked and repaired with fsck(8)
22 before continuing.
23
24 The system has a large number of internal consistency checks; if one of
25 these fails, then it will panic with a very short message indicating
26 which one failed. In many instances, this will be the name of the rou‐
27 tine which detected the error, or a two-word description of the incon‐
28 sistency. A full understanding of most panic messages requires perusal
29 of the source code for the system.
30
31 The most common cause of system failures is hardware failure, which can
32 reflect itself in different ways. Here are the messages which are most
33 likely, with some hints as to causes. Left unstated in all cases is
34 the possibility that hardware or software error produced the message in
35 some unexpected way.
36
37 iinit This cryptic panic message results from a failure to mount the
38 root filesystem during the bootstrap process. Either the root
39 filesystem has been corrupted, or the system is attempting to
40 use the wrong device as root filesystem. Usually, an alternate
41 copy of the system binary or an alternate root filesystem can be
42 used to bring up the system to investigate.
43
44 Can't exec /etc/init
45 This is not a panic message, as reboots are likely to be futile.
46 Late in the bootstrap procedure, the system was unable to locate
47 and execute the initialization process, init(8). The root
48 filesystem is incorrect or has been corrupted, or the mode or
49 type of /etc/init forbids execution.
50
51 hard IO err in swap
52 The system encountered an error trying to write to the swap
53 device or an error in reading critical information from a disk
54 drive. The offending disk should be fixed if it is broken or
55 unreliable.
56
57 timeout table overflow
58 This really shouldn't be a panic, but until the data structure
59 involved is made to be extensible, running out of entries causes
60 a crash. If this happens, make the timeout table bigger.
61 (NCALL in param.c)
62
63 trap type %o
64 An unexpected trap has occurred within the system; the trap
65 types are:
66
67 0 bus error
68 1 illegal instruction trap
69 2 BPT/trace trap
70 3 IOT
71 4 power fail trap (if autoreboot fails)
72 5 EMT
73 6 recursive system call (TRAP instruction)
74 7 programmed interrupt request
75 11 protection fault (segmentation violation)
76 12 parity trap
77
78 In some of these cases it is possible for octal 020 to be added into
79 the trap type; this indicates that the processor was in user mode when
80 the trap occurred.
81
82 In addition to the trap type, the system will have printed out three
83 (or four) other numbers: ka6, which is the contents of the segmentation
84 register for the area in which the system's stack is kept; aps, which
85 is the location where the hardware stored the program status word dur‐
86 ing the trap; pc, which was the system's program counter when it
87 faulted (already incremented to the next word); __ovno, the overlay
88 number of the currently loaded kernel overlay when the trap occurred.
89
90 The favorite trap types in system crashes are trap types 0 and 11,
91 indicating a wild reference. The code is the referenced address, and
92 the pc at the time of the fault is printed. These problems tend to be
93 easy to track down if they are kernel bugs since the processor stops
94 cold, but random flakiness seems to cause this sometimes. The debugger
95 can be used to locate the instruction and subroutine corresponding to
96 the PC value. If that is insufficient to suggest the nature of the
97 problem, more detailed examination of the system status at the time of
98 the trap usually can produce an explanation.
99
100 init died
101 The system initialization process has exited. This is bad news,
102 as no new users will then be able to log in. Rebooting is the
103 only fix, so the system just does it right away.
104
105 out of mbufs: map full
106 The network has exhausted its private page map for network buf‐
107 fers. This usually indicates that buffers are being lost, and
108 rather than allow the system to slowly degrade, it reboots imme‐
109 diately. The map may be made larger if necessary.
110
111 out of swap space
112 This really shouldn't be panics but there's no other satisfac‐
113 tory solution. The size of the swap area must be increased.
114 The system attempts to avoid running out of swap by refusing to
115 start new processes when short of swap space (resulting in ``No
116 more proceses'' messages from the shell).
117
118 &remap_area > SEG5
119 _end > SEG5
120 The kernel detected at boot time that an unacceptable portion of
121 its data space extended into the region controlled by KDSA5. In
122 the case of the first message, the size of the kernel's data
123 segment (excluding the file, proc, and text tables) must be
124 decreased. In the latter case, there are two possibilities: if
125 &remap_area is not greater than 0120000, the kernel must be
126 recompiled without defining the option NOKA5. Otherwise, as
127 above, the size of the kernel's data segment must be decreased.
128
129 That completes the list of panic types you are likely to see. There
130 are many other panic messages which are less likely to occur; most of
131 them detect logical inconsistencies within the kernel and thus ``cannot
132 happen'' unless some part of the kernel has been modified.
133
134 If the system stops or hangs without a panic, it is possible to stop it
135 and take a dump of memory before rebooting. A panic can be forced from
136 the console, which will allow a dump, automatic reboot and file system
137 check. This is accomplished by halting the CPU, putting the processor
138 in kernel mode, loading the PC with 40, and continuing without a reset
139 (use continue, not start). To put the processor in kernel mode, make
140 sure the two high bits in the processor status word are zero. (you'll
141 need to consult the procesor handbook describing your processor to
142 determine how to access the PC and PS ...) The message ``panic:
143 forced from console'' should print, and the automatic reboot will
144 start.
145
146 If this fails a dump of memory can be made on magtape: mount a tape
147 (with write ring!), halt the CPU, load address 044, and perform a start
148 (which does a reset). This should write a copy of all of core on the
149 tape with an EOF mark. Caution: Any error is taken to mean the end of
150 core has been reached. This means that you must be sure the ring is
151 in, the tape is ready, and the tape is clean and new. If the dump
152 fails, you can try again, but some of the registers will be lost.
153 After this completes, halt again and reboot.
154
155 After rebooting, or after an automatic file system check fails, check
156 and fix the file systems with fsck. If the system will not reboot, a
157 runnable system must be obtained from a backup medium after verifying
158 that the hardware is functioning normally. A damaged root file system
159 should be patched while running with an alternate root if possible.
160
161 When the system crashes if crash dumping was enabled it writes (or at
162 least attempts to write) an image of memory into the back end of the
163 dump device, usually the same as the primary swap area. After the sys‐
164 tem is rebooted, the program savecore(8) runs and preserves a copy of
165 this core image and the current system in a specified directory for
166 later perusal. See savecore(8) for details. A magtape dump can be
167 read onto disk with dd(1).
168
169 To analyze a dump you should begin by running adb(1) with the -k flag
170 on the system load image and core dump. If the core image is the
171 result of a panic, the panic message is printed. Normally the command
172 ``$c'' or ``$C'' will provide a stack trace from the point of the crash
173 and this will provide a clue as to what went wrong. ps(1) and
174 pstat(8)canalsobeused to print the process table at the time of the
175 crash via: ps -alxk and pstat -p. If the mapping or the stack frame
176 are incorrect, the following magic locations may be examined in an
177 attempt to find out what went wrong. The registers R0, R1, R2, R3, R4,
178 R5, SP, and KDSA6 (or KISA6 for machines without separate instruction
179 and data) are saved at location 04. If the core dump was taken on
180 disk, these values also appear at 0300. The value of KDSA6 (KISA6)
181 multiplied by 0100 (8) gives the address of the user structure and ker‐
182 nel stack for the running process. Relabel these addresses 0140000
183 through 0142000. R5 is C's frame or display pointer. Stored at (R5)
184 is the old R5 pointing to the previous stack frame. At (R5)+2 is the
185 saved PC of the calling procedure. Trace this calling chain to an R5
186 value of 0141756 (0141754 for overlaid kernels), which is where the
187 user's R5 is stored. If the chain is broken, look for a plausible R5,
188 PC pair and continue from there. In most cases this procedure will
189 give an idea of what is wrong.
190
191 A more complete discussion of system debugging is impossible here.
192 See, however, ``Using ADB to Debug the UNIX Kernel''.
193
195 adb(1), ps(1), pstat(1), boot(8), fsck(8), reboot(8), savecore(8)
196 PDP-11 Processor Handbook for various processors for more information
197 about PDP-11 memory management and general architecture.
198 Using ADB to Debug the UNIX Kernel
199
200
201
2023rd Berkeley Distribution July 11, 1987 CRASH(8V)