crash(8) - bsd211

1CRASH(8V)                                                            CRASH(8V)
2
3
4

NAME

6       crash - what happens when the system crashes
7

DESCRIPTION

9       This  section  explains  what happens when the system crashes and (very
10       briefly) how to analyze crash dumps.
11
12       When the system crashes voluntarily it prints a message of the form
13
14              panic: why i gave up the ghost
15
16       on the console, takes a dump on a mass  storage  peripheral,  and  then
17       invokes  an  automatic  reboot  procedure  as  described  in reboot(8).
18       Unless some unexpected inconsistency is encountered in the state of the
19       file  systems due to hardware or software failure, the system will then
20       resume multi-user operations.   If  the  automatic  file  system  check
21       fails,  the  file  systems  should be checked and repaired with fsck(8)
22       before continuing.
23
24       The system has a large number of internal consistency checks; if one of
25       these  fails,  then  it will panic with a very short message indicating
26       which one failed.  In many instances, this will be the name of the rou‐
27       tine  which detected the error, or a two-word description of the incon‐
28       sistency.  A full understanding of most panic messages requires perusal
29       of the source code for the system.
30
31       The most common cause of system failures is hardware failure, which can
32       reflect itself in different ways.  Here are the messages which are most
33       likely,  with  some  hints as to causes.  Left unstated in all cases is
34       the possibility that hardware or software error produced the message in
35       some unexpected way.
36
37       iinit  This  cryptic  panic message results from a failure to mount the
38              root filesystem during the bootstrap process.  Either  the  root
39              filesystem  has  been  corrupted, or the system is attempting to
40              use the wrong device as root filesystem.  Usually, an  alternate
41              copy of the system binary or an alternate root filesystem can be
42              used to bring up the system to investigate.
43
44       Can't exec /etc/init
45              This is not a panic message, as reboots are likely to be futile.
46              Late in the bootstrap procedure, the system was unable to locate
47              and execute  the  initialization  process,  init(8).   The  root
48              filesystem  is  incorrect  or has been corrupted, or the mode or
49              type of /etc/init forbids execution.
50
51       hard IO err in swap
52              The system encountered an error trying  to  write  to  the  swap
53              device  or  an error in reading critical information from a disk
54              drive.  The offending disk should be fixed if it  is  broken  or
55              unreliable.
56
57       timeout table overflow
58              This  really  shouldn't be a panic, but until the data structure
59              involved is made to be extensible, running out of entries causes
60              a  crash.   If  this  happens,  make  the  timeout table bigger.
61              (NCALL in param.c)
62
63       trap type %o
64              An unexpected trap has occurred  within  the  system;  the  trap
65              types are:
66
67       0    bus error
68       1    illegal instruction trap
69       2    BPT/trace trap
70       3    IOT
71       4    power fail trap (if autoreboot fails)
72       5    EMT
73       6    recursive system call (TRAP instruction)
74       7    programmed interrupt request
75       11   protection fault (segmentation violation)
76       12   parity trap
77
78       In  some  of  these cases it is possible for octal 020 to be added into
79       the trap type; this indicates that the processor was in user mode  when
80       the trap occurred.
81
82       In  addition  to  the trap type, the system will have printed out three
83       (or four) other numbers: ka6, which is the contents of the segmentation
84       register  for  the area in which the system's stack is kept; aps, which
85       is the location where the hardware stored the program status word  dur‐
86       ing  the  trap;  pc,  which  was  the  system's program counter when it
87       faulted (already incremented to the next  word);  __ovno,  the  overlay
88       number of the currently loaded kernel overlay when the trap occurred.
89
90       The  favorite  trap  types  in  system crashes are trap types 0 and 11,
91       indicating a wild reference.  The code is the referenced  address,  and
92       the  pc at the time of the fault is printed.  These problems tend to be
93       easy to track down if they are kernel bugs since  the  processor  stops
94       cold, but random flakiness seems to cause this sometimes.  The debugger
95       can be used to locate the instruction and subroutine  corresponding  to
96       the  PC  value.   If  that is insufficient to suggest the nature of the
97       problem, more detailed examination of the system status at the time  of
98       the trap usually can produce an explanation.
99
100       init died
101              The system initialization process has exited.  This is bad news,
102              as no new users will then be able to log in.  Rebooting  is  the
103              only fix, so the system just does it right away.
104
105       out of mbufs: map full
106              The  network has exhausted its private page map for network buf‐
107              fers.  This usually indicates that buffers are being  lost,  and
108              rather than allow the system to slowly degrade, it reboots imme‐
109              diately.  The map may be made larger if necessary.
110
111       out of swap space
112              This really shouldn't be panics but there's no  other  satisfac‐
113              tory  solution.   The  size  of the swap area must be increased.
114              The system attempts to avoid running out of swap by refusing  to
115              start  new processes when short of swap space (resulting in ``No
116              more proceses'' messages from the shell).
117
118       &remap_area > SEG5
119       _end > SEG5
120              The kernel detected at boot time that an unacceptable portion of
121              its data space extended into the region controlled by KDSA5.  In
122              the case of the first message, the size  of  the  kernel's  data
123              segment  (excluding  the  file,  proc,  and text tables) must be
124              decreased.  In the latter case, there are two possibilities:  if
125              &remap_area  is  not  greater  than  0120000, the kernel must be
126              recompiled without defining the  option  NOKA5.   Otherwise,  as
127              above, the size of the kernel's data segment must be decreased.
128
129       That  completes  the  list of panic types you are likely to see.  There
130       are many other panic messages which are less likely to occur;  most  of
131       them detect logical inconsistencies within the kernel and thus ``cannot
132       happen'' unless some part of the kernel has been modified.
133
134       If the system stops or hangs without a panic, it is possible to stop it
135       and take a dump of memory before rebooting.  A panic can be forced from
136       the console, which will allow a dump, automatic reboot and file  system
137       check.   This is accomplished by halting the CPU, putting the processor
138       in kernel mode, loading the PC with 40, and continuing without a  reset
139       (use  continue,  not start).  To put the processor in kernel mode, make
140       sure the two high bits in the processor status word are zero.   (you'll
141       need  to  consult  the  procesor  handbook describing your processor to
142       determine how to access the  PC  and  PS  ...)   The  message  ``panic:
143       forced  from  console''  should  print,  and  the automatic reboot will
144       start.
145
146       If this fails a dump of memory can be made on  magtape:  mount  a  tape
147       (with write ring!), halt the CPU, load address 044, and perform a start
148       (which does a reset).  This should write a copy of all of core  on  the
149       tape  with an EOF mark.  Caution: Any error is taken to mean the end of
150       core has been reached.  This means that you must be sure  the  ring  is
151       in,  the  tape  is  ready,  and the tape is clean and new.  If the dump
152       fails, you can try again, but some  of  the  registers  will  be  lost.
153       After this completes, halt again and reboot.
154
155       After  rebooting,  or after an automatic file system check fails, check
156       and fix the file systems with fsck.  If the system will not  reboot,  a
157       runnable  system  must be obtained from a backup medium after verifying
158       that the hardware is functioning normally.  A damaged root file  system
159       should be patched while running with an alternate root if possible.
160
161       When  the  system crashes if crash dumping was enabled it writes (or at
162       least attempts to write) an image of memory into the back  end  of  the
163       dump device, usually the same as the primary swap area.  After the sys‐
164       tem is rebooted, the program savecore(8) runs and preserves a  copy  of
165       this  core  image  and  the current system in a specified directory for
166       later perusal.  See savecore(8) for details.  A  magtape  dump  can  be
167       read onto disk with dd(1).
168
169       To  analyze  a dump you should begin by running adb(1) with the -k flag
170       on the system load image and core dump.   If  the  core  image  is  the
171       result  of a panic, the panic message is printed.  Normally the command
172       ``$c'' or ``$C'' will provide a stack trace from the point of the crash
173       and  this  will  provide  a  clue  as  to  what  went wrong.  ps(1) and
174       pstat(8)canalsobeused to print the process table at  the  time  of  the
175       crash  via:  ps -alxk  and pstat -p.  If the mapping or the stack frame
176       are incorrect, the following magic locations  may  be  examined  in  an
177       attempt to find out what went wrong.  The registers R0, R1, R2, R3, R4,
178       R5, SP, and KDSA6 (or KISA6 for machines without  separate  instruction
179       and  data)  are  saved  at  location 04.  If the core dump was taken on
180       disk, these values also appear at 0300.  The  value  of  KDSA6  (KISA6)
181       multiplied by 0100 (8) gives the address of the user structure and ker‐
182       nel stack for the running process.   Relabel  these  addresses  0140000
183       through  0142000.   R5 is C's frame or display pointer.  Stored at (R5)
184       is the old R5 pointing to the previous stack frame.  At (R5)+2  is  the
185       saved  PC  of the calling procedure.  Trace this calling chain to an R5
186       value of 0141756 (0141754 for overlaid kernels),  which  is  where  the
187       user's  R5 is stored.  If the chain is broken, look for a plausible R5,
188       PC pair and continue from there.  In most  cases  this  procedure  will
189       give an idea of what is wrong.
190
191       A  more  complete  discussion  of  system debugging is impossible here.
192       See, however, ``Using ADB to Debug the UNIX Kernel''.
193

NAME

DESCRIPTION

SEE ALSO