crash(8) - unix7

1CRASH(8)                    System Manager's Manual                   CRASH(8)
2
3
4

NAME

6       crash - what to do when the system crashes
7

DESCRIPTION

9       This  section  gives  at  least a few clues about how to proceed if the
10       system crashes.  It can't pretend to be complete.
11
12       Bringing it back up.  If the reason for the crash is not  evident  (see
13       below for guidance on `evident') you may want to try to dump the system
14       if you feel up to debugging.  At the moment a dump can be taken only on
15       magtape.  With a tape mounted and ready, stop the machine, load address
16       44, and start.  This should write a copy of all of  core  on  the  tape
17       with  an EOF mark.  Caution: Any error is taken to mean the end of core
18       has been reached.  This means that you must be sure the ring is in, the
19       tape  is  ready, and the tape is clean and new.  If the dump fails, you
20       can try again, but some of the registers will be lost.  See  below  for
21       what to do with the tape.
22
23       In  restarting  after  a crash, always bring up the system single-user.
24       This is accomplished by following the directions in boot(8) as modified
25       for  your particular installation; a single-user system is indicated by
26       having a particular value in the switches (173030 unless you've changed
27       init)  as  the  system starts executing.  When it is running, perform a
28       dcheck and icheck(1) on all file systems which could have been  in  use
29       at  the  time  of  the  crash.  If any serious file system problems are
30       found, they should be repaired.  When you are satisfied with the health
31       of your disks, check and set the date if necessary, then come up multi-
32       user.  This is most easily accomplished  by  changing  the  single-user
33       value  in the switches to something else, then logging out by typing an
34       EOT.
35
36       To even boot UNIX at all, three files (and the directories  leading  to
37       them) must be intact.  First, the initialization program /etc/init must
38       be present and executable.  If it is not, the CPU  will  loop  in  user
39       mode  at location 6.  For init to work correctly, /dev/tty8 and /bin/sh
40       must be present.  If  either  does  not  exist,  the  symptom  is  best
41       described  as  thrashing.  Init will go into a fork/exec loop trying to
42       create a Shell with proper standard input and output.
43
44       If you cannot get the  system  to  boot,  a  runnable  system  must  be
45       obtained  from  a backup medium.  The root file system may then be doc‐
46       tored as a mounted file system as described below.  If  there  are  any
47       problems  with  the root file system, it is probably prudent to go to a
48       backup system to avoid working on a mounted file system.
49
50       Repairing disks.  The first rule to keep in mind is that an addled disk
51       should be treated gently; it shouldn't be mounted unless necessary, and
52       if it is very valuable yet in quite bad shape,  perhaps  it  should  be
53       dumped  before  trying surgery on it.  This is an area where experience
54       and informed courage count for much.
55
56       The problems reported by icheck typically fall into two  kinds.   There
57       can  be  problems  with  the free list: duplicates in the free list, or
58       free blocks also in files.  These can be cured easily  with  an  icheck
59       -s.   If the same block appears in more than one file or if a file con‐
60       tains bad blocks, the files should be deleted, and the free list recon‐
61       structed.   The  best way to delete such a file is to use clri(1), then
62       remove its directory entries.  If any of the affected files  is  really
63       precious, you can try to copy it to another device first.
64
65       Dcheck  may  report files which have more directory entries than links.
66       Such situations are potentially dangerous;  clri  discusses  a  special
67       case  of the problem.  All the directory entries for the file should be
68       removed.  If on the other hand there  are  more  links  than  directory
69       entries,  there  is  no  danger of spreading infection, but merely some
70       disk space that is lost for use.  It is sufficient to copy the file (if
71       it has any entries and is useful) then use clri on its inode and remove
72       any directory entries that do exist.
73
74       Finally, there may be inodes reported by dcheck that have 0 links and 0
75       entries.   These  occur  on  the root device when the system is stopped
76       with pipes open, and on other file systems when the system  stops  with
77       files  that  have  been deleted while still open.  A clri will free the
78       inode, and an icheck -s will recover any missing blocks.
79
80       Why did it crash?  UNIX types a message on the console typewriter  when
81       it  voluntarily  crashes.   Here  is the current list of such messages,
82       with enough information to provide a hope at least of the remedy.   The
83       message has the form `panic: ...', possibly accompanied by other infor‐
84       mation.  Left unstated in all cases is the possibility that hardware or
85       software error produced the message in some unexpected way.
86
87       blkdev
88            The  getblk  routine was called with a nonexistent major device as
89            argument.  Definitely hardware or software error.
90
91       devtab
92            Null device table entry for the major device used as  argument  to
93            getblk.  Definitely hardware or software error.
94
95       iinit
96            An I/O error reading the super-block for the root file system dur‐
97            ing initialization.
98
99       out of inodes
100            A mounted file system has no more i-nodes when  creating  a  file.
101            Sorry, the device isn't available; the icheck should tell you.
102
103       no fs
104            A  device  has  disappeared  from the mounted-device table.  Defi‐
105            nitely hardware or software error.
106
107       no imt
108            Like `no fs', but produced elsewhere.
109
110       no inodes
111            The in-core  inode  table  is  full.   Try  increasing  NINODE  in
112            param.h.  Shouldn't be a panic, just a user error.
113
114       no clock
115            During initialization, neither the line nor programmable clock was
116            found to exist.
117
118       swap error
119            An unrecoverable I/O error during a swap.  Really shouldn't  be  a
120            panic, but it is hard to fix.
121
122       unlink - iget
123            The  directory  containing  a  file  being deleted can't be found.
124            Hardware or software.
125
126       out of swap space
127            A program needs to be swapped out,  and  there  is  no  more  swap
128            space.  It has to be increased.  This really shouldn't be a panic,
129            but there is no easy fix.
130
131       out of text
132            A pure procedure program is being executed, and the table for such
133            things is full.  This shouldn't be a panic.
134
135       trap
136            An unexpected trap has occurred within the system.  This is accom‐
137            panied by three numbers: a `ka6', which is  the  contents  of  the
138            segmentation  register for the area in which the system's stack is
139            kept; `aps', which is the location where the hardware  stored  the
140            program  status  word  during  the  trap;  and a `trap type' which
141            encodes which trap occurred.  The trap types are:
142
143       0         bus error
144       1         illegal instruction
145       2         BPT/trace
146       3         IOT
147       4         power fail
148       5         EMT
149       6         recursive system call (TRAP instruction)
150       7         11/70 cache parity, or programmed interrupt
151       10        floating point trap
152       11        segmentation violation
153
154       In some of these cases it is possible for octal 20 to be added into the
155       trap  type; this indicates that the processor was in user mode when the
156       trap occurred.  If you wish to examine the stack  after  such  a  trap,
157       either  dump  the  system, or use the console switches to examine core;
158       the required address mapping is described below.
159
160       Interpreting dumps.  All file system problems should be taken  care  of
161       before  attempting  to look at dumps.  The dump should be read into the
162       file /usr/sys/core; cp(1) will do.  At this point, you  should  execute
163       ps  -alxk  and who to print the process table and the users who were on
164       at the time of the crash.  You should dump ( od(1)) the first 30  bytes
165       of  /usr/sys/core.   Starting  at location 4, the registers R0, R1, R2,
166       R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are stored.   If  the  dump
167       had  to  be restarted, R0 will not be correct.  Next, take the value of
168       KA6 (location 022(8) in  the  dump)  multiplied  by  0100(8)  and  dump
169       01000(8) bytes starting from there.  This is the per-process data asso‐
170       ciated with the process running at the time of the crash.  Relabel  the
171       addresses  140000  to  141776.   R5  is  C's  frame or display pointer.
172       Stored at (R5) is the old R5 pointing to the previous stack frame.   At
173       (R5)+2  is  the  saved PC of the calling procedure.  Trace this calling
174       chain until you obtain an R5 value of 141756, which is where the user's
175       R5 is stored.  If the chain is broken, you have to look for a plausible
176       R5, PC pair and continue from there.  Each PC should be  looked  up  in
177       the  system's  name  list  using  adb(1)  and its `:' command, to get a
178       reverse calling order.  In most cases this procedure will give an  idea
179       of  what  is  wrong.  A more complete discussion of system debugging is
180       impossible here.
181

NAME

DESCRIPTION

SEE ALSO