In-Depth
Tales from the Trenches: Critical State
Sometimes you need to troubleshoot the people before the technology.
While at a client site one day, an alpha page came in
from a natural gas trading firm we support: “John, our
NT server running our Oracle 8.0 database and supporting
our trading floor is down. It’s stopped at a blue-screen,
and on every reboot, it comes back to the same blue screen.
It says something about an inaccessible boot device. Call
me.”
I got on the phone immediately, knowing that for every
hour down, the company was losing many thousands of dollars.
After talking to the customer for a couple of minutes,
I determined this wasn’t something we could fix over the
phone. I immediately excused myself from the current job
and headed over to the trading firm.
The server I was going to work on was a Hewlett-Packard
NetServer with dual Pentium Pro processors and 256M of
RAM. Additionally, it had four 4G SCSI hard drives, in
a RAID 5 configuration, with a hardware RAID controller.
A powerful advantage of a hardware RAID controller is
the fact that it allows you to have a fault-tolerant system,
but also lets you work with system partitions from Window
NT’s setup routine. If you use software fault-tolerance,
such as the mirroring that comes with NT, you can’t work
with system partitions in the setup routine without breaking
the fault tolerance.
The Sordid Details
The system partition was formatted with NTFS, so I knew
we couldn’t boot from a DOS disk to examine the files
on the hard disk. The server also had a 12/24G tape backup
running Computer Associates’ ARCserve backup software,
so I was hoping we had a good backup to restore with,
should that be necessary.
The server’s state was critical when I arrived: it was
at the “blue screen of death,” with a “STOP 0x0000007b”
“Inaccessible Boot Device.” I decided to do a parallel
install of NT Server, allowing me to examine the integrity
of the start-up environment. Before beginning the parallel
installation, I gathered the proper driver for the RAID
controller. The moment the setup started, I began pressing
the F6 key. This allowed me to install the RAID driver
and have NT recognize the controller on restart. This
avoided another instance of the “Inaccessible Boot Device”
message. While inspecting the startup environment, I found
the boot.ini file to be pointing to the wrong disk. A
normal boot.ini file looks like this:
[boot loader]
timeout=10
default=multi(0)disk(0)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version
4.00"
multi(0)disk(0)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version
4.00
[VGA mode]" /basevideo /sos
Ours was like this:
[boot loader]
timeout=10
default=multi(0)disk(4)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(4)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation
Version 4.00"
multi(0)disk(4)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation
Version 4.00
[VGA mode]" /basevideo /sos
Note the disk callout difference in the third line? I
quickly changed the boot.ini to correct the problem in
the disk callouts and restarted the server. The server
immediately rebooted into the existing operating system!
We celebrated briefly and continued into the OS. We then
we had some services fail and saw several shared folders
that no longer existed on the server. It turned out the
customer had tried an emergency repair on the server,
restored a registry from an old emergency repair disk,
and didn’t tell me!
Our only hope at this point was to extract an updated
catalog from the backup tape and restore our server, including
the registries. It turned out we had a good backup from
the previous night.
The restore operations stated success, so we restarted
the server and held our breath yet again. The server came
back up! We rechecked the integrity of the restoration
and boot.ini file. Everything looked good so far. We then
tested the integrity of the Oracle database. The clients
were able to attach to it successfully! We took the clients
back off-line, installed the OS service packs, and reapplied
the Y2K updates. Again, we had the server restart. Everything
started successfully, and clients were able to attach
to the database.
Time is of the Essence
Things could have gone differently. If my customer had
had an up-to-date emergency repair disk or hadn’t restored
a registry during an emergency repair, we could have repaired
the boot.ini file and had the system up faster. Every
time you make a change to your disks or partitions or
upgrade the service pack on your server, you need to be
sure to upgrade your emergency repair disk.
Additionally, if you do need to do an emergency repair
of a server, you should only restore a registry from your
emergency repair disk as a last resort. It turned out
the problem with the boot.ini file had been created when
the customer had moved some partitions around to better
use the RAID for the Oracle server.
Next, be sure to communicate all of the details of the
attempts to repair the system with all members of the
team attacking the problem. Often some of the attempts
to repair a downed server aren’t always directed at the
right source of the problem. When I’m working with some
of our junior network people, I have to constantly remind
them, I need all of the data to formulate a plan.
Last, always have a good backup, and monitor and audit
it regularly. Take it from me: There’s no worse feeling
than not having a good backup as your last line of defense.
About the Author
John T. Kruizenga, MCSE, has worked with computers and
networking since 1988. He has designed and managed networks
that incorporate VoIP and QoS, remote management, WAN
integration, collaborative software, and Web integration.