In-Depth
Tales from the Trenches: All Through the Night
Wherein our hero spends the wee hours staring down servers, rustling up disks, reaching out for help, and pondering his own stupidity.
It was one of those straightforward “I’ve-done-this-a-million-times-before/should-only-take-a-couple-of-hours”
kinds of projects. The task at hand: Install industry-standard
backup software (ARCserveIT) onto industry-standard servers
(Compaq ProLiants) running an industry-standard operating
system (Windows NT 4.0, SP 5) equipped with industry-standard
tape drives (Exabyte DATs). I had it all figured out.
At about 8 p.m. on a Friday evening, I’d head on over
to the co-location facility where our e-commerce servers
were housed, pop in the ARCserveIT CD, and make a few
mouse clicks. Next, I’d make a test backup, configure
the backup schedules, and bring our production Web site
back up. I told my wife I’d be home by 11.
When I finally walked out of the data center the next
morning, almost 12 hours later, and got in my car to drive
home, I reflected on the night’s events.
8 p.m., Friday Evening
Sitting in my office, I go over the game plan. Six NT
servers run the company’s Internet e-commerce site. Three
of them have internal SCSI tape drives and are currently
running NT Backup. I have three ARCserveIT Enterprise
licenses, along with a few open file and SQL Server agents.
Management has agreed to having our 24x7 site down for
a few hours so I can install and configure ARCserveIT.
I collect the license certificates, grab the ARCserveIT
CD, and head out the door for the co-location facility.
8:45 p.m.
After proving to the data center tech that I’m authorized
to access our equipment, I’m escorted to our server rack.
I’ve been here before, but just to take stock of the servers.
Tonight will be the first time I actually get my fingernails
dirty. I’m pleased that my predecessor, who was primarily
a programmer who also wore the hat of a network administrator,
chose to use Compaq servers. But as I struggle to connect
the monitor, keyboard, and mouse cables to the keyboard/video/mouse
switch (KVM) in the rack, I start to wonder if he really
knew what he was doing. After finally getting everything
connected, I redirect our dot-com address to point to
a maintenance page on another server and get to work.
9:30 p.m.
Because I want to minimize the down time of our primary
Web server, I decide to install to it first. I pop in
ARCserveIT and launch the setup program. I accept most
of the default options, and after just a few minutes the
system is restarting. I call my wife and tell her that
I’m only 10 minutes behind schedule—and that I’ll see
her in a couple of hours.
The file copy portion has started up, so I switch back
to the first server, which is still initializing that
RAID controller.
Suddenly, it offers up a lovely rendition of the infamous
BSOD—the Blue Screen of Death! A fatal stop error, involving
Compaq’s SCSI controller driver. As I scramble for a piece
of paper and something to write with so I can record the
particulars of the error, the system reboots itself. That’s
when I remember that the Automatic Server Recovery (ASR)
was set to restart on errors like this.
I watch the screen as the system restarts. It gets to
the RAID initialization again and sure enough, the BSOD
returns. It now starts its memory dump. Thankfully, it
takes a good 45 seconds or so to write the contents of
the 512M of RAM installed on this system, giving me enough
time to jot down the significant error information.
11 p.m.
I pause for a moment to figure out how I’m going to fix
this. Let’s see, what’s the textbook way to approach this
kind of problem? Oh, yeah, the Emergency Repair Disk (ERD)!
Only one slight problem. There is no ERD. But, hey, I’m
an MCSE—I can fix this.
That’s when it hits me. One of the other servers I’m
installing ARCserveIT onto has the same configuration
as the one that’s crashed. I’ll just make an ERD from
that machine, then use it on the first one. I track down
the data center tech and ask her if I can borrow a floppy
disk. We both forage through desk drawers and file cabinets
and eventually locate a precious disk.
As I eject the newly created ERD, the bubble of elation
surrounding me bursts—I realize that I’m also going to
need an NT Server CD so I can launch setup so that I can
use the repair utility to access the ERD and fix my server.
Another panicked call to the data center tech!
12:15 a.m., Saturday
“Hi, this is Kevin again. You wouldn’t happen to have
an NT Server CD I could borrow?”
“Wow, um, I can call the guys upstairs and see if they
know where one is. I’ll get back to you in a bit.” The
guys upstairs are the CCIE types—router gurus who keep
the backbone of this co-location facility up and running.
I’m sure they won’t know anything about an NT Server CD.
But 20 minutes later the tech comes in with a CD. She
says the only reason they had one on hand is because another
customer of theirs accidentally left it behind earlier
that day after installing some new systems.
I pop the CD in, start up the system, and wait for that
RAID controller to initialize. The bubble of elation begins
to form again. It gets to the point where it prompts for
the ERD. I put it in and restore the Compaq SCSI driver.
Just like clockwork.
Another restart, another round of waiting for that RAID
controller to initialize, and another BSOD! Another burst
bubble.
2 a.m.
I then realize that the best tool available to me at
this point is the telephone. Time to call tech support.
But which tech support? Computer Associate for ARCserveIT
support? Compaq? Microsoft? As I think about this, I call
my wife to let her know it’s going to be a late one. Funny—she
doesn’t seem the least bit surprised.
I decide to call ARCserveIT support first. I mean, everything
worked before I installed that software, right? It takes
me at least 30 minutes to find the correct number to reach
after-hours tech support. Once I finally have the number,
I’m pleasantly surprised at how quickly I get a live voice
on the other end. After asking me all the usual stuff
about software version, serial number, platform and more,
the support rep asks for a callback number. The way it
works, she explains, is that an on-call support engineer
will be paged with my callback number. I ask her how long
it takes them to respond. Her reply: Anywhere from 10
minutes to two hours. I wait about half an hour, then
decide to bark up a different tree.
4 a.m.
Now it’s Compaq’s turn. Maybe they can help me bring
my server back to life. In less than 10 minutes I’m on
the phone with an actual support engineer. Way to go,
Compaq! I describe the situation and, after a pregnant
pause, I’m put on hold while the techie does some research.
Ten minutes and she’s back. She wants me to run through
the startup routine, describing each step as it occurs.
Oh, boy! I get to wait while that RAID controller initializes
yet another time. Of course, the BSOD appears, and I read
off the pertinent information.
More holding time while she researches again. This time
it’s only five minutes. She wants me to describe the system
configuration again, especially how the various SCSI devices
are set up. There’s the built-in SCSI controller, to which
the tape drive is connected, and there’s the Smart 2DH
RAID controller. Connected to the RAID controller are
three 9G hard drives, configured as one logical drive
with two partitions. The CD drive is connected to the
built-in IDE controller.
Potential good news: She’s fairly confident that if we
disable the tape drive, we should be able to get past
the BSOD. I remind her that it’s an internal tape drive,
and that the server is stuffed into a rack without sliding
rack rails. It could take hours to get to the drive. No
problem, she says. Just go into the server configuration
utility and disable the SCSI controller. Sounds like a
plan!
I reboot the server and press F10 to access the system
utilities. What’s this? No system partition can be found!
Did the programmer-cum-network administrator not use Smart
Start to configure these servers? I check the other servers
and, sure enough, none of them has a Compaq system partition.
And, of course, I don’t happen to have a Smart Start CD
with me either. I’m starting to get really mad at the
programmer who set these systems up in the first place.
5 a.m.
I ask the Compaq engineer to hold on a moment while I
download the system configuration utility from Compaq’s
Web site. I switch the KVM to another server, get the
file, and then start the process to create the floppy.
But it actually needs several floppies (four to be exact).
All I have is the one we managed to scrounge earlier in
the evening. And from that earlier search I know we won’t
find any more.
I come up with a plan. After the first floppy is finished
I eject it, then switch back to the dead server. I boot
from the floppy, and when it prompts for disk number 2,
I switch back to the other server, then use that same
floppy over again for the second disk. I continue this
process for disks 3 and 4, and finally the utility is
up and running on the dead server.
I disable the SCSI controller and reboot the server.
There’s that RAID initialization again…and no BSOD! The
GUI loads, and shortly I’m greeted with the logon screen.
As I log in, the system advises me that at least one service
has failed to load. I check the event logs and see that
the tape driver hasn’t loaded, along with all the ARCserveIT
services. That’s fine; that’s what’s supposed to happen
with the disabled SCSI controller.
6:15 a.m.
Now we can attempt to rectify the root problem. Just
then, the ARCserveIT tech calls, finally responding to
the page of four hours ago. I have him call back on the
land line, then conference him in with the Compaq engineer.
We bring him up to date and discuss the next steps.
The ARCserveIT guy wants to know if we’ve tried the generic
SCSI driver instead of the Compaq version. The Compaq
tech reluctantly informs us that the on-board SCSI controller
is an LSI Logic Symbios board and, yes, there is a standard
driver available, but not from Compaq. I go to the LSI
Logic Web site and download the generic driver.
I try installing the driver on another server configured
like the one that crashed—same model ProLiant, same OS,
same tape drive. Works! Then I install ARCserveIT. Still
works! Of course, Compaq Insight Manager can’t get all
the information from the controller now; to top it off,
the Compaq tech tells me her company doesn’t support this
driver.
So we’ve narrowed the problem down to the particular
combination of the Compaq SCSI controller (with the Compaq
driver), ARCserveIT, and the Exabyte Digital Audio Tape
drive—they just don’t work together. I uninstall ARCserveIT
on both machines, reinstall the Compaq SCSI driver, and
bring the Web site back up, knowing that I’ll have to
come back some other time to enable the SCSI controller
again (this time with a Smart Start CD!).
7:30 a.m.
I call my wife and tell her I’m on my way home. On the
following Monday, I sit in the weekly development meeting
explaining what happened over the weekend. After I hold
the group spellbound for 20 minutes, the product VP asks,
“What have you learned from all this?”
“Well,” I reply, “two basic truths: Don’t leave home
without installation software and blank media, and programmers
make lousy network admins!”