In-Depth
Survive Chaos
How one company applies its understanding of the psychological issues surrounding troubleshooting to make it an efficient and painless process.
Every engineer’s worst nightmare goes something like
this: You arrive at work early Monday morning only to
discover no less than three post-its and five voicemail
messages indicating that nobody can access email. To make
matters worse, tape backups failed over the weekend, and
you’ve just moved into a new physical location, complete
with new server equipment in a new and unfamiliar ATM/WAN
environment. On top of it all, the two weeks that management
promised for equipment burn-in never materialized.
System logs show the failure sequence. After looking
up the error references in TechNet, you’re no closer to
determining the cause for the failure. In the first hour,
you’ve isolated the problem down to five possibilities:
hardware failure (SCSI, HDD, NIC); internal Exchange database
corruption; third-party applications; a WAN-related transaction
or infrastructure issue; or any combination of the above.
Another
Methodical Approach |
For another perspective
on a methodical approach to your work,
read Thomas Eck’s article “White-Coat
Computer Science.” |
|
|
Fortunately, you have two engineers available to work
on the problem. Management has the good sense to give
you up to 10 days to resolve the problem, and pledges
to keep users off your back as you work hard to fix this
thing. (This is quite rare—usually it’s hours, not days.)
You also get authorization to contact Microsoft’s Premier
Support for help.
True Story
The scenario I just described recently occurred on a
client site I was supporting. I consider myself a good
troubleshooter, as does the other engineer who worked
on the problem. But it took us over 200 hours to resolve
the problem. Why so long, you say? Several reasons, but
primarily the following:
- The problem was the result of a combination of a
faulty SCSI cable, a corrupted database due to IMS site
connector problems, and the intrusion of a third-party
application (InocuLan 4.5).
- The mental and physical fatigue of “round-the-clock”
efforts.
- The need for 24-hour wait cycles to certify resolution
(representing 25 percent of total hours).
- Further complicating the issue was the fact that
we were operating in a new network infrastructure (ATM),
with new server hardware and connectivity requirements.
As a result of our ordeal, I decided to share some of
the techniques we used to survive and work through these
issues. I’ll skip the obvious: the importance of system
logs and monitoring and TechNet KnowledgeBase queries—these
are basic engineering tools that anyone reading MCP
Magazine already uses.
Determine Your Resources Before Troubleshooting
the Problem
In our case we had two Windows NT engineers, sympathetic
management, TechNet, Internet access to newsgroups, Microsoft’s
KnowledgeBase, and Microsoft Technical Support’s connectivity
engineers. If you have four or more technically competent
engineers, you can split into two teams. With one team
working on the problem at a time, you can effectively
avoid having all of these resources burn out from mental
stress and fatigue. Also, you can break up your troubleshooting
in stages, using the second team to escalate while the
first team recovers.
Be Wary of the Clock
It’s vital that you keep track of the time you spend
at every stage of the problem. Signs of fatigue show up
like this:
- You notice that you become more willing to try things
that have higher and higher risk, or you become more
risk-averse as your confidence in solving the problem
erodes.
- You become careless, running utilities at the wrong
time and/or with the wrong switches (potentially fatal).
- You become impatient.
- You begin to lose the ability to do basic mathematical
calculations, reasoning, and logic.
- You experience difficulty remembering locations of
data, utilities, the next or last step taken, etc.
- You begin to go in “circles,” revisiting previously
attempted efforts.
As a general rule, the more time that passes before a
solution is discovered, the higher the probability that
the problem won’t be resolved. In the scenario previously
described, we changed many of the variables involved with
the problem: service pack levels, physical disk arrays,
SCSI cabling; in addition, we made modifications specific
to NT, Exchange, InocuLan, and ArcServe, respectively.
These steps were necessary, since we were unsuccessful
in isolating a single point of cause for the problem.
Our approach involved verifying the integrity of each
component to eliminate as many components as possible
from our list of possible causes. Unfortunately, we began
this process some 80 hours into the problem. Mental fatigue
had already begun to set in, making decision-making difficult
and logical thought nearly impossible. What saved us?
From the beginning we made and stuck to three critical
decisions, which follow.
Document, Document, Document!
This document isn’t meant for management—it’s your own
lifeguard. This vital tool contains your checklist, lists
your results with dates and times, and provides answers
to those questions, “What have we done?” and “What do
we do next?”
If you begin to go long on troubleshooting the same problem
hour after hour, you’ll consider yourself both lucky and
a genius for having the foresight to create this document.
See “A Troubleshooting Documentation Sample” to see what
we recorded.
Decide Who’ll Drive and Who’ll Navigate
This is a vital step, no matter how involved the problem
is or how many IT resources you have to throw at the problem.
Only one person should be “driving,” or executing console/command
line parameters, launching utilities, and so on. The so-called
navigator makes sure each agreed-upon step is executed,
maintains the document as you go, and takes the conservative
position during risky phases. In general, you want the
more aggressive of the two engineers driving the process.
But remember, if there’s no mutual trust, competence,
or respect, these roles won’t work.
For example, the driver on our team determined that the
ISINTEG (exchange d/b utility) ran, indicating no errors.
The next agreed-upon step was to do an online backup of
the “clean” database before re-creating the site connector.
The driver asked if we should go ahead and create the
connector, but the navigator said no and forced us to
stick with the planned step of doing the tape backup.
Although both driver and navigator are responsible for
keeping the team on course with all established strategies,
the burden usually falls on the navigator—and rightly
so. If you’re succeeding in your troubleshooting, there’s
a tendency to rush to the end steps. Remember, seeing
progress doesn’t mean the issue is resolved. Resist taking
shortcuts, and execute your plan to the end. It’s the
only way to be certain.
A
Troubleshooting Documentation Sample |
Phase
III Troubleshooting
[x] Removed IMS site connector.
[x] Discussed pros/cons of X.400 vs.
MTA site connector. X.400 was
ruled
out, since using it would require Exchange
to do data
conversion
to and from other Exchange sites.
[x] Replication is scheduled at 3:30pm
and 4:30pm.
[x] IS Maintenance scheduled to run
from noon to 4 pm hours.
[x] MTA site connector built; replication
objects were received
successfully.
[x] Exchange users were removed from
Exchange server access.
[ ] Applied SP3 hotfix—Consulting
w/Microsoft as to hotfix vs. SP4
[x] ISINTEG—test alltests run against
pub database.
[x] ISINTEG—test alltests run against
priv database.
[x] Online backup
[x] All MSExchng services started (to
run w/out Inoculan enabled).
Phase IV Troubleshooting
[x] Scenario A - If no errors occur:
Hardware is eliminated
as a cause.
Inoculan could still
be a cause.
IMS connector could
still be a cause.
[ ] Inoculan is started inbound/outbound.
If no errors occur:
IMS connector is probably the cause.
If errors occur:
Inoculan is probably the cause.
[ ] Scenario B - If errors occur:
Hardware could
still be a cause.
Problem is independent
of Inoculan
Problem is possibly
outside of Exchange (Infrastructure)
5/11
8:46am: No errors occurred overnight.
ISINTEG verify on the private
and
public databases indicated no errors
or warnings. ESUTIL verify
on the DS indicated no errors or warnings.
8:50am: Started Inoculan and stressed
Exchange through forced DS
and
IS maintenance. Anticipated runtime
2-3 hours
9:15am: DS replication occurs.
Event successful
9:30am: IS maintenance request
occurs. Event successful
10:00am: Running Exchange Optimizer.
Services shut down
cleanly. 15min
10:15am: Performed on-line backup
10:30am: Created IMS site connector;
checked integrity
11:45am: If no errors have occurred,
starting IMS site connector, and
allowing users to work in email. (Exchange
is at 100% functionality.)
If no errors occur, problem will have
been isolated to the IMS
site connector.
|
|
|
Resist Changing More than One Variable
or Condition at a Time
Your best chance of resolving a problem is within a static
environment. The problem occurs religiously in the same
sequence over and over, under the same environment. Symptoms
don’t change on you, and you have a chance to attack the
issue logically and eventually resolve it.
The typical solution that vendors give their customers
is a patch, upgrade, or service pack. These fixes, when
applied intelligently, can be very powerful tools for
resolving many technical issues—but you should consider
the larger picture outside of the vendor’s product line.
You probably have very specific technical requirements
in your LAN/WAN environment, third-party or proprietary
applications, and infrastructure issues that may require
you to be at a certain hardware or software version or
patch level.
The time to apply a patch or other upgrade path isn’t
during a crisis situation. Ideally, it should be done
in a test environment. But since we’re talking about the
real world, if you’re going to apply a patch during a
troubleshooting process, be sure you can reverse the process
in the event that it makes matters worse or has no effect
whatsoever.
In the end, we spent more than 200 hours troubleshooting
these problems. The organization lost five business days’
worth of data and approximately $75,000 to $100,000 in
productivity. It could be argued that, had we had that
two-week burn-in period that was originally requested,
this failure might never have been experienced in the
production environment.
Troubleshooting isn’t an exact science. By applying solid,
logical approaches and establishing a clear plan of attack,
you can work through even the most severe issues.