In-Depth
MSRC: Emergency Response, Microsoft Style
When malware strikes Microsoft code, Stephen Toulouse and the MSRC team rush in to extinguish the fire.
By name alone, you'd expect the Microsoft Security Response Center (MSRC) to look
like something out of a Gene Kranz memoir -- an amphitheater of workstations like
those arrayed before the legendary NASA flight director. In fact, the MSRC is
a benign-looking, oversized conference room buried in the heart of Building 27
on Microsoft's Redmond campus.
Stephen Toulouse sits at a long table in the MSRC, a bank of wide-screen flat panel displays behind him. An MSRC security program manager since 2002, Toulouse came to
Redmond to help Microsoft establish a more robust response to the security events plaguing the software giant.
It's been an up and down ride.
The MSRC was established in 1998, around the time the CIH virus (also known as Chernobyl) started wiping out files on users' hard drives. A year later, the Melissa worm hauled down networks across the globe. In rapid succession, attacks with names like VBS/Loveletter, Sircam, Code Red, Nimda, and Klez piled up.
As it turned out, none of these prepared Microsoft for the hard lessons it would learn at the end of January 2003.
Jan. 25, 2003
SQL Slammer
Toulouse will never forget the moment he first heard of SQL Slammer. It was a Saturday morning, and the freshly appointed MSRC manager was at a local auto shop, having a new stereo system installed in his Jeep.
"I'm at the shop and over the radio I hear: 'The Internet was taken down today by a worm affecting SQL Server,'" recalls Toulouse. "That was the first I heard of it."
A few moments later, Toulouse was racing toward Redmond, the interior of his Jeep still torn open from the half-finished installation. He would spend the next two weeks struggling to investigate and remediate a malware infection that completely overwhelmed his team.
"Our internal network was impacted," Toulouse says. "We had guys walking CDs over to microsoft.com servers to get things to the right places, because we had to rely on that rather than the network that night. It took close to two weeks to stabilize the situation."
Toulouse was tasked with cooking up a packaged update tool that would automatically let users know if their systems were vulnerable. The orders he was given that day were simple -- don't stop working, no matter what. "'Even if Bill Gates himself comes over and tells you to stop, you tell him to talk to me,'" Toulouse remembers being told.
Over the next six months, the MSRC would release four separate fixes for SQL Slammer. Toulouse singles out a few key lessons from that early challenge. Among them:
- The recovery effort must start from a central core of first responders
- All key stakeholders must be brought together. "Get all the smart people in one room," says Toulouse. "Let's work together so everybody is really steeped in it."
- Updates must be packaged for automatic delivery and execution to ensure remediation.
Perhaps most important, Microsoft management realized there had to be a coherent, predictable and well-documented process. The initial response to Slammer was sloppy. Critical stakeholders were scattered across the Redmond campus. Managers scrambled to produce code updates. Staffers struggled to maintain communications and Internet access throughout the event.
|
"We
face a lot of opinion around timing. There's nobody more dedicated and
more driven about getting these updates out than the MSRC"
-- Stephen Toulouse,
Program Manager,
Microsoft Security Response Center |
Microsoft customers struggled as well. They had no idea what to expect from the MSRC in terms of guidance and communication. Those struggles led to a lot of soul searching at Microsoft.
After the event had passed, then-MSRC Director Mike Nash went on a months-long road tour, talking to customers about Slammer and learning what they needed for the next such event.
"I credit actually our customers with a lot of our response process," Toulouse says.
That process today is called the Software Security
Incident Response Process, or SSIRP. The process
documents and codifies MSRC operations, replacing
ad-hoc improvisation with clearly defined roles and milestones. SSIRP would quickly become the foundation of all MSRC response activities.
Says Toulouse: "Because at Microsoft we turn nouns into verbs, you hear, 'Are we SSIRPing?' or 'Have we SSIRPed?'"
Mike Reavey is the operations manager at the MSRC and the one who's responsible for managing Microsoft's monthly Patch Tuesday releases. He's the guy who helps pull the switch that causes a scheduled update to jump the tracks and be handled as a SSIRP event.
"If the train is on the track and is moving along, we know the product team and will pull them in," says Reavey, who describes an escalation that affected a patch designed to fix the CreateTextRange flaw in Internet Explorer. "We had an IE update in path, going through its weeks of testing. We see an issue that gets posted on one of the [hacker] lists. We see this. We alert to it. We actually knew about CreateTextRange and were working on it already. This was just a change in threat level."
Using processes evolved out of the panic of SQL Slammer, the MSRC today is able to pull in affected product teams and partners to assess the threat and respond. In the case of CreateTextRange, the patch was able to launch as scheduled, on Patch Tuesday, says Reavey.
Of course, not every flaw is so accommodating.
Aug. 11, 2003
Blaster
"2003. That year really marks a huge amount of information consumption, looking at best practices, and dealing with incidents and learning from them to create the processes we're using today," says Toulouse.
The Blaster worm was really the first test of the lessons learned from SQL Slammer six months before. Blaster tapped a flaw in Remote Procedure Call (RPC)-DCOM present in Windows XP and 2000, directing infected systems to flood Microsoft's Windows Update site with traffic.
"From mobilization to execution, we were able to move much more quickly than Slammer, in a much more disciplined way. We had several contingency plans and a number of things in place to blunt that attack. We had no interruption at all."
Toulouse credits the four-stage MSRC process, which follows the steps below:
- Watch Phase: The MSRC constantly monitors mailing lists, newsgroups, MSN traffic and input from security researchers. Often, reports come in via the secure@microsoft.com e-mail, which MSRC staffers monitor constantly for hints of trouble.
- Alert Phase: The MSRC alerts product teams, security program managers and third parties such as the Global Infrastructure Alliance for Internet Safety (GIAIS) group of ISPs to help mobilize to a possible threat.
- Assess and Stabilize Phase: This is the process of judging the threat and crafting the remediation. A threat affecting very few users may be elevated to a SSIRP event if the payload is destructive enough, for example.
- Resolve Phase: The final phase includes the release of security bulletins, patch code, systems guidance, and other remediation content. Once the resolution is complete, the team returns to the watch phase, looking specifically for issues with or related to recently released updates or bulletins.
With Blaster, the MSRC significantly stepped up communications -- a key learning from the Slammer event -- launching a series of webcasts and more detailed security bulletins. The effort would soon extend to e-mail alerts, RSS feeds, Web blogs and, ultimately, give rise to the formalized monthly updates known as Patch Tuesday.
"Five years ago I used to say we wrote the best bulletins no one ever read. And now, everyone reads the bulletins," says Christopher Budd, security program manager in the MSRC. "It's a mainstream thing. To meet that broader audience we've had to step up with broader communications."
Despite the success, the stakes were high. Blaster hoped to disrupt the Windows Update service, using a distributed denial of service (DDOS) attack to prevent Microsoft from pushing patches out to millions of PCs and servers. A botched implementation in the malware made it easy for Microsoft to sidestep the attack. Still, the vulnerability forced Microsoft to look closely at the behavior of its own software -- in this case, RPC-DCOM -- and ask some hard questions.
"Are you listening on the network? Why are you listening on the network? Do you need to be listening on the network?" asks Toulouse in rapid succession. "Are you anonymous? Why are you anonymous? Do you need to be anonymous? Blaster forced them fundamentally to rethink some assumptions."
Blaster motivated Microsoft to introduce a malware removal tool as part of its response. It was the first time Microsoft had taken such a step, and foretold broader solutions from Microsoft such as Microsoft AntiSpyware (now called Windows Defender) and Microsoft OneCare.
It led also to one other Microsoft innovation, says Toulouse. "Blaster was one of the key things in the
decision to enable the firewall by default in [Windows] XP SP2."
April 30, 2004
Sasser
By the time the Sasser worm emerged, about eight months after Blaster, the MSRC was in full stride.
The group had moved into its current digs -- an expansive conference area outfitted with redundant communications, dedicated servers and workstations, and unfiltered connections to the Internet. Changes were also reaching far beyond the walls of the MSRC conference area.
"There are dedicated security program managers with product teams now. Their whole job is to work with
the MSRC," says Toulouse. "To me, these changes are really partly responsible for making the process work as efficiently as it does today."
In fact, it was this efficiency that helped Microsoft stave off the worst effects of the Sasser worm, when it struck on the last day of April 2004. Sasser was based on a known vulnerability that had been patched just two weeks earlier.
"We had the same things with Sasser as we did with Blaster," says Toulouse, "but they all occurred orders of magnitude sooner."
But Sasser also confirmed a troubling fact. When Microsoft releases a bulletin or patch, malware writers are watching. Closely.
"Actually creating the fix for a specific issue that comes in usually doesn't take that long," says Budd. "But then it widens. You fix the issue and then you fix surrounding or similar issues. We know that when we release a security update for an issue in component XYZ, that draws attention to that area."
That's exactly what happened with the April 13 patch, which was part of security bulletin MS04-011. It's widely believed Sasser was produced by reverse engineering the patch to access the vulnerability. Anyone who had failed to deploy the MS04-011 patch found themselves in the crosshairs of the worm.
Making matters more difficult, patch coders must contend with almost outrageous complexity. "Ten versions of Windows, 27 different languages," says Budd. "That's 270 different Windows updates."
Testing that many permutations is a process that can take weeks, or even months. The MSRC works with Microsoft product teams to expedite and scale the proving process, using a tightly automated, scripted process. But all it takes is a single failure to send the coders scrambling to fix the fix.
"When you look at the breadth of people running Windows and you look at the infinite software combinations, the law of large numbers starts to take affect," says Toulouse. "A million people -- that is still a big number no matter how you put it from a percentage standpoint. So now you're sunk. That's why the goal and the focus have to be around quality, and that takes time. There have been updates that have taken many test passes."
And even after release, the work is ongoing.
"There is also the post-release monitoring for customer issues," explains Reavey. "It honestly never ends … when you think about it."
Sasser also proved out the need for Microsoft's Software Development Lifecycle (SDL) program, which fundamentally changed the way code is written at Microsoft. Mike Howard, senior security program manager at Microsoft, says SDL is a critical foundation to secure systems.
"You can have all the established definitions you want -- encryption, firewalls -- and all it takes is a bad implementation or bug in the code, and all that was laid bare."
Howard, who co-authored the book "Writing Secure Code," says his group acts like an internal consulting organization, working with different product teams to deliver programmer training, specs, code review and testing, and other services.
Asked how big the change was for coders at Microsoft, Howard smiles. "Just a little."
The rigorous training and review -- including automated fuzz testing that helps find buffer overflow weaknesses -- has paid huge dividends. The number of security bulletins for SDL-enabled products like Windows Server 2003 and SQL Server 2005 are significantly lower than earlier versions.
Dec. 27, 2005
WMF Zero-Day Exploit
SQL Slammer, Blaster and Sasser all shared a common thread: They exploited previously known flaws in Microsoft code -- flaws that had already been patched. The WMF Zero-Day exploit attacked from an unforeseen direction, infecting any system that so much as displayed a malformed WMF graphics file, whether in a Web browser, an e-mail message, or even the Windows image editing program.
Microsoft had no warning that the exploit was coming, and the sneak attack plunged the MSRC into brief disarray. The MSRC initially said a patch would be released on Patch Tuesday -- two weeks away -- then reversed direction and said a patch would come early. It arrived on Jan. 5, 2006, the Thursday before Patch Tuesday.
In fact, WMF had IT professionals clamoring for the bad old days, when Microsoft would release a patch as soon as it was ready, rather than on a predictable, monthly schedule.
Recalls Budd: "We would build the updates and write the bulletin, and when they were ready, we posted them. We heard from customers. The randomness of the process -- we were just throwing a hand grenade into their inbox."
But when Microsoft announced that a WMF fix would arrive on Patch Tuesday, the industry howled. Budd, however, says Microsoft moved the WMF fix forward ('out of band' in Microsoft parlance) when the code came together more quickly than expected.
"That was a case of where, due to the targeted nature of the fix and relatively esoteric nature of the functionality, we were able to … achieve confidence more quickly than we thought," says Budd.
The early release did little to stem criticism, which reached a crescendo in the days after Microsoft's initial pronouncement.
"We face a lot of opinion around timing. There's nobody more dedicated and more driven about getting these updates out than the MSRC," says Toulouse, who points to the bigger picture issue with patches. "We cannot introduce a new problem into customer systems. They'll distrust the updates -- they will not apply them."
It's a real concern. Yet the MSRC faced the issue -- for the first time -- of a third-party authored patch gaining the recommendation of respected security organizations like The SANS Institute. For Johannes Ullrich, CTO of the SANS Institute, the critical nature of the flaw left his organization little choice.
"The WMF thing -- it was bad, people were exploited," says Ullrich. "If the exploit is already known and out there I don't see harm in [releasing a beta patch]. Do it and at least be able to help people."
Toulouse says the MSRC was on top of the threat, releasing bulletins, blogs and guidance to help sidestep the threat in advance of a patch. Still, the WMF event revived some of the historical antagonisms between Microsoft and the security community.
"Our relationships with security researchers have not always been pleasant -- there were times when it was a little rocky," Toulouse admits. But he's quick to point out that the level of collaboration with researchers, hackers and others has improved dramatically over the years.
Ullrich agrees, though he looks for more progress going forward.
"The thing that struck me during the WMF episode was that they didn't
really seem to have the hacker mind. They approach it with kind of the attitude
of, 'as long as it's not yet done in the wild it doesn't exist,'" says
Ullrich. "There obviously seems to be quite a bit of confusion in their
organization when something like WMF comes out."
Managed
Mayhem: Lessons from the MSRC |
Come Together: Whether it's
crisis management or code writing or community relations,
there is a consistent effort made to, as Toulouse put it,
"put all the smartest people into one room." That
effort has paid major dividends at the MSRC.
Equip and Prepare: The MSRC isn't an elaborate
setup, but it does come equipped with redundant Internet connections,
ample communications, and its own fleet of servers and workstations.
Know Who's Watching: When a bulletin or
patch is released, Microsoft knows malware authors are watching.
The MSRC limits the detail in security bulletins to prevent
enabling an early attack, and tracks for exploits based on
previously published patch code.
Get Cultured: It took a famous 2002 Gates
memo -- and the eruption of the SQL Slammer exploit -- to
change the culture at Microsoft. The result has been a remarkable
transformation, leading to the development of programs like
the Security Development Lifecycle (SDL).
Seek Structure: Patch Tuesday changed everything
for Microsoft and IT managers alike. By scheduling releases,
Microsoft is better able to manage the process, while IT managers
are better able to plan around it.
Seek Advice: To help it deal with and anticipate
future threats, Microsoft began sending reps to Black Hat
hacker events to glean insights. Later, the company established
the Blue Hat Conference -- an annual gathering of security
professionals and hackers.
Get Friendly: For years, Microsoft was known
for its stormy relationship with security organizations, decrying
criticisms of its software and offering an opaque window to
researchers reporting flaws to the company. Today, Microsoft
is more open and collaborative, even if friction still exists.
-- M.D. |
|
|
A Whole New World
If the MSRC has done one thing since its inception, it's to impose order
on a chaotic environment.
"We've eliminated as much of the surprise as we can," says Budd, singling out the bulletins that detail upcoming patch activity the Thursday before release. "We give them as much information as we can, for high-level planning, without jeopardizing security.
"The regularity lets us de-emergency-ify the process. In this arena, boring actually is a virtue. We want to make it as boring as possible. The regularity lets us make it as boring as possible."
But as the MSRC evolved, so did Microsoft. A company that once pushed deep application and OS integration at every turn is today obsessed with securing code and ensuring the integrity of programmatic links.
"If you had a developer at Microsoft 10 years ago, that developer was going 'cool feature, cool feature, cool feature,'" Toulouse says. "Now that developer is thinking, 'I've got a cool feature, used correctly it could do this. But now I have to consider what could happen if it's used another way.'"
The SDL program is the most dramatic symptom of this change. In fact, the effort has been so successful that malware writers are shifting to softer targets. Specifically, end users.
"You are probably going to see fewer and fewer Internet-wide attacks. I think what you're seeing now is a move from the operating system to the application layer, with really targeted attacks," says Toulouse. "What we're starting to see are more and more targeted attacks and more social engineering."
New challenges lie ahead. Zero-day exploits. Sophisticated phishing- and social engineering-based attacks. Toulouse has no doubt that security concerns and events will have him racing into Redmond in the middle of the night many more times.
"I can tell you how long it takes to get to this room from my home at three in the morning, hitting all the green lights," Toulouse laughs.
"In the end, it's a journey, not a destination. We will continue to make mistakes and we will continue to learn from our mistakes."