Search This Blog

Microsoft Hit by Huge Service Outage

Telemetry shows that this incident had an impact on about 300,000 calls.

This week's 6-hour-long global outage of Microsoft 365 was caused by a flawed Enterprise Configuration Service (ECS) deployment, as per a preliminary post-incident review. This deployment caused cascade errors and availability effects across numerous locations.

ECS is an internal central configuration repository created to allow Microsoft services to make targeted updates, such as particular configurations per tenant or user, as well as broad-scope dynamic changes affecting many services and features.

According to Microsoft, a recent deployment that featured a "broken link to an internal storage service" was the most likely reason for an outage that prevented many customers from accessing or using a variety of Microsoft 365 products for several hours.

Access to several Microsoft services, including Microsoft Teams, Exchange Server, Microsoft 365 admin center, Microsoft Word, and other Office programs, was slowed down as a result of the service issues, which began on Wednesday, July 20 in the evening and persisted into Thursday morning. Microsoft Managed Desktop and other services were also not able to auto-patch due to the problem.

Overview of the outage

Through its public Twitter statements, Microsoft failed to mention the location of the disruptions. According to comments in Microsoft's Twitter statement, the Teams outage appears to have impacted users in Los Angeles, Dallas, New York City, Hong Kong, and Eastern Australia.

With its cloud computing, Microsoft does have a complex service level agreement. Accordingly, the sole form of compensation for any downtime that an organization can receive is a service-time credit. Additionally, since it is not automatically applied, they must ask for the service credit.

"Telemetry shows that this incident had an impact on about 300,000 calls. Due to business hours falling inside the effect timeframe, the Asia Pacific (APAC) region was the most impacted. Direct Routing and Skype MFA were also significantly affected," the company explained.


What sparked the outage?

In the end, the incident had an impact on users seeking to use one or more of the Microsoft 365 apps and services, according to Bleeping Computer.

The botched Enterprise Configuration Service (ECS) deployment was the initial root cause of this outage, as stated by Redmond in their incident report. "Backward compatibility with services that use ECS was impacted by a deployment of the ECS service that had a code flaw. The end result was that it would send inaccurate configurations to all of its partners for services using ECS " the firm stated.

As a result, downstream services received a status response with the code 200, suggesting that the pull was successful, but it just included a JSON object that was poorly formatted. How each Microsoft service used the flawed configuration supplied by ECS determined the impact's severity. Impact varied from services collapsing, like Teams, to low or no impact on other services.

Microsoft claims that as a result of this incident, they are working to strengthen the Microsoft Teams service's resilience so that it may fall back to a previous version of the ECS configuration in the case of a future ECS failure.


Share it:

Code Execution Flaw

Cyber Security

ECS Instances

MFA

Microsoft 365

Microsoft Web Server