Azure Active Directory outage & RCA for Azure Cloud hicupsBorn's Tech and Windows World

[German]Today, February 1, 2019, Azure Active Directory experienced a temporary outage in Europe. During the research I found some information why Microsoft has been suffering from cloud failures since January 29, 2019.

Customers of the Microsoft cloud (Azure, Office365, Microsoft 365) have been suffering from sporadic failures for days, where services are not available. I had reported about it (see article links at the end of the article). Even the Windows update service was not reachable. Since today at noon about 14:30 I can retrieve updates (Defender, Windows Update, Microsoft Store) with my clients again. So much as a preliminary remark.

Azure Active Directory outage(1. Feb. 2019)

Browsing my my Twitter stream, I came across a new incident reported from Tero Alhonen. He was reporting an Azure Active Directory outage. Between 8:05 am and 10:00 am UTC (9:05 am and 11:00 am German time) the Azure Active Directory failed on February 1, 2019.

Issues we had this morning accessing Azure Active Directory have been fixed https://t.co/1w75mhxxaL pic.twitter.com/Hn8fb0phGK

— Tero Alhonen (@teroalhonen) 1. Februar 2019

Here are the preliminary details of this incident.

Active Directory – Mitigated

Between 08:05 and 10:00 UTC on 01st Feb 2019, a small subset of users in certain countries in Europe including France, Netherlands, Hungary, Czech Republic may have experienced intermittent issues while accessing functionality in Azure Portal, Azure Active Directory B2C, Azure Active Directory Privileged Identity Management, Managed Service Identity, Azure RBAC and Microsoft Teams.

Engineering team have applied mitigation and all impacted services have been recovered 10:00 UTC.

Engineers are continuing to monitor the health of all impacted services to ensure full mitigation.

This issue has now been fixed, and the Azure services should work as usual again.

Cause of failures (29./30. January 2019)

The Azure status page contains now some entries explaining the Azure hicups during the last week.

1/30 Intermittent Network Related Timeouts – West US – Mitigated

The issues of January 30, 2019 in the western United States were caused by network timeouts. The preliminary cause was the failure of network devices after routine network maintenance. These had an intermittent effect on Azure Services, which explains why people only noticed sporadic issues.

1/29 RCA – Network Infrastructure Event

From 29 January 2019 21:00 UTC until 30 January 2019 00:10 UTC (in Germany this is all an hour later) there were problems accessing Microsoft cloud resources. Here is the Microsoft text of the Root Cause Analysis (RCA). In addition, there were intermittent authentication issues with several Azure services affecting the Azure Cloud, the US Government Cloud and the German Cloud.

Summary of impact: Between 21:00 UTC on 29 Jan 2019 and 00:10 UTC on 30 Jan 2019, customers may have experienced issues accessing Microsoft Cloud resources, as well as intermittent authentication issues across multiple Azure services affecting the Azure Cloud, US Government Cloud, and German Cloud. This issue was mitigated for the Azure Cloud at 23:05 UTC on 29 Jan 2019. Residual impact to the Azure Government Cloud was mitigated at 00:00 UTC on 30 Jan 2019 and to the German Cloud at 00:10 UTC on 30 Jan 2019.

Azure AD authentication in the Azure Cloud was impacted between 21:07 – 21:48 UTC, and MFA was impacted between 21:07 – 22:25 UTC.
Customers using a subset of other Azure services including: Microsoft Azure portal, Azure Data Lake Store, Azure Data Lake Analytics, Application Insights, Azure Log Analytics, Azure DevOps, Azure Resource Graph, Azure Container Registries, and Azure Machine Learning may have experienced intermittent inability to access service resources during the incident. A limited subset of customers using SQL transparent data encryption with bring your own key support may have had their SQL database dropped after Azure Key Vault was not reachable. The SQL service restored all databases.

For customers using Azure Monitor, there was a period of time where alerts, including Service Health alerts, did not fire. Azure internal communications tooling was also affected by the external DNS incident. As a result, we were delayed in publishing communications to the Service health status on the Azure Status Dashboard. Customers may have also been unable to log into their Azure Management Portal to view portal communications.

This issue was resolved for the Azure Cloud at 23:05 UTC on January 29, 2019. The remaining effects on the Azure Cloud were mitigated on January 30, 2019 at 00:00 UTC and on January 30, 2019 at 00:10 UTC on the German cloud.

Interesting is the reason why this disorder occurred. It was the fault of an external DNS provider who experienced a global failure of the DNS infrastructure after a software update. In connection with the Windows update issues, the name Comcast came to my attention – even though this provider is not mentioned in the Microsoft status report. Here is the Microsoft Root Cause Statement:

Root Cause: An external DNS service provider experienced a global outage after rolling out a software update which exposed a data corruption issue in their primary servers and affected their secondary servers, impacting network traffic.

A subset of Azure services, including Azure Active Directory were leveraging the external DNS provider at the time of the incident and subsequently experienced a downstream service impact as a result of the external DNS provider incident.

Azure services that leverage Azure DNS were not impacted, however, customers may have been unable to access these services because of an inability to authenticate through Azure Active Directory.

An extremely limited subset of SQL databases using "bring your own key support" were dropped after losing connectivity to Azure Key Vault. As a result of losing connectivity to Azure Key Vault, the key is revoked, and the SQL database dropped.

I also wrote something about deleted SQL databases. According to Microsoft, the DNS services were transferred to an alternative DNS provider, which at least mitigated the issue. According to Microsoft, the Azure SQL engineers have also recovered all SQL databases deleted as a result of this incident.

Authentication requests that occurred before the routing changes could have failed as a result of this error. This was always the case when DNS requests were forwarded through the affected DNS provider. While Azure Active Directory (AAD) uses multiple DNS providers, manual intervention was required to forward part of the AAD traffic to a secondary provider.

According to Microsoft's report, the external DNS service provider has fixed the broken DNS servers. In addition, this provider has taken precautions to reduce the likelihood of such an error.

Note: I also had published articles about these events in a German news magazine. Reader discussed extensively whether failures occur more frequently/longer on OnPremise solutions run inhouse or in the cloud. The quintessence of the above incidents, however, shows how critical even small disruptions in the infrastructure of the cloud or the Internet are. And now we don't even have a large-scale cyber attack by government hackers on this infrastructure. Rather, the installation of simple updates, which of course had been extensively tested beforehand, resulted in a worldwide hiccup of the Microsoft cloud. Doesn't sound really good, imho.

Leave a Reply Cancel reply

Translate

Blogs

Links

Social networks

Awards

Sponsors