Microsoft Azure Major outage
Incident Report for Progress MOVEit Cloud
Postmortem

RCA - DNS issue impacting multiple Microsoft services (Tracking ID GVY5-TZZ)

Summary of Impact: Between 21:21 UTC and 22:00 UTC on 1 Apr 2021, Azure DNS experienced a service availability issue. This resulted in customers being unable to resolve domain names for services they use, which resulted in intermittent failures accessing or managing Azure and Microsoft services. Due to the nature of DNS, the impact of the issue was observed across multiple regions. Recovery time varied by service, but the majority of services recovered by 22:30 UTC.

Root Cause: Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.

Mitigation: The decrease in service availability triggered our monitoring systems and engaged our engineers. Our DNS services automatically recovered themselves by 22:00 UTC. This recovery time exceeded our design goal, and our engineers prepared additional serving capacity and the ability to answer DNS queries from the volumetric spike mitigation system in case further mitigation steps were needed. The majority of services were fully recovered by 22:30 UTC. Immediately after the incident, we updated the logic on the volumetric spike mitigation system to protect the DNS service from excessive retries.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Repair the code defect so that all requests can be efficiently handled in cache.
  • Improve the automatic detection and mitigation of anomalous traffic patterns.
Posted Apr 06, 2021 - 08:34 CDT

Resolved
Issue resolved update.

Microsoft is reporting that the issue has been updated, so we are going to close out this alert for now. We will send out an RCA on this issue withing the next 24 - 48 hours.
Posted Apr 01, 2021 - 19:22 CDT
Monitoring
Quick update:

MOVEit Cloud has been back online for most of the reported outage but Microsoft is still reporting issues, so we will continue to monitor this and provide and update in an hour or sooner as events warrant.

Here is the latest update from Microsoft:

DOWNSTREAM SERVICES: Due to downstream impact to a number of Azure services, recovery times may vary by service. We are attempting to assess downstream impact and will report the additional services as we know. The next update will be provided in 30 minutes or as events warrant.

This message was last updated at 23:40 UTC on 01 April 2021
Posted Apr 01, 2021 - 18:49 CDT
Identified
Microsoft has posted the following update:

DNS issue impacting multiple Microsoft services - Recovery in progress

SUMMARY OF IMPACT: Starting at approximately 21:30 UTC on 01 Apr 2021, customers may experience intermittent issues accessing Microsoft services, including Azure, Dynamics, and Xbox Live.

CURRENT STATUS: Microsoft rerouted traffic to our resilient DNS capabilities and are seeing improvement in service availability. We are continuing to investigate the cause of the DNS issue. The next update will be provided in 30 minutes or as events warrant.

This message was last updated at 22:36 UTC on 01 April 2021

We are seeing that most of the services appear to be online, so we will continue to monitor this and send out another update in an hour or sooner as events warrant.
Posted Apr 01, 2021 - 17:48 CDT
Investigating
MOVEit Cloud is currently experiencing a major outage that impacts all users and our team is currently working to restore the service to normal performance levels as quickly as possible.

We will post another update in an hour or as soon as we learn more.
Posted Apr 01, 2021 - 16:47 CDT
This incident affected: North America - Cluster 1, North America - Cluster 2, North America - Cluster 3, and Europe - Cluster 1.