[German]The year 2021 was characterized by many uncertainties and the return of the coronavirus infection wave. At the latest since the outbreak of the pandemic, terms such as Next Normal and digital user experience have entered the vocabulary of many people. In addition, as a result of changing social and economic conditions, many online services and digital platforms have experienced a huge increase in user numbers. But what happens when the Internet connection is interrupted and everything is forced offline? After all, we've had outages of Facebook, Amazon Web Services (AWS), etc.
For many companies, downtime means lost revenue and reputation, as well as a potential waste of resources to respond to incidents. Yet it is possible to circumvent or minimize the impact of such incidents by learning from the experiences of others. The network intelligence company Cisco ThousandEyes has observed and analyzed all such incidents. Reason enough to recap the most far-reaching and significant disruptions from 2021.
- Amazon Web Services – 15. Dezember 2021: A brief Amazon Web Services (AWS) outage affected services and applications in the U.S. WEST-1 and U.S. WEST-2 regions. The incident lasted about 45 minutes, and occurred early in the workday on the U.S. West Coast. This disrupted access to authentication and collaboration platforms that rely on AWS – including Okta, Workday and Slack. AWS confirmed ThousandEyes' observation that network connectivity issues due to data loss caused by congestion were responsible.
- Amazon Web Services – 7. Dezember 2021: Also at AWS, the largest provider of cloud computing services in the U.S., an even larger outage occurred in early December. The outage lasted over an hour and caused issues that impacted users of several key services, including AWS Console, Amazon Prime Now and Amazon Pharmacy. Many services that rely on AWS, such as consumer IoT devices like Roomba and Ring, were also affected. Major streaming services such as Disney+ and Netflix were also unavailable.
This outage had a particularly significant impact on enterprise customer applications and services. For example, many concerned IT specialists in companies had to wait for more than an hour for the provider's status page to display the background of the incident.
- Facebook – 4. Oktober 2021: On October 4, Facebook, Instagram and WhatsApp services could no longer be accessed. The outage affected hundreds of millions, if not billions, of users worldwide. In addition, there were reports of problems with service providers that were also affected due to Facebook's high Internet traffic volume.
Regular operations were restored for all three messaging platforms seven hours later. Understandably, this outage raises some questions. How could it have happened? Why did it take so long for the social media company's experienced network operations team to restore services?
Facebook's outage was a significant disruption in terms of scope and duration, and it also had a monetary impact: according to Forbes, the outage reportedly resulted in $60 million to $100 million in lost revenue and a $47.3 billion drop in market capitalization. I had reported on the blog several times, see links at the end of the article.
- Akamai DNS – 22. Juli 2021: Akamai experienced a widespread outage in late July. This resulted in users worldwide no longer being able to reach the websites of the company's customers. The outage lasted over an hour and had a significant impact on many websites and applications used in banking, air travel and gaming, among others.
Akamai DNS is a critical service that redirects users to Akamai's CDN edge. Users who tried to access websites hosted by Akamai received an error message during the outage. The reason: the domain they requested in each case could not be resolved to a valid IP address.
The outage was particularly significant because it affected not only Akamai customers, but also those who rely on Akamai services. Companies that use a multi-CDN approach, such as Amazon, were largely spared the impact of this incident.
- Akamai Prolexic Routed – 16. Juni 2021: For Australian Internet users and those living in the Asia-Pacific region, June 16, 2021 was a particularly frustrating day. Prolexic Routed, Akamai's DDoS defense service, experienced a service disruption that left some customers' websites unavailable for varying lengths of time.
To protect its customers from DDoS attacks, Prolexic Routed cleans up incoming traffic. This is done by displaying customer prefixes (with permission) before forwarding incoming requests to the respective network. The cause of this incident was an accidental overrun of the routing table limit.
The outage lasted over four hours, with the most severe impact occurring in the first few minutes. Different services were affected differently depending on their location, time of day, and previously created backup plans. Certain services had failover systems that enabled them to restore connectivity – in some cases within minutes.
- Fastly – 10. Juni 2021: In June, Fastly experienced a massive outage that affected 85 percent of its services worldwide. A hidden software bug triggered the hour-long outage when a customer was performing a routine update to their CDN configuration. Anyone who tried to reach the affected websites or applications likely received a 501 – Service Unavailable error message.
The outage affected many major websites, including Reddit or the New York Times web service. Even Amazon and eBay were affected in places, because they also rely on Fastly's services. It is worth mentioning that the impact was very different for each of these media and e-commerce providers, even though the cause of the outage was the same.
The cases above show that the outages of popular cloud services have a massive impact. Whereas I did not address Azure outages above. Nor did I mention security issues in this area yet.
Measures for a more resilient Jahr 2022
From the 2021 outages, ThousandEyes draws some basic lessons on how organizations can become more resilient to the above outages.
- Fall back on practical redundancy concepts. Consider using more than one provider for critical services such as CDN and DNS.
- Analyze how your service delivery chain works. This may rely on multiple dependencies. Therefore, it is important to know all dependencies, including indirect or "hidden" ones as well as external services.
- Ensure proactive visibility into your sites, applications, and key dependencies. This will allow you to most efficiently determine when a service issue has occurred and what strategy you need to employ to resolve incidents with minimal impact to your users.
- Create a contingency plan. Even if you have implemented best practices and redundant service architectures, unforeseen outages can still occur. With a backup plan for outage scenarios, you can minimize downtime and performance degradation for your services.
The 2021 outages are a powerful reminder that even the most advanced infrastructure can certainly be affected by errors and failures. Even though outages are inevitable, you should have certain measures implemented to survive them without damage. IT teams can use analysis and insights from this year's biggest outages to develop better processes, redundancies and failover systems to control and minimize expected downtime in 2022.
Amazon AWS down (Nov. 25, 2020)
Amazon AWS cloud outage causes chaos (2021/12/08)
Facebook, Instagram and WhatsApp worldwide down
Facebook, Instagram and WhatsApp have problems again (2021/10/08)
Facebook explains the causes of the big outage
Facebook's outage and identity management dependencies
Cookies helps to fund this blog: Cookie settings