On which the Internet (or a large part of it) stopped
On August 30th, CenturyLink/Level 3 – a major global Internet service provider (ISP) – suffered a network incident, lasting nearly five hours. The outage impacted not just their customers but also the network of other ISPs and services connected to their network directly or indirectly.
The impact was especially severe, as CenturyLink/Level 3 has peering agreements with many application providers and enterprises, including Google and CloudFlare. The result was a widespread shutdown of a large portion of the internet traffic around the world.
The outage was caused by a faulty Border Gateway Protocol (BGP) configuration, leading routers to block traffic in a snowball effect. According to their root cause analysis report, the faulty configuration was in a BGP extension named Flowspec, intended to push firewall-like rules to routers.
Flowspec is a BGP extension (essentially feature addition to the BGP-4 specification) that is used to easily distribute firewall-like rules via BGP updates. It’s considered a powerful tool for quickly pushing filter rules to a large number of routers and, historically, it’s been used to mitigate Denial of Service (DDoS) attacks. It functions similarly to Access Control Lists (ACLs), but unlike ACLs, which are static, Flowspec, as part of BGP, is dynamic. The dynamic nature of Flowspec means that, while powerful, it can cause significant issues if improperly configured.
Meanwhile, on the Treidion network
Teridion has thousands of probes scattered in over 500 points of presence worldwide, that monitor the internet performance every 3 seconds and report the results to Teridion’s management system, feeding our machine learning based algorithms.
At the start of the outage and during the entire event, Teridion’s probes reported extremely high levels of Packet Loss sensed on the Internet. Our probes sensed that many Internet links were underperforming, and it is better not to go through them. Traffic routed on the public Internet traffic will always flow through the same route according to its peering agreement and regardless of the route’s state. Meanwhile, Teridion selects the route only based on performance, hence avoiding broken links that cause routed traffic loss. In theCenturyLink/Level 3 outage, 5% of the potential links were not operable. As a result, Teridion automatically responded by rerouting the traffic through alternative, unaffected cloud providers, to maintain a high level of performance. In fact, our customers weren’t even aware of the huge event happening over the public internet. This was not a fluke. It has to do with Teridion’s design, which allows the network to protect its customers against future outages.
The multi-cloud, smart network solution
Teridion has a multi-cloud infrastructure with hundreds of points of presence around the world. Failures of a specific network provider do not impact our capability to provide high performance connectivity to our customers. Moreover, being a smart overlay network, Teridion is not affected by the root cause of an infrastructure issue, as long as point A has some access to the network, and point B has some access to the Internet, Teridion’s smart routing algorithm will find the way to connect A and B. As an example, during the event we saw how our system rerouted traffic from a congested datacenter of one cloud vendor, to the datacenter of a different cloud provider, allowing the traffic to flow.
Looking at our data at the time of the event demonstrates the magnitude of the event.
Here are some links and an average packet loss and latency values during the 5 hours event:
Seattle – London: On the public Internet, direct during the event: 40% packet loss and 140ms latency; With Teridion, rerouted through Chicago: 0% packet loss and 127ms latency.
Amsterdam – Toronto: On the public Internet, direct 37% packet loss, 97ms latency. With Teridion rerouted via London 0% packet loss, 86ms latency
Toronto – NY: On the public Internet, direct with cloud provider A: 35% packet loss, 18ms latency, with Teridion rerouted to a different cloud provider: 0% packet loss, 18ms latency
Is your organization protected against future outages?
Understanding your risk factors requires an understanding of who is in your wider circle of dependencies. Be clear on how their performance and availability could impact your business if something were to go wrong. Maintaining visibility into the routing, availability, and performance of your critical providers is also extremely important.
Teridion guarantees superior, reliable performance. Contact us to learn more.