How to Avoid a Mass Outage – Facebook’s Use Case

Tomer Avisar

Avoiding a Single Point of Failure

Billions of people and businesses around the world were lately touched by the outage across all platforms of Facebook. The outage had influenced not only Facebook, but also WhatsApp, Instagram, Messenger, and even Oculus. These vital communications platforms, used by a mass of people and service providers globally, were inaccessible for hours. Many Facebook workers were cut off from their workplaces, Facebook platforms users couldn’t log in, which had a profound effect on thousands of businesses, advertisers, content creators and influencers, that were cut off from platforms used as significant income platforms. Inability to access these vital services can potentially have a lasting damaging effect.

 The main risk of having the need in a global platform lies in the fact that the users become dependent on a centralized network with a wide range of services. While this may seem like a convenient solution, it also lays the ground for a single point of failure, which creates a domino effect. If servers fail within Facebook’s services, it reflects on all of their applications and services, which in turn creates a negative experience upon users globally. In the case of Facebook, configuration changes on the backbone routers that coordinate network traffic between their data centers caused issues that interrupted this communication. This disruption to network traffic influenced the data centers functionality, bringing all services to a halt. 

Start Decentralizing

Since having a single point of failure could potentially have a devastating effect on the businesses, the appropriate solution would be to change the existing perception of networks and infrastructures of various services. Decentralized networks and systems are the key to a better connectivity and reliability of services. The solution is not to trust one major source of the network’s infrastructure, but to use decentralized methodology as a philosophy, that would lead to better connectivity and better user experience.

Teridion presents a multi-cloud strategy based on decentralized methodology, which aims at preventing these issues. The company’s platform can identify failures and resolve them automatically. For instance, if a certain Teridion virtual PoP (Point of Presence) is unavailable, the system’s self-healing mechanism knows how to resolve this issue by locating the optimal available Teridion virtual PoP. Thanks to its SLA-based platform, Teridion’s mechanism is designed to avoid a single point of failure, structured to prevent potential outages, and orchestrated to provide the most optimal and stable traffic routing.

Since multi-cloud platforms do not depend on a single cloud provider, they can guarantee effective connectivity. To achieve maximum connectivity reach, Teridion deploys 25 cloud providers, more than 500 connection points, thousands of sensors sampling the data center connectivity and many other features. This strategy helps orchestrate the optimal Internet route. A synergetic, distributed network effectively minimizes the chance of reaching a single point of failure, that disrupts the entire functionality of online businesses.

Would you like to learn more? Contact us today or email us – sales@teridion.com and get a free demo.