If you’re in the habit of sending HTTP requests from the US to destinations in Asia, you’ve probably encountered major packet loss. Packet loss in Asia can happen for a number of reasons, including:
- Inferior routes to the destination
- Peering issues between the host and the destination
- Government implemented firewalls
- High latency, resulting in timeouts
While government firewalls can hinder network performance or result in high latency, the worst culprit is often routing and peering issues, which can result in continuous packet loss intervals ranging from several minutes to several hours.
This type of severe packet loss can be a huge operational headache. We see the painful results of this all the time when we talk to prospects. In fact, this type of Internet inefficiency is exactly what Teridion is built to eliminate, and so irony was definitely in the air when we experienced the problem with one of our own internal processes!
Routing SaaS Traffic To Asia
You see, Teridion works by mapping the fastest paths across the Internet, and then constructing dynamic virtual networks for our customers across about 20 public cloud providers like AWS and GCP. When we find a faster route, we instantiate new virtual routers, called TCRs, wherever needed to use the fastest route. When our system spins up these routers, we use automated routines for creating VMs, sending REST API requests to the cloud provider’s API endpoint. Each cloud provider has its own API endpoints, and depending on the provider, the nearest endpoint could be in the US or could be somewhere else in the world.
Each routine consists of several requests, such as getting the proper image, retrieving available machine profiles, and the most important – creating an instance. These calls are sent one after the other. The last call is the machine creation request, and its response contains the newly created machine properties, used for later deployment.
Before we expanded our stable of cloud provider partners to include Asian cloud providers, this TCR creation process worked flawlessly. Whenever we needed to create a router, we would send our API calls to the nearest Google Cloud or IBM Softlayer endpoint, and the result would be as predictable as clockwork. But during our expansion into Asia, we encountered our first packet loss storm, and that created some real problems for us.
Let’s take a closer look at the details of what happened. Each HTTP request sent to the cloud provider API endpoint can get lost along the creation routine, and the loss of that packet can create a situation where the automation process times out. This in turn results in a retry of the entire automation process. That might not seem like a big deal. After all, we’re only talking about a process that is measured in milliseconds. But it all depends on which packet is lost. The truly painful scenario is where a confirmation response from the API endpoint made to a request to create an instance gets lost.
In this case, the lost response results in a retry of the entire automation process and a creation of a new machine, even though the first VM was created successfully.
The first machine isn’t known to the system as it received no confirmation of its creation and isn’t in service on the customer’s network. Despite this, Teridion is being billed for it by the provider since it is in fact up and running. Because this packet loss can exist over a large interval of time, it can result in dozens or even hundreds of unused machines.
Solving Packet Loss In Asia
Like any other SaaS provider suffering from performance issues in hard to serve geographies, there were a number of unpalatable options we could have chosen to improve reliability and performance:
- Moving our machine creation client closer to Asia: Impractical. The client communicates with other infrastructure components and we would simply be shifting the problem rather than solving it.
- Moving the entire infrastructure closer to Asia: Not acceptable, as the infrastructure interacts with other services located closer to the US than to Asia. This approach is still commonly how providers address the problem.
- Creating a cleanup routine: This would detect unused machines by comparing existing machines in our internal database with a report of existing machines taken from the cloud provider, and then deleting machines that didn’t exist in both places. It’s an acceptable method for terminating unused machines, but it does not prevent the initial billing of each machine.
- Using a proxy machine closer to Asia: Acceptable performance, but inevitably would add other requirements in resource-intensive areas like security and peering.
Physician, Heal Thyself!
It didn’t take us long to realize that the most effective solution to the problem was right in front of us: our own service! It assures the machine-creation client will always take the best available route to the cloud provider API endpoint, which results in less packet loss, and it overcomes peering issues by choosing routes with better peering, which improves the reliability of requests from the machine-creation client to API endpoint.
Like our customers do, we benefited from how quickly and simply we were able to set up the Virtual Backbone Network used for the transaction. In just a few minutes we had created the VBN from our machine creation server located in Iowa the cloud provider’s API endpoint. We created a CNAME as an entry point, and instead of sending our API request directly to the cloud provider API CNAME, we directed all requests to the CNAME we generated, assuring that the API request would get directed through our own network.
After that, we could sit back and let the system automatically find the best route between the two points.
The results were predictably impressive (at least predictable for our engineers, who see these kinds of results all the time in our customer networks). Packet loss reduced significantly, resulting in significant cost savings for us by eliminating the problem of the billing of phantom servers. We overcame the connectivity issues while our infrastructure remained intact- a much simpler, faster, and easier solution than the other possibilities I mentioned earlier. Besides the empirical benefits, it was also fun and gratifying to use our own infrastructure to optimize itself!
If you’d like to learn more about Teridion, take a deep dive into our whitepapers and case studies, read about how crucial good performance is for SaaS customer retention, or if you only have a few minutes to spare, have a look at our overview video.