Retailers lose 4% of a day’s sales  each hour a website is down, according to Channel Advisor, a software platform for retailers. From H&M to Home Depot and Nordstrom Rack – all of them experienced either downtime or intermittent outages during the Holiday season 2019. And one thing we know for sure is that there is nothing called 100% uptime. But there has to be a way to make our systems resilient and prevent outages for eCommerce sites.
We at Unbxd pay utmost attention to the growth of our customers and ensure that they are up and running for most of the time and do not suffer any business loss. And we do this by making our ecosystem – robust, reliable and resilient.
“Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation. ” ~Wikipedia
Ever since the term services and recently microservices came into usage, application developers have been converting monolithic APIs into simple and single-function microservices. However, such conversions come with the cost of ensuring consistent response times and resiliency when certain dependencies become unavailable.
For example, a monolithic web application that performs a retry for every call is potentially resilient to some extent, as it can recover when certain dependencies (such as databases or other services) are unavailable. This resilience comes without any additional network or code complexity.
For a service that orchestrates numerous dependencies, each invocation is costly, and a failure can lead to diminished user experience as well as to higher stress on the underlying system that is attempting to recover from the failure. And that is what we at Unbxd work towards – providing a seamless shopping experience for our customers across verticals.
Let us consider a typical use-case where an e-commerce site that is overloaded with requests on Black Friday, and the vendor providing the payment operations goes offline for a few seconds due to heavy traffic.
The users then begin to see long wait times for their checkouts due to the high concurrency of requests. These conditions also keep all of the application servers clogged with the threads that are waiting to receive a response from the vendor. After a long wait time, the eventual result is a failure.
This leads to either abandoned carts, or users trying to refresh or retry their checkouts, thereby increasing the load on the application servers—which already have long-waiting threads, leading to network congestion. Here is where circuit breaker patterns come helpful!
A circuit breaker is a simple structure that constantly remains vigilant, monitoring for faults. In the above-mentioned scenario, the circuit breaker identifies long waiting times among the calls to the vendor and fails-fast, returning an error response to the user instead of making the threads wait. Thus, the circuit breaker prevents users from having a very suboptimal response time.
And that is what keeps me and my team excited most of the time – to find a better and even more efficient circuit breaker pattern which can create an ecosystem, that can survive outages and downtime, without being impacted or least impact, if at all.
Martin Fowler says, “The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually, you’ll also want some kind of monitor alert if the circuit breaker trips.”
Recovery time is crucial for the underlying resource, and having a circuit breaker that fails-fast without overloading the system ensures that the vendor can recover quickly.
A circuit breaker is an always-live system keeping watch over dependency invocations. In case of a high failure rate, the circuit breaker stops the calls from going through for a small amount of time, rather than responding with a standard error.
We at Unbxd are always working towards building the nearest version of an Ideal Circuit Breaker. A harmonious system is one where we have an ideal circuit breaker, real-time monitoring, and a fast recovery variable setup, making the application truly resilient.
And that is what we are creating for our customers.
Unbxd has many client-facing APIs. They talk to a few of downstream out of which one essential service is Catalog Service. A failure of this service implies a failure for client services as well. Failure need not be an error always as an inability to serve the request in a timely manner is also equivalent to failure for all practical purposes. The problem for client services has been to make them resilient to this service and also not bombard this service if it is already down. We identified a Circuit Breaker to be an ideal solution for the problem our customers were facing.
We zeroed down to Hystrix which is an open-source implementation of Circuit Breaker by Netflix. All the calls to Catalog Service are wrapped in Hystrix functions. Any timeout or error by downstream forces the request to be served by alternate fallback strategy. Our team identified that the problem was to build an alternate service that could get us the catalog. Clearly, a cache was needed to serve the purpose. This can be seen in the sequence of images below:
We can clearly see that the cache hit rises as the circuit breaker gets initiated and response success is established and once the system is restored, the cache hit starts decreasing and the circuit breaker is back to open state.
LRU (Least Recently Used) Cache was implemented backed by aerospike. LRU was chosen to go by 80-20 rule (80% of the requests are for 20% of the products). Now aerospike does not have out of the box LRU sort of implementation. To create LRU behavior data entry/retrieval in aerospike was made through Lua scripts that run on aerospike nodes.
Now all the successful requests are also being cached in aerospike and when there is a failure (either timeout or error) requests are served from cache. If the failure persists for a few seconds for above a threshold percentage of the requests, the circuit becomes open and now all the requests are served from cache only. The system keeps on actively checking for the stability of downstream after a sleeping window and whenever it is stable, the circuit becomes close and the overall system returns to a normal state.
Using the example of the e-commerce site from above, with a resilient system in place, the circuit breaker keeps an ongoing evaluation of the faults from the payments processor and identifies long wait times or errors from the vendor. On such occurrences, it breaks the circuit, failing fast. As a result, users are notified of the problem and the vendor has enough time to recover.
In the meantime, the circuit breaker also keeps sending one request at regular intervals to evaluate if the vendor system is back again. If so, the circuit breaker closes the circuit immediately, allowing the rest of the calls to go through successfully, thereby effectively removing the problem of network congestion and long wait times.
And this is how we are building a resilient system for our eCommerce customers and preventing the cascading downstream failures from happening. In the future, we aim to improve and further fine-tune our cache admission strategy. We plan to use the frequency information of an LRU to implement the same. Right now we are using a single successful execution to close the circuit but we intend to use a configurable number of data points to make a more intelligent decision. Our ideal vision is to prevent a system from any kind of outage or possible downtime with a fully robust and resilient system in place and reducing the occurrence of any such incidents to zero.