NATS is an open-source cloud-native messaging system. NATS solves the problems of performance and availability while staying incredibly lean. It is always on and available and uses a fire-and-forget messaging pattern. Its simplicity, focus, and lightweight characteristics make it a prime candidate for the microservices ecosystem.
- Highly performant and lightweight
- Clustered servers
- Cluster aware clients
- Text-based simple protocol
- No external dependency like zookeeper, etc
- Unlike many other messaging systems, no need to create topics and subscriptions before use.
- Easy to extend and expand clusters using Gateways.
- Wildcard support on subscription topics. NATS supports all three communication patterns described earlier.
NATS supports all three communication patterns described earlier.
Use-Cases at Unbxd
Here we discuss a couple of use cases for leveraging NATS as a central messaging and communication platform at Unbxd.
We have a central configuration service responsible for managing all customer configurations. Various other services consume these configurations. Configuration updates need to be available to all consumer services in real time. Consumer services should also be able to subscribe to updates on only a subset of configurations.
NATS provides all the capabilities required to meet these requirements.
Configuration updates are published on NATS notifier with subjects as service names, for example, service. service-a.property-x, service.service-b.property-y or particular configuration like property.type-a, property.type-b. Consumer services subscribe to interesting topics and receive real-time updates. NATS wildcard subscription feature greatly simplifies the subscription logic by simply subscribing on service.service-a.*; this enables subscription on all updates on service-a properties. New configurations can be subscribed to simply by adding the new topic to a list of interesting topics.
Here is what happens in a step-by-step manner
- The client publishes a configuration update to the config service.
- Config service updates the local store and publishes the message on the NATS cluster on certain subjects.
- NATS pushes the data to cluster nodes connected to the consumer interested in these subjects.
- Consumer service receives the message and updates its local store.
Cross-region synchronization and replication
Unbxd serves customers globally, with its services deployed across multiple AWS regions. A customer's data resides in a specific region and is served only from that region. We have an application-level mechanism to redirect a request that has landed in the wrong AWS region due to various routing reasons back to the customer's home region. This adds considerable latency to the end user. Also, the inability to serve requests from a different region can cause an outage due to a region/service failure.
Cross-region replication thus becomes essential to serve requests from any region and a fault-tolerant system.
To be able to serve requests from any region, we need to replicate all our data, configurations, and services across multiple regions. Most of our services are stateless; hence, it becomes relatively easy to replicate services across regions with proper deployment practices. The major challenge lies in replicating configuration and data, which needs to be replicated in near real-time.
When we started with the design, we decided that each region would be an isolated entity, and there would be no inter-region service calls. Any request will be served completely by a single region only. We also decided that strong consistency is not a requirement. We rely on the eventual consistency model for replication, where differences are okay for a short period across regions. For example, it's okay if 2 different regions occasionally have slightly different configuration/data. These assumptions simplify the design to not deal with global locking, transactional update, inter-region reads, rollbacks, etc.
The diagram above shows the high-level design of cross-region infrastructure.
The design involves building a layer on top of the existing stack in each region, which can intercept the actions, publish them on a message broker across the regions, and replay the action.
This service acts as a proxy between the actual service and the client. It proxies HTTP requests (Action) to downstream services. On successful response, It converts the request to an event entity, decides the topic on which the request needs to be published, and publishes the request on a message broker. Eventually, this service will be responsible for handling concurrency and retries.
This service subscribes to the messages on the broker published by the bridge. Based on message metadata, it decides which other regions the event needs to be published in and publishes it on the message broker for different regions.
This service listens to the messages of the broker. Based on message metadata, it converts the event into an HTTP request and makes an appropriate downstream HTTP call.
NATS cluster and gateway
NATS is responsible for asynchronous communication between the bridge, the broadcaster, and the replayer. NATS clusters are deployed independently in each AWS region; the NATS gateway forms a communication link between clusters in different regions. When a message is published on the NATS cluster on a given topic, if a subscription exists for that topic in another region, the NATS gateway is responsible for making that message available to that cross-region subscriber.
The diagram above shows how a single Action in Region 1 is replicated in Region 2.
- The client requests an action in region 1. It submits the request on the bridge.
- Bridge proxies the requests to downstream services in the same region.
- Upon successful response, the bridge converts the action into an event and publishes it on the NATS cluster in the same region.
- Broadcaster in the same region receives the event, reads the metadata, and determines which other regions event needs to be published.
- Broadcaster publishes the message on the subjects (topic) of those regions.
- NATS identifies subscribers in the other region and sends the data across the region via a gateway.
- Replayer in Region 2 receives the event, converts it back into action, and calls the downstream (downstream info is embedded into the event in a region-agnostic manner).
The NATS server, designed for high performance and simplicity, doesn't provide a persistent message store. NATS Streaming comes with a persistent store for having a log for the messages published over the NATS server. To make the system build on top of NATS to be reliable and resilient, we will explore NATS streaming as an alternative.