One of the challenges with microservices based applications lies in choosing the mechanism for inter-service communication. A microservices based application is a distributed system running on multiple processes or services, the interaction between services is more than simple function calls and need to use protocols such as HTTP, AMQP or a binary protocol like TCP. To achieve the full advantages of microservices based architecture, such as being able to individually scale, operate, and evolve each service, this communication must also be very loosely coupled.
Services typically interact with one of the 3 common patterns:
- Request-response messaging: A producer produces a message, expects a response from each message consumer who received that message to know what the result of all executions was.
- One-way messaging: A producer produces a message to a messaging channel and doesn’t expect or want a response from any consumer.
- Publish-Subscribe messaging: A message is published to a topic and is immediately received by all of the subscribers of the topic.
All three are very common communication patterns and all are applicable for various use cases at Unbxd. Without central platform-level support, every team/service end up building their own solution.
NATS is an open-source cloud-native messaging system. NATS solves the problems of performance and availability while staying incredibly lean. It is always on and available and uses a fire and forget messaging pattern. It’s simplicity, focus and lightweight characteristics make it a prime candidate for the microservices ecosystem.
- Highly performant and lightweight
- Clustered servers
- Cluster aware clients
- Text-based simple protocol
- No external dependency like zookeeper, etcd etc
- Unlike many other messaging systems, no need to create topics and subscriptions before use.
- Easy to extend and expand clusters using Gateways.
- Wildcard support on subscription topics.
NATS supports all three communication patterns described earlier.
Use-Cases at Unbxd
Here we discuss a couple of use cases as to how we leverage NATS as a central messaging and communication platform at Unbxd.
We have a central configuration service which is responsible for managing all customer configurations. These configurations are consumed by various other services. Configuration updates need to be available to all consumer services in real-time. Consumer services should also be able to subscribe to updates on only a subset of configurations.
NATS provides all the capabilities required to meet these requirements.
Configuration updates are published on NATS notifier with subjects as service names, for example, service.service-a.property-x, service.service-b.property-y or particular configuration like property.type-a, property.type-b. Consumer services subscribe to the interested topics and receive real-time updates. NATS wildcard subscription feature greatly simplifies the subscription logic by simply subscribing on service.service-a.*, this enables subscription on all updates on service-a properties. New configurations can be subscribed simply by adding the new topic to a list of interested topics.
Here is what happens in a step by step manner
- Client publishes a configuration update to config service.
- Config service updates the local store and publishes the message on NATS cluster on certain subjects.
- NATS pushes the data to cluster nodes that are connected to the consumer interested in these subjects.
- Consumer service receives the message and updates its local store.
Cross-region synchronization and replication
Unbxd serves customers across the globe with its services deployed across multiple AWS regions. Data for a particular customer resides in a specific region and is served only from that region. We have an application-level mechanism to redirect request that has landed in the wrong AWS region due to various routing reasons back to the customer’s home region. This obviously adds considerable latency to the end-user. Also, the inability to serve requests from a different region can cause outage due to a region/service failures.
Cross-region replication thus becomes a basic requirement to be able to serve requests from any region and a fault-tolerant system.
In order to be able to serve requests from any region, we need to replicate all our data, configurations, and services across multiple regions. Most of our services are stateless and hence it becomes relatively easy to replicate services across regions with proper deployment practices. The major challenge lies in replicating configuration and data which needs to be replicated in near real-time.
When we started with the design, we decided that each region will be an isolated entity and there will be no inter-region service calls. Any request will be served completely by a single region only. We also decided that strong consistency is not a requirement. We rely on the eventual consistency model for replication where differences are okay for a short period of time across regions. It’s okay if 2 different regions occasionally have slightly different configuration/data. These assumptions simplify the design to not deal with global locking, transactional update, inter-region reads, rollbacks, etc.
The diagram above shows the high level design of cross-region infrastructure.
The design involves building a layer on top of the existing stack in each region which can intercept the actions, can publish it on a message broker which spans across the regions, and can replay the action.
This service acts as a proxy between the actual service and the client. It proxies HTTP requests (Action) to downstream service. On successful response, It converts the request to an event entity, decides the topic on which request needs to be published, and publishes the request on a message broker. Eventually, this service will be responsible for handling concurrency and retries.
This service subscribes to the messages on the broker published by the bridge. Based on message metadata, it decides which other regions the event needs to be published and publishes it on message broker for different regions.
This service listens to the messages on the broker. Based on message metadata, it converts the event into an HTTP request and makes an appropriate downstream HTTP call.
NATS cluster and gateway
NATS is responsible for asynchronous communication between the bridge, the broadcaster, and the replayer. NATS clusters are deployed independently in each AWS region, NATS gateway forms a communication link between clusters in different regions. When a message is published on NATS cluster on a given topic, if a subscription exists for that topic in another region, NATS gateway is responsible to make that message available to that cross-region subscriber.
The diagram above shows how a single Action in Region 1 is replicated in Region 2.
- The client requests an action in region 1. It submits the request on the Bridge.
- Bridge proxies the requests to downstream service in the same region.
- Upon successful response, the bridge converts the action into an event and publishes it on the NATS cluster in the same region.
- Broadcaster in the same region receives the event, reads the metadata, determines which other regions event needs to be published.
- Broadcaster publishes the message on the subjects (topic) of those regions.
- NATS figures out that there are subscribers on the other region and sends the data across the region via gateway.
- Replayer in the Region 2 receives the event, converts it back into action and calls the downstream (downstream info is embedded into the event in a region agnostic manner).
NATS server which is designed for high performance and simplicity doesn’t provide a persistent store for the messages. NATS Streaming comes with a persistent store for having a log for the messages that publish over the NATS server. In order to make the system build on top of NATS to be reliable and resilient, we will be exploring NATS streaming as an alternative.