Building Resilient Applications in the Cloud

8 min readJan 15, 2024

When building an application for serving customers, one of the questions raised is how do I know if my application is resilient and will survive a failure?

In this blog post, we will review what it means to build resilient applications in the cloud, and we will review some of the common best practices for achieving resilient applications.

What does it mean resilient applications?

AWS provides us with the following definition for the term resiliency:

“The ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.”

(Source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/resiliency-and-the-components-of-reliability.html)

Resiliency is part of the Reliability pillar for cloud providers such as AWS, Azure, GCP, and Oracle Cloud.

AWS takes it one step further, and shows how resiliency is part of the shared responsibility model:

The cloud provider is responsible for the resilience of the cloud (i.e., hardware, software, computing, storage, networking, and anything related to their data centers)
The customer is responsible for the resilience in the cloud (i.e., selecting the services to use, building resilient architectures, backup strategies, data replication, and more).

Source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/shared-responsibility-model-for-resiliency.html

How do we build resilient applications?

This blog post assumes that you are building modern applications in the public cloud.

We have all heard of RTO (Recovery time objective).

Resilient workload (a combination of application, data, and the infrastructure that supports it), should not only recover automatically, but it must recover within a pre-defined RTO, agreed by the business owner.

Below are common best practices for building resilient applications:

Design for high-availability

The public cloud allows you to easily deploy infrastructure over multiple availability zones.

Examples of implementing high availability in the cloud:

Deploying multiple VMs behind an auto-scaling group and a front-end load-balancer
Spreading container load over multiple Kubernetes worker nodes, deploying in multiple AZs
Deploying a cluster of database instances in multiple AZs
Deploying global (or multi-regional) database services (such as Amazon Aurora Global Database, Azure Cosmos DB, Google Cloud Spanner, and Oracle Global Data Services (GDS)
Configure DNS routing rules to send customers’ traffic to more than a single region
Deploy global load-balancer (such as AWS Global Accelerator, Azure Cross-region Load Balancer, or Google Global external Application Load Balancer) to spread customers’ traffic across regions

Implement autoscaling

Autoscaling is one of the biggest advantages of the public cloud.

Assuming we built a stateless application, we can add or remove additional compute nodes using autoscaling capability, and adjust it to the actual load on our application.

In a cloud-native infrastructure, we will use a managed load-balancer service, to receive traffic from customers, and send an API call to an autoscaling group, to add or remove additional compute nodes.

Implement microservice architecture

Microservice architecture is meant to break a complex application into smaller parts, each responsible for certain functionality of the application.

By implementing microservice architecture, we are decreasing the impact of failed components on the rest of the application.

In case of high load on a specific component, it is possible to add more compute resources to the specific component, and in case we discover a bug in one of the microservices, we can roll back to a previous functioning version of the specific microservice, with minimal impact on the rest of the application.

Implement event-driven architecture

Event-driven architecture allows us to decouple our application components.

Resiliency can be achieved using event-driven architecture, by the fact that even if one component fails, the rest of the application continues to function.

Components are loosely coupled by using events that trigger actions.

Event-driven architectures are usually (but not always) based on services managed by cloud providers, who are responsible for the scale and maintenance of the managed infrastructure.

Event-driven architectures are based on models such as pub/sub model (services such as Amazon SQS, Azure Web PubSub, Google Cloud Pub/Sub, and OCI Queue service) or based on event delivery (services such as Amazon EventBridge, Azure Event Grid, Google Eventarc, and OCI Events service).

Implement API Gateways

If your application exposes APIs, use API Gateways (services such as Amazon API Gateway, Azure API Management, Google Apigee, or OCI API Gateway) to allow incoming traffic to your backend APIs, perform throttling to protect the APIs from spikes in traffic, and perform authorization on incoming requests from customers.

Implement immutable infrastructure

Immutable infrastructure (such as VMs or containers) are meant to run application components, without storing session information inside the compute nodes.

In case of a failed component, it is easy to replace the failed component with a new one, with minimal disruption to the entire application, allowing to achieve fast recovery.

Data Management

Find the most suitable data store for your workload.

A microservice architecture allows you to select different data stores (from object storage to backend databases) for each microservice, decreasing the risk of complete failure due to availability issues in one of the backend data stores.

Once you select a data store, replicate it across multiple AZs, and if the business requires it, replicate it across multiple regions, to allow better availability, closer to the customers.

Implement observability

By monitoring all workload components, and sending logs from both infrastructure and application components to a central logging system, it is possible to identify anomalies, anticipate failures before they impact customers, and act.

Examples of actions can be sending a command to restart a VM, deploying a new container instead of a failed one, and more.

It is important to keep track of measurements — for example, what is considered normal response time to a customer request, to be able to detect anomalies.

Implement chaos engineering

The base assumption is that everything will eventually fail.

Implementing chaos engineering, allows us to conduct controlled experiments, inject faults into our workloads, testing what will happen in case of failure.

This allows us to better understand if our workload will survive a failure.

Examples can be adding load on disk volumes, injecting timeout when an application tier connects to a backend database, and more.

Examples of services for implementing chaos engineering are AWS Fault Injection Simulator, Azure Chaos Studio, and Gremlin.

Create a failover plan

In an ideal world, your workload will be designed for self-healing, meaning, it will automatically detect a failure and recover from it, for example, replace failed components, restart services, or switch to another AZ or even another region.

In practice, you need to prepare a failover plan, keep it up to date, and make sure your team is trained to act in case of major failure.

A disaster recovery plan without proper and regular testing is worth nothing — your team must practice repeatedly, and adjust the plan, and hopefully, they will be able to execute the plan during an emergency with minimal impact on customers.

Resilient applications tradeoffs

Failure can happen in various ways, and when we design our workload, we need to limit the blast radius on our workload.

Below are some common failure scenarios, and possible solutions:

Failure in a specific component of the application — By designing a microservice architecture, we can limit the impact of a failed component to a specific area of our application (depending on the criticality of the component, as part of the entire application)
Failure or a single AZ — By deploying infrastructure over multiple AZs, we can decrease the chance of application failure and impact on our customers
Failure of an entire region — Although this scenario is rare, cloud regions also fail, and by designing a multi-region architecture, we can decrease the impact on our customers
DDoS attack — By implementing DDoS protection mechanisms, we can decrease the risk of impacting our application with a DDoS attack

Whatever solution we design for our workloads, we need to understand that there is a cost and there might be tradeoffs for the solution we design.

Multi-region architecture aspects

A multi-region architecture will allow the most highly available resilient solution for your workloads; however, multi-region adds high cost for cross-region egress traffic, most services are limited to a single region, and your staff needs to know to support such a complex architecture.

Another limitation of multi-region architecture is data residency — if your business or regulator demands that customers’ data be stored in a specific region, a multi-region architecture is not an option.

Service quota/service limits

When designing a highly resilient architecture, we must take into consideration service quotas or service limits.

Sometimes we are bound to a service quota on a specific AZ or region, an issue that we may need to resolve with the cloud provider’s support team.

Sometimes we need to understand there is a service limit in a specific region, such as a specific service that is not available in a specific region, or there is a shortage of hardware in a specific region.

Autoscaling considerations

Horizontal autoscale (the ability to add or remove compute nodes) is one of the fundamental capabilities of the cloud, however, it has its limitations.

Provisioning a new compute node (from a VM, container instance, or even database instance) may take a couple of minutes to spin up (which may impact customer experience) or to spin down (which may impact service cost).

Also, to support horizontal scaling, you need to make sure the compute nodes are stateless, and that the application supports such capability.

Failover considerations

One of the limitations of database failover is their ability to switch between the primary node and one of the secondary nodes, either in case of failure or in case of scheduled maintenance.

We need to take into consideration the data replication, making sure transactions were saved and moved from the primary to the read replica node.

Summary

In this blog post, we have covered many aspects of building resilient applications in the cloud.

When designing new applications, we need to understand the business expectations (in terms of application availability and customer impact).

We also need to understand the various architectural design considerations, and their tradeoffs, to be able to match the technology to the business requirements.

As I always recommend — do not stay on the theoretical side of the equation, begin designing and building modern and highly resilient applications to serve your customers — There is no replacement for actual hands-on experience.

References

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.