Resiliency Testing- Use Case (Sample Scenarios)
#
Component
Hypothesis
Test Scenarios
1
​
​
​
2
​
​
​
​
3
​
​
​
​​4
​​
​
5
​​
6
​​
​
7
​
​
8
​
​
9
​
​
​
10
​
​
11
​
​​
​
12
Network Load Balancer (NLB)
​
​
​
App Load Balancer (ALB)
​
​
​
​
AZ
​
​
​
​
​
EC2
​
​
​
​​
​
​
​
​
Aurora RDS
​
​​​
S3
​
​
​
​
​
​
​
Kubernetes Cluster (EKS)
If one of the NLB instance fails, application availability will not be impacted. Need to validate if failover is there or not.
​
​
If one of the NLB instance fails, application availability will not be impacted. To be finalized on 2nd ALB instance for failover and switching mechanism.
​
​
If one of the AZ goes down application availability will not be impacted
​
​
If any of the EC2 instances dies, availability will not be impacted
​
​
High resource utilization in any EC2 instance will not impact availability of the application
​​​
​
​
​
​​​​
If the primary RDS instance dies then the steady state will be maintained
​​​
Failure of single S3 instance will not impact corresponding service availability. (Low priority as stores metadata related to terraform.)
​
​
Kubernetes cluster continues operating when one of the control plane nodes fails
​
​
When one or more Worker node goes down, other nodes will be able to continue providing services or new nodes should spawn up
​​
​
If an entire AZ fails then availability will not be impacted
Shut down one of the NLB instance
​
​
​
Shut down one of the ALB instance
​
​
​
​
Bring down one of the AZ
​
​
​
​Shut down one EC2 instance
​
​
CPU attack on EC2 instance
​
Memory attack on EC2 instance
​
High Disk I/O on EC2 instance
​
​
Reboot primary DB instance
​
​
​Induce blackhole attack on one of the S3 instance
​
​
Shutdown one of the control plane node
​
​
​
Shutdown one of the worker node
​
​
​
Shut down all the worker nodes inside a single AZ
#
Application
Test Scenarios
1
​
​
​2
​
​​​
​3
​
​
​​​4
​​
​​
​
5
​
​
​​
6
​​
​
7
​
​
8
​
​​
9
​
​
​​
​
​​​10
​
​
11
​
​​
​
​
Kafka
​
​
​
​
​
​
​​​​​
NGINX Proxy
​​
​
​​
​
Application Microservices (Acc. No gen, LiMs etc.)
​
​
​
​​​
Enterprise Services
​
​
​
​
​
​​​
​
​​​​3rd Party Services
With the increase of disk I/O on broker nodes, corresponding pipeline throughput will decrease
​
if a broker node fails, messages are automatically routed to other brokers with no loss in data
​
If the leader broker fails, Kafka will quickly elect a new leader and successfully replicate messages again.
​
Kafka cluster can tolerate short term outage of zookeeper without crashing the cluster or corrupting the data
​
​
Even if one instance of NGINX Proxy fails, it will be able to route requests to downstream services
4 node cluster will spread across 3 Azs.
​
​
Each service will be able to continue functioning in case of failure of one or more instances
​
In case of very slow service response, application should be available with gracefully degradation
​
​
High latency of common services should not impact application availability by graceful degradation
​
Application should show graceful degradation in case of unavailability of dependent services
​​
​
​​
​
Application should show graceful degradation in case of failure of 3rd party services
Introduce disk I/O attack on the broker nodes
​
Introduce fault to any broker node​
​
​
block all traffic to and from primary broker
​
​
Block all traffic to and from zookeeper nodes for few seconds
​
​
Shut down any of the NGINX Proxy instance
​
​​​
​
​
Shutdown one or more instances/ PODs for a service
​
Introduce high latency for the specific service
​
​
​
Introduce high latency for specific services
​​
​
Introduce failure for specific services
​
​
​
​
Introduce failure to specific 3rd party service
​
Introduce high latency to specific 3rd party service
.png)