Finacle NFR Resiliency Testing

Back

Resiliency Testing- Use Case (Sample Scenarios)

#

Component

Hypothesis

Test Scenarios

Network Load Balancer (NLB)

App Load Balancer (ALB)

EC2

Aurora RDS

Kubernetes Cluster (EKS)

If one of the NLB instance fails, application availability will not be impacted. Need to validate if failover is there or not.

If one of the NLB instance fails, application availability will not be impacted. To be finalized on 2nd ALB instance for failover and switching mechanism.

If one of the AZ goes down application availability will not be impacted

If any of the EC2 instances dies, availability will not be impacted

High resource utilization in any EC2 instance will not impact availability of the application

If the primary RDS instance dies then the steady state will be maintained

Failure of single S3 instance will not impact corresponding service availability. (Low priority as stores metadata related to terraform.)

Kubernetes cluster continues operating when one of the control plane nodes fails

When one or more Worker node goes down, other nodes will be able to continue providing services or new nodes should spawn up

If an entire AZ fails then availability will not be impacted

Shut down one of the NLB instance

Shut down one of the ALB instance

Bring down one of the AZ

Shut down one EC2 instance

CPU attack on EC2 instance

Memory attack on EC2 instance

High Disk I/O on EC2 instance

Reboot primary DB instance

Induce blackhole attack on one of the S3 instance

Shutdown one of the control plane node

Shutdown one of the worker node

Shut down all the worker nodes inside a single AZ

#

Application

Test Scenarios

Kafka

NGINX Proxy

Application Microservices (Acc. No gen, LiMs etc.)

Enterprise Services

3rd Party Services

With the increase of disk I/O on broker nodes, corresponding pipeline throughput will decrease

if a broker node fails, messages are automatically routed to other brokers with no loss in data

If the leader broker fails, Kafka will quickly elect a new leader and successfully replicate messages again.

Kafka cluster can tolerate short term outage of zookeeper without crashing the cluster or corrupting the data

Even if one instance of NGINX Proxy fails, it will be able to route requests to downstream services
4 node cluster will spread across 3 Azs.

Each service will be able to continue functioning in case of failure of one or more instances

In case of very slow service response, application should be available with gracefully degradation

High latency of common services should not impact application availability by graceful degradation

Application should show graceful degradation in case of unavailability of dependent services

Application should show graceful degradation in case of failure of 3rd party services

Introduce disk I/O attack on the broker nodes

Introduce fault to any broker node

block all traffic to and from primary broker

Block all traffic to and from zookeeper nodes for few seconds

Shut down any of the NGINX Proxy instance

Shutdown one or more instances/ PODs for a service

Introduce high latency for the specific service

Introduce high latency for specific services

Introduce failure for specific services

Introduce failure to specific 3rd party service

Introduce high latency to specific 3rd party service