Chaos Engineering Basics for QE
Introduction to chaos engineering and resilience testing
Introduction
Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions. Unlike traditional testing that proves the system works when things go right, chaos engineering proves it works when things go wrong.
What is Chaos Engineering?
Definition: The practice of intentionally injecting failures into your system to identify weaknesses before they cause outages.
Think of it like earthquake testing for buildings—you stress the system before real disasters strike.
Why Chaos Engineering?
Modern systems are complex and failure is inevitable:
- Microservices have many failure points
- Cloud infrastructure can fail unpredictably
- Networks are unreliable
- Third-party services go down
- Traffic spikes happen unexpectedly
Goal: Make sure your system degrades gracefully, not catastrophically.
The Principles of Chaos Engineering
1. Start with Steady State
Define what "normal" looks like:
// Define steady state metrics
const steadyState = {
// Business metrics
ordersPerMinute: { min: 50, max: 200 },
checkoutSuccessRate: { min: 98 },
// System metrics
apiResponseTime: { p95: 200, p99: 500 }, // milliseconds
errorRate: { max: 1 }, // percent
// Infrastructure
cpuUsage: { max: 70 }, // percent
memoryUsage: { max: 80 }
};
async function verifySteadyState() {
const current = await getCurrentMetrics();
return {
isHealthy:
current.ordersPerMinute >= steadyState.ordersPerMinute.min &&
current.checkoutSuccessRate >= steadyState.checkoutSuccessRate.min &&
current.apiResponseTime.p95 <= steadyState.apiResponseTime.p95 &&
current.errorRate <= steadyState.errorRate.max,
metrics: current
};
}2. Hypothesize About Steady State
Form testable hypotheses:
## Hypothesis: System Handles Database Failure
**Given**: The primary database becomes unavailable
**When**: We continue to receive normal traffic
**Then**:
- The system fails over to read replica within 5 seconds
- 95% of read requests still succeed
- Write requests are queued and processed when DB recovers
- Users see degraded functionality message, not errors
**Success Criteria**:
- API availability stays above 90%
- P95 latency increases by < 50%
- No data loss occurs
- System recovers automatically within 60 seconds3. Introduce Real-World Events
Simulate realistic failures:
Types of Chaos:
-
Resource Failures
- Server crashes
- Container restarts
- Pod evictions
-
Network Issues
- Latency injection
- Packet loss
- Bandwidth limits
- DNS failures
-
Application Failures
- Service crashes
- Memory leaks
- CPU spikes
- Dependency failures
-
Infrastructure Chaos
- Zone outages
- Cloud provider issues
- Database failures
4. Disprove the Hypothesis
Try to break your system:
// Chaos experiment workflow
async function runChaosExperiment(experiment) {
// 1. Verify steady state
console.log('📊 Verifying steady state...');
const baseline = await verifySteadyState();
if (!baseline.isHealthy) {
throw new Error('System not in steady state - abort experiment');
}
// 2. Inject chaos
console.log('💥 Injecting chaos:', experiment.name);
const chaos = await injectChaos(experiment.chaos);
// 3. Monitor during chaos
console.log('👀 Monitoring system behavior...');
const duringChaos = await monitorDuring(experiment.duration);
// 4. Remove chaos
console.log('🔧 Removing chaos...');
await removeChaos(chaos);
// 5. Verify recovery
console.log('✅ Verifying recovery...');
await sleep(experiment.recoveryTime);
const afterChaos = await verifySteadyState();
// 6. Analyze results
return analyzeResults({
hypothesis: experiment.hypothesis,
baseline,
duringChaos,
afterChaos,
expectedImpact: experiment.expectedImpact
});
}5. Automate and Minimize Blast Radius
Start small, expand carefully:
const chaosConfig = {
// Start with non-prod
environments: ['dev', 'staging'], // Not 'prod' yet!
// Limit scope
targetPercentage: 10, // Only affect 10% of instances
// Safety controls
maxDuration: 300, // 5 minutes max
abortOnMetricThreshold: {
errorRate: 5, // Abort if errors > 5%
latencyP95: 5000 // Abort if p95 > 5s
},
// Schedule
runDuring: {
days: ['Tuesday', 'Wednesday', 'Thursday'],
hours: [10, 11, 14, 15], // Business hours with team present
excludeHolidays: true
}
};Chaos Engineering Tools
1. Chaos Monkey (Netflix)
Randomly terminates instances:
# chaos-monkey-config.yml
chaos:
monkey:
enabled: true
schedule:
enabled: true
frequency: 30 # Every 30 minutes
# Termination settings
termination:
probability: 0.1 # 10% chance
enabled: true
# Safety settings
excludeDays:
- Saturday
- Sunday
excludeHours:
start: 0
end: 9 # Don't run 12am-9am
# Target groups
grouping:
type: app
apps: ['api-service', 'worker-service']
excludeApps: ['payment-service'] # Too criticalRunning Chaos Monkey:
# Install
npm install @netflix/chaos-monkey
# Configure
export CHAOS_MONKEY_ENABLED=true
export CHAOS_MONKEY_PROBABILITY=0.1
# Run
chaos-monkey start2. Chaos Toolkit
Framework for chaos experiments:
# experiment.yaml
version: 1.0.0
title: Database Failover Test
description: Verify system handles primary database failure gracefully
steady-state-hypothesis:
title: System is healthy
probes:
- name: api-is-available
type: probe
tolerance: 200
provider:
type: http
url: https://api.example.com/health
- name: error-rate-is-low
type: probe
tolerance:
type: range
range: [0, 1.0]
provider:
type: prometheus
query: rate(http_errors_total[1m])
method:
- type: action
name: kill-primary-database
provider:
type: process
path: kubectl
arguments:
- delete
- pod
- -l
- app=postgres,role=primary
- type: probe
name: check-failover-time
provider:
type: python
module: chaoslib.probes.http
func: get
arguments:
url: https://api.example.com/health
timeout: 10
- type: action
name: wait-for-recovery
provider:
type: python
module: time
func: sleep
arguments:
seconds: 30
rollbacks:
- type: action
name: restore-primary-database
provider:
type: process
path: kubectl
arguments:
- apply
- -f
- postgres-primary.yamlRun experiment:
# Install
pip install chaostoolkit
# Validate experiment
chaos validate experiment.yaml
# Run experiment
chaos run experiment.yaml
# Run with report
chaos run experiment.yaml --rollback-strategy=always --journal-path=journal.json3. Gremlin
Commercial chaos engineering platform:
// Gremlin Node.js SDK
const Gremlin = require('gremlin-client');
const gremlin = new Gremlin({
teamId: process.env.GREMLIN_TEAM_ID,
apiKey: process.env.GREMLIN_API_KEY
});
// CPU attack
async function cpuAttack() {
const attack = await gremlin.attacks.create({
target: {
type: 'Random',
exact: 1,
tags: {
service: 'api-service',
env: 'staging'
}
},
impact: {
type: 'cpu',
percent: 80, // Use 80% CPU
length: 300 // 5 minutes
}
});
console.log('Attack started:', attack.id);
return attack;
}
// Latency attack
async function latencyAttack() {
const attack = await gremlin.attacks.create({
target: {
type: 'Random',
percent: 50, // Affect 50% of instances
tags: { service: 'payment-service' }
},
impact: {
type: 'latency',
ms: 2000, // Add 2 second delay
length: 600
}
});
return attack;
}
// Network blackhole
async function networkBlackhole() {
const attack = await gremlin.attacks.create({
target: {
type: 'Exact',
exact: 1,
tags: { service: 'database' }
},
impact: {
type: 'blackhole',
port: 5432, // PostgreSQL port
length: 120
}
});
return attack;
}4. Litmus (Kubernetes)
Chaos engineering for Kubernetes:
# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-service-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: 'app=api-service'
appkind: deployment
engineState: 'active'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
probe:
- name: check-api-availability
type: httpProbe
httpProbe/inputs:
url: http://api-service/health
insecureSkipVerify: false
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 1Deploy and run:
# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.8.yaml
# Install experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.8?file=charts/generic/experiments.yaml
# Run experiment
kubectl apply -f pod-delete-experiment.yaml
# Check status
kubectl describe chaosengine api-service-chaos
# View results
kubectl logs -l name=chaos-runner5. Toxiproxy
Simulate network conditions:
// toxiproxy-client.js
const Toxiproxy = require('toxiproxy-client');
const toxiproxy = new Toxiproxy('http://localhost:8474');
// Create proxy for database
async function setupDatabaseProxy() {
const proxy = await toxiproxy.create({
name: 'postgres_master',
listen: '0.0.0.0:5433',
upstream: 'postgres:5432'
});
return proxy;
}
// Add latency
async function addLatency(proxyName, latency = 1000) {
const proxy = await toxiproxy.get(proxyName);
await proxy.addToxic({
type: 'latency',
name: 'slow_network',
attributes: {
latency: latency, // milliseconds
jitter: 100
}
});
console.log(`Added ${latency}ms latency to ${proxyName}`);
}
// Simulate packet loss
async function addPacketLoss(proxyName, percent = 10) {
const proxy = await toxiproxy.get(proxyName);
await proxy.addToxic({
type: 'timeout',
name: 'packet_loss',
attributes: {
timeout: 0
},
toxicity: percent / 100 // 0.1 = 10%
});
console.log(`Added ${percent}% packet loss to ${proxyName}`);
}
// Simulate connection limit
async function limitConnections(proxyName, maxConnections = 10) {
const proxy = await toxiproxy.get(proxyName);
await proxy.addToxic({
type: 'limit_data',
name: 'connection_limit',
attributes: {
bytes: maxConnections * 1024
}
});
}
// Remove all toxics
async function cleanup(proxyName) {
const proxy = await toxiproxy.get(proxyName);
const toxics = await proxy.toxics();
for (const toxic of toxics) {
await toxic.remove();
}
console.log(`Removed all toxics from ${proxyName}`);
}Docker Compose with Toxiproxy:
version: '3.8'
services:
toxiproxy:
image: ghcr.io/shopify/toxiproxy:latest
ports:
- "8474:8474" # Toxiproxy API
- "5433:5433" # Proxied postgres
networks:
- test-net
postgres:
image: postgres:14
environment:
POSTGRES_PASSWORD: password
networks:
- test-net
api-service:
build: .
environment:
# Point to proxy instead of direct postgres
DATABASE_URL: postgres://postgres:password@toxiproxy:5433/mydb
depends_on:
- toxiproxy
networks:
- test-net
networks:
test-net:Writing Chaos Experiments
Example: API Service Resilience
// chaos-experiments/api-resilience.js
const { expect } = require('chai');
const axios = require('axios');
const Toxiproxy = require('toxiproxy-client');
describe('API Service Resilience Tests', () => {
let toxiproxy;
let databaseProxy;
before(async () => {
toxiproxy = new Toxiproxy('http://localhost:8474');
databaseProxy = await toxiproxy.get('postgres_master');
});
afterEach(async () => {
// Clean up toxics after each test
const toxics = await databaseProxy.toxics();
for (const toxic of toxics) {
await toxic.remove();
}
});
it('should handle database latency gracefully', async () => {
// Add 2 second database latency
await databaseProxy.addToxic({
type: 'latency',
attributes: { latency: 2000 }
});
// Make API request
const start = Date.now();
const response = await axios.get('http://localhost:3000/api/products');
const duration = Date.now() - start;
// Verify response is still successful
expect(response.status).to.equal(200);
// Verify it used cache or handled timeout
expect(duration).to.be.lessThan(3000); // Should timeout/cache before 3s
// Verify data is reasonable (cached or partial)
expect(response.data.products).to.be.an('array');
});
it('should retry on database connection failure', async () => {
let attempts = 0;
// Intercept database calls
const originalQuery = db.query;
db.query = async (...args) => {
attempts++;
if (attempts <= 2) {
throw new Error('Connection refused');
}
return originalQuery(...args);
};
// Should succeed after retries
const response = await axios.get('http://localhost:3000/api/users/123');
expect(response.status).to.equal(200);
expect(attempts).to.equal(3); // Failed twice, succeeded third time
// Cleanup
db.query = originalQuery;
});
it('should degrade gracefully when cache is unavailable', async () => {
// Kill Redis
await toxiproxy.get('redis_master').disable();
try {
const response = await axios.get('http://localhost:3000/api/products');
// Should still work, just slower
expect(response.status).to.equal(200);
expect(response.data.products).to.be.an('array');
// Check for degraded mode indicator
expect(response.headers['x-cache-status']).to.equal('miss');
} finally {
await toxiproxy.get('redis_master').enable();
}
});
it('should handle partial service outage', async () => {
// Kill 50% of API instances (in real scenario)
// For this test, we'll simulate by making service slow
await databaseProxy.addToxic({
type: 'latency',
attributes: { latency: 5000 },
toxicity: 0.5 // Affect 50% of requests
});
// Make multiple requests
const requests = Array(20).fill().map(() =>
axios.get('http://localhost:3000/api/health', { timeout: 6000 })
);
const results = await Promise.allSettled(requests);
const successful = results.filter(r => r.status === 'fulfilled');
// At least 50% should succeed
expect(successful.length).to.be.at.least(10);
// All successful requests should be healthy
successful.forEach(result => {
expect(result.value.data.status).to.equal('healthy');
});
});
});Example: Payment Service Chaos
// chaos-experiments/payment-resilience.js
describe('Payment Service Chaos Tests', () => {
it('should handle payment gateway timeout', async () => {
// Mock payment gateway with timeout
nock('https://payment-gateway.com')
.post('/charge')
.delayConnection(31000) // 31 second delay (exceeds timeout)
.reply(200);
const paymentRequest = {
amount: 99.99,
cardNumber: '4242424242424242'
};
// Should timeout and return proper error
const response = await axios.post('/api/payments', paymentRequest, {
validateStatus: () => true
});
expect(response.status).to.equal(504); // Gateway timeout
expect(response.data.error).to.include('timeout');
// Verify payment wasn't partially processed
const order = await db.query('SELECT * FROM orders WHERE id = ?', [orderId]);
expect(order.status).to.equal('pending'); // Not charged
});
it('should queue payments when gateway is down', async () => {
// Mock gateway completely down
nock('https://payment-gateway.com')
.post('/charge')
.replyWithError('Service unavailable');
const response = await axios.post('/api/payments', paymentRequest);
// Should queue for later processing
expect(response.status).to.equal(202); // Accepted
expect(response.data.status).to.equal('queued');
// Verify it's in the queue
const queued = await redis.get(`payment:${response.data.id}`);
expect(queued).to.not.be.null;
});
it('should not lose money on network partition', async () => {
// Simulate network split during payment
let gatewayResponded = false;
nock('https://payment-gateway.com')
.post('/charge')
.reply(() => {
gatewayResponded = true;
// Simulate response lost in network
throw new Error('ECONNRESET');
});
try {
await axios.post('/api/payments', paymentRequest);
} catch (error) {
// Expected to fail
}
// Gateway processed it (debited card)
expect(gatewayResponded).to.be.true;
// Our system should mark as "uncertain" not "failed"
const payment = await db.query('SELECT * FROM payments WHERE order_id = ?', [orderId]);
expect(payment.status).to.equal('pending_verification');
// Should have reconciliation job to verify
const job = await queue.getJob(payment.id);
expect(job.name).to.equal('verify_payment_status');
});
});GameDays and Fire Drills
Planning a GameDay
## Payment Service GameDay - February 15, 2026
**Objective**: Verify payment system handles various failure scenarios
**Time**: 10:00 AM - 2:00 PM PST
**Team**:
- Chaos Engineer: Alice (leader)
- SRE: Bob (observer)
- Backend Dev: Carol (responder)
- QE: Dave (metrics)
**Scenarios**:
1. **Database Failover** (10:00-10:30)
- Kill primary database
- Expect: Auto-failover to replica within 30s
- Monitor: Payment success rate, latency
2. **Payment Gateway Timeout** (11:00-11:30)
- Add 10s latency to gateway
- Expect: Requests timeout gracefully, retry mechanism works
- Monitor: Timeout rate, retry success rate
3. **Traffic Spike** (1:00-1:30)
- 10x normal traffic
- Expect: Auto-scaling triggers, no errors
- Monitor: Instance count, CPU, error rate
4. **Multi-Failure** (1:30-2:00)
- Combine: DB latency + cache failure
- Expect: Degraded but functional
- Monitor: Overall system health
**Success Criteria**:
- Payment success rate stays > 95%
- No data loss or corruption
- All failures detected by monitoring
- Team responds appropriately
**Rollback Plan**:
- Abort if payment success < 90%
- Kill switch: restore all systems immediately
- Escalation: Page on-call if neededGameDay Execution Checklist
## Before GameDay
- [ ] Get approval from stakeholders
- [ ] Notify customers of potential issues (optional)
- [ ] Set up extra monitoring
- [ ] Prepare rollback procedures
- [ ] Dry run in staging
- [ ] Ensure team availability
## During GameDay
- [ ] Start with steady state verification
- [ ] Document everything (notes, screenshots, metrics)
- [ ] Have war room / video call active
- [ ] Monitor customer impact
- [ ] Be ready to abort
## After GameDay
- [ ] Verify system recovery
- [ ] Collect all data and logs
- [ ] Debrief meeting within 24 hours
- [ ] Document lessons learned
- [ ] Create action items for issues found
- [ ] Share results with broader teamSafety and Best Practices
1. Start Small
const chaosProgression = {
week1: {
environment: 'dev',
scope: 'single service',
impact: 'low',
duration: '1 minute'
},
week2: {
environment: 'staging',
scope: 'single service',
impact: 'medium',
duration: '5 minutes'
},
week4: {
environment: 'staging',
scope: 'multiple services',
impact: 'medium',
duration: '15 minutes'
},
week8: {
environment: 'production',
scope: 'canary deployment',
impact: 'low',
percentage: '1%',
duration: '5 minutes'
}
};2. Implement Kill Switches
// Chaos abort conditions
const abortConditions = {
errorRate: async () => {
const rate = await getErrorRate();
return rate > 5; // Abort if > 5% errors
},
customerComplaints: async () => {
const complaints = await getRecentComplaints();
return complaints.length > 10; // Abort if >10 complaints
},
revenueImpact: async () => {
const revenue = await getCurrentRevenue();
const baseline = await getBaselineRevenue();
const drop = (baseline - revenue) / baseline * 100;
return drop > 10; // Abort if revenue down >10%
}
};
// Monitor and abort if needed
async function monitorChaos(experimentId) {
const interval = setInterval(async () => {
for (const [condition, check] of Object.entries(abortConditions)) {
if (await check()) {
console.error(`❌ Abort condition triggered: ${condition}`);
await abortExperiment(experimentId);
clearInterval(interval);
break;
}
}
}, 5000); // Check every 5 seconds
}3. Communicate
// Notify before chaos
async function notifyTeam(experiment) {
await slack.send({
channel: '#chaos-engineering',
text: `🔬 Starting chaos experiment: ${experiment.name}`,
attachments: [{
color: 'warning',
fields: [
{ title: 'Environment', value: experiment.env },
{ title: 'Duration', value: `${experiment.duration}s` },
{ title: 'Target', value: experiment.target },
{ title: 'Dashboard', value: experiment.dashboardUrl }
]
}]
});
}
// Report results
async function reportResults(results) {
const status = results.success ? '✅ Success' : '❌ Failed';
await slack.send({
channel: '#chaos-engineering',
text: `${status} Chaos experiment complete: ${results.name}`,
attachments: [{
color: results.success ? 'good' : 'danger',
fields: [
{ title: 'Hypothesis', value: results.hypothesis },
{ title: 'Result', value: results.outcome },
{ title: 'Impact', value: results.impact },
{ title: 'Action Items', value: results.actionItems.join('\n') }
]
}]
});
}Common Patterns
Circuit Breaker
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.successThreshold = options.successThreshold || 2;
this.timeout = options.timeout || 60000;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.failures = 0;
this.successes = 0;
this.nextAttempt = Date.now();
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
if (this.state === 'HALF_OPEN') {
this.successes++;
if (this.successes >= this.successThreshold) {
this.state = 'CLOSED';
this.successes = 0;
}
}
}
onFailure() {
this.failures++;
this.successes = 0;
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Usage
const paymentGatewayBreaker = new CircuitBreaker({
failureThreshold: 5,
timeout: 60000
});
async function callPaymentGateway(data) {
try {
return await paymentGatewayBreaker.execute(async () => {
return await paymentGateway.charge(data);
});
} catch (error) {
// Circuit is open, use fallback
return await queueForLater(data);
}
}Retry with Exponential Backoff
async function retryWithBackoff(fn, options = {}) {
const maxRetries = options.maxRetries || 3;
const baseDelay = options.baseDelay || 1000;
const maxDelay = options.maxDelay || 30000;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) {
throw error;
}
// Exponential backoff with jitter
const delay = Math.min(
baseDelay * Math.pow(2, attempt) + Math.random() * 1000,
maxDelay
);
console.log(`Retry ${attempt + 1}/${maxRetries} after ${delay}ms`);
await sleep(delay);
}
}
}
// Usage
const data = await retryWithBackoff(
() => database.query('SELECT * FROM users'),
{ maxRetries: 3, baseDelay: 1000 }
);Measuring Success
Track chaos engineering effectiveness:
const chaosMetrics = {
// Discovery metrics
issuesFound: 12, // Bugs/weaknesses discovered
criticalIssues: 3,
issuesFixed: 10,
// Coverage metrics
experimentsRun: 45,
servicesTestedPercentage: 78,
failureScenariosCovered: 23,
// Impact metrics
mttrImprovement: -35, // 35% faster recovery
incidentsPreveented: 5, // Issues caught before production
confidenceScore: 8.5, // Team confidence (1-10)
// Organizational metrics
teamMembers: 8,
gamedays: 4,
documentedPlaybooks: 12
};Next Steps
- Start with monitoring - Can't do chaos without observability
- Read "Chaos Engineering" book by Netflix engineers
- Run first experiment in dev - Kill a container, observe
- Expand gradually - More services, longer duration, higher impact
- Schedule a GameDay - Practice as a team
- Automate experiments - Run regularly in CI/CD
- Build a chaos champions program - Spread the practice
Related Articles
- "Monitoring & Observability for QE" - Essential prerequisite
- "Resilience Testing Strategies" - Broader resilience concepts
- "Performance Testing" - Chaos under load
- "Production Testing" - Testing in prod safely
- "Incident Response" - What to do when chaos finds real issues
Conclusion
Chaos Engineering isn't about breaking things—it's about building confidence. By deliberately introducing failures in controlled ways, you:
- Find weaknesses before they cause outages
- Build resilient systems that degrade gracefully
- Develop team skills in incident response
- Sleep better knowing your system can handle failure
Start small, be safe, and remember: The best time to find out your system can't handle a database failure is during a chaos experiment, not during Black Friday.
Remember: If you haven't tested it failing, you don't know it works!
Comments (0)
Loading comments...