Back to Articles
AdvancedAdvanced

Chaos Engineering Basics for QE

Introduction to chaos engineering and resilience testing

17 min read
...
chaos-engineeringresiliencetestingadvanced
Banner for Chaos Engineering Basics for QE

Introduction

Chaos Engineering is the discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions. Unlike traditional testing that proves the system works when things go right, chaos engineering proves it works when things go wrong.

What is Chaos Engineering?

Definition: The practice of intentionally injecting failures into your system to identify weaknesses before they cause outages.

Think of it like earthquake testing for buildings—you stress the system before real disasters strike.

Why Chaos Engineering?

Modern systems are complex and failure is inevitable:

  • Microservices have many failure points
  • Cloud infrastructure can fail unpredictably
  • Networks are unreliable
  • Third-party services go down
  • Traffic spikes happen unexpectedly

Goal: Make sure your system degrades gracefully, not catastrophically.

The Principles of Chaos Engineering

1. Start with Steady State

Define what "normal" looks like:

// Define steady state metrics
const steadyState = {
  // Business metrics
  ordersPerMinute: { min: 50, max: 200 },
  checkoutSuccessRate: { min: 98 },
  
  // System metrics
  apiResponseTime: { p95: 200, p99: 500 }, // milliseconds
  errorRate: { max: 1 }, // percent
  
  // Infrastructure
  cpuUsage: { max: 70 }, // percent
  memoryUsage: { max: 80 }
};
 
async function verifySteadyState() {
  const current = await getCurrentMetrics();
  
  return {
    isHealthy: 
      current.ordersPerMinute >= steadyState.ordersPerMinute.min &&
      current.checkoutSuccessRate >= steadyState.checkoutSuccessRate.min &&
      current.apiResponseTime.p95 <= steadyState.apiResponseTime.p95 &&
      current.errorRate <= steadyState.errorRate.max,
    metrics: current
  };
}

2. Hypothesize About Steady State

Form testable hypotheses:

## Hypothesis: System Handles Database Failure
 
**Given**: The primary database becomes unavailable
**When**: We continue to receive normal traffic
**Then**: 
- The system fails over to read replica within 5 seconds
- 95% of read requests still succeed
- Write requests are queued and processed when DB recovers
- Users see degraded functionality message, not errors
 
**Success Criteria**:
- API availability stays above 90%
- P95 latency increases by < 50%
- No data loss occurs
- System recovers automatically within 60 seconds

3. Introduce Real-World Events

Simulate realistic failures:

Types of Chaos:

  1. Resource Failures

    • Server crashes
    • Container restarts
    • Pod evictions
  2. Network Issues

    • Latency injection
    • Packet loss
    • Bandwidth limits
    • DNS failures
  3. Application Failures

    • Service crashes
    • Memory leaks
    • CPU spikes
    • Dependency failures
  4. Infrastructure Chaos

    • Zone outages
    • Cloud provider issues
    • Database failures

4. Disprove the Hypothesis

Try to break your system:

// Chaos experiment workflow
async function runChaosExperiment(experiment) {
  // 1. Verify steady state
  console.log('📊 Verifying steady state...');
  const baseline = await verifySteadyState();
  if (!baseline.isHealthy) {
    throw new Error('System not in steady state - abort experiment');
  }
  
  // 2. Inject chaos
  console.log('💥 Injecting chaos:', experiment.name);
  const chaos = await injectChaos(experiment.chaos);
  
  // 3. Monitor during chaos
  console.log('👀 Monitoring system behavior...');
  const duringChaos = await monitorDuring(experiment.duration);
  
  // 4. Remove chaos
  console.log('🔧 Removing chaos...');
  await removeChaos(chaos);
  
  // 5. Verify recovery
  console.log('✅ Verifying recovery...');
  await sleep(experiment.recoveryTime);
  const afterChaos = await verifySteadyState();
  
  // 6. Analyze results
  return analyzeResults({
    hypothesis: experiment.hypothesis,
    baseline,
    duringChaos,
    afterChaos,
    expectedImpact: experiment.expectedImpact
  });
}

5. Automate and Minimize Blast Radius

Start small, expand carefully:

const chaosConfig = {
  // Start with non-prod
  environments: ['dev', 'staging'], // Not 'prod' yet!
  
  // Limit scope
  targetPercentage: 10, // Only affect 10% of instances
  
  // Safety controls
  maxDuration: 300, // 5 minutes max
  abortOnMetricThreshold: {
    errorRate: 5, // Abort if errors > 5%
    latencyP95: 5000 // Abort if p95 > 5s
  },
  
  // Schedule
  runDuring: {
    days: ['Tuesday', 'Wednesday', 'Thursday'],
    hours: [10, 11, 14, 15], // Business hours with team present
    excludeHolidays: true
  }
};

Chaos Engineering Tools

1. Chaos Monkey (Netflix)

Randomly terminates instances:

# chaos-monkey-config.yml
chaos:
  monkey:
    enabled: true
    schedule:
      enabled: true
      frequency: 30 # Every 30 minutes
    
    # Termination settings
    termination:
      probability: 0.1 # 10% chance
      enabled: true
      
    # Safety settings
    excludeDays:
      - Saturday
      - Sunday
    excludeHours:
      start: 0
      end: 9 # Don't run 12am-9am
      
    # Target groups
    grouping:
      type: app
      apps: ['api-service', 'worker-service']
      excludeApps: ['payment-service'] # Too critical

Running Chaos Monkey:

# Install
npm install @netflix/chaos-monkey
 
# Configure
export CHAOS_MONKEY_ENABLED=true
export CHAOS_MONKEY_PROBABILITY=0.1
 
# Run
chaos-monkey start

2. Chaos Toolkit

Framework for chaos experiments:

# experiment.yaml
version: 1.0.0
title: Database Failover Test
 
description: Verify system handles primary database failure gracefully
 
steady-state-hypothesis:
  title: System is healthy
  probes:
    - name: api-is-available
      type: probe
      tolerance: 200
      provider:
        type: http
        url: https://api.example.com/health
        
    - name: error-rate-is-low
      type: probe
      tolerance: 
        type: range
        range: [0, 1.0]
      provider:
        type: prometheus
        query: rate(http_errors_total[1m])
 
method:
  - type: action
    name: kill-primary-database
    provider:
      type: process
      path: kubectl
      arguments: 
        - delete
        - pod
        - -l
        - app=postgres,role=primary
        
  - type: probe
    name: check-failover-time
    provider:
      type: python
      module: chaoslib.probes.http
      func: get
      arguments:
        url: https://api.example.com/health
        timeout: 10
        
  - type: action
    name: wait-for-recovery
    provider:
      type: python
      module: time
      func: sleep
      arguments:
        seconds: 30
 
rollbacks:
  - type: action
    name: restore-primary-database
    provider:
      type: process
      path: kubectl
      arguments:
        - apply
        - -f
        - postgres-primary.yaml

Run experiment:

# Install
pip install chaostoolkit
 
# Validate experiment
chaos validate experiment.yaml
 
# Run experiment
chaos run experiment.yaml
 
# Run with report
chaos run experiment.yaml --rollback-strategy=always --journal-path=journal.json

3. Gremlin

Commercial chaos engineering platform:

// Gremlin Node.js SDK
const Gremlin = require('gremlin-client');
 
const gremlin = new Gremlin({
  teamId: process.env.GREMLIN_TEAM_ID,
  apiKey: process.env.GREMLIN_API_KEY
});
 
// CPU attack
async function cpuAttack() {
  const attack = await gremlin.attacks.create({
    target: {
      type: 'Random',
      exact: 1,
      tags: {
        service: 'api-service',
        env: 'staging'
      }
    },
    impact: {
      type: 'cpu',
      percent: 80, // Use 80% CPU
      length: 300 // 5 minutes
    }
  });
  
  console.log('Attack started:', attack.id);
  return attack;
}
 
// Latency attack
async function latencyAttack() {
  const attack = await gremlin.attacks.create({
    target: {
      type: 'Random',
      percent: 50, // Affect 50% of instances
      tags: { service: 'payment-service' }
    },
    impact: {
      type: 'latency',
      ms: 2000, // Add 2 second delay
      length: 600
    }
  });
  
  return attack;
}
 
// Network blackhole
async function networkBlackhole() {
  const attack = await gremlin.attacks.create({
    target: {
      type: 'Exact',
      exact: 1,
      tags: { service: 'database' }
    },
    impact: {
      type: 'blackhole',
      port: 5432, // PostgreSQL port
      length: 120
    }
  });
  
  return attack;
}

4. Litmus (Kubernetes)

Chaos engineering for Kubernetes:

# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-service-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: 'app=api-service'
    appkind: deployment
    
  engineState: 'active'
  
  chaosServiceAccount: litmus-admin
  
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
              
            - name: CHAOS_INTERVAL
              value: '10'
              
            - name: FORCE
              value: 'false'
              
        probe:
          - name: check-api-availability
            type: httpProbe
            httpProbe/inputs:
              url: http://api-service/health
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 1

Deploy and run:

# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.8.yaml
 
# Install experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.8?file=charts/generic/experiments.yaml
 
# Run experiment
kubectl apply -f pod-delete-experiment.yaml
 
# Check status
kubectl describe chaosengine api-service-chaos
 
# View results
kubectl logs -l name=chaos-runner

5. Toxiproxy

Simulate network conditions:

// toxiproxy-client.js
const Toxiproxy = require('toxiproxy-client');
const toxiproxy = new Toxiproxy('http://localhost:8474');
 
// Create proxy for database
async function setupDatabaseProxy() {
  const proxy = await toxiproxy.create({
    name: 'postgres_master',
    listen: '0.0.0.0:5433',
    upstream: 'postgres:5432'
  });
  
  return proxy;
}
 
// Add latency
async function addLatency(proxyName, latency = 1000) {
  const proxy = await toxiproxy.get(proxyName);
  
  await proxy.addToxic({
    type: 'latency',
    name: 'slow_network',
    attributes: {
      latency: latency, // milliseconds
      jitter: 100
    }
  });
  
  console.log(`Added ${latency}ms latency to ${proxyName}`);
}
 
// Simulate packet loss
async function addPacketLoss(proxyName, percent = 10) {
  const proxy = await toxiproxy.get(proxyName);
  
  await proxy.addToxic({
    type: 'timeout',
    name: 'packet_loss',
    attributes: {
      timeout: 0
    },
    toxicity: percent / 100 // 0.1 = 10%
  });
  
  console.log(`Added ${percent}% packet loss to ${proxyName}`);
}
 
// Simulate connection limit
async function limitConnections(proxyName, maxConnections = 10) {
  const proxy = await toxiproxy.get(proxyName);
  
  await proxy.addToxic({
    type: 'limit_data',
    name: 'connection_limit',
    attributes: {
      bytes: maxConnections * 1024
    }
  });
}
 
// Remove all toxics
async function cleanup(proxyName) {
  const proxy = await toxiproxy.get(proxyName);
  const toxics = await proxy.toxics();
  
  for (const toxic of toxics) {
    await toxic.remove();
  }
  
  console.log(`Removed all toxics from ${proxyName}`);
}

Docker Compose with Toxiproxy:

version: '3.8'
 
services:
  toxiproxy:
    image: ghcr.io/shopify/toxiproxy:latest
    ports:
      - "8474:8474" # Toxiproxy API
      - "5433:5433" # Proxied postgres
    networks:
      - test-net
      
  postgres:
    image: postgres:14
    environment:
      POSTGRES_PASSWORD: password
    networks:
      - test-net
      
  api-service:
    build: .
    environment:
      # Point to proxy instead of direct postgres
      DATABASE_URL: postgres://postgres:password@toxiproxy:5433/mydb
    depends_on:
      - toxiproxy
    networks:
      - test-net
 
networks:
  test-net:

Writing Chaos Experiments

Example: API Service Resilience

// chaos-experiments/api-resilience.js
const { expect } = require('chai');
const axios = require('axios');
const Toxiproxy = require('toxiproxy-client');
 
describe('API Service Resilience Tests', () => {
  let toxiproxy;
  let databaseProxy;
  
  before(async () => {
    toxiproxy = new Toxiproxy('http://localhost:8474');
    databaseProxy = await toxiproxy.get('postgres_master');
  });
  
  afterEach(async () => {
    // Clean up toxics after each test
    const toxics = await databaseProxy.toxics();
    for (const toxic of toxics) {
      await toxic.remove();
    }
  });
  
  it('should handle database latency gracefully', async () => {
    // Add 2 second database latency
    await databaseProxy.addToxic({
      type: 'latency',
      attributes: { latency: 2000 }
    });
    
    // Make API request
    const start = Date.now();
    const response = await axios.get('http://localhost:3000/api/products');
    const duration = Date.now() - start;
    
    // Verify response is still successful
    expect(response.status).to.equal(200);
    
    // Verify it used cache or handled timeout
    expect(duration).to.be.lessThan(3000); // Should timeout/cache before 3s
    
    // Verify data is reasonable (cached or partial)
    expect(response.data.products).to.be.an('array');
  });
  
  it('should retry on database connection failure', async () => {
    let attempts = 0;
    
    // Intercept database calls
    const originalQuery = db.query;
    db.query = async (...args) => {
      attempts++;
      if (attempts <= 2) {
        throw new Error('Connection refused');
      }
      return originalQuery(...args);
    };
    
    // Should succeed after retries
    const response = await axios.get('http://localhost:3000/api/users/123');
    
    expect(response.status).to.equal(200);
    expect(attempts).to.equal(3); // Failed twice, succeeded third time
    
    // Cleanup
    db.query = originalQuery;
  });
  
  it('should degrade gracefully when cache is unavailable', async () => {
    // Kill Redis
    await toxiproxy.get('redis_master').disable();
    
    try {
      const response = await axios.get('http://localhost:3000/api/products');
      
      // Should still work, just slower
      expect(response.status).to.equal(200);
      expect(response.data.products).to.be.an('array');
      
      // Check for degraded mode indicator
      expect(response.headers['x-cache-status']).to.equal('miss');
      
    } finally {
      await toxiproxy.get('redis_master').enable();
    }
  });
  
  it('should handle partial service outage', async () => {
    // Kill 50% of API instances (in real scenario)
    // For this test, we'll simulate by making service slow
    
    await databaseProxy.addToxic({
      type: 'latency',
      attributes: { latency: 5000 },
      toxicity: 0.5 // Affect 50% of requests
    });
    
    // Make multiple requests
    const requests = Array(20).fill().map(() => 
      axios.get('http://localhost:3000/api/health', { timeout: 6000 })
    );
    
    const results = await Promise.allSettled(requests);
    const successful = results.filter(r => r.status === 'fulfilled');
    
    // At least 50% should succeed
    expect(successful.length).to.be.at.least(10);
    
    // All successful requests should be healthy
    successful.forEach(result => {
      expect(result.value.data.status).to.equal('healthy');
    });
  });
});

Example: Payment Service Chaos

// chaos-experiments/payment-resilience.js
describe('Payment Service Chaos Tests', () => {
  
  it('should handle payment gateway timeout', async () => {
    // Mock payment gateway with timeout
    nock('https://payment-gateway.com')
      .post('/charge')
      .delayConnection(31000) // 31 second delay (exceeds timeout)
      .reply(200);
    
    const paymentRequest = {
      amount: 99.99,
      cardNumber: '4242424242424242'
    };
    
    // Should timeout and return proper error
    const response = await axios.post('/api/payments', paymentRequest, {
      validateStatus: () => true
    });
    
    expect(response.status).to.equal(504); // Gateway timeout
    expect(response.data.error).to.include('timeout');
    
    // Verify payment wasn't partially processed
    const order = await db.query('SELECT * FROM orders WHERE id = ?', [orderId]);
    expect(order.status).to.equal('pending'); // Not charged
  });
  
  it('should queue payments when gateway is down', async () => {
    // Mock gateway completely down
    nock('https://payment-gateway.com')
      .post('/charge')
      .replyWithError('Service unavailable');
    
    const response = await axios.post('/api/payments', paymentRequest);
    
    // Should queue for later processing
    expect(response.status).to.equal(202); // Accepted
    expect(response.data.status).to.equal('queued');
    
    // Verify it's in the queue
    const queued = await redis.get(`payment:${response.data.id}`);
    expect(queued).to.not.be.null;
  });
  
  it('should not lose money on network partition', async () => {
    // Simulate network split during payment
    let gatewayResponded = false;
    
    nock('https://payment-gateway.com')
      .post('/charge')
      .reply(() => {
        gatewayResponded = true;
        // Simulate response lost in network
        throw new Error('ECONNRESET');
      });
    
    try {
      await axios.post('/api/payments', paymentRequest);
    } catch (error) {
      // Expected to fail
    }
    
    // Gateway processed it (debited card)
    expect(gatewayResponded).to.be.true;
    
    // Our system should mark as "uncertain" not "failed"
    const payment = await db.query('SELECT * FROM payments WHERE order_id = ?', [orderId]);
    expect(payment.status).to.equal('pending_verification');
    
    // Should have reconciliation job to verify
    const job = await queue.getJob(payment.id);
    expect(job.name).to.equal('verify_payment_status');
  });
});

GameDays and Fire Drills

Planning a GameDay

## Payment Service GameDay - February 15, 2026
 
**Objective**: Verify payment system handles various failure scenarios
 
**Time**: 10:00 AM - 2:00 PM PST
 
**Team**:
- Chaos Engineer: Alice (leader)
- SRE: Bob (observer)
- Backend Dev: Carol (responder)
- QE: Dave (metrics)
 
**Scenarios**:
 
1. **Database Failover** (10:00-10:30)
   - Kill primary database
   - Expect: Auto-failover to replica within 30s
   - Monitor: Payment success rate, latency
 
2. **Payment Gateway Timeout** (11:00-11:30)
   - Add 10s latency to gateway
   - Expect: Requests timeout gracefully, retry mechanism works
   - Monitor: Timeout rate, retry success rate
 
3. **Traffic Spike** (1:00-1:30)
   - 10x normal traffic
   - Expect: Auto-scaling triggers, no errors
   - Monitor: Instance count, CPU, error rate
 
4. **Multi-Failure** (1:30-2:00)
   - Combine: DB latency + cache failure
   - Expect: Degraded but functional
   - Monitor: Overall system health
 
**Success Criteria**:
- Payment success rate stays > 95%
- No data loss or corruption
- All failures detected by monitoring
- Team responds appropriately
 
**Rollback Plan**:
- Abort if payment success < 90%
- Kill switch: restore all systems immediately
- Escalation: Page on-call if needed

GameDay Execution Checklist

## Before GameDay
- [ ] Get approval from stakeholders
- [ ] Notify customers of potential issues (optional)
- [ ] Set up extra monitoring
- [ ] Prepare rollback procedures
- [ ] Dry run in staging
- [ ] Ensure team availability
 
## During GameDay
- [ ] Start with steady state verification
- [ ] Document everything (notes, screenshots, metrics)
- [ ] Have war room / video call active
- [ ] Monitor customer impact
- [ ] Be ready to abort
 
## After GameDay
- [ ] Verify system recovery
- [ ] Collect all data and logs
- [ ] Debrief meeting within 24 hours
- [ ] Document lessons learned
- [ ] Create action items for issues found
- [ ] Share results with broader team

Safety and Best Practices

1. Start Small

const chaosProgression = {
  week1: {
    environment: 'dev',
    scope: 'single service',
    impact: 'low',
    duration: '1 minute'
  },
  week2: {
    environment: 'staging',
    scope: 'single service',
    impact: 'medium',
    duration: '5 minutes'
  },
  week4: {
    environment: 'staging',
    scope: 'multiple services',
    impact: 'medium',
    duration: '15 minutes'
  },
  week8: {
    environment: 'production',
    scope: 'canary deployment',
    impact: 'low',
    percentage: '1%',
    duration: '5 minutes'
  }
};

2. Implement Kill Switches

// Chaos abort conditions
const abortConditions = {
  errorRate: async () => {
    const rate = await getErrorRate();
    return rate > 5; // Abort if > 5% errors
  },
  
  customerComplaints: async () => {
    const complaints = await getRecentComplaints();
    return complaints.length > 10; // Abort if >10 complaints
  },
  
  revenueImpact: async () => {
    const revenue = await getCurrentRevenue();
    const baseline = await getBaselineRevenue();
    const drop = (baseline - revenue) / baseline * 100;
    return drop > 10; // Abort if revenue down >10%
  }
};
 
// Monitor and abort if needed
async function monitorChaos(experimentId) {
  const interval = setInterval(async () => {
    for (const [condition, check] of Object.entries(abortConditions)) {
      if (await check()) {
        console.error(`❌ Abort condition triggered: ${condition}`);
        await abortExperiment(experimentId);
        clearInterval(interval);
        break;
      }
    }
  }, 5000); // Check every 5 seconds
}

3. Communicate

// Notify before chaos
async function notifyTeam(experiment) {
  await slack.send({
    channel: '#chaos-engineering',
    text: `🔬 Starting chaos experiment: ${experiment.name}`,
    attachments: [{
      color: 'warning',
      fields: [
        { title: 'Environment', value: experiment.env },
        { title: 'Duration', value: `${experiment.duration}s` },
        { title: 'Target', value: experiment.target },
        { title: 'Dashboard', value: experiment.dashboardUrl }
      ]
    }]
  });
}
 
// Report results
async function reportResults(results) {
  const status = results.success ? '✅ Success' : '❌ Failed';
  
  await slack.send({
    channel: '#chaos-engineering',
    text: `${status} Chaos experiment complete: ${results.name}`,
    attachments: [{
      color: results.success ? 'good' : 'danger',
      fields: [
        { title: 'Hypothesis', value: results.hypothesis },
        { title: 'Result', value: results.outcome },
        { title: 'Impact', value: results.impact },
        { title: 'Action Items', value: results.actionItems.join('\n') }
      ]
    }]
  });
}

Common Patterns

Circuit Breaker

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.successThreshold = options.successThreshold || 2;
    this.timeout = options.timeout || 60000;
    
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.failures = 0;
    this.successes = 0;
    this.nextAttempt = Date.now();
  }
  
  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    
    if (this.state === 'HALF_OPEN') {
      this.successes++;
      if (this.successes >= this.successThreshold) {
        this.state = 'CLOSED';
        this.successes = 0;
      }
    }
  }
  
  onFailure() {
    this.failures++;
    this.successes = 0;
    
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}
 
// Usage
const paymentGatewayBreaker = new CircuitBreaker({
  failureThreshold: 5,
  timeout: 60000
});
 
async function callPaymentGateway(data) {
  try {
    return await paymentGatewayBreaker.execute(async () => {
      return await paymentGateway.charge(data);
    });
  } catch (error) {
    // Circuit is open, use fallback
    return await queueForLater(data);
  }
}

Retry with Exponential Backoff

async function retryWithBackoff(fn, options = {}) {
  const maxRetries = options.maxRetries || 3;
  const baseDelay = options.baseDelay || 1000;
  const maxDelay = options.maxDelay || 30000;
  
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) {
        throw error;
      }
      
      // Exponential backoff with jitter
      const delay = Math.min(
        baseDelay * Math.pow(2, attempt) + Math.random() * 1000,
        maxDelay
      );
      
      console.log(`Retry ${attempt + 1}/${maxRetries} after ${delay}ms`);
      await sleep(delay);
    }
  }
}
 
// Usage
const data = await retryWithBackoff(
  () => database.query('SELECT * FROM users'),
  { maxRetries: 3, baseDelay: 1000 }
);

Measuring Success

Track chaos engineering effectiveness:

const chaosMetrics = {
  // Discovery metrics
  issuesFound: 12, // Bugs/weaknesses discovered
  criticalIssues: 3,
  issuesFixed: 10,
  
  // Coverage metrics
  experimentsRun: 45,
  servicesTestedPercentage: 78,
  failureScenariosCovered: 23,
  
  // Impact metrics
  mttrImprovement: -35, // 35% faster recovery
  incidentsPreveented: 5, // Issues caught before production
  confidenceScore: 8.5, // Team confidence (1-10)
  
  // Organizational metrics
  teamMembers: 8,
  gamedays: 4,
  documentedPlaybooks: 12
};

Next Steps

  1. Start with monitoring - Can't do chaos without observability
  2. Read "Chaos Engineering" book by Netflix engineers
  3. Run first experiment in dev - Kill a container, observe
  4. Expand gradually - More services, longer duration, higher impact
  5. Schedule a GameDay - Practice as a team
  6. Automate experiments - Run regularly in CI/CD
  7. Build a chaos champions program - Spread the practice
  • "Monitoring & Observability for QE" - Essential prerequisite
  • "Resilience Testing Strategies" - Broader resilience concepts
  • "Performance Testing" - Chaos under load
  • "Production Testing" - Testing in prod safely
  • "Incident Response" - What to do when chaos finds real issues

Conclusion

Chaos Engineering isn't about breaking things—it's about building confidence. By deliberately introducing failures in controlled ways, you:

  • Find weaknesses before they cause outages
  • Build resilient systems that degrade gracefully
  • Develop team skills in incident response
  • Sleep better knowing your system can handle failure

Start small, be safe, and remember: The best time to find out your system can't handle a database failure is during a chaos experiment, not during Black Friday.

Remember: If you haven't tested it failing, you don't know it works!

Comments (0)

Loading comments...