Monitoring & Observability for QE

Introduction

As a Quality Engineer, your job doesn't end when tests pass in CI/CD. Modern QE involves understanding how your application behaves in production through monitoring and observability. This guide will teach you how to catch issues before users report them.

Why Monitoring Matters for QE

Traditional testing happens in controlled environments. But production is unpredictable:

Real user traffic patterns are different from test scenarios
Edge cases emerge that weren't covered in tests
Infrastructure issues appear only under real load
Third-party services fail unexpectedly

Observability helps you understand why something happened. Monitoring tells you that something happened.

The Three Pillars of Observability

1. Metrics (What is happening?)

Metrics are numerical measurements over time:

// Example: Track API response times
const responseTime = Date.now() - startTime;
metrics.histogram('api.response_time', responseTime, {
  endpoint: '/api/products',
  method: 'GET',
  status: response.status
});
 
// Example: Count events
metrics.increment('checkout.completed', {
  payment_method: 'credit_card'
});

Key Metrics to Track:

Request rate (requests per second)
Error rate (errors per minute)
Response time (p50, p95, p99)
Resource utilization (CPU, memory, disk)
Business metrics (checkouts, logins, searches)

2. Logs (What happened in detail?)

Logs are timestamped event records:

{
  "timestamp": "2026-01-30T10:15:30Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "user_id": "user-123",
  "order_id": "order-456",
  "error": "Gateway timeout",
  "trace_id": "abc123xyz"
}

Structured Logging Best Practices:

// Good: Structured logging
logger.error('Payment processing failed', {
  user_id: userId,
  order_id: orderId,
  amount: amount,
  gateway: 'stripe',
  error_code: error.code
});
 
// Bad: Unstructured logging
logger.error(`Payment failed for user ${userId}`);

3. Traces (How did it flow?)

Traces show request flow through distributed systems:

// Distributed tracing example with OpenTelemetry
const tracer = opentelemetry.trace.getTracer('payment-service');
 
async function processPayment(orderId) {
  const span = tracer.startSpan('process_payment');
  span.setAttribute('order_id', orderId);
  
  try {
    // This creates a child span automatically
    const payment = await validatePayment(orderId);
    await chargeCard(payment);
    await sendReceipt(payment);
    
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}

A trace shows the complete journey:

Request: POST /api/checkout
├─ API Gateway (15ms)
├─ Auth Service (25ms)
├─ Order Service (150ms)
│  ├─ Inventory Check (50ms)
│  └─ Database Insert (100ms)
└─ Payment Service (500ms)
   ├─ Validate Card (200ms)
   └─ Charge Card (300ms) ← SLOW!

Essential Monitoring Tools

Prometheus (Metrics)

Prometheus is the industry standard for metrics collection:

# prometheus.yml configuration
scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Exposing Metrics in Your App:

// Node.js with prom-client
const client = require('prom-client');
 
// Define metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code']
});
 
// Instrument your code
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    httpRequestDuration.labels(req.method, req.route?.path, res.statusCode).observe(duration);
  });
  next();
});
 
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

PromQL Queries (QE Essentials):

# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
 
# 95th percentile response time
histogram_quantile(0.95, http_request_duration_ms_bucket)
 
# Requests per second by endpoint
rate(http_requests_total[1m])
 
# Alert if error rate > 1%
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01

ELK Stack (Logs)

Elasticsearch, Logstash, Kibana for centralized logging:

// Send logs to Elasticsearch
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');
 
const logger = winston.createLogger({
  transports: [
    new ElasticsearchTransport({
      level: 'info',
      clientOpts: { node: 'http://localhost:9200' },
      index: 'app-logs'
    })
  ]
});
 
logger.info('User logged in', {
  user_id: 'user-123',
  ip_address: req.ip,
  user_agent: req.headers['user-agent']
});

Kibana Query Examples:

# Find all errors in last hour
level:ERROR AND @timestamp:[now-1h TO now]
 
# Search for specific user's activity
user_id:"user-123"
 
# Payment failures
service:"payment-service" AND message:"failed"
 
# Slow queries (>1 second)
query_time:>1000

Grafana (Visualization)

Grafana creates dashboards from Prometheus metrics:

Example Dashboard JSON:

{
  "title": "API Health Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "targets": [{
        "expr": "rate(http_requests_total[5m])"
      }]
    },
    {
      "title": "Error Rate",
      "targets": [{
        "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
      }]
    },
    {
      "title": "Response Time (p95)",
      "targets": [{
        "expr": "histogram_quantile(0.95, http_request_duration_ms_bucket)"
      }]
    }
  ]
}

Datadog (All-in-One)

Commercial solution combining metrics, logs, and traces:

// Datadog APM
const tracer = require('dd-trace').init();
 
// Automatic instrumentation
const express = require('express');
const app = express();
 
// Custom metrics
const StatsD = require('node-dogstatsd').StatsD;
const dogstatsd = new StatsD();
 
app.post('/api/checkout', async (req, res) => {
  dogstatsd.increment('checkout.attempt');
  
  try {
    await processCheckout(req.body);
    dogstatsd.increment('checkout.success');
    res.json({ success: true });
  } catch (error) {
    dogstatsd.increment('checkout.failure', { error_type: error.type });
    res.status(500).json({ error: error.message });
  }
});

Setting Up Alerts

Alerts notify you when things go wrong. Be strategic—too many alerts lead to alert fatigue.

Alert Best Practices

Good Alerts (Actionable):

# High error rate
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 10
  for: 5m
  annotations:
    summary: "High error rate on {{ $labels.service }}"
    description: "Error rate is {{ $value }} requests/sec"
 
# API response time degradation
- alert: SlowAPIResponse
  expr: histogram_quantile(0.95, http_request_duration_ms_bucket) > 1000
  for: 10m
  annotations:
    summary: "API response time degraded"
    description: "95th percentile response time is {{ $value }}ms"

Bad Alerts (Noisy):

# Too sensitive - fires constantly
- alert: AnyError
  expr: http_requests_total{status=~"5.."} > 0
  
# Not actionable - what do you do?
- alert: CPUHigh
  expr: cpu_usage > 50

Alert Fatigue Prevention

Use the 4 Golden Signals:
- Latency (response time)
- Traffic (request rate)
- Errors (error rate)
- Saturation (resource usage)
Set Appropriate Thresholds:
- Use percentiles (p95, p99) not averages
- Account for time of day variations
- Base on historical baselines
Alert Routing:

# PagerDuty routing
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  routes:
    # Critical alerts - page on-call
    - match:
        severity: critical
      receiver: pagerduty
    
    # Warnings - Slack only
    - match:
        severity: warning
      receiver: slack
    
    # Info - email digest
    - match:
        severity: info
      receiver: email

QE Integration Patterns

1. Synthetic Monitoring

Run automated tests against production:

// Synthetic test with Playwright
const { chromium } = require('playwright');
 
async function syntheticTest() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  
  const start = Date.now();
  
  try {
    // Navigate to login
    await page.goto('https://app.example.com/login');
    
    // Fill credentials
    await page.fill('[name="email"]', 'test@example.com');
    await page.fill('[name="password"]', process.env.TEST_PASSWORD);
    await page.click('button[type="submit"]');
    
    // Verify logged in
    await page.waitForSelector('[data-testid="dashboard"]');
    
    const duration = Date.now() - start;
    
    // Report metrics
    metrics.gauge('synthetic.login_flow.duration', duration);
    metrics.increment('synthetic.login_flow.success');
    
  } catch (error) {
    metrics.increment('synthetic.login_flow.failure');
    logger.error('Synthetic test failed', { error: error.message });
  } finally {
    await browser.close();
  }
}
 
// Run every 5 minutes
setInterval(syntheticTest, 5 * 60 * 1000);

2. Test Environment Monitoring

Monitor test environments to catch flakiness:

// Track test execution metrics
afterEach(function() {
  const testName = this.currentTest.title;
  const duration = this.currentTest.duration;
  const status = this.currentTest.state; // passed, failed
  
  metrics.histogram('test.duration', duration, {
    test_name: testName,
    status: status
  });
  
  if (status === 'failed') {
    logger.error('Test failed', {
      test_name: testName,
      error: this.currentTest.err?.message,
      environment: process.env.TEST_ENV
    });
  }
});

3. Production Verification Tests

Run read-only tests against production:

// Verify production data integrity
async function verifyProductionHealth() {
  try {
    // Check API health endpoint
    const health = await fetch('https://api.example.com/health');
    metrics.gauge('production.health_check.status', health.ok ? 1 : 0);
    
    // Verify database connectivity
    const dbCheck = await db.query('SELECT 1');
    metrics.gauge('production.db_check.status', dbCheck ? 1 : 0);
    
    // Check cache
    const cacheCheck = await redis.ping();
    metrics.gauge('production.cache_check.status', cacheCheck === 'PONG' ? 1 : 0);
    
  } catch (error) {
    logger.error('Production health check failed', { error: error.message });
    metrics.increment('production.health_check.failure');
  }
}

Debugging with Observability

Scenario: API is Slow

Step 1: Check Metrics

# What's the current response time?
http_request_duration_ms{endpoint="/api/products"}
 
# Is it a specific endpoint?
topk(5, http_request_duration_ms)
 
# Did it just start?
http_request_duration_ms{endpoint="/api/products"}[1h]

Step 2: Check Traces

Find slow traces in Jaeger/Datadog
Identify which service is slow
Look at span durations

Step 3: Check Logs

service:"product-service" AND @timestamp:[now-1h TO now]
level:ERROR OR level:WARN

Step 4: Correlate

Same trace_id across metrics, logs, traces
Timeline: When did it start?
What changed: Recent deployments?

Scenario: Users Report Errors

Step 1: Search Logs

level:ERROR AND user_id:"affected-user" AND @timestamp:[now-1d TO now]

Step 2: Check Error Rate

# Are others affected?
rate(http_requests_total{status=~"5.."}[5m])
 
# Which endpoints?
rate(http_requests_total{status=~"5..", endpoint!=""}[5m]) by (endpoint)

Step 3: Find the Trace

Search for user's request in APM
Follow the trace to find failure point
Check error details in span

Best Practices

1. Correlation IDs

Add correlation IDs to link related events:

const { v4: uuidv4 } = require('uuid');
 
app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuidv4();
  res.setHeader('x-correlation-id', req.correlationId);
  next();
});
 
// Use in logs
logger.info('Processing request', {
  correlation_id: req.correlationId,
  endpoint: req.path
});

2. Consistent Labeling

Use consistent label names across metrics:

// Good: Consistent labels
metrics.increment('http.requests', {
  method: 'GET',
  endpoint: '/api/products',
  status_code: 200
});
 
// Bad: Inconsistent labels
metrics.increment('requests', { verb: 'GET', path: '/api/products', code: 200 });

3. Don't Log Sensitive Data

// Bad: Logging passwords
logger.info('User login attempt', { email, password });
 
// Good: Redact sensitive data
logger.info('User login attempt', { 
  email,
  password: '[REDACTED]'
});
 
// Better: Don't log it at all
logger.info('User login attempt', { email });

4. Monitor Your Tests

Track test metrics in CI/CD:

# GitHub Actions example
- name: Run Tests
  run: |
    npm test -- --reporter=json > test-results.json
    
- name: Report Metrics
  run: |
    PASSED=$(jq '.stats.passes' test-results.json)
    FAILED=$(jq '.stats.failures' test-results.json)
    DURATION=$(jq '.stats.duration' test-results.json)
    
    curl -X POST https://metrics.example.com/api/metrics \
      -d "ci.tests.passed=${PASSED}"
      -d "ci.tests.failed=${FAILED}"
      -d "ci.tests.duration=${DURATION}"

Common Pitfalls

Too Many Metrics: Creates noise. Focus on what matters.
High Cardinality: Avoid labels with many unique values (user IDs, order IDs)
No Context: Logs without correlation IDs are hard to trace
Alert Fatigue: Too many non-actionable alerts get ignored
No Baselines: Can't tell if metrics are abnormal without baselines
Ignoring Trends: Don't just alert on thresholds, watch for trends

Tools Comparison

Tool	Best For	Pros	Cons
Prometheus + Grafana	Metrics, self-hosted	Open source, powerful, flexible	Setup complexity
ELK Stack	Logs, search	Powerful search, scalable	Resource intensive
Datadog	All-in-one	Easy setup, great UX	Expensive
New Relic	APM, full-stack	Comprehensive, good support	Expensive
Splunk	Enterprise logs	Powerful, mature	Very expensive
Jaeger	Distributed tracing	Open source, CNCF	Logs/metrics separate

Next Steps

Set up basic monitoring on your test environment
Add structured logging to your test framework
Create a dashboard showing test run metrics
Configure alerts for test failures
Implement synthetic monitoring for critical user flows
Practice debugging using metrics, logs, and traces together

"Test Reporting & Dashboards" - Visualize test results
"CI/CD Pipeline Testing" - Integrate monitoring into pipelines
"API Testing Strategies" - Test what you monitor
"Performance Testing Basics" - Monitor performance metrics
"Production Testing Strategies" - Safe testing in production

Conclusion

Monitoring and observability are not just ops concerns—they're essential QE skills. By understanding how your application behaves in production, you can:

Catch issues before users report them
Debug production issues faster
Validate that features work in the real world
Improve test coverage based on production data

Start small: add basic metrics and logging. As you get comfortable, expand to tracing and advanced analysis. The investment in observability pays dividends in faster debugging and higher confidence in production.

Remember: The best bug fix is the one you deploy before users notice the bug!