Back to Articles
Tools & SkillsIntermediate

Monitoring & Observability for QE

Learn to use monitoring and observability tools to catch issues before users do

10 min read
...
monitoringobservabilityloggingmetricsalerts
Banner for Monitoring & Observability for QE

Introduction

As a Quality Engineer, your job doesn't end when tests pass in CI/CD. Modern QE involves understanding how your application behaves in production through monitoring and observability. This guide will teach you how to catch issues before users report them.

Why Monitoring Matters for QE

Traditional testing happens in controlled environments. But production is unpredictable:

  • Real user traffic patterns are different from test scenarios
  • Edge cases emerge that weren't covered in tests
  • Infrastructure issues appear only under real load
  • Third-party services fail unexpectedly

Observability helps you understand why something happened. Monitoring tells you that something happened.

The Three Pillars of Observability

1. Metrics (What is happening?)

Metrics are numerical measurements over time:

// Example: Track API response times
const responseTime = Date.now() - startTime;
metrics.histogram('api.response_time', responseTime, {
  endpoint: '/api/products',
  method: 'GET',
  status: response.status
});
 
// Example: Count events
metrics.increment('checkout.completed', {
  payment_method: 'credit_card'
});

Key Metrics to Track:

  • Request rate (requests per second)
  • Error rate (errors per minute)
  • Response time (p50, p95, p99)
  • Resource utilization (CPU, memory, disk)
  • Business metrics (checkouts, logins, searches)

2. Logs (What happened in detail?)

Logs are timestamped event records:

{
  "timestamp": "2026-01-30T10:15:30Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "user_id": "user-123",
  "order_id": "order-456",
  "error": "Gateway timeout",
  "trace_id": "abc123xyz"
}

Structured Logging Best Practices:

// Good: Structured logging
logger.error('Payment processing failed', {
  user_id: userId,
  order_id: orderId,
  amount: amount,
  gateway: 'stripe',
  error_code: error.code
});
 
// Bad: Unstructured logging
logger.error(`Payment failed for user ${userId}`);

3. Traces (How did it flow?)

Traces show request flow through distributed systems:

// Distributed tracing example with OpenTelemetry
const tracer = opentelemetry.trace.getTracer('payment-service');
 
async function processPayment(orderId) {
  const span = tracer.startSpan('process_payment');
  span.setAttribute('order_id', orderId);
  
  try {
    // This creates a child span automatically
    const payment = await validatePayment(orderId);
    await chargeCard(payment);
    await sendReceipt(payment);
    
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}

A trace shows the complete journey:

Request: POST /api/checkout
├─ API Gateway (15ms)
├─ Auth Service (25ms)
├─ Order Service (150ms)
│  ├─ Inventory Check (50ms)
│  └─ Database Insert (100ms)
└─ Payment Service (500ms)
   ├─ Validate Card (200ms)
   └─ Charge Card (300ms) ← SLOW!

Essential Monitoring Tools

Prometheus (Metrics)

Prometheus is the industry standard for metrics collection:

# prometheus.yml configuration
scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Exposing Metrics in Your App:

// Node.js with prom-client
const client = require('prom-client');
 
// Define metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code']
});
 
// Instrument your code
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    httpRequestDuration.labels(req.method, req.route?.path, res.statusCode).observe(duration);
  });
  next();
});
 
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

PromQL Queries (QE Essentials):

# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
 
# 95th percentile response time
histogram_quantile(0.95, http_request_duration_ms_bucket)
 
# Requests per second by endpoint
rate(http_requests_total[1m])
 
# Alert if error rate > 1%
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01

ELK Stack (Logs)

Elasticsearch, Logstash, Kibana for centralized logging:

// Send logs to Elasticsearch
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');
 
const logger = winston.createLogger({
  transports: [
    new ElasticsearchTransport({
      level: 'info',
      clientOpts: { node: 'http://localhost:9200' },
      index: 'app-logs'
    })
  ]
});
 
logger.info('User logged in', {
  user_id: 'user-123',
  ip_address: req.ip,
  user_agent: req.headers['user-agent']
});

Kibana Query Examples:

# Find all errors in last hour
level:ERROR AND @timestamp:[now-1h TO now]
 
# Search for specific user's activity
user_id:"user-123"
 
# Payment failures
service:"payment-service" AND message:"failed"
 
# Slow queries (>1 second)
query_time:>1000

Grafana (Visualization)

Grafana creates dashboards from Prometheus metrics:

Example Dashboard JSON:

{
  "title": "API Health Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "targets": [{
        "expr": "rate(http_requests_total[5m])"
      }]
    },
    {
      "title": "Error Rate",
      "targets": [{
        "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
      }]
    },
    {
      "title": "Response Time (p95)",
      "targets": [{
        "expr": "histogram_quantile(0.95, http_request_duration_ms_bucket)"
      }]
    }
  ]
}

Datadog (All-in-One)

Commercial solution combining metrics, logs, and traces:

// Datadog APM
const tracer = require('dd-trace').init();
 
// Automatic instrumentation
const express = require('express');
const app = express();
 
// Custom metrics
const StatsD = require('node-dogstatsd').StatsD;
const dogstatsd = new StatsD();
 
app.post('/api/checkout', async (req, res) => {
  dogstatsd.increment('checkout.attempt');
  
  try {
    await processCheckout(req.body);
    dogstatsd.increment('checkout.success');
    res.json({ success: true });
  } catch (error) {
    dogstatsd.increment('checkout.failure', { error_type: error.type });
    res.status(500).json({ error: error.message });
  }
});

Setting Up Alerts

Alerts notify you when things go wrong. Be strategic—too many alerts lead to alert fatigue.

Alert Best Practices

Good Alerts (Actionable):

# High error rate
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 10
  for: 5m
  annotations:
    summary: "High error rate on {{ $labels.service }}"
    description: "Error rate is {{ $value }} requests/sec"
 
# API response time degradation
- alert: SlowAPIResponse
  expr: histogram_quantile(0.95, http_request_duration_ms_bucket) > 1000
  for: 10m
  annotations:
    summary: "API response time degraded"
    description: "95th percentile response time is {{ $value }}ms"

Bad Alerts (Noisy):

# Too sensitive - fires constantly
- alert: AnyError
  expr: http_requests_total{status=~"5.."} > 0
  
# Not actionable - what do you do?
- alert: CPUHigh
  expr: cpu_usage > 50

Alert Fatigue Prevention

  1. Use the 4 Golden Signals:

    • Latency (response time)
    • Traffic (request rate)
    • Errors (error rate)
    • Saturation (resource usage)
  2. Set Appropriate Thresholds:

    • Use percentiles (p95, p99) not averages
    • Account for time of day variations
    • Base on historical baselines
  3. Alert Routing:

# PagerDuty routing
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  routes:
    # Critical alerts - page on-call
    - match:
        severity: critical
      receiver: pagerduty
    
    # Warnings - Slack only
    - match:
        severity: warning
      receiver: slack
    
    # Info - email digest
    - match:
        severity: info
      receiver: email

QE Integration Patterns

1. Synthetic Monitoring

Run automated tests against production:

// Synthetic test with Playwright
const { chromium } = require('playwright');
 
async function syntheticTest() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  
  const start = Date.now();
  
  try {
    // Navigate to login
    await page.goto('https://app.example.com/login');
    
    // Fill credentials
    await page.fill('[name="email"]', 'test@example.com');
    await page.fill('[name="password"]', process.env.TEST_PASSWORD);
    await page.click('button[type="submit"]');
    
    // Verify logged in
    await page.waitForSelector('[data-testid="dashboard"]');
    
    const duration = Date.now() - start;
    
    // Report metrics
    metrics.gauge('synthetic.login_flow.duration', duration);
    metrics.increment('synthetic.login_flow.success');
    
  } catch (error) {
    metrics.increment('synthetic.login_flow.failure');
    logger.error('Synthetic test failed', { error: error.message });
  } finally {
    await browser.close();
  }
}
 
// Run every 5 minutes
setInterval(syntheticTest, 5 * 60 * 1000);

2. Test Environment Monitoring

Monitor test environments to catch flakiness:

// Track test execution metrics
afterEach(function() {
  const testName = this.currentTest.title;
  const duration = this.currentTest.duration;
  const status = this.currentTest.state; // passed, failed
  
  metrics.histogram('test.duration', duration, {
    test_name: testName,
    status: status
  });
  
  if (status === 'failed') {
    logger.error('Test failed', {
      test_name: testName,
      error: this.currentTest.err?.message,
      environment: process.env.TEST_ENV
    });
  }
});

3. Production Verification Tests

Run read-only tests against production:

// Verify production data integrity
async function verifyProductionHealth() {
  try {
    // Check API health endpoint
    const health = await fetch('https://api.example.com/health');
    metrics.gauge('production.health_check.status', health.ok ? 1 : 0);
    
    // Verify database connectivity
    const dbCheck = await db.query('SELECT 1');
    metrics.gauge('production.db_check.status', dbCheck ? 1 : 0);
    
    // Check cache
    const cacheCheck = await redis.ping();
    metrics.gauge('production.cache_check.status', cacheCheck === 'PONG' ? 1 : 0);
    
  } catch (error) {
    logger.error('Production health check failed', { error: error.message });
    metrics.increment('production.health_check.failure');
  }
}

Debugging with Observability

Scenario: API is Slow

Step 1: Check Metrics

# What's the current response time?
http_request_duration_ms{endpoint="/api/products"}
 
# Is it a specific endpoint?
topk(5, http_request_duration_ms)
 
# Did it just start?
http_request_duration_ms{endpoint="/api/products"}[1h]

Step 2: Check Traces

  • Find slow traces in Jaeger/Datadog
  • Identify which service is slow
  • Look at span durations

Step 3: Check Logs

service:"product-service" AND @timestamp:[now-1h TO now]
level:ERROR OR level:WARN

Step 4: Correlate

  • Same trace_id across metrics, logs, traces
  • Timeline: When did it start?
  • What changed: Recent deployments?

Scenario: Users Report Errors

Step 1: Search Logs

level:ERROR AND user_id:"affected-user" AND @timestamp:[now-1d TO now]

Step 2: Check Error Rate

# Are others affected?
rate(http_requests_total{status=~"5.."}[5m])
 
# Which endpoints?
rate(http_requests_total{status=~"5..", endpoint!=""}[5m]) by (endpoint)

Step 3: Find the Trace

  • Search for user's request in APM
  • Follow the trace to find failure point
  • Check error details in span

Best Practices

1. Correlation IDs

Add correlation IDs to link related events:

const { v4: uuidv4 } = require('uuid');
 
app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuidv4();
  res.setHeader('x-correlation-id', req.correlationId);
  next();
});
 
// Use in logs
logger.info('Processing request', {
  correlation_id: req.correlationId,
  endpoint: req.path
});

2. Consistent Labeling

Use consistent label names across metrics:

// Good: Consistent labels
metrics.increment('http.requests', {
  method: 'GET',
  endpoint: '/api/products',
  status_code: 200
});
 
// Bad: Inconsistent labels
metrics.increment('requests', { verb: 'GET', path: '/api/products', code: 200 });

3. Don't Log Sensitive Data

// Bad: Logging passwords
logger.info('User login attempt', { email, password });
 
// Good: Redact sensitive data
logger.info('User login attempt', { 
  email,
  password: '[REDACTED]'
});
 
// Better: Don't log it at all
logger.info('User login attempt', { email });

4. Monitor Your Tests

Track test metrics in CI/CD:

# GitHub Actions example
- name: Run Tests
  run: |
    npm test -- --reporter=json > test-results.json
    
- name: Report Metrics
  run: |
    PASSED=$(jq '.stats.passes' test-results.json)
    FAILED=$(jq '.stats.failures' test-results.json)
    DURATION=$(jq '.stats.duration' test-results.json)
    
    curl -X POST https://metrics.example.com/api/metrics \
      -d "ci.tests.passed=${PASSED}"
      -d "ci.tests.failed=${FAILED}"
      -d "ci.tests.duration=${DURATION}"

Common Pitfalls

  1. Too Many Metrics: Creates noise. Focus on what matters.
  2. High Cardinality: Avoid labels with many unique values (user IDs, order IDs)
  3. No Context: Logs without correlation IDs are hard to trace
  4. Alert Fatigue: Too many non-actionable alerts get ignored
  5. No Baselines: Can't tell if metrics are abnormal without baselines
  6. Ignoring Trends: Don't just alert on thresholds, watch for trends

Tools Comparison

ToolBest ForProsCons
Prometheus + GrafanaMetrics, self-hostedOpen source, powerful, flexibleSetup complexity
ELK StackLogs, searchPowerful search, scalableResource intensive
DatadogAll-in-oneEasy setup, great UXExpensive
New RelicAPM, full-stackComprehensive, good supportExpensive
SplunkEnterprise logsPowerful, matureVery expensive
JaegerDistributed tracingOpen source, CNCFLogs/metrics separate

Next Steps

  1. Set up basic monitoring on your test environment
  2. Add structured logging to your test framework
  3. Create a dashboard showing test run metrics
  4. Configure alerts for test failures
  5. Implement synthetic monitoring for critical user flows
  6. Practice debugging using metrics, logs, and traces together
  • "Test Reporting & Dashboards" - Visualize test results
  • "CI/CD Pipeline Testing" - Integrate monitoring into pipelines
  • "API Testing Strategies" - Test what you monitor
  • "Performance Testing Basics" - Monitor performance metrics
  • "Production Testing Strategies" - Safe testing in production

Conclusion

Monitoring and observability are not just ops concerns—they're essential QE skills. By understanding how your application behaves in production, you can:

  • Catch issues before users report them
  • Debug production issues faster
  • Validate that features work in the real world
  • Improve test coverage based on production data

Start small: add basic metrics and logging. As you get comfortable, expand to tracing and advanced analysis. The investment in observability pays dividends in faster debugging and higher confidence in production.

Remember: The best bug fix is the one you deploy before users notice the bug!

Comments (0)

Loading comments...