Monitoring & Observability for QE
Learn to use monitoring and observability tools to catch issues before users do
Introduction
As a Quality Engineer, your job doesn't end when tests pass in CI/CD. Modern QE involves understanding how your application behaves in production through monitoring and observability. This guide will teach you how to catch issues before users report them.
Why Monitoring Matters for QE
Traditional testing happens in controlled environments. But production is unpredictable:
- Real user traffic patterns are different from test scenarios
- Edge cases emerge that weren't covered in tests
- Infrastructure issues appear only under real load
- Third-party services fail unexpectedly
Observability helps you understand why something happened. Monitoring tells you that something happened.
The Three Pillars of Observability
1. Metrics (What is happening?)
Metrics are numerical measurements over time:
// Example: Track API response times
const responseTime = Date.now() - startTime;
metrics.histogram('api.response_time', responseTime, {
endpoint: '/api/products',
method: 'GET',
status: response.status
});
// Example: Count events
metrics.increment('checkout.completed', {
payment_method: 'credit_card'
});Key Metrics to Track:
- Request rate (requests per second)
- Error rate (errors per minute)
- Response time (p50, p95, p99)
- Resource utilization (CPU, memory, disk)
- Business metrics (checkouts, logins, searches)
2. Logs (What happened in detail?)
Logs are timestamped event records:
{
"timestamp": "2026-01-30T10:15:30Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment processing failed",
"user_id": "user-123",
"order_id": "order-456",
"error": "Gateway timeout",
"trace_id": "abc123xyz"
}Structured Logging Best Practices:
// Good: Structured logging
logger.error('Payment processing failed', {
user_id: userId,
order_id: orderId,
amount: amount,
gateway: 'stripe',
error_code: error.code
});
// Bad: Unstructured logging
logger.error(`Payment failed for user ${userId}`);3. Traces (How did it flow?)
Traces show request flow through distributed systems:
// Distributed tracing example with OpenTelemetry
const tracer = opentelemetry.trace.getTracer('payment-service');
async function processPayment(orderId) {
const span = tracer.startSpan('process_payment');
span.setAttribute('order_id', orderId);
try {
// This creates a child span automatically
const payment = await validatePayment(orderId);
await chargeCard(payment);
await sendReceipt(payment);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}A trace shows the complete journey:
Request: POST /api/checkout
├─ API Gateway (15ms)
├─ Auth Service (25ms)
├─ Order Service (150ms)
│ ├─ Inventory Check (50ms)
│ └─ Database Insert (100ms)
└─ Payment Service (500ms)
├─ Validate Card (200ms)
└─ Charge Card (300ms) ← SLOW!Essential Monitoring Tools
Prometheus (Metrics)
Prometheus is the industry standard for metrics collection:
# prometheus.yml configuration
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15sExposing Metrics in Your App:
// Node.js with prom-client
const client = require('prom-client');
// Define metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in ms',
labelNames: ['method', 'route', 'status_code']
});
// Instrument your code
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpRequestDuration.labels(req.method, req.route?.path, res.statusCode).observe(duration);
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});PromQL Queries (QE Essentials):
# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# 95th percentile response time
histogram_quantile(0.95, http_request_duration_ms_bucket)
# Requests per second by endpoint
rate(http_requests_total[1m])
# Alert if error rate > 1%
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01ELK Stack (Logs)
Elasticsearch, Logstash, Kibana for centralized logging:
// Send logs to Elasticsearch
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');
const logger = winston.createLogger({
transports: [
new ElasticsearchTransport({
level: 'info',
clientOpts: { node: 'http://localhost:9200' },
index: 'app-logs'
})
]
});
logger.info('User logged in', {
user_id: 'user-123',
ip_address: req.ip,
user_agent: req.headers['user-agent']
});Kibana Query Examples:
# Find all errors in last hour
level:ERROR AND @timestamp:[now-1h TO now]
# Search for specific user's activity
user_id:"user-123"
# Payment failures
service:"payment-service" AND message:"failed"
# Slow queries (>1 second)
query_time:>1000Grafana (Visualization)
Grafana creates dashboards from Prometheus metrics:
Example Dashboard JSON:
{
"title": "API Health Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(http_requests_total[5m])"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}]
},
{
"title": "Response Time (p95)",
"targets": [{
"expr": "histogram_quantile(0.95, http_request_duration_ms_bucket)"
}]
}
]
}Datadog (All-in-One)
Commercial solution combining metrics, logs, and traces:
// Datadog APM
const tracer = require('dd-trace').init();
// Automatic instrumentation
const express = require('express');
const app = express();
// Custom metrics
const StatsD = require('node-dogstatsd').StatsD;
const dogstatsd = new StatsD();
app.post('/api/checkout', async (req, res) => {
dogstatsd.increment('checkout.attempt');
try {
await processCheckout(req.body);
dogstatsd.increment('checkout.success');
res.json({ success: true });
} catch (error) {
dogstatsd.increment('checkout.failure', { error_type: error.type });
res.status(500).json({ error: error.message });
}
});Setting Up Alerts
Alerts notify you when things go wrong. Be strategic—too many alerts lead to alert fatigue.
Alert Best Practices
Good Alerts (Actionable):
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 10
for: 5m
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value }} requests/sec"
# API response time degradation
- alert: SlowAPIResponse
expr: histogram_quantile(0.95, http_request_duration_ms_bucket) > 1000
for: 10m
annotations:
summary: "API response time degraded"
description: "95th percentile response time is {{ $value }}ms"Bad Alerts (Noisy):
# Too sensitive - fires constantly
- alert: AnyError
expr: http_requests_total{status=~"5.."} > 0
# Not actionable - what do you do?
- alert: CPUHigh
expr: cpu_usage > 50Alert Fatigue Prevention
-
Use the 4 Golden Signals:
- Latency (response time)
- Traffic (request rate)
- Errors (error rate)
- Saturation (resource usage)
-
Set Appropriate Thresholds:
- Use percentiles (p95, p99) not averages
- Account for time of day variations
- Base on historical baselines
-
Alert Routing:
# PagerDuty routing
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
routes:
# Critical alerts - page on-call
- match:
severity: critical
receiver: pagerduty
# Warnings - Slack only
- match:
severity: warning
receiver: slack
# Info - email digest
- match:
severity: info
receiver: emailQE Integration Patterns
1. Synthetic Monitoring
Run automated tests against production:
// Synthetic test with Playwright
const { chromium } = require('playwright');
async function syntheticTest() {
const browser = await chromium.launch();
const page = await browser.newPage();
const start = Date.now();
try {
// Navigate to login
await page.goto('https://app.example.com/login');
// Fill credentials
await page.fill('[name="email"]', 'test@example.com');
await page.fill('[name="password"]', process.env.TEST_PASSWORD);
await page.click('button[type="submit"]');
// Verify logged in
await page.waitForSelector('[data-testid="dashboard"]');
const duration = Date.now() - start;
// Report metrics
metrics.gauge('synthetic.login_flow.duration', duration);
metrics.increment('synthetic.login_flow.success');
} catch (error) {
metrics.increment('synthetic.login_flow.failure');
logger.error('Synthetic test failed', { error: error.message });
} finally {
await browser.close();
}
}
// Run every 5 minutes
setInterval(syntheticTest, 5 * 60 * 1000);2. Test Environment Monitoring
Monitor test environments to catch flakiness:
// Track test execution metrics
afterEach(function() {
const testName = this.currentTest.title;
const duration = this.currentTest.duration;
const status = this.currentTest.state; // passed, failed
metrics.histogram('test.duration', duration, {
test_name: testName,
status: status
});
if (status === 'failed') {
logger.error('Test failed', {
test_name: testName,
error: this.currentTest.err?.message,
environment: process.env.TEST_ENV
});
}
});3. Production Verification Tests
Run read-only tests against production:
// Verify production data integrity
async function verifyProductionHealth() {
try {
// Check API health endpoint
const health = await fetch('https://api.example.com/health');
metrics.gauge('production.health_check.status', health.ok ? 1 : 0);
// Verify database connectivity
const dbCheck = await db.query('SELECT 1');
metrics.gauge('production.db_check.status', dbCheck ? 1 : 0);
// Check cache
const cacheCheck = await redis.ping();
metrics.gauge('production.cache_check.status', cacheCheck === 'PONG' ? 1 : 0);
} catch (error) {
logger.error('Production health check failed', { error: error.message });
metrics.increment('production.health_check.failure');
}
}Debugging with Observability
Scenario: API is Slow
Step 1: Check Metrics
# What's the current response time?
http_request_duration_ms{endpoint="/api/products"}
# Is it a specific endpoint?
topk(5, http_request_duration_ms)
# Did it just start?
http_request_duration_ms{endpoint="/api/products"}[1h]Step 2: Check Traces
- Find slow traces in Jaeger/Datadog
- Identify which service is slow
- Look at span durations
Step 3: Check Logs
service:"product-service" AND @timestamp:[now-1h TO now]
level:ERROR OR level:WARNStep 4: Correlate
- Same trace_id across metrics, logs, traces
- Timeline: When did it start?
- What changed: Recent deployments?
Scenario: Users Report Errors
Step 1: Search Logs
level:ERROR AND user_id:"affected-user" AND @timestamp:[now-1d TO now]Step 2: Check Error Rate
# Are others affected?
rate(http_requests_total{status=~"5.."}[5m])
# Which endpoints?
rate(http_requests_total{status=~"5..", endpoint!=""}[5m]) by (endpoint)Step 3: Find the Trace
- Search for user's request in APM
- Follow the trace to find failure point
- Check error details in span
Best Practices
1. Correlation IDs
Add correlation IDs to link related events:
const { v4: uuidv4 } = require('uuid');
app.use((req, res, next) => {
req.correlationId = req.headers['x-correlation-id'] || uuidv4();
res.setHeader('x-correlation-id', req.correlationId);
next();
});
// Use in logs
logger.info('Processing request', {
correlation_id: req.correlationId,
endpoint: req.path
});2. Consistent Labeling
Use consistent label names across metrics:
// Good: Consistent labels
metrics.increment('http.requests', {
method: 'GET',
endpoint: '/api/products',
status_code: 200
});
// Bad: Inconsistent labels
metrics.increment('requests', { verb: 'GET', path: '/api/products', code: 200 });3. Don't Log Sensitive Data
// Bad: Logging passwords
logger.info('User login attempt', { email, password });
// Good: Redact sensitive data
logger.info('User login attempt', {
email,
password: '[REDACTED]'
});
// Better: Don't log it at all
logger.info('User login attempt', { email });4. Monitor Your Tests
Track test metrics in CI/CD:
# GitHub Actions example
- name: Run Tests
run: |
npm test -- --reporter=json > test-results.json
- name: Report Metrics
run: |
PASSED=$(jq '.stats.passes' test-results.json)
FAILED=$(jq '.stats.failures' test-results.json)
DURATION=$(jq '.stats.duration' test-results.json)
curl -X POST https://metrics.example.com/api/metrics \
-d "ci.tests.passed=${PASSED}"
-d "ci.tests.failed=${FAILED}"
-d "ci.tests.duration=${DURATION}"Common Pitfalls
- Too Many Metrics: Creates noise. Focus on what matters.
- High Cardinality: Avoid labels with many unique values (user IDs, order IDs)
- No Context: Logs without correlation IDs are hard to trace
- Alert Fatigue: Too many non-actionable alerts get ignored
- No Baselines: Can't tell if metrics are abnormal without baselines
- Ignoring Trends: Don't just alert on thresholds, watch for trends
Tools Comparison
| Tool | Best For | Pros | Cons |
|---|---|---|---|
| Prometheus + Grafana | Metrics, self-hosted | Open source, powerful, flexible | Setup complexity |
| ELK Stack | Logs, search | Powerful search, scalable | Resource intensive |
| Datadog | All-in-one | Easy setup, great UX | Expensive |
| New Relic | APM, full-stack | Comprehensive, good support | Expensive |
| Splunk | Enterprise logs | Powerful, mature | Very expensive |
| Jaeger | Distributed tracing | Open source, CNCF | Logs/metrics separate |
Next Steps
- Set up basic monitoring on your test environment
- Add structured logging to your test framework
- Create a dashboard showing test run metrics
- Configure alerts for test failures
- Implement synthetic monitoring for critical user flows
- Practice debugging using metrics, logs, and traces together
Related Articles
- "Test Reporting & Dashboards" - Visualize test results
- "CI/CD Pipeline Testing" - Integrate monitoring into pipelines
- "API Testing Strategies" - Test what you monitor
- "Performance Testing Basics" - Monitor performance metrics
- "Production Testing Strategies" - Safe testing in production
Conclusion
Monitoring and observability are not just ops concerns—they're essential QE skills. By understanding how your application behaves in production, you can:
- Catch issues before users report them
- Debug production issues faster
- Validate that features work in the real world
- Improve test coverage based on production data
Start small: add basic metrics and logging. As you get comfortable, expand to tracing and advanced analysis. The investment in observability pays dividends in faster debugging and higher confidence in production.
Remember: The best bug fix is the one you deploy before users notice the bug!
Comments (0)
Loading comments...