Monitoring and Observability - Production Systems Guide for Modern Applications
Master monitoring and observability with metrics collection, distributed tracing, log aggregation, alerting strategies, and SLO/SLI frameworks for production systems at scale.
Introduction
Monitoring and observability determine mean time to detection (MTTD) and resolution (MTTR) for production incidents, with observability-mature organizations detecting issues 10x faster (3 minutes vs 30 minutes average) and resolving 5x quicker (15 minutes vs 75 minutes) according to DORA metrics, directly impacting availability SLAs and customer satisfaction. Traditional monitoring—collecting predefined metrics like CPU, memory, request count—proves insufficient for modern distributed systems where failures emerge from complex interactions between microservices, with 73% of production incidents requiring correlation across multiple data sources (metrics, traces, logs) that traditional dashboards fail to provide. Observability extends monitoring by enabling teams to ask arbitrary questions about system behavior without predicting failure modes in advance, supporting investigation of novel issues through high-cardinality data, distributed tracing revealing request paths across 10-50 services, and structured logging enabling rich querying of application context.
Organizations implementing comprehensive observability report 60-80% reduction in MTTR through faster root cause identification, 40-50% decrease in alert fatigue by replacing symptom-based alerts with SLO violations, and 20-30% reduction in infrastructure costs by identifying underutilized resources and optimization opportunities. Companies like Netflix achieve 99.99% availability serving 230+ million subscribers through observability-driven incident response correlating metrics, traces, and logs in real-time, Uber processes 100+ million distributed traces daily debugging latency issues across 2,000+ microservices, and Shopify monitors 10+ million metrics per second detecting anomalies before customer impact through machine learning-based alerting. This guide explores production-proven observability practices including metrics collection and visualization (Prometheus, Grafana), distributed tracing (Jaeger, Tempo, OpenTelemetry), log aggregation (Elasticsearch, Loki), alerting strategies (SLO-based alerting, alert routing), service level objectives (SLIs, SLOs, error budgets), and observability-driven development integrating monitoring into application code.
Metrics Collection and Visualization
Prometheus for Metrics Collection
Prometheus provides time-series database optimized for operational metrics with pull-based scraping and powerful query language.
Instrumenting Applications:
// Go application instrumentation with Prometheus client
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter: Monotonically increasing value (requests, errors)
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"}, // Labels for filtering
)
// Histogram: Distribution of values (latency, request size)
httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency in seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
// Gauge: Value that can increase or decrease (active connections, queue depth)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
// Track active connections
activeConnections.Inc()
defer activeConnections.Dec()
// Capture response status
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
// Record request
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
http.StatusText(wrapped.statusCode),
).Inc()
})
}
func main() {
http.Handle("/metrics", promhttp.Handler()) // Expose metrics endpoint
http.Handle("/", metricsMiddleware(http.HandlerFunc(handler)))
http.ListenAndServe(":8080", nil)
}
Prometheus Scrape Configuration:
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate alerting rules
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
labels:
environment: 'production'
region: 'us-east-1'
Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: address
PromQL Queries:
# Request rate (requests per second)
rate(http_requests_total[5m])
Error rate (percentage of 5xx responses)
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
- 100
P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Resource utilization by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
Alert on high error rate
(
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05 # > 5% error rate
Grafana Dashboards
Grafana provides visualization layer for metrics from Prometheus, Elasticsearch, and other data sources.
Dashboard Design Principles:
Dashboard Structure:
- Overview row (top-level metrics visible without scrolling)
- Request rate, error rate, P95 latency, availability
- Service health row (per-service breakdown)
- Error rates by endpoint, latency distribution, throughput
- Resource utilization row (infrastructure metrics)
- CPU, memory, disk I/O, network
- Business metrics row (application-specific KPIs)
- Orders/second, revenue, active users
Variables for filtering:
- $environment (production, staging, development)
- $service (api, worker, frontend)
- $region (us-east-1, eu-west-1)
Time ranges:
- Default: Last 6 hours
- Quick ranges: 5m, 15m, 1h, 6h, 24h, 7d
- Refresh: Every 30s for production dashboards
Effective Panel Configuration:
{
"title": "Request Rate and Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m]))",
"legendFormat": "Request Rate (req/s)",
"refId": "A"
},
{
"expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m])) * 100",
"legendFormat": "Error Rate (%)",
"refId": "B"
}
],
"yaxes": [
{
"label": "req/s",
"format": "short"
},
{
"label": "%",
"format": "percent",
"max": 100
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"query": {
"params": ["B", "5m", "now"]
},
"type": "query"
}
],
"frequency": "60s",
"handler": 1,
"name": "High Error Rate",
"notifications": [
{"uid": "slack-alerts"}
]
}
}
Distributed Tracing
OpenTelemetry for Instrumentation
OpenTelemetry provides vendor-neutral observability framework for traces, metrics, and logs.
Automatic Instrumentation (Node.js Example):
// Auto-instrumentation for Express, HTTP, database clients
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { PgInstrumentation } = require('@opentelemetry/instrumentation-pg');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
// Initialize tracer provider
const provider = new NodeTracerProvider({
resource: {
attributes: {
'service.name': 'api-service',
'service.version': '1.0.0',
'deployment.environment': 'production'
}
}
});
// Configure exporter (Jaeger backend)
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Auto-instrument libraries
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation()
]
});
// Application code requires no changes - tracing automatic!
const express = require('express');
const app = express();
app.get('/users/:id', async (req, res) => {
// Automatically creates span for HTTP request
const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
// Automatically creates span for database query
res.json(user);
});
Manual Span Creation for Custom Operations:
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('api-service');
app.post('/orders', async (req, res) => {
// Create custom span for business logic
const span = tracer.startSpan('process_order', {
attributes: {
'order.id': req.body.id,
'order.amount': req.body.total,
'user.id': req.user.id
}
});
try {
// Validate order
const validationSpan = tracer.startSpan('validate_order', {
parent: span
});
await validateOrder(req.body);
validationSpan.end();
// Check inventory
const inventorySpan = tracer.startSpan('check_inventory', {
parent: span
});
const available = await checkInventory(req.body.items);
inventorySpan.setAttribute('inventory.available', available);
inventorySpan.end();
if (!available) {
span.setStatus({ code: SpanStatusCode.ERROR, message: 'Out of stock' });
return res.status(409).json({ error: 'Out of stock' });
}
// Create order
const order = await createOrder(req.body);
span.setAttribute('order.created', true);
span.setStatus({ code: SpanStatusCode.OK });
res.json(order);
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
res.status(500).json({ error: 'Failed to process order' });
} finally {
span.end();
}
});
Analyzing Traces with Jaeger
Distributed traces reveal request flow across microservices identifying latency bottlenecks.
Trace Structure:
Trace: Order Creation (Total Duration: 1.2s)
├─ Span: HTTP POST /orders (1.2s)
│ ├─ Span: validate_order (50ms)
│ ├─ Span: check_inventory (800ms) # ← BOTTLENECK!
│ │ ├─ Span: HTTP GET inventory-service/check (780ms)
│ │ │ ├─ Span: postgres.query SELECT (750ms) # ← Slow query!
│ │ │ └─ Span: serialize_response (30ms)
│ │ └─ Span: parse_response (20ms)
│ ├─ Span: create_order (300ms)
│ │ ├─ Span: postgres.query INSERT (250ms)
│ │ └─ Span: publish_event (50ms)
│ └─ Span: send_confirmation (50ms)
Findings:
- 800ms of 1.2s (67%) spent on inventory check
- Inventory service database query taking 750ms (slow index?)
- Opportunity: Cache inventory status, add database index
Query Patterns in Jaeger UI:
# Find slow traces (P95+ latency)
Service: api-service
Operation: POST /orders
Min Duration: 1s
Tags: http.status_code=200
Find error traces
Service: api-service
Tags: error=true OR http.status_code>=500
Find traces touching specific service
Service: inventory-service
Operation: *
Correlate with other signals
Tags: user.id=12345 # Find all requests for user
Tags: order.id=67890 # Track specific order through system
Log Aggregation and Analysis
Structured Logging
Structured logs in JSON format enable rich querying and correlation with traces.
Structured Logging Example:
const winston = require('winston');
const { trace } = require('@opentelemetry/api');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'app.log' })
]
});
// Log with trace context
function log(level, message, metadata = {}) {
const span = trace.getActiveSpan();
const spanContext = span?.spanContext();
logger.log({
level,
message,
...metadata,
// Include trace context for correlation
trace_id: spanContext?.traceId,
span_id: spanContext?.spanId,
service: 'api-service',
environment: process.env.NODE_ENV
});
}
// Usage
app.post('/orders', async (req, res) => {
log('info', 'Order creation started', {
order_id: req.body.id,
user_id: req.user.id,
items_count: req.body.items.length
});
try {
const order = await createOrder(req.body);
log('info', 'Order created successfully', {
order_id: order.id,
total: order.total,
processing_time_ms: Date.now() - startTime
});
res.json(order);
} catch (error) {
log('error', 'Order creation failed', {
order_id: req.body.id,
error_message: error.message,
error_stack: error.stack
});
res.status(500).json({ error: 'Failed to create order' });
}
});
// Output:
// {
// "level": "info",
// "message": "Order created successfully",
// "order_id": "order_123",
// "total": 99.99,
// "processing_time_ms": 234,
// "trace_id": "f47ac10b58cc4372a567",
// "span_id": "8a2d3b4e5f6a7890",
// "service": "api-service",
// "environment": "production",
// "timestamp": "2026-02-15T12:00:00.000Z"
// }
Grafana Loki for Log Aggregation
Loki provides lightweight log aggregation designed for Kubernetes and cloud-native environments.
Loki Configuration:
# promtail.yaml (log shipper)
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:
Parse JSON logs
- json:
expressions:
level: level
message: message
trace_id: trace_id
Extract labels
- labels:
level:
trace_id:
LogQL Queries:
# Filter logs by service and level
{app="api-service", level="error"}
Find logs for specific trace
| json | trace_id="f47ac10b58cc4372a567"
Count errors per service
sum(count_over_time([1h])) by (app)
Find slow requests
| json | processing_time_ms > 1000
Pattern matching
|= "Order creation failed"
Metrics from logs (log-based alerting)
rate([5m]) > 0.1
Service Level Objectives (SLOs)
Defining SLIs and SLOs
Service Level Indicators (SLIs) measure service health, Service Level Objectives (SLOs) define acceptable thresholds.
Common SLIs:
Availability SLI:
Definition: Percentage of successful requests
Measurement: (successful_requests / total_requests) * 100
Target: 99.9% (3 nines)
Latency SLI:
Definition: Percentage of requests completing within threshold
Measurement: (requests_under_threshold / total_requests) * 100
Threshold: 95% of requests < 200ms
Target: 99.0%
Error Rate SLI:
Definition: Percentage of requests without errors
Measurement: ((total_requests - error_requests) / total_requests) * 100
Target: 99.5% (error rate < 0.5%)
Throughput SLI:
Definition: Requests processed per second
Measurement: sum(rate(http_requests_total[5m]))
Target: > 1000 req/s
Data Freshness SLI:
Definition: Percentage of data updated within SLA
Measurement: (records_updated_on_time / total_records) * 100
Target: 99.0% of data < 5 minutes old
SLO Configuration:
# SLO definition (Prometheus)
apiVersion: v1
kind: SLO
metadata:
name: api-availability
spec:
service: api-service
description: API requests should succeed 99.9% of the time
target: 99.9
window: 30d # 30-day rolling window
indicator:
# Good events: successful requests (status < 500)
good:
promql: sum(rate(http_requests_total{status!~"5.."}[5m]))
# Total events: all requests
total:
promql: sum(rate(http_requests_total[5m]))
alerting:
# Burn rate: how fast error budget consumed
# Fast burn (5% budget in 1 hour) → page
- severity: critical
burn_rate: 14.4
window: 1h
# Slow burn (5% budget in 6 hours) → ticket
- severity: warning
burn_rate: 6
window: 6h
Error Budgets and Burn Rate
Error budgets quantify acceptable failure rate, burn rate measures consumption velocity.
Error Budget Calculation:
# 99.9% availability SLO = 0.1% error budget
error_budget_percent = 100 - slo_target # 100 - 99.9 = 0.1%
Monthly error budget (30 days)
total_minutes = 30 * 24 * 60 # 43,200 minutes
allowed_downtime_minutes = total_minutes * (error_budget_percent / 100)
43,200 * 0.001 = 43.2 minutes allowed downtime per month
Current error rate
current_errors = 50 # errors in last hour
total_requests = 10000 # requests in last hour
error_rate = (current_errors / total_requests) * 100 # 0.5%
Burn rate (how many times faster than budget)
burn_rate = error_rate / error_budget_percent
0.5 / 0.1 = 5x burn rate
Time to exhaust budget
hours_to_exhaust = (30 * 24) / burn_rate
720 / 5 = 144 hours (6 days)
Alert: If burn rate > 14.4 for 1 hour → exhausts 5% of monthly budget
Critical: Page on-call engineer
Warning: If burn rate > 6 for 6 hours → exhausts 5% of monthly budget
Create ticket for investigation
Alerting Strategies
Alert Design Principles
Effective alerts are actionable, specific, and correlated with user impact.
Alert Tiers:
Critical (Page):
Trigger: Service down, SLO violation imminent
Example: Error rate > 5% for 5 minutes (users affected NOW)
Response: Immediate action required, wake engineer
Frequency: Rare (< 1 per week ideal)
Warning (Ticket):
Trigger: Degraded performance, trend toward SLO violation
Example: P95 latency > 500ms for 30 minutes
Response: Investigate during business hours
Frequency: Occasional (< 1 per day ideal)
Info (Dashboard):
Trigger: Interesting but not urgent
Example: Disk usage > 70%
Response: Monitor, plan capacity
Frequency: Common, no notification
Alert Template:
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m # Must be true for 5 minutes (reduce flapping)
labels:
severity: critical
service: api-service
annotations:
summary: "High error rate detected"
description: |
Error rate is {{ $value | humanizePercentage }} (threshold: 5%)
Service: {{ $labels.service }}
Environment: {{ $labels.environment }}
Impact: Users experiencing 500 errors
Likely causes:
- Database connectivity issues
- Dependency service failure
- Code deployment regression
Runbook: https://wiki.company.com/runbooks/high-error-rate
Metrics: https://grafana.company.com/d/api-service
Traces: https://jaeger.company.com/?service=api-service&lookback=1h
dashboard: https://grafana.company.com/d/api-overview
runbook: https://wiki.company.com/runbooks/high-error-rate
Alert Routing and On-Call
Route alerts based on severity, time, and service ownership.
AlertManager Configuration:
# alertmanager.yml
route:
receiver: default
group_by: [alertname, service]
group_wait: 30s # Wait for more alerts before sending
group_interval: 5m # How often to send grouped notifications
repeat_interval: 4h # Repeat notification if still firing
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true # Also send to Slack
# Critical alerts → Slack #incidents
- match:
severity: critical
receiver: slack-incidents
# Warning alerts → Slack #alerts
- match:
severity: warning
receiver: slack-alerts
# Service-specific routing
- match:
service: payment-service
receiver: payments-team
routes:
- match:
severity: critical
receiver: payments-oncall
receivers:
-
name: pagerduty
pagerduty_configs:
- service_key: <key>
description: "{}: {}"
-
name: slack-incidents
slack_configs:
- api_url: <webhook>
channel: '#incidents'
title: "{}"
text: "{}"
actions:
- type: button
text: "View Dashboard"
url: "{}"
- type: button
text: "Runbook"
url: "{}"
Conclusion
Monitoring and observability separate high-performing engineering teams from reactive firefighters—observability-mature organizations detect incidents 10x faster (3 minutes vs 30 minutes), resolve 5x quicker (15 minutes vs 75 minutes), and reduce alert fatigue 40-50% through SLO-based alerting replacing symptom-based threshold alerts. Implementing comprehensive observability strategy spanning metrics collection (Prometheus, PromQL), distributed tracing (OpenTelemetry, Jaeger), structured logging (Loki, LogQL), service level objectives (SLIs, SLOs, error budgets), and intelligent alerting (burn rate alerts, contextual routing) enables teams achieving 99.99% availability and mean time to resolution under 15 minutes.
Production-proven practices demonstrate concrete impact: Netflix serves 230+ million subscribers with 99.99% availability through observability-driven incident response, Uber debugs latency across 2,000+ microservices processing 100+ million traces daily, Shopify monitors 10+ million metrics per second detecting anomalies before customer impact. Core principles—instrument everything but alert on user impact, correlate signals (metrics + traces + logs) for faster diagnosis, define SLOs from user perspective rather than infrastructure thresholds, design actionable alerts with clear runbooks—separate observability enabling proactive optimization from monitoring creating alert fatigue requiring constant firefighting.
Observability proves neither monitoring tool selection nor dashboard creation but cultural shift treating system understanding as first-class engineering requirement demanding instrumentation in application code, correlation across telemetry signals, and continuous refinement based on incident learnings. Teams embedding observability culture—reviewing SLO compliance weekly, conducting blameless postmortems improving instrumentation, maintaining runbooks linking alerts to remediation steps, celebrating reduced MTTR as key performance metric—prevent production crises through proactive detection rather than reactive escalation when customers report outages. Whether operating monolithic application serving thousands or distributed systems supporting millions, treating observability as product requirement determines whether engineering team spends time building features or debugging production incidents.
Related Articles
GraphQL API Design - Production Architecture and Best Practices for Scalable Systems
Master GraphQL API design covering schema design principles, resolver optimization, N+1 query prevention with DataLoader, authentication and authorization patterns, caching strategies, error handling, and production deployment for high-performance GraphQL systems.
Testing Strategies - Unit, Integration, and E2E Testing Best Practices for Production Quality
Comprehensive guide to testing strategies covering unit tests, integration tests, end-to-end testing, test-driven development, mocking patterns, testing pyramid, and production testing practices for reliable software delivery.
Monitoring and Observability - Production Systems Performance and Debugging at Scale
Master monitoring and observability covering metrics collection with Prometheus, distributed tracing with OpenTelemetry, log aggregation, alerting strategies, SLOs/SLIs, and production debugging techniques for reliable systems.
Written by StaticBlock Editorial
StaticBlock Editorial is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.