0% read
Skip to main content
Monitoring and Observability - Production Systems Guide for Modern Applications

Monitoring and Observability - Production Systems Guide for Modern Applications

Master monitoring and observability with metrics collection, distributed tracing, log aggregation, alerting strategies, and SLO/SLI frameworks for production systems at scale.

S
StaticBlock Editorial
22 min read

Introduction

Monitoring and observability determine mean time to detection (MTTD) and resolution (MTTR) for production incidents, with observability-mature organizations detecting issues 10x faster (3 minutes vs 30 minutes average) and resolving 5x quicker (15 minutes vs 75 minutes) according to DORA metrics, directly impacting availability SLAs and customer satisfaction. Traditional monitoring—collecting predefined metrics like CPU, memory, request count—proves insufficient for modern distributed systems where failures emerge from complex interactions between microservices, with 73% of production incidents requiring correlation across multiple data sources (metrics, traces, logs) that traditional dashboards fail to provide. Observability extends monitoring by enabling teams to ask arbitrary questions about system behavior without predicting failure modes in advance, supporting investigation of novel issues through high-cardinality data, distributed tracing revealing request paths across 10-50 services, and structured logging enabling rich querying of application context.

Organizations implementing comprehensive observability report 60-80% reduction in MTTR through faster root cause identification, 40-50% decrease in alert fatigue by replacing symptom-based alerts with SLO violations, and 20-30% reduction in infrastructure costs by identifying underutilized resources and optimization opportunities. Companies like Netflix achieve 99.99% availability serving 230+ million subscribers through observability-driven incident response correlating metrics, traces, and logs in real-time, Uber processes 100+ million distributed traces daily debugging latency issues across 2,000+ microservices, and Shopify monitors 10+ million metrics per second detecting anomalies before customer impact through machine learning-based alerting. This guide explores production-proven observability practices including metrics collection and visualization (Prometheus, Grafana), distributed tracing (Jaeger, Tempo, OpenTelemetry), log aggregation (Elasticsearch, Loki), alerting strategies (SLO-based alerting, alert routing), service level objectives (SLIs, SLOs, error budgets), and observability-driven development integrating monitoring into application code.

Metrics Collection and Visualization

Prometheus for Metrics Collection

Prometheus provides time-series database optimized for operational metrics with pull-based scraping and powerful query language.

Instrumenting Applications:

// Go application instrumentation with Prometheus client
package main

import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp" )

var ( // Counter: Monotonically increasing value (requests, errors) httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, // Labels for filtering )

// Histogram: Distribution of values (latency, request size)
httpDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request latency in seconds",
        Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
    },
    []string{"method", "endpoint"},
)

// Gauge: Value that can increase or decrease (active connections, queue depth)
activeConnections = promauto.NewGauge(
    prometheus.GaugeOpts{
        Name: "active_connections",
        Help: "Number of active connections",
    },
)

)

func metricsMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { timer := prometheus.NewTimer(httpDuration.WithLabelValues(r.Method, r.URL.Path)) defer timer.ObserveDuration()

    // Track active connections
    activeConnections.Inc()
    defer activeConnections.Dec()

    // Capture response status
    wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
    next.ServeHTTP(wrapped, r)

    // Record request
    httpRequestsTotal.WithLabelValues(
        r.Method,
        r.URL.Path,
        http.StatusText(wrapped.statusCode),
    ).Inc()
})

}

func main() { http.Handle("/metrics", promhttp.Handler()) // Expose metrics endpoint http.Handle("/", metricsMiddleware(http.HandlerFunc(handler))) http.ListenAndServe(":8080", nil) }

Prometheus Scrape Configuration:

# prometheus.yml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate alerting rules

scrape_configs:

  • job_name: 'api-service' static_configs:
    • targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: environment: 'production' region: 'us-east-1'

Kubernetes service discovery

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

PromQL Queries:

# Request rate (requests per second)
rate(http_requests_total[5m])

Error rate (percentage of 5xx responses)

sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m]))

  • 100

P95 latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Resource utilization by pod

sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

Alert on high error rate

( sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 # > 5% error rate

Grafana Dashboards

Grafana provides visualization layer for metrics from Prometheus, Elasticsearch, and other data sources.

Dashboard Design Principles:

Dashboard Structure:
  - Overview row (top-level metrics visible without scrolling)
    - Request rate, error rate, P95 latency, availability
  - Service health row (per-service breakdown)
    - Error rates by endpoint, latency distribution, throughput
  - Resource utilization row (infrastructure metrics)
    - CPU, memory, disk I/O, network
  - Business metrics row (application-specific KPIs)
    - Orders/second, revenue, active users

Variables for filtering:

  • $environment (production, staging, development)
  • $service (api, worker, frontend)
  • $region (us-east-1, eu-west-1)

Time ranges:

  • Default: Last 6 hours
  • Quick ranges: 5m, 15m, 1h, 6h, 24h, 7d
  • Refresh: Every 30s for production dashboards

Effective Panel Configuration:

{
  "title": "Request Rate and Error Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m]))",
      "legendFormat": "Request Rate (req/s)",
      "refId": "A"
    },
    {
      "expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m])) * 100",
      "legendFormat": "Error Rate (%)",
      "refId": "B"
    }
  ],
  "yaxes": [
    {
      "label": "req/s",
      "format": "short"
    },
    {
      "label": "%",
      "format": "percent",
      "max": 100
    }
  ],
  "alert": {
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "query": {
          "params": ["B", "5m", "now"]
        },
        "type": "query"
      }
    ],
    "frequency": "60s",
    "handler": 1,
    "name": "High Error Rate",
    "notifications": [
      {"uid": "slack-alerts"}
    ]
  }
}

Distributed Tracing

OpenTelemetry for Instrumentation

OpenTelemetry provides vendor-neutral observability framework for traces, metrics, and logs.

Automatic Instrumentation (Node.js Example):

// Auto-instrumentation for Express, HTTP, database clients
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { PgInstrumentation } = require('@opentelemetry/instrumentation-pg');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');

// Initialize tracer provider const provider = new NodeTracerProvider({ resource: { attributes: { 'service.name': 'api-service', 'service.version': '1.0.0', 'deployment.environment': 'production' } } });

// Configure exporter (Jaeger backend) const exporter = new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' });

provider.addSpanProcessor(new BatchSpanProcessor(exporter)); provider.register();

// Auto-instrument libraries registerInstrumentations({ instrumentations: [ new HttpInstrumentation(), new ExpressInstrumentation(), new PgInstrumentation() ] });

// Application code requires no changes - tracing automatic! const express = require('express'); const app = express();

app.get('/users/:id', async (req, res) => { // Automatically creates span for HTTP request const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]); // Automatically creates span for database query res.json(user); });

Manual Span Creation for Custom Operations:

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('api-service');

app.post('/orders', async (req, res) => { // Create custom span for business logic const span = tracer.startSpan('process_order', { attributes: { 'order.id': req.body.id, 'order.amount': req.body.total, 'user.id': req.user.id } });

try { // Validate order const validationSpan = tracer.startSpan('validate_order', { parent: span }); await validateOrder(req.body); validationSpan.end();

// Check inventory
const inventorySpan = tracer.startSpan('check_inventory', {
  parent: span
});
const available = await checkInventory(req.body.items);
inventorySpan.setAttribute('inventory.available', available);
inventorySpan.end();

if (!available) {
  span.setStatus({ code: SpanStatusCode.ERROR, message: 'Out of stock' });
  return res.status(409).json({ error: 'Out of stock' });
}

// Create order
const order = await createOrder(req.body);
span.setAttribute('order.created', true);
span.setStatus({ code: SpanStatusCode.OK });

res.json(order);

} catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); res.status(500).json({ error: 'Failed to process order' }); } finally { span.end(); } });

Analyzing Traces with Jaeger

Distributed traces reveal request flow across microservices identifying latency bottlenecks.

Trace Structure:

Trace: Order Creation (Total Duration: 1.2s)
  ├─ Span: HTTP POST /orders (1.2s)
  │   ├─ Span: validate_order (50ms)
  │   ├─ Span: check_inventory (800ms)  # ← BOTTLENECK!
  │   │   ├─ Span: HTTP GET inventory-service/check (780ms)
  │   │   │   ├─ Span: postgres.query SELECT (750ms)  # ← Slow query!
  │   │   │   └─ Span: serialize_response (30ms)
  │   │   └─ Span: parse_response (20ms)
  │   ├─ Span: create_order (300ms)
  │   │   ├─ Span: postgres.query INSERT (250ms)
  │   │   └─ Span: publish_event (50ms)
  │   └─ Span: send_confirmation (50ms)

Findings:

  • 800ms of 1.2s (67%) spent on inventory check
  • Inventory service database query taking 750ms (slow index?)
  • Opportunity: Cache inventory status, add database index

Query Patterns in Jaeger UI:

# Find slow traces (P95+ latency)
Service: api-service
Operation: POST /orders
Min Duration: 1s
Tags: http.status_code=200

Find error traces

Service: api-service Tags: error=true OR http.status_code>=500

Find traces touching specific service

Service: inventory-service Operation: *

Correlate with other signals

Tags: user.id=12345 # Find all requests for user Tags: order.id=67890 # Track specific order through system

Log Aggregation and Analysis

Structured Logging

Structured logs in JSON format enable rich querying and correlation with traces.

Structured Logging Example:

const winston = require('winston');
const { trace } = require('@opentelemetry/api');

const logger = winston.createLogger({ format: winston.format.combine( winston.format.timestamp(), winston.format.json() ), transports: [ new winston.transports.Console(), new winston.transports.File({ filename: 'app.log' }) ] });

// Log with trace context function log(level, message, metadata = {}) { const span = trace.getActiveSpan(); const spanContext = span?.spanContext();

logger.log({ level, message, ...metadata, // Include trace context for correlation trace_id: spanContext?.traceId, span_id: spanContext?.spanId, service: 'api-service', environment: process.env.NODE_ENV }); }

// Usage app.post('/orders', async (req, res) => { log('info', 'Order creation started', { order_id: req.body.id, user_id: req.user.id, items_count: req.body.items.length });

try { const order = await createOrder(req.body);

log('info', 'Order created successfully', {
  order_id: order.id,
  total: order.total,
  processing_time_ms: Date.now() - startTime
});

res.json(order);

} catch (error) { log('error', 'Order creation failed', { order_id: req.body.id, error_message: error.message, error_stack: error.stack });

res.status(500).json({ error: 'Failed to create order' });

} });

// Output: // { // "level": "info", // "message": "Order created successfully", // "order_id": "order_123", // "total": 99.99, // "processing_time_ms": 234, // "trace_id": "f47ac10b58cc4372a567", // "span_id": "8a2d3b4e5f6a7890", // "service": "api-service", // "environment": "production", // "timestamp": "2026-02-15T12:00:00.000Z" // }

Grafana Loki for Log Aggregation

Loki provides lightweight log aggregation designed for Kubernetes and cloud-native environments.

Loki Configuration:

# promtail.yaml (log shipper)
server:
  http_listen_port: 9080

positions: filename: /tmp/positions.yaml

clients:

  • url: http://loki:3100/loki/api/v1/push

scrape_configs:

Kubernetes pod logs

  • job_name: kubernetes-pods kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_label_app] target_label: app
    • source_labels: [__meta_kubernetes_pod_name] target_label: pod
    • source_labels: [__meta_kubernetes_namespace] target_label: namespace pipeline_stages:

    Parse JSON logs

    • json: expressions: level: level message: message trace_id: trace_id

    Extract labels

    • labels: level: trace_id:

LogQL Queries:

# Filter logs by service and level
{app="api-service", level="error"}

Find logs for specific trace

| json | trace_id="f47ac10b58cc4372a567"

Count errors per service

sum(count_over_time([1h])) by (app)

Find slow requests

| json | processing_time_ms > 1000

Pattern matching

|= "Order creation failed"

Metrics from logs (log-based alerting)

rate([5m]) > 0.1

Service Level Objectives (SLOs)

Defining SLIs and SLOs

Service Level Indicators (SLIs) measure service health, Service Level Objectives (SLOs) define acceptable thresholds.

Common SLIs:

Availability SLI:
  Definition: Percentage of successful requests
  Measurement: (successful_requests / total_requests) * 100
  Target: 99.9% (3 nines)

Latency SLI: Definition: Percentage of requests completing within threshold Measurement: (requests_under_threshold / total_requests) * 100 Threshold: 95% of requests < 200ms Target: 99.0%

Error Rate SLI: Definition: Percentage of requests without errors Measurement: ((total_requests - error_requests) / total_requests) * 100 Target: 99.5% (error rate < 0.5%)

Throughput SLI: Definition: Requests processed per second Measurement: sum(rate(http_requests_total[5m])) Target: > 1000 req/s

Data Freshness SLI: Definition: Percentage of data updated within SLA Measurement: (records_updated_on_time / total_records) * 100 Target: 99.0% of data < 5 minutes old

SLO Configuration:

# SLO definition (Prometheus)
apiVersion: v1
kind: SLO
metadata:
  name: api-availability
spec:
  service: api-service
  description: API requests should succeed 99.9% of the time
  target: 99.9
  window: 30d  # 30-day rolling window

indicator: # Good events: successful requests (status < 500) good: promql: sum(rate(http_requests_total{status!~"5.."}[5m]))

# Total events: all requests
total:
  promql: sum(rate(http_requests_total[5m]))

alerting: # Burn rate: how fast error budget consumed # Fast burn (5% budget in 1 hour) → page - severity: critical burn_rate: 14.4 window: 1h

# Slow burn (5% budget in 6 hours) → ticket
- severity: warning
  burn_rate: 6
  window: 6h

Error Budgets and Burn Rate

Error budgets quantify acceptable failure rate, burn rate measures consumption velocity.

Error Budget Calculation:

# 99.9% availability SLO = 0.1% error budget
error_budget_percent = 100 - slo_target  # 100 - 99.9 = 0.1%

Monthly error budget (30 days)

total_minutes = 30 * 24 * 60 # 43,200 minutes allowed_downtime_minutes = total_minutes * (error_budget_percent / 100)

43,200 * 0.001 = 43.2 minutes allowed downtime per month

Current error rate

current_errors = 50 # errors in last hour total_requests = 10000 # requests in last hour error_rate = (current_errors / total_requests) * 100 # 0.5%

Burn rate (how many times faster than budget)

burn_rate = error_rate / error_budget_percent

0.5 / 0.1 = 5x burn rate

Time to exhaust budget

hours_to_exhaust = (30 * 24) / burn_rate

720 / 5 = 144 hours (6 days)

Alert: If burn rate > 14.4 for 1 hour → exhausts 5% of monthly budget

Critical: Page on-call engineer

Warning: If burn rate > 6 for 6 hours → exhausts 5% of monthly budget

Create ticket for investigation

Alerting Strategies

Alert Design Principles

Effective alerts are actionable, specific, and correlated with user impact.

Alert Tiers:

Critical (Page):
  Trigger: Service down, SLO violation imminent
  Example: Error rate > 5% for 5 minutes (users affected NOW)
  Response: Immediate action required, wake engineer
  Frequency: Rare (< 1 per week ideal)

Warning (Ticket): Trigger: Degraded performance, trend toward SLO violation Example: P95 latency > 500ms for 30 minutes Response: Investigate during business hours Frequency: Occasional (< 1 per day ideal)

Info (Dashboard): Trigger: Interesting but not urgent Example: Disk usage > 70% Response: Monitor, plan capacity Frequency: Common, no notification

Alert Template:

alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 5m  # Must be true for 5 minutes (reduce flapping)
labels:
  severity: critical
  service: api-service
annotations:
  summary: "High error rate detected"
  description: |
    Error rate is {{ $value | humanizePercentage }} (threshold: 5%)
    Service: {{ $labels.service }}
    Environment: {{ $labels.environment }}
Impact: Users experiencing 500 errors
Likely causes:
- Database connectivity issues
- Dependency service failure
- Code deployment regression

Runbook: https://wiki.company.com/runbooks/high-error-rate
Metrics: https://grafana.company.com/d/api-service
Traces: https://jaeger.company.com/?service=api-service&amp;lookback=1h

dashboard: https://grafana.company.com/d/api-overview runbook: https://wiki.company.com/runbooks/high-error-rate

Alert Routing and On-Call

Route alerts based on severity, time, and service ownership.

AlertManager Configuration:

# alertmanager.yml
route:
  receiver: default
  group_by: [alertname, service]
  group_wait: 30s       # Wait for more alerts before sending
  group_interval: 5m    # How often to send grouped notifications
  repeat_interval: 4h   # Repeat notification if still firing

routes: # Critical alerts → PagerDuty - match: severity: critical receiver: pagerduty continue: true # Also send to Slack

# Critical alerts → Slack #incidents
- match:
    severity: critical
  receiver: slack-incidents

# Warning alerts → Slack #alerts
- match:
    severity: warning
  receiver: slack-alerts

# Service-specific routing
- match:
    service: payment-service
  receiver: payments-team
  routes:
    - match:
        severity: critical
      receiver: payments-oncall

receivers:

  • name: pagerduty pagerduty_configs:

    • service_key: <key> description: "{}: {}"
  • name: slack-incidents slack_configs:

    • api_url: <webhook> channel: '#incidents' title: "{}" text: "{}" actions:
      • type: button text: "View Dashboard" url: "{}"
      • type: button text: "Runbook" url: "{}"

Conclusion

Monitoring and observability separate high-performing engineering teams from reactive firefighters—observability-mature organizations detect incidents 10x faster (3 minutes vs 30 minutes), resolve 5x quicker (15 minutes vs 75 minutes), and reduce alert fatigue 40-50% through SLO-based alerting replacing symptom-based threshold alerts. Implementing comprehensive observability strategy spanning metrics collection (Prometheus, PromQL), distributed tracing (OpenTelemetry, Jaeger), structured logging (Loki, LogQL), service level objectives (SLIs, SLOs, error budgets), and intelligent alerting (burn rate alerts, contextual routing) enables teams achieving 99.99% availability and mean time to resolution under 15 minutes.

Production-proven practices demonstrate concrete impact: Netflix serves 230+ million subscribers with 99.99% availability through observability-driven incident response, Uber debugs latency across 2,000+ microservices processing 100+ million traces daily, Shopify monitors 10+ million metrics per second detecting anomalies before customer impact. Core principles—instrument everything but alert on user impact, correlate signals (metrics + traces + logs) for faster diagnosis, define SLOs from user perspective rather than infrastructure thresholds, design actionable alerts with clear runbooks—separate observability enabling proactive optimization from monitoring creating alert fatigue requiring constant firefighting.

Observability proves neither monitoring tool selection nor dashboard creation but cultural shift treating system understanding as first-class engineering requirement demanding instrumentation in application code, correlation across telemetry signals, and continuous refinement based on incident learnings. Teams embedding observability culture—reviewing SLO compliance weekly, conducting blameless postmortems improving instrumentation, maintaining runbooks linking alerts to remediation steps, celebrating reduced MTTR as key performance metric—prevent production crises through proactive detection rather than reactive escalation when customers report outages. Whether operating monolithic application serving thousands or distributed systems supporting millions, treating observability as product requirement determines whether engineering team spends time building features or debugging production incidents.

Found this helpful? Share it!

Related Articles

S

Written by StaticBlock Editorial

StaticBlock Editorial is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.