Monitoring and Observability - Production Systems Guide for Modern Applications

Introduction

Monitoring and observability determine mean time to detection (MTTD) and resolution (MTTR) for production incidents, with observability-mature organizations detecting issues 10x faster (3 minutes vs 30 minutes average) and resolving 5x quicker (15 minutes vs 75 minutes) according to DORA metrics, directly impacting availability SLAs and customer satisfaction. Traditional monitoring—collecting predefined metrics like CPU, memory, request count—proves insufficient for modern distributed systems where failures emerge from complex interactions between microservices, with 73% of production incidents requiring correlation across multiple data sources (metrics, traces, logs) that traditional dashboards fail to provide. Observability extends monitoring by enabling teams to ask arbitrary questions about system behavior without predicting failure modes in advance, supporting investigation of novel issues through high-cardinality data, distributed tracing revealing request paths across 10-50 services, and structured logging enabling rich querying of application context.

Organizations implementing comprehensive observability report 60-80% reduction in MTTR through faster root cause identification, 40-50% decrease in alert fatigue by replacing symptom-based alerts with SLO violations, and 20-30% reduction in infrastructure costs by identifying underutilized resources and optimization opportunities. Companies like Netflix achieve 99.99% availability serving 230+ million subscribers through observability-driven incident response correlating metrics, traces, and logs in real-time, Uber processes 100+ million distributed traces daily debugging latency issues across 2,000+ microservices, and Shopify monitors 10+ million metrics per second detecting anomalies before customer impact through machine learning-based alerting. This guide explores production-proven observability practices including metrics collection and visualization (Prometheus, Grafana), distributed tracing (Jaeger, Tempo, OpenTelemetry), log aggregation (Elasticsearch, Loki), alerting strategies (SLO-based alerting, alert routing), service level objectives (SLIs, SLOs, error budgets), and observability-driven development integrating monitoring into application code.

Metrics Collection and Visualization

Prometheus for Metrics Collection

Prometheus provides time-series database optimized for operational metrics with pull-based scraping and powerful query language.

Instrumenting Applications:

// Go application instrumentation with Prometheus client
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter: Monotonically increasing value (requests, errors)
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},  // Labels for filtering
)
// Histogram: Distribution of values (latency, request size)
httpDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: &quot;http_request_duration_seconds&quot;,
        Help: &quot;HTTP request latency in seconds&quot;,
        Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
    },
    []string{&quot;method&quot;, &quot;endpoint&quot;},
)

// Gauge: Value that can increase or decrease (active connections, queue depth)
activeConnections = promauto.NewGauge(
    prometheus.GaugeOpts{
        Name: &quot;active_connections&quot;,
        Help: &quot;Number of active connections&quot;,
    },
)

)
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
    // Track active connections
    activeConnections.Inc()
    defer activeConnections.Dec()

    // Capture response status
    wrapped := &amp;responseWriter{ResponseWriter: w, statusCode: 200}
    next.ServeHTTP(wrapped, r)

    // Record request
    httpRequestsTotal.WithLabelValues(
        r.Method,
        r.URL.Path,
        http.StatusText(wrapped.statusCode),
    ).Inc()
})

}
func main() {
http.Handle("/metrics", promhttp.Handler())  // Expose metrics endpoint
http.Handle("/", metricsMiddleware(http.HandlerFunc(handler)))
http.ListenAndServe(":8080", nil)
}

Prometheus Scrape Configuration:

# prometheus.yml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate alerting rules
scrape_configs:

job_name: 'api-service'
static_configs:

targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
labels:
environment: 'production'
region: 'us-east-1'



Kubernetes service discovery

job_name: 'kubernetes-pods'
kubernetes_sd_configs:

role: pod
relabel_configs:
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: address

PromQL Queries:

# Request rate (requests per second)
rate(http_requests_total[5m])
Error rate (percentage of 5xx responses)
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))

100

P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Resource utilization by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
Alert on high error rate
(
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05  # > 5% error rate

Grafana Dashboards

Grafana provides visualization layer for metrics from Prometheus, Elasticsearch, and other data sources.

Dashboard Design Principles:

Dashboard Structure:
  - Overview row (top-level metrics visible without scrolling)
    - Request rate, error rate, P95 latency, availability
  - Service health row (per-service breakdown)
    - Error rates by endpoint, latency distribution, throughput
  - Resource utilization row (infrastructure metrics)
    - CPU, memory, disk I/O, network
  - Business metrics row (application-specific KPIs)
    - Orders/second, revenue, active users
Variables for filtering:

$environment (production, staging, development)
$service (api, worker, frontend)
$region (us-east-1, eu-west-1)

Time ranges:

Default: Last 6 hours
Quick ranges: 5m, 15m, 1h, 6h, 24h, 7d
Refresh: Every 30s for production dashboards

Effective Panel Configuration:

{
  "title": "Request Rate and Error Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m]))",
      "legendFormat": "Request Rate (req/s)",
      "refId": "A"
    },
    {
      "expr": "sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\",service=\"$service\"}[5m])) * 100",
      "legendFormat": "Error Rate (%)",
      "refId": "B"
    }
  ],
  "yaxes": [
    {
      "label": "req/s",
      "format": "short"
    },
    {
      "label": "%",
      "format": "percent",
      "max": 100
    }
  ],
  "alert": {
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "query": {
          "params": ["B", "5m", "now"]
        },
        "type": "query"
      }
    ],
    "frequency": "60s",
    "handler": 1,
    "name": "High Error Rate",
    "notifications": [
      {"uid": "slack-alerts"}
    ]
  }
}

Distributed Tracing

OpenTelemetry for Instrumentation

OpenTelemetry provides vendor-neutral observability framework for traces, metrics, and logs.

Automatic Instrumentation (Node.js Example):

// Auto-instrumentation for Express, HTTP, database clients
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { PgInstrumentation } = require('@opentelemetry/instrumentation-pg');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
// Initialize tracer provider
const provider = new NodeTracerProvider({
resource: {
attributes: {
'service.name': 'api-service',
'service.version': '1.0.0',
'deployment.environment': 'production'
}
}
});
// Configure exporter (Jaeger backend)
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Auto-instrument libraries
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation()
]
});
// Application code requires no changes - tracing automatic!
const express = require('express');
const app = express();
app.get('/users/:id', async (req, res) => {
// Automatically creates span for HTTP request
const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
// Automatically creates span for database query
res.json(user);
});

Manual Span Creation for Custom Operations:

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('api-service');
app.post('/orders', async (req, res) => {
// Create custom span for business logic
const span = tracer.startSpan('process_order', {
attributes: {
'order.id': req.body.id,
'order.amount': req.body.total,
'user.id': req.user.id
}
});
try {
// Validate order
const validationSpan = tracer.startSpan('validate_order', {
parent: span
});
await validateOrder(req.body);
validationSpan.end();
// Check inventory
const inventorySpan = tracer.startSpan('check_inventory', {
  parent: span
});
const available = await checkInventory(req.body.items);
inventorySpan.setAttribute('inventory.available', available);
inventorySpan.end();

if (!available) {
  span.setStatus({ code: SpanStatusCode.ERROR, message: 'Out of stock' });
  return res.status(409).json({ error: 'Out of stock' });
}

// Create order
const order = await createOrder(req.body);
span.setAttribute('order.created', true);
span.setStatus({ code: SpanStatusCode.OK });

res.json(order);

} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
res.status(500).json({ error: 'Failed to process order' });
} finally {
span.end();
}
});

Analyzing Traces with Jaeger

Distributed traces reveal request flow across microservices identifying latency bottlenecks.

Trace Structure:

Trace: Order Creation (Total Duration: 1.2s)
  ├─ Span: HTTP POST /orders (1.2s)
  │   ├─ Span: validate_order (50ms)
  │   ├─ Span: check_inventory (800ms)  # ← BOTTLENECK!
  │   │   ├─ Span: HTTP GET inventory-service/check (780ms)
  │   │   │   ├─ Span: postgres.query SELECT (750ms)  # ← Slow query!
  │   │   │   └─ Span: serialize_response (30ms)
  │   │   └─ Span: parse_response (20ms)
  │   ├─ Span: create_order (300ms)
  │   │   ├─ Span: postgres.query INSERT (250ms)
  │   │   └─ Span: publish_event (50ms)
  │   └─ Span: send_confirmation (50ms)
Findings:

800ms of 1.2s (67%) spent on inventory check
Inventory service database query taking 750ms (slow index?)
Opportunity: Cache inventory status, add database index

Query Patterns in Jaeger UI:

# Find slow traces (P95+ latency) Service: api-service Operation: POST /orders Min Duration: 1s Tags: http.status_code=200 Find error traces Service: api-service Tags: error=true OR http.status_code>=500 Find traces touching specific service Service: inventory-service Operation: * Correlate with other signals

Tags: user.id=12345 # Find all requests for user Tags: order.id=67890 # Track specific order through system

Log Aggregation and Analysis

Structured Logging

Structured logs in JSON format enable rich querying and correlation with traces.

Structured Logging Example:

const winston = require('winston');
const { trace } = require('@opentelemetry/api');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'app.log' })
]
});
// Log with trace context
function log(level, message, metadata = {}) {
const span = trace.getActiveSpan();
const spanContext = span?.spanContext();
logger.log({
level,
message,
...metadata,
// Include trace context for correlation
trace_id: spanContext?.traceId,
span_id: spanContext?.spanId,
service: 'api-service',
environment: process.env.NODE_ENV
});
}
// Usage
app.post('/orders', async (req, res) => {
log('info', 'Order creation started', {
order_id: req.body.id,
user_id: req.user.id,
items_count: req.body.items.length
});
try {
const order = await createOrder(req.body);
log('info', 'Order created successfully', {
  order_id: order.id,
  total: order.total,
  processing_time_ms: Date.now() - startTime
});

res.json(order);

} catch (error) {
log('error', 'Order creation failed', {
order_id: req.body.id,
error_message: error.message,
error_stack: error.stack
});
res.status(500).json({ error: 'Failed to create order' });

}
});
// Output:
// {
//   "level": "info",
//   "message": "Order created successfully",
//   "order_id": "order_123",
//   "total": 99.99,
//   "processing_time_ms": 234,
//   "trace_id": "f47ac10b58cc4372a567",
//   "span_id": "8a2d3b4e5f6a7890",
//   "service": "api-service",
//   "environment": "production",
//   "timestamp": "2026-02-15T12:00:00.000Z"
// }

Grafana Loki for Log Aggregation

Loki provides lightweight log aggregation designed for Kubernetes and cloud-native environments.

Loki Configuration:

# promtail.yaml (log shipper)
server:
  http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:

url: http://loki:3100/loki/api/v1/push

scrape_configs:
Kubernetes pod logs

job_name: kubernetes-pods
kubernetes_sd_configs:

role: pod
relabel_configs:
source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
source_labels: [__meta_kubernetes_pod_name]
target_label: pod
source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:

Parse JSON logs

json:
expressions:
level: level
message: message
trace_id: trace_id

Extract labels

labels:
level:
trace_id:

LogQL Queries:

# Filter logs by service and level
{app="api-service", level="error"}
Find logs for specific trace
| json | trace_id="f47ac10b58cc4372a567"
Count errors per service
sum(count_over_time([1h])) by (app)
Find slow requests
| json | processing_time_ms > 1000
Pattern matching
|= "Order creation failed"
Metrics from logs (log-based alerting)
rate([5m]) > 0.1

Service Level Objectives (SLOs)

Defining SLIs and SLOs

Service Level Indicators (SLIs) measure service health, Service Level Objectives (SLOs) define acceptable thresholds.

Common SLIs:

Availability SLI:
  Definition: Percentage of successful requests
  Measurement: (successful_requests / total_requests) * 100
  Target: 99.9% (3 nines)
Latency SLI:
Definition: Percentage of requests completing within threshold
Measurement: (requests_under_threshold / total_requests) * 100
Threshold: 95% of requests < 200ms
Target: 99.0%
Error Rate SLI:
Definition: Percentage of requests without errors
Measurement: ((total_requests - error_requests) / total_requests) * 100
Target: 99.5% (error rate < 0.5%)
Throughput SLI:
Definition: Requests processed per second
Measurement: sum(rate(http_requests_total[5m]))
Target: > 1000 req/s
Data Freshness SLI:
Definition: Percentage of data updated within SLA
Measurement: (records_updated_on_time / total_records) * 100
Target: 99.0% of data < 5 minutes old

SLO Configuration:

# SLO definition (Prometheus)
apiVersion: v1
kind: SLO
metadata:
  name: api-availability
spec:
  service: api-service
  description: API requests should succeed 99.9% of the time
  target: 99.9
  window: 30d  # 30-day rolling window
indicator:
# Good events: successful requests (status < 500)
good:
promql: sum(rate(http_requests_total{status!~"5.."}[5m]))
# Total events: all requests
total:
  promql: sum(rate(http_requests_total[5m]))

alerting:
# Burn rate: how fast error budget consumed
# Fast burn (5% budget in 1 hour) → page
- severity: critical
burn_rate: 14.4
window: 1h
# Slow burn (5% budget in 6 hours) → ticket
- severity: warning
  burn_rate: 6
  window: 6h

Error Budgets and Burn Rate

Error budgets quantify acceptable failure rate, burn rate measures consumption velocity.

Error Budget Calculation:

# 99.9% availability SLO = 0.1% error budget
error_budget_percent = 100 - slo_target  # 100 - 99.9 = 0.1%
Monthly error budget (30 days)
total_minutes = 30 * 24 * 60  # 43,200 minutes
allowed_downtime_minutes = total_minutes * (error_budget_percent / 100)
43,200 * 0.001 = 43.2 minutes allowed downtime per month
Current error rate
current_errors = 50  # errors in last hour
total_requests = 10000  # requests in last hour
error_rate = (current_errors / total_requests) * 100  # 0.5%
Burn rate (how many times faster than budget)
burn_rate = error_rate / error_budget_percent
0.5 / 0.1 = 5x burn rate
Time to exhaust budget
hours_to_exhaust = (30 * 24) / burn_rate
720 / 5 = 144 hours (6 days)
Alert: If burn rate > 14.4 for 1 hour → exhausts 5% of monthly budget
Critical: Page on-call engineer
Warning: If burn rate > 6 for 6 hours → exhausts 5% of monthly budget
Create ticket for investigation

Alerting Strategies

Alert Design Principles

Effective alerts are actionable, specific, and correlated with user impact.

Alert Tiers:

Critical (Page):
  Trigger: Service down, SLO violation imminent
  Example: Error rate > 5% for 5 minutes (users affected NOW)
  Response: Immediate action required, wake engineer
  Frequency: Rare (< 1 per week ideal)
Warning (Ticket):
Trigger: Degraded performance, trend toward SLO violation
Example: P95 latency > 500ms for 30 minutes
Response: Investigate during business hours
Frequency: Occasional (< 1 per day ideal)
Info (Dashboard):
Trigger: Interesting but not urgent
Example: Disk usage > 70%
Response: Monitor, plan capacity
Frequency: Common, no notification

Alert Template:

alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 5m  # Must be true for 5 minutes (reduce flapping)
labels:
  severity: critical
  service: api-service
annotations:
  summary: "High error rate detected"
  description: |
    Error rate is {{ $value | humanizePercentage }} (threshold: 5%)
    Service: {{ $labels.service }}
    Environment: {{ $labels.environment }}
Impact: Users experiencing 500 errors
Likely causes:
- Database connectivity issues
- Dependency service failure
- Code deployment regression

Runbook: https://wiki.company.com/runbooks/high-error-rate
Metrics: https://grafana.company.com/d/api-service
Traces: https://jaeger.company.com/?service=api-service&amp;lookback=1h

dashboard: https://grafana.company.com/d/api-overview
runbook: https://wiki.company.com/runbooks/high-error-rate

Alert Routing and On-Call

Route alerts based on severity, time, and service ownership.

AlertManager Configuration:

# alertmanager.yml
route:
  receiver: default
  group_by: [alertname, service]
  group_wait: 30s       # Wait for more alerts before sending
  group_interval: 5m    # How often to send grouped notifications
  repeat_interval: 4h   # Repeat notification if still firing
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true      # Also send to Slack
# Critical alerts → Slack #incidents
- match:
    severity: critical
  receiver: slack-incidents

# Warning alerts → Slack #alerts
- match:
    severity: warning
  receiver: slack-alerts

# Service-specific routing
- match:
    service: payment-service
  receiver: payments-team
  routes:
    - match:
        severity: critical
      receiver: payments-oncall

receivers:


name: pagerduty
pagerduty_configs:

service_key: <key>
description: "{}: {}"



name: slack-incidents
slack_configs:

api_url: <webhook>
channel: '#incidents'
title: "{}"
text: "{}"
actions:

type: button
text: "View Dashboard"
url: "{}"
type: button
text: "Runbook"
url: "{}"

Conclusion

Monitoring and observability separate high-performing engineering teams from reactive firefighters—observability-mature organizations detect incidents 10x faster (3 minutes vs 30 minutes), resolve 5x quicker (15 minutes vs 75 minutes), and reduce alert fatigue 40-50% through SLO-based alerting replacing symptom-based threshold alerts. Implementing comprehensive observability strategy spanning metrics collection (Prometheus, PromQL), distributed tracing (OpenTelemetry, Jaeger), structured logging (Loki, LogQL), service level objectives (SLIs, SLOs, error budgets), and intelligent alerting (burn rate alerts, contextual routing) enables teams achieving 99.99% availability and mean time to resolution under 15 minutes.

Production-proven practices demonstrate concrete impact: Netflix serves 230+ million subscribers with 99.99% availability through observability-driven incident response, Uber debugs latency across 2,000+ microservices processing 100+ million traces daily, Shopify monitors 10+ million metrics per second detecting anomalies before customer impact. Core principles—instrument everything but alert on user impact, correlate signals (metrics + traces + logs) for faster diagnosis, define SLOs from user perspective rather than infrastructure thresholds, design actionable alerts with clear runbooks—separate observability enabling proactive optimization from monitoring creating alert fatigue requiring constant firefighting.

Observability proves neither monitoring tool selection nor dashboard creation but cultural shift treating system understanding as first-class engineering requirement demanding instrumentation in application code, correlation across telemetry signals, and continuous refinement based on incident learnings. Teams embedding observability culture—reviewing SLO compliance weekly, conducting blameless postmortems improving instrumentation, maintaining runbooks linking alerts to remediation steps, celebrating reduced MTTR as key performance metric—prevent production crises through proactive detection rather than reactive escalation when customers report outages. Whether operating monolithic application serving thousands or distributed systems supporting millions, treating observability as product requirement determines whether engineering team spends time building features or debugging production incidents.

Introduction

Metrics Collection and Visualization

Prometheus for Metrics Collection

Kubernetes service discovery

Error rate (percentage of 5xx responses)

P95 latency

Resource utilization by pod

Alert on high error rate

Grafana Dashboards

Distributed Tracing

OpenTelemetry for Instrumentation

Analyzing Traces with Jaeger

Find error traces

Find traces touching specific service

Correlate with other signals

Log Aggregation and Analysis

Structured Logging

Grafana Loki for Log Aggregation

Kubernetes pod logs

Parse JSON logs

Extract labels

Find logs for specific trace

Count errors per service

Find slow requests

Pattern matching

Metrics from logs (log-based alerting)

Service Level Objectives (SLOs)

Defining SLIs and SLOs

Error Budgets and Burn Rate

Monthly error budget (30 days)

43,200 * 0.001 = 43.2 minutes allowed downtime per month

Current error rate

Burn rate (how many times faster than budget)

0.5 / 0.1 = 5x burn rate

Time to exhaust budget

720 / 5 = 144 hours (6 days)

Alert: If burn rate > 14.4 for 1 hour → exhausts 5% of monthly budget

Critical: Page on-call engineer

Warning: If burn rate > 6 for 6 hours → exhausts 5% of monthly budget

Create ticket for investigation

Alerting Strategies

Alert Design Principles

Alert Routing and On-Call

Conclusion

Related Articles

GraphQL API Design - Production Architecture and Best Practices for Scalable Systems

Testing Strategies - Unit, Integration, and E2E Testing Best Practices for Production Quality

Monitoring and Observability - Production Systems Performance and Debugging at Scale

Written by StaticBlock Editorial