Production Observability: A Complete Guide to OpenTelemetry, Prometheus, and Grafana
Build comprehensive production observability with OpenTelemetry, Prometheus, and Grafana. Learn instrumentation strategies, metrics collection, distributed tracing, log aggregation, custom dashboards, alerting rules, and performance optimization. Includes real-world examples for Node.js, Python, Go, and Java microservices with Kubernetes deployment patterns and SLO/SLI monitoring.
Introduction
Observability is the ability to understand what's happening inside your systems by examining their outputs. Unlike traditional monitoring, which answers "is the system working?", observability answers "why isn't it working?" and "what changed?"
The modern observability stack—OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization—has become the de facto standard for production systems in 2025. This combination provides comprehensive insights into distributed applications while remaining vendor-neutral and open-source.
This guide covers everything from basic instrumentation to advanced distributed tracing, custom metrics, alerting strategies, and performance optimization for high-traffic production environments.
The Three Pillars of Observability
1. Metrics (What's happening?)
Definition: Numerical values measured over time (e.g., request rate, error count, latency)
Use cases:
- Real-time dashboards
- Alerting on thresholds
- Capacity planning
- SLO/SLI tracking
Example: HTTP request duration histogram, memory usage gauge, database query count
2. Logs (What happened?)
Definition: Discrete event records with timestamp and contextual information
Use cases:
- Debugging specific issues
- Audit trails
- Error investigation
- User behavior analysis
Example: "User 12345 failed login attempt from IP 192.168.1.1 at 2025-11-23T10:15:30Z"
3. Traces (How did it flow?)
Definition: End-to-end journey of a request through distributed systems
Use cases:
- Understanding service dependencies
- Identifying bottlenecks
- Root cause analysis
- Performance optimization
Example: Web request → API Gateway → Auth Service → Database → Cache → Response (with timing for each step)
Architecture Overview
┌─────────────┐
│ Application │
│ (Your Code)│
└──────┬──────┘
│ OpenTelemetry SDK
│ (Auto/Manual Instrumentation)
▼
┌──────────────────────┐
│ OpenTelemetry │
│ Collector │
│ (Receive/Process/ │
│ Export) │
└─────┬────────────────┘
│
├─────────────┬──────────────┐
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌────────────┐
│Prometheus│ │ Jaeger │ │ Loki │
│(Metrics) │ │(Traces) │ │(Logs) │
└────┬─────┘ └────┬────┘ └─────┬──────┘
│ │ │
└─────────────┼──────────────┘
▼
┌──────────────┐
│ Grafana │
│(Visualization│
│ & Alerting) │
└──────────────┘
Setting Up OpenTelemetry
Installation and Basic Configuration
Node.js/TypeScript Example
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-prometheus \
@opentelemetry/exporter-jaeger \
@opentelemetry/api
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
// Define service metadata
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.2.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
})
);
// Prometheus metrics exporter
const prometheusExporter = new PrometheusExporter({
port: 9464, // Metrics exposed on :9464/metrics
endpoint: '/metrics',
});
// Jaeger trace exporter
const jaegerExporter = new JaegerExporter({
agentHost: process.env.JAEGER_AGENT_HOST || 'localhost',
agentPort: parseInt(process.env.JAEGER_AGENT_PORT || '6832'),
});
// Initialize SDK with auto-instrumentation
const sdk = new NodeSDK({
resource,
metricReader: prometheusExporter,
traceExporter: jaegerExporter,
instrumentations: [
getNodeAutoInstrumentations({
// Fine-tune auto-instrumentation
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics'],
},
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-pg': {
enabled: true, // PostgreSQL instrumentation
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
}),
],
});
// Start SDK
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk
.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
export default sdk;
// app.ts
import './tracing'; // Must be first import!
import express from 'express';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
const app = express();
const tracer = trace.getTracer('payment-service');
app.post('/api/checkout', async (req, res) => {
// Create custom span for business logic
const span = tracer.startSpan('checkout.process');
try {
// Add custom attributes
span.setAttribute('user.id', req.body.userId);
span.setAttribute('cart.total', req.body.total);
span.setAttribute('payment.method', req.body.paymentMethod);
// Business logic with automatic child spans from auto-instrumentation
const payment = await processPayment(req.body);
const order = await createOrder(payment);
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('order.id', order.id);
res.json({ success: true, orderId: order.id });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
res.status(500).json({ success: false, error: error.message });
} finally {
span.end();
}
});
app.listen(3000);
Python Example with FastAPI
# instrumentation.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
def setup_observability(app):
# Define resource attributes
resource = Resource.create({
"service.name": "user-service",
"service.version": "2.1.0",
"deployment.environment": os.getenv("ENV", "production"),
})
# Set up tracer provider
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer_provider = trace.get_tracer_provider()
# Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name=os.getenv("JAEGER_AGENT_HOST", "localhost"),
agent_port=int(os.getenv("JAEGER_AGENT_PORT", "6831")),
)
tracer_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
# Instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument()
# Instrument Redis
RedisInstrumentor().instrument()
return trace.get_tracer(__name__)
# main.py
from fastapi import FastAPI, HTTPException
from opentelemetry import trace
app = FastAPI()
tracer = setup_observability(app)
@app.post("/api/users")
async def create_user(user: UserCreate):
with tracer.start_as_current_span("create_user") as span:
span.set_attribute("user.email", user.email)
span.set_attribute("user.role", user.role)
try:
# Validate email
with tracer.start_as_current_span("validate_email"):
if not is_valid_email(user.email):
raise ValueError("Invalid email format")
# Check existing user
with tracer.start_as_current_span("check_duplicate"):
existing = await db.query(User).filter_by(email=user.email).first()
if existing:
raise HTTPException(400, "User already exists")
# Create user
with tracer.start_as_current_span("database.insert"):
new_user = User(**user.dict())
db.add(new_user)
await db.commit()
span.set_attribute("user.id", new_user.id)
return {"id": new_user.id, "email": new_user.email}
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise
Custom Metrics with Prometheus
Metric Types
1. Counter: Monotonically increasing value (e.g., total requests)
2. Gauge: Value that can go up or down (e.g., active connections)
3. Histogram: Distribution of values (e.g., request duration)
4. Summary: Similar to histogram but with quantiles
Implementing Custom Metrics
// metrics.ts
import { metrics } from '@opentelemetry/api';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
const prometheusExporter = new PrometheusExporter({ port: 9464 });
const meterProvider = new MeterProvider({
readers: [new PeriodicExportingMetricReader({
exporter: prometheusExporter,
exportIntervalMillis: 1000,
})],
});
metrics.setGlobalMeterProvider(meterProvider);
const meter = metrics.getMeter('payment-service');
// Counter: Total processed payments
export const paymentsProcessed = meter.createCounter('payments_processed_total', {
description: 'Total number of payment transactions processed',
unit: 'transactions',
});
// Counter with labels
export const paymentsProcessedByMethod = meter.createCounter(
'payments_processed_by_method_total',
{ description: 'Payments processed by payment method' }
);
// Gauge: Active payment processing
export const activePayments = meter.createUpDownCounter('payments_active', {
description: 'Number of payments currently being processed',
unit: 'payments',
});
// Histogram: Payment processing duration
export const paymentDuration = meter.createHistogram('payment_duration_seconds', {
description: 'Time taken to process payments',
unit: 'seconds',
});
// Histogram with custom buckets
export const paymentAmount = meter.createHistogram('payment_amount_dollars', {
description: 'Distribution of payment amounts',
unit: 'dollars',
});
// Observable Gauge: System metrics
export const memoryUsage = meter.createObservableGauge('process_memory_bytes', {
description: 'Process memory usage',
unit: 'bytes',
});
memoryUsage.addCallback((observableResult) => {
const usage = process.memoryUsage();
observableResult.observe(usage.heapUsed, { type: 'heap_used' });
observableResult.observe(usage.heapTotal, { type: 'heap_total' });
observableResult.observe(usage.rss, { type: 'rss' });
});
Using Metrics in Application Code
// payment-service.ts
import {
paymentsProcessed,
paymentsProcessedByMethod,
activePayments,
paymentDuration,
paymentAmount,
} from './metrics';
export async function processPayment(paymentData: PaymentRequest) {
const startTime = Date.now();
// Increment active payments
activePayments.add(1);
try {
// Process payment
const result = await chargeCard(paymentData);
// Record successful payment
paymentsProcessed.add(1, {
status: 'success',
currency: paymentData.currency,
});
paymentsProcessedByMethod.add(1, {
method: paymentData.method, // 'credit_card', 'paypal', etc.
status: 'success',
});
// Record payment amount
paymentAmount.record(paymentData.amount, {
currency: paymentData.currency,
method: paymentData.method,
});
return result;
} catch (error) {
// Record failed payment
paymentsProcessed.add(1, {
status: 'failed',
error_type: error.code,
});
paymentsProcessedByMethod.add(1, {
method: paymentData.method,
status: 'failed',
});
throw error;
} finally {
// Record duration
const duration = (Date.now() - startTime) / 1000;
paymentDuration.record(duration, {
method: paymentData.method,
});
// Decrement active payments
activePayments.add(-1);
}
}
Distributed Tracing Patterns
Context Propagation Across Services
// api-gateway.ts (Service A)
import { context, propagation, trace } from '@opentelemetry/api';
import axios from 'axios';
app.post('/api/orders', async (req, res) => {
const span = trace.getActiveSpan();
// Prepare headers for downstream service
const headers: Record<string, string> = {};
// Inject trace context into headers
propagation.inject(context.active(), headers);
try {
// Call payment service with trace context
const paymentResult = await axios.post(
'http://payment-service:3001/api/charge',
req.body.payment,
{ headers }
);
// Call inventory service with same trace context
const inventoryResult = await axios.post(
'http://inventory-service:3002/api/reserve',
req.body.items,
{ headers }
);
res.json({ success: true, data: { paymentResult, inventoryResult } });
} catch (error) {
span.recordException(error);
res.status(500).json({ error: error.message });
}
});
// payment-service.ts (Service B)
import { context, propagation, trace } from '@opentelemetry/api';
app.post('/api/charge', (req, res) => {
// Extract trace context from incoming headers
const extractedContext = propagation.extract(context.active(), req.headers);
// Continue trace from parent span
context.with(extractedContext, () => {
const span = trace.getActiveSpan();
// This span will be a child of the gateway's span
span?.setAttribute('payment.amount', req.body.amount);
span?.setAttribute('payment.method', req.body.method);
// Process payment...
});
});
Trace Sampling Strategies
// sampling-config.ts
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
// Sample 10% of traces in production
const productionSampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
});
// Sample 100% of error traces
class ErrorSampler implements Sampler {
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample if error attribute is present
if (attributes['error'] || attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample 10% of normal traces
return new TraceIdRatioBasedSampler(0.1).shouldSample(
context,
traceId,
spanName,
spanKind,
attributes,
links
);
}
}
const sdk = new NodeSDK({
sampler: process.env.NODE_ENV === 'production'
? new ErrorSampler()
: new ParentBasedSampler({ root: new AlwaysOnSampler() }),
});
Prometheus Configuration
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production-us-east-1'
environment: 'production'
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load rules
rule_files:
- '/etc/prometheus/rules/*.yml'
Scrape configs
scrape_configs:
Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
relabel_configs:
- source_labels: [address]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'
Application services (Kubernetes service discovery)
-
job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Use custom port if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: address
regex: (.+)
replacement: $1
Use custom path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
Add pod metadata as labels
-
source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
-
source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
-
source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
-
source_labels: [__meta_kubernetes_pod_label_version]
action: replace
target_label: version
Static scrape for specific services
-
job_name: 'payment-service'
static_configs:
- targets:
- 'payment-service:9464'
labels:
service: 'payment'
team: 'checkout'
-
job_name: 'user-service'
static_configs:
- targets:
- 'user-service:9464'
labels:
service: 'user'
team: 'identity'
Alert Rules
# /etc/prometheus/rules/alerts.yml
groups:
- name: service_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 5m
labels:
severity: warning
team: '{{ $labels.team }}'
annotations:
summary: 'High error rate on {{ $labels.service }}'
description: '{{ $labels.service }} has {{ $value | humanizePercentage }} error rate over the last 5 minutes'
# High latency (p95)
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: 'High latency on {{ $labels.service }}'
description: 'P95 latency is {{ $value }}s on {{ $labels.service }}'
# Service down
- alert: ServiceDown
expr: up{job="kubernetes-pods"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: 'Service {{ $labels.kubernetes_pod_name }} is down'
description: 'Service has been down for more than 2 minutes'
# High memory usage
- alert: HighMemoryUsage
expr: |
(
process_memory_bytes{type="heap_used"}
/
process_memory_bytes{type="heap_total"}
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: 'High memory usage on {{ $labels.service }}'
description: 'Memory usage is at {{ $value | humanizePercentage }}'
# Database connection pool exhaustion
- alert: DatabasePoolExhaustion
expr: |
(
database_connections_active
/
database_connections_max
) > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: 'Database connection pool nearly exhausted'
description: '{{ $value | humanizePercentage }} of database connections in use'
-
name: slo_alerts
interval: 1m
rules:
SLO: 99.9% availability
- alert: SLOViolation
expr: |
(
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) < 0.999
labels:
severity: critical
slo: availability
annotations:
summary: 'SLO violation: Availability below 99.9%'
description: 'Current availability: {{ $value | humanizePercentage }}'
Error budget burn rate
- alert: HighErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (0.001 * 14.4)
labels:
severity: warning
annotations:
summary: 'Burning error budget too quickly'
description: 'At current rate, will exhaust monthly error budget in < 2 days'
Grafana Dashboards
Kubernetes Deployment for Observability Stack
# grafana-deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.0
ports:
- containerPort: 3000
name: http
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: admin-password
- name: GF_INSTALL_PLUGINS
value: 'grafana-piechart-panel,grafana-clock-panel'
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
- name: grafana-dashboards
mountPath: /etc/grafana/provisioning/dashboards
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-datasources
configMap:
name: grafana-datasources
- name: grafana-dashboards
configMap:
name: grafana-dashboards
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger-query:16686
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
Sample Dashboard JSON (Service Overview)
{
"dashboard": {
"title": "Service Overview - Payment Service",
"tags": ["payment", "microservices"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"payment-service\"}[5m])) by (status)",
"legendFormat": "{{ status }}"
}
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"payment-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))",
"legendFormat": "Error Rate"
}
],
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
"alert": {
"conditions": [
{
"evaluator": { "params": [0.05], "type": "gt" },
"query": { "params": ["A", "5m", "now"] }
}
]
}
},
{
"title": "Latency (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
"legendFormat": "p99"
}
],
"gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
},
{
"title": "Active Payments",
"type": "stat",
"targets": [
{
"expr": "payments_active{service=\"payment-service\"}"
}
],
"gridPos": { "x": 0, "y": 16, "w": 6, "h": 4 }
},
{
"title": "Payment Methods Distribution",
"type": "piechart",
"targets": [
{
"expr": "sum(rate(payments_processed_by_method_total[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"gridPos": { "x": 6, "y": 16, "w": 6, "h": 8 }
}
]
}
}
Advanced Observability Patterns
RED Method (Request, Error, Duration)
# Request Rate
sum(rate(http_requests_total{service="payment-service"}[5m]))
Error Rate
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
Duration (p95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
USE Method (Utilization, Saturation, Errors)
# CPU Utilization
rate(process_cpu_seconds_total[5m])
Memory Saturation
process_memory_bytes / process_memory_bytes
Database Connection Saturation
database_connections_active / database_connections_max
Error Rate
rate(database_errors_total[5m])
Golden Signals (Latency, Traffic, Errors, Saturation)
# Latency (p95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Traffic (requests/second)
sum(rate(http_requests_total[5m])) by (service)
Errors (error rate)
sum(rate(http_requests_total[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
Saturation (resource utilization)
avg(rate(process_cpu_seconds_total[5m])) by (service)
Service Level Objectives (SLOs)
Defining SLIs and SLOs
# slo-config.yml
slos:
- name: api_availability
description: 'API endpoint availability'
sli:
ratio_query: |
sum(rate(http_requests_total{job="api",status!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
target: 0.999 # 99.9%
window: 30d
error_budget: 0.001 # 0.1%
- name: api_latency
description: 'API latency p95 < 500ms'
sli:
threshold_query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.5
target: 0.995 # 99.5% of requests under 500ms
window: 30d
Error Budget Calculation
// error-budget.ts
interface SLO {
target: number; // e.g., 0.999 (99.9%)
windowDays: number; // e.g., 30
}
function calculateErrorBudget(slo: SLO): {
allowedFailures: number;
allowedFailurePercentage: number;
} {
const allowedFailurePercentage = 1 - slo.target;
// Assuming 1M requests per month
const totalRequests = 1_000_000;
const allowedFailures = totalRequests * allowedFailurePercentage;
return {
allowedFailures: Math.floor(allowedFailures),
allowedFailurePercentage: allowedFailurePercentage * 100,
};
}
// Example: 99.9% SLO over 30 days
const budget = calculateErrorBudget({ target: 0.999, windowDays: 30 });
console.log(Allowed failures: ${budget.allowedFailures});
// Output: Allowed failures: 1000 (0.1% of 1M requests)
Performance Optimization
Reducing Metric Cardinality
// ❌ BAD: High cardinality (user_id as label)
paymentsProcessed.add(1, {
user_id: '12345', // Millions of unique values!
status: 'success',
});
// ✅ GOOD: Low cardinality (user_tier instead)
paymentsProcessed.add(1, {
user_tier: 'premium', // Limited values: 'free', 'premium', 'enterprise'
status: 'success',
});
// ❌ BAD: Timestamp as label
requestCounter.add(1, {
timestamp: Date.now().toString(), // Infinite cardinality!
});
// ✅ GOOD: Use metric timestamp, not labels
requestCounter.add(1, {
method: 'POST',
endpoint: '/api/checkout',
});
Sampling High-Volume Traces
// Dynamic sampling based on error status
class SmartSampler implements Sampler {
private baseSampleRate = 0.01; // 1% base rate
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample errors
if (attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Always sample slow requests (> 1s)
if (attributes['http.duration'] > 1000) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample critical endpoints at higher rate
if (attributes['http.route'] === '/api/checkout') {
return this.sampleAtRate(0.1); // 10% for checkout
}
// Sample everything else at base rate
return this.sampleAtRate(this.baseSampleRate);
}
private sampleAtRate(rate: number): SamplingResult {
return Math.random() < rate
? { decision: SamplingDecision.RECORD_AND_SAMPLED }
: { decision: SamplingDecision.NOT_RECORD };
}
}
Efficient Log Aggregation
// Use structured logging with levels
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
// Only log to file in production
...(process.env.NODE_ENV === 'production' && {
transport: {
target: 'pino/file',
options: { destination: '/var/log/app.log' },
},
}),
});
// Add trace context to logs automatically
app.use((req, res, next) => {
const span = trace.getActiveSpan();
if (span) {
req.log = logger.child({
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId,
});
}
next();
});
// Structured log with correct level
logger.info(
{
user_id: user.id,
action: 'checkout',
amount: payment.amount,
duration_ms: performance.now() - startTime,
},
'Payment processed successfully'
);
Troubleshooting Common Issues
Missing Traces
Problem: Traces not appearing in Jaeger
Solutions:
- Check OpenTelemetry Collector is running:
kubectl get pods -n monitoring | grep otel-collector
- Verify trace export configuration:
// Ensure exporter is configured
const jaegerExporter = new JaegerExporter({
agentHost: 'jaeger-agent', // Correct hostname
agentPort: 6831,
});
- Check sampling rate:
// Verify you're not sampling out all traces
const sampler = new TraceIdRatioBasedSampler(1.0); // 100% for debugging
- Enable debug logging:
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
High Memory Usage
Problem: Prometheus consuming excessive memory
Solutions:
- Reduce retention period:
# prometheus.yml
global:
retention.time: 15d # Reduce from 30d
retention.size: 50GB
- Decrease scrape frequency for low-priority services:
scrape_configs:
- job_name: 'background-jobs'
scrape_interval: 60s # Less frequent than default 15s
- Limit metric cardinality with relabeling:
scrape_configs:
- job_name: 'api'
metric_relabel_configs:
# Drop high-cardinality labels
- source_labels: [user_id]
action: labeldrop
Dashboard Loading Slowly
Problem: Grafana dashboards take 30+ seconds to load
Solutions:
- Reduce query time range:
# Instead of 24h range
rate(http_requests_total[24h])
Use shorter range
rate(http_requests_total[5m])
- Use recording rules for complex queries:
# /etc/prometheus/rules/recording.yml
groups:
- name: api_metrics
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- Use dashboard variables to limit scope:
{
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"multi": false
}
]
}
}
Best Practices
- Start with auto-instrumentation, add custom spans only where needed
- Use consistent naming across metrics, traces, and logs
- Tag resources with environment, service, version
- Implement health checks separate from business metrics
- Set up alerting early, refine thresholds over time
- Monitor the monitors - ensure Prometheus/Grafana are healthy
- Document SLOs and share with stakeholders
- Review dashboards regularly - remove stale panels
- Use exemplars to link metrics to traces
- Test observability in staging before production
Conclusion
Comprehensive observability is essential for operating modern distributed systems. The combination of OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization provides a powerful, vendor-neutral stack that scales from small applications to large microservices architectures.
By implementing the patterns in this guide—proper instrumentation, meaningful metrics, distributed tracing, effective alerting, and SLO monitoring—you gain deep insights into system behavior, enabling faster debugging, proactive issue detection, and data-driven capacity planning.
Key Takeaways:
- Instrument comprehensively - metrics, traces, and logs working together
- Keep cardinality low - avoid high-cardinality labels in metrics
- Sample intelligently - higher rates for errors and slow requests
- Define SLOs early - measure what matters to users
- Alert on symptoms, not causes - focus on user impact
- Optimize queries - use recording rules for expensive calculations
- Maintain dashboards - keep them focused and actionable
- Trace context matters - propagate across all services
With robust observability, you transform from reactive firefighting to proactive system management.
Additional Resources
- OpenTelemetry: https://opentelemetry.io/
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Jaeger: https://www.jaegertracing.io/
- SLO Best Practices: https://sre.google/workbook/implementing-slos/
- RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
- USE Method: http://www.brendangregg.com/usemethod.html
Related Articles
GraphQL API Design - Production Architecture and Best Practices for Scalable Systems
Master GraphQL API design covering schema design principles, resolver optimization, N+1 query prevention with DataLoader, authentication and authorization patterns, caching strategies, error handling, and production deployment for high-performance GraphQL systems.
Testing Strategies - Unit, Integration, and E2E Testing Best Practices for Production Quality
Comprehensive guide to testing strategies covering unit tests, integration tests, end-to-end testing, test-driven development, mocking patterns, testing pyramid, and production testing practices for reliable software delivery.
Monitoring and Observability - Production Systems Performance and Debugging at Scale
Master monitoring and observability covering metrics collection with Prometheus, distributed tracing with OpenTelemetry, log aggregation, alerting strategies, SLOs/SLIs, and production debugging techniques for reliable systems.
Written by StaticBlock Editorial
StaticBlock Editorial is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.