Production Observability: A Complete Guide to OpenTelemetry, Prometheus, and Grafana

Introduction

Observability is the ability to understand what's happening inside your systems by examining their outputs. Unlike traditional monitoring, which answers "is the system working?", observability answers "why isn't it working?" and "what changed?"

The modern observability stack—OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization—has become the de facto standard for production systems in 2025. This combination provides comprehensive insights into distributed applications while remaining vendor-neutral and open-source.

This guide covers everything from basic instrumentation to advanced distributed tracing, custom metrics, alerting strategies, and performance optimization for high-traffic production environments.

The Three Pillars of Observability

1. Metrics (What's happening?)

Definition: Numerical values measured over time (e.g., request rate, error count, latency)

Use cases:

Real-time dashboards
Alerting on thresholds
Capacity planning
SLO/SLI tracking

Example: HTTP request duration histogram, memory usage gauge, database query count

2. Logs (What happened?)

Definition: Discrete event records with timestamp and contextual information

Use cases:

Debugging specific issues
Audit trails
Error investigation
User behavior analysis

Example: "User 12345 failed login attempt from IP 192.168.1.1 at 2025-11-23T10:15:30Z"

3. Traces (How did it flow?)

Definition: End-to-end journey of a request through distributed systems

Use cases:

Understanding service dependencies
Identifying bottlenecks
Root cause analysis
Performance optimization

Example: Web request → API Gateway → Auth Service → Database → Cache → Response (with timing for each step)

Architecture Overview

┌─────────────┐
│ Application │
│  (Your Code)│
└──────┬──────┘
       │ OpenTelemetry SDK
       │ (Auto/Manual Instrumentation)
       ▼
┌──────────────────────┐
│ OpenTelemetry        │
│ Collector            │
│ (Receive/Process/    │
│  Export)             │
└─────┬────────────────┘
      │
      ├─────────────┬──────────────┐
      ▼             ▼              ▼
┌──────────┐  ┌─────────┐  ┌────────────┐
│Prometheus│  │ Jaeger  │  │ Loki       │
│(Metrics) │  │(Traces) │  │(Logs)      │
└────┬─────┘  └────┬────┘  └─────┬──────┘
     │             │              │
     └─────────────┼──────────────┘
                   ▼
            ┌──────────────┐
            │   Grafana    │
            │(Visualization│
            │ & Alerting)  │
            └──────────────┘

Setting Up OpenTelemetry

Installation and Basic Configuration

Node.js/TypeScript Example

npm install @opentelemetry/sdk-node \
            @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-prometheus \
            @opentelemetry/exporter-jaeger \
            @opentelemetry/api

// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
// Define service metadata
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.2.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
})
);
// Prometheus metrics exporter
const prometheusExporter = new PrometheusExporter({
port: 9464, // Metrics exposed on :9464/metrics
endpoint: '/metrics',
});
// Jaeger trace exporter
const jaegerExporter = new JaegerExporter({
agentHost: process.env.JAEGER_AGENT_HOST || 'localhost',
agentPort: parseInt(process.env.JAEGER_AGENT_PORT || '6832'),
});
// Initialize SDK with auto-instrumentation
const sdk = new NodeSDK({
resource,
metricReader: prometheusExporter,
traceExporter: jaegerExporter,
instrumentations: [
getNodeAutoInstrumentations({
// Fine-tune auto-instrumentation
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics'],
},
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-pg': {
enabled: true, // PostgreSQL instrumentation
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
}),
],
});
// Start SDK
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk
.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
export default sdk;

// app.ts
import './tracing'; // Must be first import!
import express from 'express';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const app = express();
const tracer = trace.getTracer('payment-service');

app.post('/api/checkout', async (req, res) => {
  // Create custom span for business logic
  const span = tracer.startSpan('checkout.process');

  try {
    // Add custom attributes
    span.setAttribute('user.id', req.body.userId);
    span.setAttribute('cart.total', req.body.total);
    span.setAttribute('payment.method', req.body.paymentMethod);

    // Business logic with automatic child spans from auto-instrumentation
    const payment = await processPayment(req.body);
    const order = await createOrder(payment);

    span.setStatus({ code: SpanStatusCode.OK });
    span.setAttribute('order.id', order.id);

    res.json({ success: true, orderId: order.id });
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    span.recordException(error);

    res.status(500).json({ success: false, error: error.message });
  } finally {
    span.end();
  }
});

app.listen(3000);

Python Example with FastAPI

# instrumentation.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
def setup_observability(app):
# Define resource attributes
resource = Resource.create({
"service.name": "user-service",
"service.version": "2.1.0",
"deployment.environment": os.getenv("ENV", "production"),
})
# Set up tracer provider
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer_provider = trace.get_tracer_provider()

# Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name=os.getenv(&quot;JAEGER_AGENT_HOST&quot;, &quot;localhost&quot;),
    agent_port=int(os.getenv(&quot;JAEGER_AGENT_PORT&quot;, &quot;6831&quot;)),
)

tracer_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))

# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

# Instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument()

# Instrument Redis
RedisInstrumentor().instrument()

return trace.get_tracer(__name__)

# main.py
from fastapi import FastAPI, HTTPException
from opentelemetry import trace

app = FastAPI()
tracer = setup_observability(app)

@app.post("/api/users")
async def create_user(user: UserCreate):
    with tracer.start_as_current_span("create_user") as span:
        span.set_attribute("user.email", user.email)
        span.set_attribute("user.role", user.role)

        try:
            # Validate email
            with tracer.start_as_current_span("validate_email"):
                if not is_valid_email(user.email):
                    raise ValueError("Invalid email format")

            # Check existing user
            with tracer.start_as_current_span("check_duplicate"):
                existing = await db.query(User).filter_by(email=user.email).first()
                if existing:
                    raise HTTPException(400, "User already exists")

            # Create user
            with tracer.start_as_current_span("database.insert"):
                new_user = User(**user.dict())
                db.add(new_user)
                await db.commit()

            span.set_attribute("user.id", new_user.id)
            return {"id": new_user.id, "email": new_user.email}

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            raise

Custom Metrics with Prometheus

Metric Types

1. Counter: Monotonically increasing value (e.g., total requests)
2. Gauge: Value that can go up or down (e.g., active connections)
3. Histogram: Distribution of values (e.g., request duration)
4. Summary: Similar to histogram but with quantiles

Implementing Custom Metrics

// metrics.ts
import { metrics } from '@opentelemetry/api';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
const prometheusExporter = new PrometheusExporter({ port: 9464 });
const meterProvider = new MeterProvider({
readers: [new PeriodicExportingMetricReader({
exporter: prometheusExporter,
exportIntervalMillis: 1000,
})],
});
metrics.setGlobalMeterProvider(meterProvider);
const meter = metrics.getMeter('payment-service');
// Counter: Total processed payments
export const paymentsProcessed = meter.createCounter('payments_processed_total', {
description: 'Total number of payment transactions processed',
unit: 'transactions',
});
// Counter with labels
export const paymentsProcessedByMethod = meter.createCounter(
'payments_processed_by_method_total',
{ description: 'Payments processed by payment method' }
);
// Gauge: Active payment processing
export const activePayments = meter.createUpDownCounter('payments_active', {
description: 'Number of payments currently being processed',
unit: 'payments',
});
// Histogram: Payment processing duration
export const paymentDuration = meter.createHistogram('payment_duration_seconds', {
description: 'Time taken to process payments',
unit: 'seconds',
});
// Histogram with custom buckets
export const paymentAmount = meter.createHistogram('payment_amount_dollars', {
description: 'Distribution of payment amounts',
unit: 'dollars',
});
// Observable Gauge: System metrics
export const memoryUsage = meter.createObservableGauge('process_memory_bytes', {
description: 'Process memory usage',
unit: 'bytes',
});
memoryUsage.addCallback((observableResult) => {
const usage = process.memoryUsage();
observableResult.observe(usage.heapUsed, { type: 'heap_used' });
observableResult.observe(usage.heapTotal, { type: 'heap_total' });
observableResult.observe(usage.rss, { type: 'rss' });
});

Using Metrics in Application Code

// payment-service.ts
import {
  paymentsProcessed,
  paymentsProcessedByMethod,
  activePayments,
  paymentDuration,
  paymentAmount,
} from './metrics';
export async function processPayment(paymentData: PaymentRequest) {
const startTime = Date.now();
// Increment active payments
activePayments.add(1);
try {
// Process payment
const result = await chargeCard(paymentData);
// Record successful payment
paymentsProcessed.add(1, {
  status: 'success',
  currency: paymentData.currency,
});

paymentsProcessedByMethod.add(1, {
  method: paymentData.method, // 'credit_card', 'paypal', etc.
  status: 'success',
});

// Record payment amount
paymentAmount.record(paymentData.amount, {
  currency: paymentData.currency,
  method: paymentData.method,
});

return result;

} catch (error) {
// Record failed payment
paymentsProcessed.add(1, {
status: 'failed',
error_type: error.code,
});
paymentsProcessedByMethod.add(1, {
  method: paymentData.method,
  status: 'failed',
});

throw error;

} finally {
// Record duration
const duration = (Date.now() - startTime) / 1000;
paymentDuration.record(duration, {
method: paymentData.method,
});
// Decrement active payments
activePayments.add(-1);

}
}

Distributed Tracing Patterns

Context Propagation Across Services

// api-gateway.ts (Service A)
import { context, propagation, trace } from '@opentelemetry/api';
import axios from 'axios';
app.post('/api/orders', async (req, res) => {
const span = trace.getActiveSpan();
// Prepare headers for downstream service
const headers: Record<string, string> = {};
// Inject trace context into headers
propagation.inject(context.active(), headers);
try {
// Call payment service with trace context
const paymentResult = await axios.post(
'http://payment-service:3001/api/charge',
req.body.payment,
{ headers }
);
// Call inventory service with same trace context
const inventoryResult = await axios.post(
  'http://inventory-service:3002/api/reserve',
  req.body.items,
  { headers }
);

res.json({ success: true, data: { paymentResult, inventoryResult } });

} catch (error) {
span.recordException(error);
res.status(500).json({ error: error.message });
}
});

// payment-service.ts (Service B)
import { context, propagation, trace } from '@opentelemetry/api';

app.post('/api/charge', (req, res) => {
  // Extract trace context from incoming headers
  const extractedContext = propagation.extract(context.active(), req.headers);

  // Continue trace from parent span
  context.with(extractedContext, () => {
    const span = trace.getActiveSpan();

    // This span will be a child of the gateway's span
    span?.setAttribute('payment.amount', req.body.amount);
    span?.setAttribute('payment.method', req.body.method);

    // Process payment...
  });
});

Trace Sampling Strategies

// sampling-config.ts
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
// Sample 10% of traces in production
const productionSampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
});
// Sample 100% of error traces
class ErrorSampler implements Sampler {
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample if error attribute is present
if (attributes['error'] || attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample 10% of normal traces
return new TraceIdRatioBasedSampler(0.1).shouldSample(
  context,
  traceId,
  spanName,
  spanKind,
  attributes,
  links
);

}
}
const sdk = new NodeSDK({
sampler: process.env.NODE_ENV === 'production'
? new ErrorSampler()
: new ParentBasedSampler({ root: new AlwaysOnSampler() }),
});

Prometheus Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production-us-east-1'
    environment: 'production'
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load rules
rule_files:

'/etc/prometheus/rules/*.yml'

Scrape configs
scrape_configs:
Prometheus itself

job_name: 'prometheus'
static_configs:

targets: ['localhost:9090']



Node exporter for system metrics

job_name: 'node'
static_configs:

targets:

'node-exporter:9100'
relabel_configs:


source_labels: [address]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'



Application services (Kubernetes service discovery)


job_name: 'kubernetes-pods'
kubernetes_sd_configs:

role: pod
relabel_configs:

Only scrape pods with prometheus.io/scrape annotation

source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

Use custom port if specified

source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: address
regex: (.+)
replacement: $1

Use custom path if specified

source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)

Add pod metadata as labels


source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace


source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name


source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app


source_labels: [__meta_kubernetes_pod_label_version]
action: replace
target_label: version




Static scrape for specific services


job_name: 'payment-service'
static_configs:

targets:

'payment-service:9464'
labels:
service: 'payment'
team: 'checkout'





job_name: 'user-service'
static_configs:

targets:

'user-service:9464'
labels:
service: 'user'
team: 'identity'

Alert Rules

# /etc/prometheus/rules/alerts.yml
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.05
        for: 5m
        labels:
          severity: warning
          team: '{{ $labels.team }}'
        annotations:
          summary: 'High error rate on {{ $labels.service }}'
          description: '{{ $labels.service }} has {{ $value | humanizePercentage }} error rate over the last 5 minutes'
  # High latency (p95)
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      ) &gt; 1.0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: 'High latency on {{ $labels.service }}'
      description: 'P95 latency is {{ $value }}s on {{ $labels.service }}'

  # Service down
  - alert: ServiceDown
    expr: up{job=&quot;kubernetes-pods&quot;} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: 'Service {{ $labels.kubernetes_pod_name }} is down'
      description: 'Service has been down for more than 2 minutes'

  # High memory usage
  - alert: HighMemoryUsage
    expr: |
      (
        process_memory_bytes{type=&quot;heap_used&quot;}
        /
        process_memory_bytes{type=&quot;heap_total&quot;}
      ) &gt; 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'High memory usage on {{ $labels.service }}'
      description: 'Memory usage is at {{ $value | humanizePercentage }}'

  # Database connection pool exhaustion
  - alert: DatabasePoolExhaustion
    expr: |
      (
        database_connections_active
        /
        database_connections_max
      ) &gt; 0.8
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Database connection pool nearly exhausted'
      description: '{{ $value | humanizePercentage }} of database connections in use'



name: slo_alerts
interval: 1m
rules:
SLO: 99.9% availability

alert: SLOViolation
expr: |
(
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) < 0.999
labels:
severity: critical
slo: availability
annotations:
summary: 'SLO violation: Availability below 99.9%'
description: 'Current availability: {{ $value | humanizePercentage }}'

Error budget burn rate

alert: HighErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (0.001 * 14.4)
labels:
severity: warning
annotations:
summary: 'Burning error budget too quickly'
description: 'At current rate, will exhaust monthly error budget in < 2 days'

Grafana Dashboards

Kubernetes Deployment for Observability Stack

# grafana-deployment.yml apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:10.2.0 ports: - containerPort: 3000 name: http env: - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-credentials key: admin-password - name: GF_INSTALL_PLUGINS value: 'grafana-piechart-panel,grafana-clock-panel' volumeMounts: - name: grafana-storage mountPath: /var/lib/grafana - name: grafana-datasources mountPath: /etc/grafana/provisioning/datasources - name: grafana-dashboards mountPath: /etc/grafana/provisioning/dashboards volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana-pvc - name: grafana-datasources configMap: name: grafana-datasources - name: grafana-dashboards configMap: name: grafana-dashboards --- apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources namespace: monitoring data: datasources.yml: | apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false - name: Jaeger type: jaeger access: proxy url: http://jaeger-query:16686 editable: false - name: Loki type: loki access: proxy url: http://loki:3100 editable: false

Sample Dashboard JSON (Service Overview)

{
  "dashboard": {
    "title": "Service Overview - Payment Service",
    "tags": ["payment", "microservices"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\"}[5m])) by (status)",
            "legendFormat": "{{ status }}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))",
            "legendFormat": "Error Rate"
          }
        ],
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [0.05], "type": "gt" },
              "query": { "params": ["A", "5m", "now"] }
            }
          ]
        }
      },
      {
        "title": "Latency (p50, p95, p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p99"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      },
      {
        "title": "Active Payments",
        "type": "stat",
        "targets": [
          {
            "expr": "payments_active{service=\"payment-service\"}"
          }
        ],
        "gridPos": { "x": 0, "y": 16, "w": 6, "h": 4 }
      },
      {
        "title": "Payment Methods Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum(rate(payments_processed_by_method_total[5m])) by (method)",
            "legendFormat": "{{ method }}"
          }
        ],
        "gridPos": { "x": 6, "y": 16, "w": 6, "h": 8 }
      }
    ]
  }
}

Advanced Observability Patterns

RED Method (Request, Error, Duration)

# Request Rate
sum(rate(http_requests_total{service="payment-service"}[5m]))
Error Rate
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
Duration (p95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

USE Method (Utilization, Saturation, Errors)

# CPU Utilization
rate(process_cpu_seconds_total[5m])
Memory Saturation
process_memory_bytes / process_memory_bytes
Database Connection Saturation
database_connections_active / database_connections_max
Error Rate
rate(database_errors_total[5m])

Golden Signals (Latency, Traffic, Errors, Saturation)

# Latency (p95)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Traffic (requests/second)
sum(rate(http_requests_total[5m])) by (service)
Errors (error rate)
sum(rate(http_requests_total[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
Saturation (resource utilization)
avg(rate(process_cpu_seconds_total[5m])) by (service)

Service Level Objectives (SLOs)

Defining SLIs and SLOs

# slo-config.yml
slos:
  - name: api_availability
    description: 'API endpoint availability'
    sli:
      ratio_query: |
        sum(rate(http_requests_total{job="api",status!~"5.."}[30d]))
        /
        sum(rate(http_requests_total{job="api"}[30d]))
    target: 0.999  # 99.9%
    window: 30d
    error_budget: 0.001  # 0.1%

name: api_latency
description: 'API latency p95 < 500ms'
sli:
threshold_query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.5
target: 0.995  # 99.5% of requests under 500ms
window: 30d

Error Budget Calculation

// error-budget.ts
interface SLO {
  target: number;      // e.g., 0.999 (99.9%)
  windowDays: number;  // e.g., 30
}
function calculateErrorBudget(slo: SLO): {
allowedFailures: number;
allowedFailurePercentage: number;
} {
const allowedFailurePercentage = 1 - slo.target;
// Assuming 1M requests per month
const totalRequests = 1_000_000;
const allowedFailures = totalRequests * allowedFailurePercentage;
return {
allowedFailures: Math.floor(allowedFailures),
allowedFailurePercentage: allowedFailurePercentage * 100,
};
}
// Example: 99.9% SLO over 30 days
const budget = calculateErrorBudget({ target: 0.999, windowDays: 30 });
console.log(Allowed failures: ${budget.allowedFailures});
// Output: Allowed failures: 1000 (0.1% of 1M requests)

Performance Optimization

Reducing Metric Cardinality

// ❌ BAD: High cardinality (user_id as label)
paymentsProcessed.add(1, {
  user_id: '12345',  // Millions of unique values!
  status: 'success',
});
// ✅ GOOD: Low cardinality (user_tier instead)
paymentsProcessed.add(1, {
user_tier: 'premium',  // Limited values: 'free', 'premium', 'enterprise'
status: 'success',
});
// ❌ BAD: Timestamp as label
requestCounter.add(1, {
timestamp: Date.now().toString(),  // Infinite cardinality!
});
// ✅ GOOD: Use metric timestamp, not labels
requestCounter.add(1, {
method: 'POST',
endpoint: '/api/checkout',
});

Sampling High-Volume Traces

// Dynamic sampling based on error status
class SmartSampler implements Sampler {
  private baseSampleRate = 0.01; // 1% base rate
shouldSample(context, traceId, spanName, spanKind, attributes, links) {
// Always sample errors
if (attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Always sample slow requests (&gt; 1s)
if (attributes['http.duration'] &gt; 1000) {
  return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}

// Sample critical endpoints at higher rate
if (attributes['http.route'] === '/api/checkout') {
  return this.sampleAtRate(0.1); // 10% for checkout
}

// Sample everything else at base rate
return this.sampleAtRate(this.baseSampleRate);

}
private sampleAtRate(rate: number): SamplingResult {
return Math.random() < rate
? { decision: SamplingDecision.RECORD_AND_SAMPLED }
: { decision: SamplingDecision.NOT_RECORD };
}
}

Efficient Log Aggregation

// Use structured logging with levels
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
// Only log to file in production
...(process.env.NODE_ENV === 'production' && {
transport: {
target: 'pino/file',
options: { destination: '/var/log/app.log' },
},
}),
});
// Add trace context to logs automatically
app.use((req, res, next) => {
const span = trace.getActiveSpan();
if (span) {
req.log = logger.child({
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId,
});
}
next();
});
// Structured log with correct level
logger.info(
{
user_id: user.id,
action: 'checkout',
amount: payment.amount,
duration_ms: performance.now() - startTime,
},
'Payment processed successfully'
);

Troubleshooting Common Issues

Missing Traces

Problem: Traces not appearing in Jaeger

Solutions:

Check OpenTelemetry Collector is running:

kubectl get pods -n monitoring | grep otel-collector

Verify trace export configuration:

// Ensure exporter is configured
const jaegerExporter = new JaegerExporter({
  agentHost: 'jaeger-agent',  // Correct hostname
  agentPort: 6831,
});

Check sampling rate:

// Verify you're not sampling out all traces
const sampler = new TraceIdRatioBasedSampler(1.0);  // 100% for debugging

Enable debug logging:

import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

High Memory Usage

Problem: Prometheus consuming excessive memory

Solutions:

Reduce retention period:

# prometheus.yml
global:
  retention.time: 15d  # Reduce from 30d
  retention.size: 50GB

Decrease scrape frequency for low-priority services:

scrape_configs:
  - job_name: 'background-jobs'
    scrape_interval: 60s  # Less frequent than default 15s

Limit metric cardinality with relabeling:

scrape_configs:
  - job_name: 'api'
    metric_relabel_configs:
      # Drop high-cardinality labels
      - source_labels: [user_id]
        action: labeldrop

Dashboard Loading Slowly

Problem: Grafana dashboards take 30+ seconds to load

Solutions:

Reduce query time range:

# Instead of 24h range
rate(http_requests_total[24h])
Use shorter range
rate(http_requests_total[5m])

Use recording rules for complex queries:

# /etc/prometheus/rules/recording.yml
groups:
  - name: api_metrics
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
  - record: job:http_request_duration_seconds:p95
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

Use dashboard variables to limit scope:

{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(http_requests_total, service)",
        "multi": false
      }
    ]
  }
}

Best Practices

Start with auto-instrumentation, add custom spans only where needed
Use consistent naming across metrics, traces, and logs
Tag resources with environment, service, version
Implement health checks separate from business metrics
Set up alerting early, refine thresholds over time
Monitor the monitors - ensure Prometheus/Grafana are healthy
Document SLOs and share with stakeholders
Review dashboards regularly - remove stale panels
Use exemplars to link metrics to traces
Test observability in staging before production

Conclusion

Comprehensive observability is essential for operating modern distributed systems. The combination of OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization provides a powerful, vendor-neutral stack that scales from small applications to large microservices architectures.

By implementing the patterns in this guide—proper instrumentation, meaningful metrics, distributed tracing, effective alerting, and SLO monitoring—you gain deep insights into system behavior, enabling faster debugging, proactive issue detection, and data-driven capacity planning.

Key Takeaways:

Instrument comprehensively - metrics, traces, and logs working together
Keep cardinality low - avoid high-cardinality labels in metrics
Sample intelligently - higher rates for errors and slow requests
Define SLOs early - measure what matters to users
Alert on symptoms, not causes - focus on user impact
Optimize queries - use recording rules for expensive calculations
Maintain dashboards - keep them focused and actionable
Trace context matters - propagate across all services

With robust observability, you transform from reactive firefighting to proactive system management.

Additional Resources

OpenTelemetry: https://opentelemetry.io/
Prometheus: https://prometheus.io/docs/
Grafana: https://grafana.com/docs/
Jaeger: https://www.jaegertracing.io/
SLO Best Practices: https://sre.google/workbook/implementing-slos/
RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
USE Method: http://www.brendangregg.com/usemethod.html

Introduction

The Three Pillars of Observability

1. Metrics (What's happening?)

2. Logs (What happened?)

3. Traces (How did it flow?)

Architecture Overview

Setting Up OpenTelemetry

Installation and Basic Configuration

Node.js/TypeScript Example

Python Example with FastAPI

Custom Metrics with Prometheus

Metric Types

Implementing Custom Metrics

Using Metrics in Application Code

Distributed Tracing Patterns

Context Propagation Across Services

Trace Sampling Strategies

Prometheus Configuration

prometheus.yml

Alertmanager configuration

Load rules

Scrape configs

Prometheus itself

Node exporter for system metrics

Application services (Kubernetes service discovery)

Only scrape pods with prometheus.io/scrape annotation

Use custom port if specified

Use custom path if specified

Add pod metadata as labels

Static scrape for specific services

Alert Rules

SLO: 99.9% availability

Error budget burn rate

Grafana Dashboards

Kubernetes Deployment for Observability Stack

Sample Dashboard JSON (Service Overview)

Advanced Observability Patterns

RED Method (Request, Error, Duration)

Error Rate

Duration (p95)

USE Method (Utilization, Saturation, Errors)

Memory Saturation

Database Connection Saturation

Error Rate

Golden Signals (Latency, Traffic, Errors, Saturation)

Traffic (requests/second)

Errors (error rate)

Saturation (resource utilization)

Service Level Objectives (SLOs)

Defining SLIs and SLOs

Error Budget Calculation

Performance Optimization

Reducing Metric Cardinality

Sampling High-Volume Traces

Efficient Log Aggregation

Troubleshooting Common Issues

Missing Traces

High Memory Usage

Dashboard Loading Slowly

Use shorter range

Best Practices

Conclusion

Additional Resources

Related Articles

GraphQL API Design - Production Architecture and Best Practices for Scalable Systems

Testing Strategies - Unit, Integration, and E2E Testing Best Practices for Production Quality

Monitoring and Observability - Production Systems Performance and Debugging at Scale

Written by StaticBlock Editorial