0% read
Skip to main content
Production Observability: A Complete Guide to OpenTelemetry, Prometheus, and Grafana

Production Observability: A Complete Guide to OpenTelemetry, Prometheus, and Grafana

Build comprehensive production observability with OpenTelemetry, Prometheus, and Grafana. Learn instrumentation strategies, metrics collection, distributed tracing, log aggregation, custom dashboards, alerting rules, and performance optimization. Includes real-world examples for Node.js, Python, Go, and Java microservices with Kubernetes deployment patterns and SLO/SLI monitoring.

S
StaticBlock Editorial
21 min read

Introduction

Observability is the ability to understand what's happening inside your systems by examining their outputs. Unlike traditional monitoring, which answers "is the system working?", observability answers "why isn't it working?" and "what changed?"

The modern observability stack—OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization—has become the de facto standard for production systems in 2025. This combination provides comprehensive insights into distributed applications while remaining vendor-neutral and open-source.

This guide covers everything from basic instrumentation to advanced distributed tracing, custom metrics, alerting strategies, and performance optimization for high-traffic production environments.

The Three Pillars of Observability

1. Metrics (What's happening?)

Definition: Numerical values measured over time (e.g., request rate, error count, latency)

Use cases:

  • Real-time dashboards
  • Alerting on thresholds
  • Capacity planning
  • SLO/SLI tracking

Example: HTTP request duration histogram, memory usage gauge, database query count

2. Logs (What happened?)

Definition: Discrete event records with timestamp and contextual information

Use cases:

  • Debugging specific issues
  • Audit trails
  • Error investigation
  • User behavior analysis

Example: "User 12345 failed login attempt from IP 192.168.1.1 at 2025-11-23T10:15:30Z"

3. Traces (How did it flow?)

Definition: End-to-end journey of a request through distributed systems

Use cases:

  • Understanding service dependencies
  • Identifying bottlenecks
  • Root cause analysis
  • Performance optimization

Example: Web request → API Gateway → Auth Service → Database → Cache → Response (with timing for each step)

Architecture Overview

┌─────────────┐
│ Application │
│  (Your Code)│
└──────┬──────┘
       │ OpenTelemetry SDK
       │ (Auto/Manual Instrumentation)
       ▼
┌──────────────────────┐
│ OpenTelemetry        │
│ Collector            │
│ (Receive/Process/    │
│  Export)             │
└─────┬────────────────┘
      │
      ├─────────────┬──────────────┐
      ▼             ▼              ▼
┌──────────┐  ┌─────────┐  ┌────────────┐
│Prometheus│  │ Jaeger  │  │ Loki       │
│(Metrics) │  │(Traces) │  │(Logs)      │
└────┬─────┘  └────┬────┘  └─────┬──────┘
     │             │              │
     └─────────────┼──────────────┘
                   ▼
            ┌──────────────┐
            │   Grafana    │
            │(Visualization│
            │ & Alerting)  │
            └──────────────┘

Setting Up OpenTelemetry

Installation and Basic Configuration

Node.js/TypeScript Example

npm install @opentelemetry/sdk-node \
            @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-prometheus \
            @opentelemetry/exporter-jaeger \
            @opentelemetry/api
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

// Define service metadata const resource = Resource.default().merge( new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service', [SemanticResourceAttributes.SERVICE_VERSION]: '1.2.0', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }) );

// Prometheus metrics exporter const prometheusExporter = new PrometheusExporter({ port: 9464, // Metrics exposed on :9464/metrics endpoint: '/metrics', });

// Jaeger trace exporter const jaegerExporter = new JaegerExporter({ agentHost: process.env.JAEGER_AGENT_HOST || 'localhost', agentPort: parseInt(process.env.JAEGER_AGENT_PORT || '6832'), });

// Initialize SDK with auto-instrumentation const sdk = new NodeSDK({ resource, metricReader: prometheusExporter, traceExporter: jaegerExporter, instrumentations: [ getNodeAutoInstrumentations({ // Fine-tune auto-instrumentation '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/metrics'], }, '@opentelemetry/instrumentation-express': { enabled: true, }, '@opentelemetry/instrumentation-pg': { enabled: true, // PostgreSQL instrumentation }, '@opentelemetry/instrumentation-redis': { enabled: true, }, }), ], });

// Start SDK sdk.start();

// Graceful shutdown process.on('SIGTERM', () => { sdk .shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); });

export default sdk;

// app.ts
import './tracing'; // Must be first import!
import express from 'express';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const app = express();
const tracer = trace.getTracer('payment-service');

app.post('/api/checkout', async (req, res) => {
  // Create custom span for business logic
  const span = tracer.startSpan('checkout.process');

  try {
    // Add custom attributes
    span.setAttribute('user.id', req.body.userId);
    span.setAttribute('cart.total', req.body.total);
    span.setAttribute('payment.method', req.body.paymentMethod);

    // Business logic with automatic child spans from auto-instrumentation
    const payment = await processPayment(req.body);
    const order = await createOrder(payment);

    span.setStatus({ code: SpanStatusCode.OK });
    span.setAttribute('order.id', order.id);

    res.json({ success: true, orderId: order.id });
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    span.recordException(error);

    res.status(500).json({ success: false, error: error.message });
  } finally {
    span.end();
  }
});

app.listen(3000);

Python Example with FastAPI

# instrumentation.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor

def setup_observability(app): # Define resource attributes resource = Resource.create({ "service.name": "user-service", "service.version": "2.1.0", "deployment.environment": os.getenv("ENV", "production"), })

# Set up tracer provider
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer_provider = trace.get_tracer_provider()

# Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name=os.getenv("JAEGER_AGENT_HOST", "localhost"),
    agent_port=int(os.getenv("JAEGER_AGENT_PORT", "6831")),
)

tracer_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))

# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

# Instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument()

# Instrument Redis
RedisInstrumentor().instrument()

return trace.get_tracer(__name__)

# main.py
from fastapi import FastAPI, HTTPException
from opentelemetry import trace

app = FastAPI()
tracer = setup_observability(app)

@app.post("/api/users")
async def create_user(user: UserCreate):
    with tracer.start_as_current_span("create_user") as span:
        span.set_attribute("user.email", user.email)
        span.set_attribute("user.role", user.role)

        try:
            # Validate email
            with tracer.start_as_current_span("validate_email"):
                if not is_valid_email(user.email):
                    raise ValueError("Invalid email format")

            # Check existing user
            with tracer.start_as_current_span("check_duplicate"):
                existing = await db.query(User).filter_by(email=user.email).first()
                if existing:
                    raise HTTPException(400, "User already exists")

            # Create user
            with tracer.start_as_current_span("database.insert"):
                new_user = User(**user.dict())
                db.add(new_user)
                await db.commit()

            span.set_attribute("user.id", new_user.id)
            return {"id": new_user.id, "email": new_user.email}

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            raise

Custom Metrics with Prometheus

Metric Types

1. Counter: Monotonically increasing value (e.g., total requests)
2. Gauge: Value that can go up or down (e.g., active connections)
3. Histogram: Distribution of values (e.g., request duration)
4. Summary: Similar to histogram but with quantiles

Implementing Custom Metrics

// metrics.ts
import { metrics } from '@opentelemetry/api';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

const prometheusExporter = new PrometheusExporter({ port: 9464 });

const meterProvider = new MeterProvider({ readers: [new PeriodicExportingMetricReader({ exporter: prometheusExporter, exportIntervalMillis: 1000, })], });

metrics.setGlobalMeterProvider(meterProvider);

const meter = metrics.getMeter('payment-service');

// Counter: Total processed payments export const paymentsProcessed = meter.createCounter('payments_processed_total', { description: 'Total number of payment transactions processed', unit: 'transactions', });

// Counter with labels export const paymentsProcessedByMethod = meter.createCounter( 'payments_processed_by_method_total', { description: 'Payments processed by payment method' } );

// Gauge: Active payment processing export const activePayments = meter.createUpDownCounter('payments_active', { description: 'Number of payments currently being processed', unit: 'payments', });

// Histogram: Payment processing duration export const paymentDuration = meter.createHistogram('payment_duration_seconds', { description: 'Time taken to process payments', unit: 'seconds', });

// Histogram with custom buckets export const paymentAmount = meter.createHistogram('payment_amount_dollars', { description: 'Distribution of payment amounts', unit: 'dollars', });

// Observable Gauge: System metrics export const memoryUsage = meter.createObservableGauge('process_memory_bytes', { description: 'Process memory usage', unit: 'bytes', });

memoryUsage.addCallback((observableResult) => { const usage = process.memoryUsage(); observableResult.observe(usage.heapUsed, { type: 'heap_used' }); observableResult.observe(usage.heapTotal, { type: 'heap_total' }); observableResult.observe(usage.rss, { type: 'rss' }); });

Using Metrics in Application Code

// payment-service.ts
import {
  paymentsProcessed,
  paymentsProcessedByMethod,
  activePayments,
  paymentDuration,
  paymentAmount,
} from './metrics';

export async function processPayment(paymentData: PaymentRequest) { const startTime = Date.now();

// Increment active payments activePayments.add(1);

try { // Process payment const result = await chargeCard(paymentData);

// Record successful payment
paymentsProcessed.add(1, {
  status: 'success',
  currency: paymentData.currency,
});

paymentsProcessedByMethod.add(1, {
  method: paymentData.method, // 'credit_card', 'paypal', etc.
  status: 'success',
});

// Record payment amount
paymentAmount.record(paymentData.amount, {
  currency: paymentData.currency,
  method: paymentData.method,
});

return result;

} catch (error) { // Record failed payment paymentsProcessed.add(1, { status: 'failed', error_type: error.code, });

paymentsProcessedByMethod.add(1, {
  method: paymentData.method,
  status: 'failed',
});

throw error;

} finally { // Record duration const duration = (Date.now() - startTime) / 1000; paymentDuration.record(duration, { method: paymentData.method, });

// Decrement active payments
activePayments.add(-1);

} }

Distributed Tracing Patterns

Context Propagation Across Services

// api-gateway.ts (Service A)
import { context, propagation, trace } from '@opentelemetry/api';
import axios from 'axios';

app.post('/api/orders', async (req, res) => { const span = trace.getActiveSpan();

// Prepare headers for downstream service const headers: Record<string, string> = {};

// Inject trace context into headers propagation.inject(context.active(), headers);

try { // Call payment service with trace context const paymentResult = await axios.post( 'http://payment-service:3001/api/charge', req.body.payment, { headers } );

// Call inventory service with same trace context
const inventoryResult = await axios.post(
  'http://inventory-service:3002/api/reserve',
  req.body.items,
  { headers }
);

res.json({ success: true, data: { paymentResult, inventoryResult } });

} catch (error) { span.recordException(error); res.status(500).json({ error: error.message }); } });

// payment-service.ts (Service B)
import { context, propagation, trace } from '@opentelemetry/api';

app.post('/api/charge', (req, res) => {
  // Extract trace context from incoming headers
  const extractedContext = propagation.extract(context.active(), req.headers);

  // Continue trace from parent span
  context.with(extractedContext, () => {
    const span = trace.getActiveSpan();

    // This span will be a child of the gateway's span
    span?.setAttribute('payment.amount', req.body.amount);
    span?.setAttribute('payment.method', req.body.method);

    // Process payment...
  });
});

Trace Sampling Strategies

// sampling-config.ts
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

// Sample 10% of traces in production const productionSampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1), });

// Sample 100% of error traces class ErrorSampler implements Sampler { shouldSample(context, traceId, spanName, spanKind, attributes, links) { // Always sample if error attribute is present if (attributes['error'] || attributes['http.status_code'] >= 400) { return { decision: SamplingDecision.RECORD_AND_SAMPLED }; }

// Sample 10% of normal traces
return new TraceIdRatioBasedSampler(0.1).shouldSample(
  context,
  traceId,
  spanName,
  spanKind,
  attributes,
  links
);

} }

const sdk = new NodeSDK({ sampler: process.env.NODE_ENV === 'production' ? new ErrorSampler() : new ParentBasedSampler({ root: new AlwaysOnSampler() }), });

Prometheus Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production-us-east-1'
    environment: 'production'

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

rule_files:

  • '/etc/prometheus/rules/*.yml'

Scrape configs

scrape_configs:

Prometheus itself

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter for system metrics

  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100' relabel_configs:
    • source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

Application services (Kubernetes service discovery)

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:

    • role: pod relabel_configs:

    Only scrape pods with prometheus.io/scrape annotation

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

    Use custom port if specified

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: address regex: (.+) replacement: $1

    Use custom path if specified

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)

    Add pod metadata as labels

    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace

    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

    • source_labels: [__meta_kubernetes_pod_label_app] action: replace target_label: app

    • source_labels: [__meta_kubernetes_pod_label_version] action: replace target_label: version

Static scrape for specific services

  • job_name: 'payment-service' static_configs:

    • targets:
      • 'payment-service:9464' labels: service: 'payment' team: 'checkout'
  • job_name: 'user-service' static_configs:

    • targets:
      • 'user-service:9464' labels: service: 'user' team: 'identity'

Alert Rules

# /etc/prometheus/rules/alerts.yml
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.05
        for: 5m
        labels:
          severity: warning
          team: '{{ $labels.team }}'
        annotations:
          summary: 'High error rate on {{ $labels.service }}'
          description: '{{ $labels.service }} has {{ $value | humanizePercentage }} error rate over the last 5 minutes'
  # High latency (p95)
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      ) &gt; 1.0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: 'High latency on {{ $labels.service }}'
      description: 'P95 latency is {{ $value }}s on {{ $labels.service }}'

  # Service down
  - alert: ServiceDown
    expr: up{job=&quot;kubernetes-pods&quot;} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: 'Service {{ $labels.kubernetes_pod_name }} is down'
      description: 'Service has been down for more than 2 minutes'

  # High memory usage
  - alert: HighMemoryUsage
    expr: |
      (
        process_memory_bytes{type=&quot;heap_used&quot;}
        /
        process_memory_bytes{type=&quot;heap_total&quot;}
      ) &gt; 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'High memory usage on {{ $labels.service }}'
      description: 'Memory usage is at {{ $value | humanizePercentage }}'

  # Database connection pool exhaustion
  - alert: DatabasePoolExhaustion
    expr: |
      (
        database_connections_active
        /
        database_connections_max
      ) &gt; 0.8
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Database connection pool nearly exhausted'
      description: '{{ $value | humanizePercentage }} of database connections in use'
  • name: slo_alerts interval: 1m rules:

    SLO: 99.9% availability

    • alert: SLOViolation expr: | ( sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])) ) < 0.999 labels: severity: critical slo: availability annotations: summary: 'SLO violation: Availability below 99.9%' description: 'Current availability: {{ $value | humanizePercentage }}'

    Error budget burn rate

    • alert: HighErrorBudgetBurnRate expr: | ( 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (0.001 * 14.4) labels: severity: warning annotations: summary: 'Burning error budget too quickly' description: 'At current rate, will exhaust monthly error budget in < 2 days'

Grafana Dashboards

Kubernetes Deployment for Observability Stack

# grafana-deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:10.2.0
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-credentials
                  key: admin-password
            - name: GF_INSTALL_PLUGINS
              value: 'grafana-piechart-panel,grafana-clock-panel'
          volumeMounts:
            - name: grafana-storage
              mountPath: /var/lib/grafana
            - name: grafana-datasources
              mountPath: /etc/grafana/provisioning/datasources
            - name: grafana-dashboards
              mountPath: /etc/grafana/provisioning/dashboards
      volumes:
        - name: grafana-storage
          persistentVolumeClaim:
            claimName: grafana-pvc
        - name: grafana-datasources
          configMap:
            name: grafana-datasources
        - name: grafana-dashboards
          configMap:
            name: grafana-dashboards
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger-query:16686
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

Sample Dashboard JSON (Service Overview)

{
  "dashboard": {
    "title": "Service Overview - Payment Service",
    "tags": ["payment", "microservices"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\"}[5m])) by (status)",
            "legendFormat": "{{ status }}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))",
            "legendFormat": "Error Rate"
          }
        ],
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [0.05], "type": "gt" },
              "query": { "params": ["A", "5m", "now"] }
            }
          ]
        }
      },
      {
        "title": "Latency (p50, p95, p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le))",
            "legendFormat": "p99"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      },
      {
        "title": "Active Payments",
        "type": "stat",
        "targets": [
          {
            "expr": "payments_active{service=\"payment-service\"}"
          }
        ],
        "gridPos": { "x": 0, "y": 16, "w": 6, "h": 4 }
      },
      {
        "title": "Payment Methods Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum(rate(payments_processed_by_method_total[5m])) by (method)",
            "legendFormat": "{{ method }}"
          }
        ],
        "gridPos": { "x": 6, "y": 16, "w": 6, "h": 8 }
      }
    ]
  }
}

Advanced Observability Patterns

RED Method (Request, Error, Duration)

# Request Rate
sum(rate(http_requests_total{service="payment-service"}[5m]))

Error Rate

sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m]))

Duration (p95)

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

USE Method (Utilization, Saturation, Errors)

# CPU Utilization
rate(process_cpu_seconds_total[5m])

Memory Saturation

process_memory_bytes / process_memory_bytes

Database Connection Saturation

database_connections_active / database_connections_max

Error Rate

rate(database_errors_total[5m])

Golden Signals (Latency, Traffic, Errors, Saturation)

# Latency (p95)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Traffic (requests/second)

sum(rate(http_requests_total[5m])) by (service)

Errors (error rate)

sum(rate(http_requests_total[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

Saturation (resource utilization)

avg(rate(process_cpu_seconds_total[5m])) by (service)

Service Level Objectives (SLOs)

Defining SLIs and SLOs

# slo-config.yml
slos:
  - name: api_availability
    description: 'API endpoint availability'
    sli:
      ratio_query: |
        sum(rate(http_requests_total{job="api",status!~"5.."}[30d]))
        /
        sum(rate(http_requests_total{job="api"}[30d]))
    target: 0.999  # 99.9%
    window: 30d
    error_budget: 0.001  # 0.1%
  • name: api_latency description: 'API latency p95 < 500ms' sli: threshold_query: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) < 0.5 target: 0.995 # 99.5% of requests under 500ms window: 30d

Error Budget Calculation

// error-budget.ts
interface SLO {
  target: number;      // e.g., 0.999 (99.9%)
  windowDays: number;  // e.g., 30
}

function calculateErrorBudget(slo: SLO): { allowedFailures: number; allowedFailurePercentage: number; } { const allowedFailurePercentage = 1 - slo.target;

// Assuming 1M requests per month const totalRequests = 1_000_000; const allowedFailures = totalRequests * allowedFailurePercentage;

return { allowedFailures: Math.floor(allowedFailures), allowedFailurePercentage: allowedFailurePercentage * 100, }; }

// Example: 99.9% SLO over 30 days const budget = calculateErrorBudget({ target: 0.999, windowDays: 30 }); console.log(Allowed failures: ${budget.allowedFailures}); // Output: Allowed failures: 1000 (0.1% of 1M requests)

Performance Optimization

Reducing Metric Cardinality

// ❌ BAD: High cardinality (user_id as label)
paymentsProcessed.add(1, {
  user_id: '12345',  // Millions of unique values!
  status: 'success',
});

// ✅ GOOD: Low cardinality (user_tier instead) paymentsProcessed.add(1, { user_tier: 'premium', // Limited values: 'free', 'premium', 'enterprise' status: 'success', });

// ❌ BAD: Timestamp as label requestCounter.add(1, { timestamp: Date.now().toString(), // Infinite cardinality! });

// ✅ GOOD: Use metric timestamp, not labels requestCounter.add(1, { method: 'POST', endpoint: '/api/checkout', });

Sampling High-Volume Traces

// Dynamic sampling based on error status
class SmartSampler implements Sampler {
  private baseSampleRate = 0.01; // 1% base rate

shouldSample(context, traceId, spanName, spanKind, attributes, links) { // Always sample errors if (attributes['http.status_code'] >= 400) { return { decision: SamplingDecision.RECORD_AND_SAMPLED }; }

// Always sample slow requests (&gt; 1s)
if (attributes['http.duration'] &gt; 1000) {
  return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}

// Sample critical endpoints at higher rate
if (attributes['http.route'] === '/api/checkout') {
  return this.sampleAtRate(0.1); // 10% for checkout
}

// Sample everything else at base rate
return this.sampleAtRate(this.baseSampleRate);

}

private sampleAtRate(rate: number): SamplingResult { return Math.random() < rate ? { decision: SamplingDecision.RECORD_AND_SAMPLED } : { decision: SamplingDecision.NOT_RECORD }; } }

Efficient Log Aggregation

// Use structured logging with levels
import pino from 'pino';

const logger = pino({ level: process.env.LOG_LEVEL || 'info', // Only log to file in production ...(process.env.NODE_ENV === 'production' && { transport: { target: 'pino/file', options: { destination: '/var/log/app.log' }, }, }), });

// Add trace context to logs automatically app.use((req, res, next) => { const span = trace.getActiveSpan(); if (span) { req.log = logger.child({ trace_id: span.spanContext().traceId, span_id: span.spanContext().spanId, }); } next(); });

// Structured log with correct level logger.info( { user_id: user.id, action: 'checkout', amount: payment.amount, duration_ms: performance.now() - startTime, }, 'Payment processed successfully' );

Troubleshooting Common Issues

Missing Traces

Problem: Traces not appearing in Jaeger

Solutions:

  1. Check OpenTelemetry Collector is running:
kubectl get pods -n monitoring | grep otel-collector
  1. Verify trace export configuration:
// Ensure exporter is configured
const jaegerExporter = new JaegerExporter({
  agentHost: 'jaeger-agent',  // Correct hostname
  agentPort: 6831,
});
  1. Check sampling rate:
// Verify you're not sampling out all traces
const sampler = new TraceIdRatioBasedSampler(1.0);  // 100% for debugging
  1. Enable debug logging:
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

High Memory Usage

Problem: Prometheus consuming excessive memory

Solutions:

  1. Reduce retention period:
# prometheus.yml
global:
  retention.time: 15d  # Reduce from 30d
  retention.size: 50GB
  1. Decrease scrape frequency for low-priority services:
scrape_configs:
  - job_name: 'background-jobs'
    scrape_interval: 60s  # Less frequent than default 15s
  1. Limit metric cardinality with relabeling:
scrape_configs:
  - job_name: 'api'
    metric_relabel_configs:
      # Drop high-cardinality labels
      - source_labels: [user_id]
        action: labeldrop

Dashboard Loading Slowly

Problem: Grafana dashboards take 30+ seconds to load

Solutions:

  1. Reduce query time range:
# Instead of 24h range
rate(http_requests_total[24h])

Use shorter range

rate(http_requests_total[5m])

  1. Use recording rules for complex queries:
# /etc/prometheus/rules/recording.yml
groups:
  - name: api_metrics
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
  - record: job:http_request_duration_seconds:p95
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

  1. Use dashboard variables to limit scope:
{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(http_requests_total, service)",
        "multi": false
      }
    ]
  }
}

Best Practices

  1. Start with auto-instrumentation, add custom spans only where needed
  2. Use consistent naming across metrics, traces, and logs
  3. Tag resources with environment, service, version
  4. Implement health checks separate from business metrics
  5. Set up alerting early, refine thresholds over time
  6. Monitor the monitors - ensure Prometheus/Grafana are healthy
  7. Document SLOs and share with stakeholders
  8. Review dashboards regularly - remove stale panels
  9. Use exemplars to link metrics to traces
  10. Test observability in staging before production

Conclusion

Comprehensive observability is essential for operating modern distributed systems. The combination of OpenTelemetry for instrumentation, Prometheus for metrics, and Grafana for visualization provides a powerful, vendor-neutral stack that scales from small applications to large microservices architectures.

By implementing the patterns in this guide—proper instrumentation, meaningful metrics, distributed tracing, effective alerting, and SLO monitoring—you gain deep insights into system behavior, enabling faster debugging, proactive issue detection, and data-driven capacity planning.

Key Takeaways:

  1. Instrument comprehensively - metrics, traces, and logs working together
  2. Keep cardinality low - avoid high-cardinality labels in metrics
  3. Sample intelligently - higher rates for errors and slow requests
  4. Define SLOs early - measure what matters to users
  5. Alert on symptoms, not causes - focus on user impact
  6. Optimize queries - use recording rules for expensive calculations
  7. Maintain dashboards - keep them focused and actionable
  8. Trace context matters - propagate across all services

With robust observability, you transform from reactive firefighting to proactive system management.

Additional Resources

  • OpenTelemetry: https://opentelemetry.io/
  • Prometheus: https://prometheus.io/docs/
  • Grafana: https://grafana.com/docs/
  • Jaeger: https://www.jaegertracing.io/
  • SLO Best Practices: https://sre.google/workbook/implementing-slos/
  • RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
  • USE Method: http://www.brendangregg.com/usemethod.html

Found this helpful? Share it!

Related Articles

S

Written by StaticBlock Editorial

StaticBlock Editorial is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.