Monitoring and Observability

Monitoring and observability are critical for maintaining reliable production systems. While monitoring tells you what's wrong, observability helps you understand why. Companies like Google process billions of metrics per second, Netflix uses distributed tracing across thousands of microservices, and Uber maintains 99.99% uptime through comprehensive observability practices.

This guide covers the three pillars of observability (metrics, logs, traces), monitoring best practices with Prometheus, distributed tracing with OpenTelemetry, log aggregation with ELK Stack, alerting strategies, SLOs and SLIs, error budgets, and production debugging techniques used by high-scale systems.

CDN Fundamentals

Why Message Queues?

Message queues solve critical distributed systems challenges:

1. Decoupling Services

// Without Message Queue (Tight Coupling)
// Order service directly calls inventory service
async function createOrder(order) {
  await inventoryService.reserveStock(order.items); // Fails if inventory service down
  await paymentService.charge(order.payment);       // Fails if payment service down
  await emailService.sendConfirmation(order.email); // Blocks on email sending
return { orderId: order.id, status: 'completed' };
}
// Problem: All services must be available simultaneously
// Order creation fails if ANY downstream service is unavailable
// With Message Queue (Loose Coupling)
async function createOrder(order) {
// Publish events, don't wait for processing
await queue.publish('order.created', order);
return { orderId: order.id, status: 'pending' };
}
// Separate consumers handle each responsibility
queue.consume('order.created', async (order) => {
await inventoryService.reserveStock(order.items);
});
queue.consume('order.created', async (order) => {
await paymentService.charge(order.payment);
});
queue.consume('order.created', async (order) => {
await emailService.sendConfirmation(order.email);
});
// Benefits:
// - Order creation succeeds even if downstream services are down
// - Each service processes independently at its own pace
// - Failed operations can retry without affecting others

2. Load Leveling

// Without Queue: Traffic spike overwhelms system // 10,000 requests/second → Database crashes

// With Queue: Absorbs spike, processes at sustainable rate // 10,000 requests/second → Queue buffers → Process 1,000/second // Queue depth grows temporarily, then drains when traffic subsides

3. Guaranteed Delivery

// Without Queue: Message lost if recipient is down await downstreamService.notify(data); // Throws error, data lost

// With Queue: Message persisted until successfully processed await queue.publish('notifications', data); // Persisted to disk // Consumer processes when available, retries on failure

RabbitMQ vs Kafka: When to Use Each

Feature	RabbitMQ	Kafka
Architecture	Traditional message broker	Distributed commit log
Message Ordering	Per-queue ordering	Per-partition ordering (strict)
Message Retention	Deleted after acknowledgment	Retained for configured period (7 days default)
Throughput	50K-100K messages/sec	1M+ messages/sec
Latency	<10ms	5-15ms
Use Cases	Task queues, RPC, routing	Event streaming, logs, analytics
Message Size	<128 KB optimal	<1 MB optimal
Consumer Model	Push (broker pushes to consumers)	Pull (consumers pull from broker)

Choose RabbitMQ for:

Task/work queues with complex routing
RPC (request-reply) patterns
Low-latency message delivery (<10ms)
Per-message acknowledgments
Priority queues

Choose Kafka for:

Event streaming and event sourcing
High-throughput data pipelines (>100K msg/sec)
Log aggregation
Real-time analytics
Message replay (consumers can re-read old messages)

RabbitMQ Production Patterns

Work Queue Pattern

Use Case: Distribute time-consuming tasks among multiple workers

// producer.js - Send tasks to queue
import amqp from 'amqplib';
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
const queue = 'tasks';
await channel.assertQueue(queue, {
durable: true // Queue survives broker restart
});
// Send task
const task = {
type: 'image-resize',
imageUrl: 'https://example.com/image.jpg',
sizes: [300, 600, 1200]
};
channel.sendToQueue(queue, Buffer.from(JSON.stringify(task)), {
persistent: true // Message survives broker restart
});
console.log('Task sent:', task);

// worker.js - Process tasks from queue
import amqp from 'amqplib';
import sharp from 'sharp';

const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();

const queue = 'tasks';
await channel.assertQueue(queue, { durable: true });

// Prefetch 1 message at a time (fair dispatch)
channel.prefetch(1);

console.log('Worker waiting for tasks...');

channel.consume(queue, async (msg) => {
  const task = JSON.parse(msg.content.toString());
  console.log('Processing task:', task.type);

  try {
    // Simulate image processing
    for (const size of task.sizes) {
      await sharp(task.imageUrl)
        .resize(size, size)
        .toFile(`output-${size}.jpg`);
    }

    // Acknowledge successful processing
    channel.ack(msg);
    console.log('Task completed');

  } catch (error) {
    console.error('Task failed:', error);

    // Reject and requeue (retry)
    channel.nack(msg, false, true);
  }
}, {
  noAck: false // Manual acknowledgment required
});

Production Configuration:

// Ensure durability and fair dispatch
await channel.assertQueue('tasks', {
  durable: true,           // Queue survives restart
  arguments: {
    'x-max-priority': 10   // Enable priority queue (0-10)
  }
});
channel.prefetch(1); // Process 1 message at a time (prevents overwhelming workers)
// Send high-priority task
channel.sendToQueue('tasks', Buffer.from(JSON.stringify(urgentTask)), {
persistent: true,
priority: 9 // Higher priority processed first
});

Pub/Sub Pattern (Fanout Exchange)

Use Case: Broadcast messages to multiple consumers

// publisher.js - Broadcast events
const exchange = 'events';
await channel.assertExchange(exchange, 'fanout', {
durable: true
});
// Publish user registration event
const event = {
type: 'user.registered',
userId: 'user-123',
email: 'alice@example.com',
timestamp: Date.now()
};
channel.publish(exchange, '', Buffer.from(JSON.stringify(event)), {
persistent: true
});
console.log('Event published:', event.type);

// subscriber-email.js - Send welcome email
const exchange = 'events';
const queue = 'email-service';

await channel.assertExchange(exchange, 'fanout', { durable: true });
await channel.assertQueue(queue, { durable: true });
await channel.bindQueue(queue, exchange, ''); // Bind to exchange

channel.consume(queue, async (msg) => {
  const event = JSON.parse(msg.content.toString());

  if (event.type === 'user.registered') {
    await sendWelcomeEmail(event.email);
    channel.ack(msg);
  }
});

// subscriber-analytics.js - Track registration
const exchange = 'events';
const queue = 'analytics-service';

await channel.assertExchange(exchange, 'fanout', { durable: true });
await channel.assertQueue(queue, { durable: true });
await channel.bindQueue(queue, exchange, '');

channel.consume(queue, async (msg) => {
  const event = JSON.parse(msg.content.toString());

  if (event.type === 'user.registered') {
    await analytics.track('User Registered', event);
    channel.ack(msg);
  }
});

Result: Single event published → Multiple independent consumers process it simultaneously

Topic Exchange (Routing Pattern)

Use Case: Route messages based on routing keys (e.g., log levels, event types)

// publisher.js - Publish logs with routing keys
const exchange = 'logs';
await channel.assertExchange(exchange, 'topic', { durable: true });
// Routing keys follow pattern: <service>.<level>
channel.publish(exchange, 'auth.error', Buffer.from('Authentication failed'));
channel.publish(exchange, 'auth.info', Buffer.from('User logged in'));
channel.publish(exchange, 'payment.error', Buffer.from('Payment declined'));
channel.publish(exchange, 'payment.info', Buffer.from('Payment successful'));

// consumer-errors.js - Subscribe to all errors
const exchange = 'logs';
const queue = 'error-handler';

await channel.assertExchange(exchange, 'topic', { durable: true });
await channel.assertQueue(queue, { durable: true });

// Routing pattern: *.error (all error logs from any service)
await channel.bindQueue(queue, exchange, '*.error');

channel.consume(queue, async (msg) => {
  const routingKey = msg.fields.routingKey;
  const log = msg.content.toString();

  console.log(`[ERROR] ${routingKey}: ${log}`);

  // Alert team on errors
  await alertTeam(`Error in ${routingKey.split('.')[0]}: ${log}`);

  channel.ack(msg);
});

// consumer-auth-logs.js - Subscribe to all auth logs
const queue = 'auth-logger';

await channel.bindQueue(queue, exchange, 'auth.*'); // All auth logs

channel.consume(queue, async (msg) => {
  const routingKey = msg.fields.routingKey;
  const log = msg.content.toString();

  await saveToDatabase({
    service: 'auth',
    level: routingKey.split('.')[1],
    message: log,
    timestamp: Date.now()
  });

  channel.ack(msg);
});

Routing Key Patterns:

* matches exactly one word
# matches zero or more words

Examples:

auth.* matches auth.error, auth.info
*.error matches auth.error, payment.error
auth.# matches auth.error, auth.user.login, auth.user.logout

Dead Letter Queue (DLQ)

Use Case: Handle messages that fail processing after multiple retries

// Setup main queue with DLQ
const mainQueue = 'orders';
const dlqExchange = 'dlx';
const dlq = 'orders-dlq';
// Create DLQ
await channel.assertExchange(dlqExchange, 'fanout', { durable: true });
await channel.assertQueue(dlq, { durable: true });
await channel.bindQueue(dlq, dlqExchange, '');
// Create main queue with DLX
await channel.assertQueue(mainQueue, {
durable: true,
arguments: {
'x-dead-letter-exchange': dlqExchange, // Send failed messages here
'x-message-ttl': 60000 // 60 second TTL
}
});
// Consumer for main queue
channel.consume(mainQueue, async (msg) => {
const order = JSON.parse(msg.content.toString());
try {
await processOrder(order);
channel.ack(msg);
} catch (error) {
console.error('Processing failed:', error);
// Reject without requeue → Goes to DLQ
channel.nack(msg, false, false);

}
});
// Monitor DLQ for failed orders
channel.consume(dlq, async (msg) => {
const failedOrder = JSON.parse(msg.content.toString());
const failureCount = (msg.properties.headers['x-death']?.[0]?.count || 0);
console.error(Order ${failedOrder.id} failed ${failureCount} times);
// Alert team
await alertOps(Order processing failed: ${failedOrder.id});
// Store for manual review
await saveFailedOrder(failedOrder, {
failureCount,
originalQueue: msg.properties.headers['x-death']?.[0]?.queue,
reason: msg.properties.headers['x-first-death-reason']
});
channel.ack(msg);
});

Retry with Exponential Backoff

// Retry queue with increasing delays
const createRetryQueue = async (attempt) => {
  const delay = Math.min(1000 * Math.pow(2, attempt), 60000); // Cap at 60s
  const queueName = `orders-retry-${attempt}`;
await channel.assertQueue(queueName, {
durable: true,
arguments: {
'x-dead-letter-exchange': 'orders-exchange',
'x-dead-letter-routing-key': 'orders',
'x-message-ttl': delay // Delay before retry
}
});
return queueName;
};
// Main consumer with retry logic
channel.consume('orders', async (msg) => {
const order = JSON.parse(msg.content.toString());
const retryCount = (msg.properties.headers?.['x-retry-count'] || 0);
try {
await processOrder(order);
channel.ack(msg);
} catch (error) {
if (retryCount < 5) {
// Retry with exponential backoff
const retryQueue = await createRetryQueue(retryCount);
  channel.sendToQueue(retryQueue, msg.content, {
    persistent: true,
    headers: {
      'x-retry-count': retryCount + 1
    }
  });

  channel.ack(msg); // Remove from main queue
  console.log(`Order ${order.id} retrying (attempt ${retryCount + 1})`);

} else {
  // Max retries exceeded → DLQ
  channel.nack(msg, false, false);
  console.error(`Order ${order.id} failed after 5 retries`);
}

}
});

Kafka Production Patterns

Producer with Idempotence and Transactions

// producer.js - Idempotent producer
import { Kafka } from 'kafkajs';
const kafka = new Kafka({
clientId: 'order-service',
brokers: ['localhost:9092']
});
const producer = kafka.producer({
idempotent: true, // Exactly-once delivery semantics
maxInFlightRequests: 5,
transactionalId: 'order-transactions'
});
await producer.connect();
// Send messages in transaction
await producer.send({
topic: 'orders',
messages: [
{
key: 'order-123', // Messages with same key go to same partition
value: JSON.stringify({
orderId: 'order-123',
userId: 'user-456',
total: 99.99,
items: [...]
}),
headers: {
'trace-id': 'abc-123',
'timestamp': Date.now().toString()
}
}
]
});
// Transactional writes (all-or-nothing)
const transaction = await producer.transaction();
try {
await transaction.send({
topic: 'orders',
messages: [{ key: 'order-123', value: orderData }]
});
await transaction.send({
topic: 'inventory',
messages: [{ key: 'item-789', value: inventoryUpdate }]
});
await transaction.commit(); // Both succeed or both fail
} catch (error) {
await transaction.abort();
console.error('Transaction failed:', error);
}

Consumer Groups with Offset Management

// consumer.js - Consumer group
const consumer = kafka.consumer({
  groupId: 'order-processing-group',
  sessionTimeout: 30000,
  heartbeatInterval: 3000
});
await consumer.connect();
await consumer.subscribe({
topic: 'orders',
fromBeginning: false // Start from latest
});
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const order = JSON.parse(message.value.toString());
console.log({
  partition,
  offset: message.offset,
  order: order.orderId
});

try {
  await processOrder(order);

  // Commit offset manually after successful processing
  await consumer.commitOffsets([{
    topic,
    partition,
    offset: (parseInt(message.offset) + 1).toString()
  }]);

} catch (error) {
  console.error('Processing failed:', error);
  // Don't commit offset → Will retry from this message
}

}
});

Consumer Group Behavior:

Multiple consumers in same group share partitions
Each partition consumed by exactly one consumer in group
If consumer fails, partition reassigned to another consumer

Topic: orders (3 partitions)
Consumer Group: order-processors (2 consumers)
Consumer A: [Partition 0, Partition 1]
Consumer B: [Partition 2]
If Consumer A fails:
Consumer B: [Partition 0, Partition 1, Partition 2]

Kafka Streams for Real-Time Processing

// stream-processor.js - Real-time order analytics
import { Kafka } from 'kafkajs';
const kafka = new Kafka({ brokers: ['localhost:9092'] });
const consumer = kafka.consumer({ groupId: 'analytics-group' });
await consumer.subscribe({ topic: 'orders' });
// In-memory aggregation (windowed)
const orderStats = new Map();
await consumer.run({
eachMessage: async ({ message }) => {
const order = JSON.parse(message.value.toString());
const hour = new Date(order.timestamp).setMinutes(0, 0, 0);
// Update stats for current hour
const stats = orderStats.get(hour) || { count: 0, total: 0 };
stats.count += 1;
stats.total += order.total;
orderStats.set(hour, stats);

// Emit aggregated stats
if (stats.count % 100 === 0) {
  console.log(`Hour ${new Date(hour).toISOString()}:`, {
    orders: stats.count,
    revenue: stats.total,
    averageOrderValue: stats.total / stats.count
  });

  // Publish aggregated metrics
  await producer.send({
    topic: 'order-metrics',
    messages: [{
      value: JSON.stringify({
        window: hour,
        metrics: stats
      })
    }]
  });
}

}
});

Guaranteed Delivery Patterns

At-Least-Once Delivery

Pattern: Message delivered at least once, possibly multiple times (requires idempotent processing)

// RabbitMQ: Manual acknowledgment
channel.consume('orders', async (msg) => {
  const order = JSON.parse(msg.content.toString());
try {
await processOrder(order); // Must be idempotent
channel.ack(msg); // Only ack after successful processing
} catch (error) {
// Don't ack → Message redelivered
console.error('Failed, will retry:', error);
}
});

// Kafka: Commit offset after processing
await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    await processOrder(message.value);

    // Only commit offset after successful processing
    await consumer.commitOffsets([{
      topic,
      partition,
      offset: (parseInt(message.offset) + 1).toString()
    }]);
  }
});

Exactly-Once Delivery (Kafka)

Pattern: Message delivered exactly once using idempotent producers and transactional consumers

// Producer: Enable idempotence
const producer = kafka.producer({
  idempotent: true,
  transactionalId: 'my-transactional-producer'
});
// Consumer: Process with transactions
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const transaction = await producer.transaction();
try {
  // Process message
  const result = await processOrder(message.value);

  // Write result to output topic
  await transaction.send({
    topic: 'order-results',
    messages: [{ value: JSON.stringify(result) }]
  });

  // Commit input offset
  await transaction.sendOffsets({
    consumerGroupId: 'order-processors',
    topics: [{
      topic,
      partitions: [{
        partition,
        offset: (parseInt(message.offset) + 1).toString()
      }]
    }]
  });

  await transaction.commit();

} catch (error) {
  await transaction.abort();
  throw error;
}

}
});

Idempotency with Deduplication

// idempotency.js - Deduplicate messages using Redis
import Redis from 'ioredis';
const redis = new Redis();
const processIdempotent = async (messageId, processFn) => {
const key = processed:${messageId};
// Check if already processed
const exists = await redis.get(key);
if (exists) {
console.log('Message already processed:', messageId);
return JSON.parse(exists);
}
// Process message
const result = await processFn();
// Store result with 24h TTL
await redis.setex(key, 86400, JSON.stringify(result));
return result;
};
// Usage in consumer
channel.consume('orders', async (msg) => {
const order = JSON.parse(msg.content.toString());
const messageId = msg.properties.messageId;
await processIdempotent(messageId, async () => {
return await processOrder(order);
});
channel.ack(msg);
});

Performance Optimization

RabbitMQ Performance Tuning

1. Batch Publishing

// ❌ Slow: Publish one at a time
for (const message of messages) {
  channel.sendToQueue('tasks', Buffer.from(JSON.stringify(message)));
}
// ✅ Fast: Batch publish
const batch = messages.map(msg =>
channel.sendToQueue('tasks', Buffer.from(JSON.stringify(msg)))
);
await Promise.all(batch);

2. Prefetch Configuration

// ❌ Bad: No prefetch limit (consumer overwhelmed)
channel.consume('tasks', handler);
// ✅ Good: Limit in-flight messages
channel.prefetch(10); // Process max 10 messages concurrently
channel.consume('tasks', handler);

3. Connection Pooling

// connection-pool.js
class RabbitMQPool {
  constructor(url, poolSize = 10) {
    this.url = url;
    this.poolSize = poolSize;
    this.connections = [];
    this.channels = [];
  }
async init() {
for (let i = 0; i < this.poolSize; i++) {
const connection = await amqp.connect(this.url);
const channel = await connection.createChannel();
this.connections.push(connection);
this.channels.push(channel);
}
}
getChannel() {
// Round-robin channel selection
return this.channels[Math.floor(Math.random() * this.channels.length)];
}
}
const pool = new RabbitMQPool('amqp://localhost', 10);
await pool.init();
// Use pooled channel
const channel = pool.getChannel();
channel.sendToQueue('tasks', Buffer.from(data));

Kafka Performance Tuning

1. Batch Size and Linger

const producer = kafka.producer({
  // Wait up to 10ms to batch messages (higher throughput)
  linger: 10,
// Batch up to 1MB before sending
maxBatchSize: 1048576,
// Compress messages (2-5x smaller)
compression: CompressionTypes.GZIP
});

2. Partitioning Strategy

// ❌ Bad: All messages to same partition (no parallelism)
await producer.send({
  topic: 'orders',
  messages: [{ value: JSON.stringify(order) }]
});
// ✅ Good: Partition by user (related orders in same partition, preserves order)
await producer.send({
topic: 'orders',
messages: [{
key: order.userId, // Same user → same partition
value: JSON.stringify(order)
}]
});
// ✅ Good: Custom partitioner for load balancing
const producer = kafka.producer({
createPartitioner: () => {
return ({ topic, partitionMetadata, message }) => {
// Custom logic: Hash user ID to partition
const partition = hashCode(message.key) % partitionMetadata.length;
return partition;
};
}
});

3. Consumer Parallelism

Topic: orders (12 partitions) Consumer Group: processors (12 consumers) Max parallelism: 12 concurrent consumers

Adding more consumers won't increase parallelism → Increase partitions instead

RabbitMQ Metrics

// Monitor queue depth and consumer count
const getQueueStats = async (queueName) => {
  const stats = await channel.assertQueue(queueName, { durable: true });
console.log({
queue: queueName,
messages: stats.messageCount,      // Messages in queue
consumers: stats.consumerCount,    // Active consumers
messagesPerConsumer: stats.messageCount / (stats.consumerCount || 1)
});
// Alert if queue backing up
if (stats.messageCount > 10000) {
await alertOps(Queue ${queueName} has ${stats.messageCount} messages);
}
};
setInterval(() => getQueueStats('orders'), 60000); // Check every minute

Kafka Monitoring

// Monitor consumer lag
const admin = kafka.admin();
await admin.connect();
const consumerGroups = await admin.fetchOffsets({
groupId: 'order-processors',
topics: ['orders']
});
for (const topic of consumerGroups) {
for (const partition of topic.partitions) {
const lag = partition.high - partition.offset;
console.log({
  topic: topic.topic,
  partition: partition.partition,
  consumerOffset: partition.offset,
  highWaterMark: partition.high,
  lag // Messages behind
});

if (lag &gt; 100000) {
  await alertOps(`Consumer lag: ${lag} messages on partition ${partition.partition}`);
}

}
}

Real-World Examples

Uber: 100 Trillion Kafka Messages/Year

Architecture:

4,000+ microservices
1,000+ Kafka clusters
100 trillion messages annually
Sub-10ms P99 latency

Use Cases:

Trip events (request, accept, start, complete)
Real-time location updates
Surge pricing calculations
Driver matching

Key Patterns:

Strict message ordering per rider/driver (partition by user ID)
Exactly-once processing for billing
Real-time stream processing for pricing

Robinhood: RabbitMQ for Financial Transactions