Monitoring and Observability - Production Systems Performance and Debugging at Scale
Monitoring and observability are critical for maintaining reliable production systems. While monitoring tells you what's wrong, observability helps you understand why. Companies like Google process billions of metrics per second, Netflix uses distributed tracing across thousands of microservices, and Uber maintains 99.99% uptime through comprehensive observability practices.
This guide covers the three pillars of observability (metrics, logs, traces), monitoring best practices with Prometheus, distributed tracing with OpenTelemetry, log aggregation with ELK Stack, alerting strategies, SLOs and SLIs, error budgets, and production debugging techniques used by high-scale systems.
Table of Contents
CDN Fundamentals
Why Message Queues?
Message queues solve critical distributed systems challenges:
1. Decoupling Services
// Without Message Queue (Tight Coupling)
// Order service directly calls inventory service
async function createOrder(order) {
await inventoryService.reserveStock(order.items); // Fails if inventory service down
await paymentService.charge(order.payment); // Fails if payment service down
await emailService.sendConfirmation(order.email); // Blocks on email sending
return { orderId: order.id, status: 'completed' };
}
// Problem: All services must be available simultaneously
// Order creation fails if ANY downstream service is unavailable
// With Message Queue (Loose Coupling)
async function createOrder(order) {
// Publish events, don't wait for processing
await queue.publish('order.created', order);
return { orderId: order.id, status: 'pending' };
}
// Separate consumers handle each responsibility
queue.consume('order.created', async (order) => {
await inventoryService.reserveStock(order.items);
});
queue.consume('order.created', async (order) => {
await paymentService.charge(order.payment);
});
queue.consume('order.created', async (order) => {
await emailService.sendConfirmation(order.email);
});
// Benefits:
// - Order creation succeeds even if downstream services are down
// - Each service processes independently at its own pace
// - Failed operations can retry without affecting others
2. Load Leveling
// Without Queue: Traffic spike overwhelms system
// 10,000 requests/second → Database crashes
// With Queue: Absorbs spike, processes at sustainable rate
// 10,000 requests/second → Queue buffers → Process 1,000/second
// Queue depth grows temporarily, then drains when traffic subsides
3. Guaranteed Delivery
// Without Queue: Message lost if recipient is down
await downstreamService.notify(data); // Throws error, data lost
// With Queue: Message persisted until successfully processed
await queue.publish('notifications', data); // Persisted to disk
// Consumer processes when available, retries on failure
RabbitMQ vs Kafka: When to Use Each
| Feature | RabbitMQ | Kafka |
|---|---|---|
| Architecture | Traditional message broker | Distributed commit log |
| Message Ordering | Per-queue ordering | Per-partition ordering (strict) |
| Message Retention | Deleted after acknowledgment | Retained for configured period (7 days default) |
| Throughput | 50K-100K messages/sec | 1M+ messages/sec |
| Latency | <10ms | 5-15ms |
| Use Cases | Task queues, RPC, routing | Event streaming, logs, analytics |
| Message Size | <128 KB optimal | <1 MB optimal |
| Consumer Model | Push (broker pushes to consumers) | Pull (consumers pull from broker) |
Choose RabbitMQ for:
- Task/work queues with complex routing
- RPC (request-reply) patterns
- Low-latency message delivery (<10ms)
- Per-message acknowledgments
- Priority queues
Choose Kafka for:
- Event streaming and event sourcing
- High-throughput data pipelines (>100K msg/sec)
- Log aggregation
- Real-time analytics
- Message replay (consumers can re-read old messages)
RabbitMQ Production Patterns
Work Queue Pattern
Use Case: Distribute time-consuming tasks among multiple workers
// producer.js - Send tasks to queue
import amqp from 'amqplib';
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
const queue = 'tasks';
await channel.assertQueue(queue, {
durable: true // Queue survives broker restart
});
// Send task
const task = {
type: 'image-resize',
imageUrl: 'https://example.com/image.jpg',
sizes: [300, 600, 1200]
};
channel.sendToQueue(queue, Buffer.from(JSON.stringify(task)), {
persistent: true // Message survives broker restart
});
console.log('Task sent:', task);
// worker.js - Process tasks from queue
import amqp from 'amqplib';
import sharp from 'sharp';
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
const queue = 'tasks';
await channel.assertQueue(queue, { durable: true });
// Prefetch 1 message at a time (fair dispatch)
channel.prefetch(1);
console.log('Worker waiting for tasks...');
channel.consume(queue, async (msg) => {
const task = JSON.parse(msg.content.toString());
console.log('Processing task:', task.type);
try {
// Simulate image processing
for (const size of task.sizes) {
await sharp(task.imageUrl)
.resize(size, size)
.toFile(`output-${size}.jpg`);
}
// Acknowledge successful processing
channel.ack(msg);
console.log('Task completed');
} catch (error) {
console.error('Task failed:', error);
// Reject and requeue (retry)
channel.nack(msg, false, true);
}
}, {
noAck: false // Manual acknowledgment required
});
Production Configuration:
// Ensure durability and fair dispatch
await channel.assertQueue('tasks', {
durable: true, // Queue survives restart
arguments: {
'x-max-priority': 10 // Enable priority queue (0-10)
}
});
channel.prefetch(1); // Process 1 message at a time (prevents overwhelming workers)
// Send high-priority task
channel.sendToQueue('tasks', Buffer.from(JSON.stringify(urgentTask)), {
persistent: true,
priority: 9 // Higher priority processed first
});
Pub/Sub Pattern (Fanout Exchange)
Use Case: Broadcast messages to multiple consumers
// publisher.js - Broadcast events
const exchange = 'events';
await channel.assertExchange(exchange, 'fanout', {
durable: true
});
// Publish user registration event
const event = {
type: 'user.registered',
userId: 'user-123',
email: 'alice@example.com',
timestamp: Date.now()
};
channel.publish(exchange, '', Buffer.from(JSON.stringify(event)), {
persistent: true
});
console.log('Event published:', event.type);
// subscriber-email.js - Send welcome email
const exchange = 'events';
const queue = 'email-service';
await channel.assertExchange(exchange, 'fanout', { durable: true });
await channel.assertQueue(queue, { durable: true });
await channel.bindQueue(queue, exchange, ''); // Bind to exchange
channel.consume(queue, async (msg) => {
const event = JSON.parse(msg.content.toString());
if (event.type === 'user.registered') {
await sendWelcomeEmail(event.email);
channel.ack(msg);
}
});
// subscriber-analytics.js - Track registration
const exchange = 'events';
const queue = 'analytics-service';
await channel.assertExchange(exchange, 'fanout', { durable: true });
await channel.assertQueue(queue, { durable: true });
await channel.bindQueue(queue, exchange, '');
channel.consume(queue, async (msg) => {
const event = JSON.parse(msg.content.toString());
if (event.type === 'user.registered') {
await analytics.track('User Registered', event);
channel.ack(msg);
}
});
Result: Single event published → Multiple independent consumers process it simultaneously
Topic Exchange (Routing Pattern)
Use Case: Route messages based on routing keys (e.g., log levels, event types)
// publisher.js - Publish logs with routing keys
const exchange = 'logs';
await channel.assertExchange(exchange, 'topic', { durable: true });
// Routing keys follow pattern: <service>.<level>
channel.publish(exchange, 'auth.error', Buffer.from('Authentication failed'));
channel.publish(exchange, 'auth.info', Buffer.from('User logged in'));
channel.publish(exchange, 'payment.error', Buffer.from('Payment declined'));
channel.publish(exchange, 'payment.info', Buffer.from('Payment successful'));
// consumer-errors.js - Subscribe to all errors
const exchange = 'logs';
const queue = 'error-handler';
await channel.assertExchange(exchange, 'topic', { durable: true });
await channel.assertQueue(queue, { durable: true });
// Routing pattern: *.error (all error logs from any service)
await channel.bindQueue(queue, exchange, '*.error');
channel.consume(queue, async (msg) => {
const routingKey = msg.fields.routingKey;
const log = msg.content.toString();
console.log(`[ERROR] ${routingKey}: ${log}`);
// Alert team on errors
await alertTeam(`Error in ${routingKey.split('.')[0]}: ${log}`);
channel.ack(msg);
});
// consumer-auth-logs.js - Subscribe to all auth logs
const queue = 'auth-logger';
await channel.bindQueue(queue, exchange, 'auth.*'); // All auth logs
channel.consume(queue, async (msg) => {
const routingKey = msg.fields.routingKey;
const log = msg.content.toString();
await saveToDatabase({
service: 'auth',
level: routingKey.split('.')[1],
message: log,
timestamp: Date.now()
});
channel.ack(msg);
});
Routing Key Patterns:
*matches exactly one word#matches zero or more words
Examples:
auth.*matchesauth.error,auth.info*.errormatchesauth.error,payment.errorauth.#matchesauth.error,auth.user.login,auth.user.logout
Dead Letter Queue (DLQ)
Use Case: Handle messages that fail processing after multiple retries
// Setup main queue with DLQ
const mainQueue = 'orders';
const dlqExchange = 'dlx';
const dlq = 'orders-dlq';
// Create DLQ
await channel.assertExchange(dlqExchange, 'fanout', { durable: true });
await channel.assertQueue(dlq, { durable: true });
await channel.bindQueue(dlq, dlqExchange, '');
// Create main queue with DLX
await channel.assertQueue(mainQueue, {
durable: true,
arguments: {
'x-dead-letter-exchange': dlqExchange, // Send failed messages here
'x-message-ttl': 60000 // 60 second TTL
}
});
// Consumer for main queue
channel.consume(mainQueue, async (msg) => {
const order = JSON.parse(msg.content.toString());
try {
await processOrder(order);
channel.ack(msg);
} catch (error) {
console.error('Processing failed:', error);
// Reject without requeue → Goes to DLQ
channel.nack(msg, false, false);
}
});
// Monitor DLQ for failed orders
channel.consume(dlq, async (msg) => {
const failedOrder = JSON.parse(msg.content.toString());
const failureCount = (msg.properties.headers['x-death']?.[0]?.count || 0);
console.error(Order ${failedOrder.id} failed ${failureCount} times);
// Alert team
await alertOps(Order processing failed: ${failedOrder.id});
// Store for manual review
await saveFailedOrder(failedOrder, {
failureCount,
originalQueue: msg.properties.headers['x-death']?.[0]?.queue,
reason: msg.properties.headers['x-first-death-reason']
});
channel.ack(msg);
});
Retry with Exponential Backoff
// Retry queue with increasing delays
const createRetryQueue = async (attempt) => {
const delay = Math.min(1000 * Math.pow(2, attempt), 60000); // Cap at 60s
const queueName = `orders-retry-${attempt}`;
await channel.assertQueue(queueName, {
durable: true,
arguments: {
'x-dead-letter-exchange': 'orders-exchange',
'x-dead-letter-routing-key': 'orders',
'x-message-ttl': delay // Delay before retry
}
});
return queueName;
};
// Main consumer with retry logic
channel.consume('orders', async (msg) => {
const order = JSON.parse(msg.content.toString());
const retryCount = (msg.properties.headers?.['x-retry-count'] || 0);
try {
await processOrder(order);
channel.ack(msg);
} catch (error) {
if (retryCount < 5) {
// Retry with exponential backoff
const retryQueue = await createRetryQueue(retryCount);
channel.sendToQueue(retryQueue, msg.content, {
persistent: true,
headers: {
'x-retry-count': retryCount + 1
}
});
channel.ack(msg); // Remove from main queue
console.log(`Order ${order.id} retrying (attempt ${retryCount + 1})`);
} else {
// Max retries exceeded → DLQ
channel.nack(msg, false, false);
console.error(`Order ${order.id} failed after 5 retries`);
}
}
});
Kafka Production Patterns
Producer with Idempotence and Transactions
// producer.js - Idempotent producer
import { Kafka } from 'kafkajs';
const kafka = new Kafka({
clientId: 'order-service',
brokers: ['localhost:9092']
});
const producer = kafka.producer({
idempotent: true, // Exactly-once delivery semantics
maxInFlightRequests: 5,
transactionalId: 'order-transactions'
});
await producer.connect();
// Send messages in transaction
await producer.send({
topic: 'orders',
messages: [
{
key: 'order-123', // Messages with same key go to same partition
value: JSON.stringify({
orderId: 'order-123',
userId: 'user-456',
total: 99.99,
items: [...]
}),
headers: {
'trace-id': 'abc-123',
'timestamp': Date.now().toString()
}
}
]
});
// Transactional writes (all-or-nothing)
const transaction = await producer.transaction();
try {
await transaction.send({
topic: 'orders',
messages: [{ key: 'order-123', value: orderData }]
});
await transaction.send({
topic: 'inventory',
messages: [{ key: 'item-789', value: inventoryUpdate }]
});
await transaction.commit(); // Both succeed or both fail
} catch (error) {
await transaction.abort();
console.error('Transaction failed:', error);
}
Consumer Groups with Offset Management
// consumer.js - Consumer group
const consumer = kafka.consumer({
groupId: 'order-processing-group',
sessionTimeout: 30000,
heartbeatInterval: 3000
});
await consumer.connect();
await consumer.subscribe({
topic: 'orders',
fromBeginning: false // Start from latest
});
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const order = JSON.parse(message.value.toString());
console.log({
partition,
offset: message.offset,
order: order.orderId
});
try {
await processOrder(order);
// Commit offset manually after successful processing
await consumer.commitOffsets([{
topic,
partition,
offset: (parseInt(message.offset) + 1).toString()
}]);
} catch (error) {
console.error('Processing failed:', error);
// Don't commit offset → Will retry from this message
}
}
});
Consumer Group Behavior:
- Multiple consumers in same group share partitions
- Each partition consumed by exactly one consumer in group
- If consumer fails, partition reassigned to another consumer
Topic: orders (3 partitions)
Consumer Group: order-processors (2 consumers)
Consumer A: [Partition 0, Partition 1]
Consumer B: [Partition 2]
If Consumer A fails:
Consumer B: [Partition 0, Partition 1, Partition 2]
Kafka Streams for Real-Time Processing
// stream-processor.js - Real-time order analytics
import { Kafka } from 'kafkajs';
const kafka = new Kafka({ brokers: ['localhost:9092'] });
const consumer = kafka.consumer({ groupId: 'analytics-group' });
await consumer.subscribe({ topic: 'orders' });
// In-memory aggregation (windowed)
const orderStats = new Map();
await consumer.run({
eachMessage: async ({ message }) => {
const order = JSON.parse(message.value.toString());
const hour = new Date(order.timestamp).setMinutes(0, 0, 0);
// Update stats for current hour
const stats = orderStats.get(hour) || { count: 0, total: 0 };
stats.count += 1;
stats.total += order.total;
orderStats.set(hour, stats);
// Emit aggregated stats
if (stats.count % 100 === 0) {
console.log(`Hour ${new Date(hour).toISOString()}:`, {
orders: stats.count,
revenue: stats.total,
averageOrderValue: stats.total / stats.count
});
// Publish aggregated metrics
await producer.send({
topic: 'order-metrics',
messages: [{
value: JSON.stringify({
window: hour,
metrics: stats
})
}]
});
}
}
});
Guaranteed Delivery Patterns
At-Least-Once Delivery
Pattern: Message delivered at least once, possibly multiple times (requires idempotent processing)
// RabbitMQ: Manual acknowledgment
channel.consume('orders', async (msg) => {
const order = JSON.parse(msg.content.toString());
try {
await processOrder(order); // Must be idempotent
channel.ack(msg); // Only ack after successful processing
} catch (error) {
// Don't ack → Message redelivered
console.error('Failed, will retry:', error);
}
});
// Kafka: Commit offset after processing
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
await processOrder(message.value);
// Only commit offset after successful processing
await consumer.commitOffsets([{
topic,
partition,
offset: (parseInt(message.offset) + 1).toString()
}]);
}
});
Exactly-Once Delivery (Kafka)
Pattern: Message delivered exactly once using idempotent producers and transactional consumers
// Producer: Enable idempotence
const producer = kafka.producer({
idempotent: true,
transactionalId: 'my-transactional-producer'
});
// Consumer: Process with transactions
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const transaction = await producer.transaction();
try {
// Process message
const result = await processOrder(message.value);
// Write result to output topic
await transaction.send({
topic: 'order-results',
messages: [{ value: JSON.stringify(result) }]
});
// Commit input offset
await transaction.sendOffsets({
consumerGroupId: 'order-processors',
topics: [{
topic,
partitions: [{
partition,
offset: (parseInt(message.offset) + 1).toString()
}]
}]
});
await transaction.commit();
} catch (error) {
await transaction.abort();
throw error;
}
}
});
Idempotency with Deduplication
// idempotency.js - Deduplicate messages using Redis
import Redis from 'ioredis';
const redis = new Redis();
const processIdempotent = async (messageId, processFn) => {
const key = processed:${messageId};
// Check if already processed
const exists = await redis.get(key);
if (exists) {
console.log('Message already processed:', messageId);
return JSON.parse(exists);
}
// Process message
const result = await processFn();
// Store result with 24h TTL
await redis.setex(key, 86400, JSON.stringify(result));
return result;
};
// Usage in consumer
channel.consume('orders', async (msg) => {
const order = JSON.parse(msg.content.toString());
const messageId = msg.properties.messageId;
await processIdempotent(messageId, async () => {
return await processOrder(order);
});
channel.ack(msg);
});
Performance Optimization
RabbitMQ Performance Tuning
1. Batch Publishing
// ❌ Slow: Publish one at a time
for (const message of messages) {
channel.sendToQueue('tasks', Buffer.from(JSON.stringify(message)));
}
// ✅ Fast: Batch publish
const batch = messages.map(msg =>
channel.sendToQueue('tasks', Buffer.from(JSON.stringify(msg)))
);
await Promise.all(batch);
2. Prefetch Configuration
// ❌ Bad: No prefetch limit (consumer overwhelmed)
channel.consume('tasks', handler);
// ✅ Good: Limit in-flight messages
channel.prefetch(10); // Process max 10 messages concurrently
channel.consume('tasks', handler);
3. Connection Pooling
// connection-pool.js
class RabbitMQPool {
constructor(url, poolSize = 10) {
this.url = url;
this.poolSize = poolSize;
this.connections = [];
this.channels = [];
}
async init() {
for (let i = 0; i < this.poolSize; i++) {
const connection = await amqp.connect(this.url);
const channel = await connection.createChannel();
this.connections.push(connection);
this.channels.push(channel);
}
}
getChannel() {
// Round-robin channel selection
return this.channels[Math.floor(Math.random() * this.channels.length)];
}
}
const pool = new RabbitMQPool('amqp://localhost', 10);
await pool.init();
// Use pooled channel
const channel = pool.getChannel();
channel.sendToQueue('tasks', Buffer.from(data));
Kafka Performance Tuning
1. Batch Size and Linger
const producer = kafka.producer({
// Wait up to 10ms to batch messages (higher throughput)
linger: 10,
// Batch up to 1MB before sending
maxBatchSize: 1048576,
// Compress messages (2-5x smaller)
compression: CompressionTypes.GZIP
});
2. Partitioning Strategy
// ❌ Bad: All messages to same partition (no parallelism)
await producer.send({
topic: 'orders',
messages: [{ value: JSON.stringify(order) }]
});
// ✅ Good: Partition by user (related orders in same partition, preserves order)
await producer.send({
topic: 'orders',
messages: [{
key: order.userId, // Same user → same partition
value: JSON.stringify(order)
}]
});
// ✅ Good: Custom partitioner for load balancing
const producer = kafka.producer({
createPartitioner: () => {
return ({ topic, partitionMetadata, message }) => {
// Custom logic: Hash user ID to partition
const partition = hashCode(message.key) % partitionMetadata.length;
return partition;
};
}
});
3. Consumer Parallelism
Topic: orders (12 partitions)
Consumer Group: processors (12 consumers)
Max parallelism: 12 concurrent consumers
Adding more consumers won't increase parallelism
→ Increase partitions instead
Monitoring and Observability
RabbitMQ Metrics
// Monitor queue depth and consumer count
const getQueueStats = async (queueName) => {
const stats = await channel.assertQueue(queueName, { durable: true });
console.log({
queue: queueName,
messages: stats.messageCount, // Messages in queue
consumers: stats.consumerCount, // Active consumers
messagesPerConsumer: stats.messageCount / (stats.consumerCount || 1)
});
// Alert if queue backing up
if (stats.messageCount > 10000) {
await alertOps(Queue ${queueName} has ${stats.messageCount} messages);
}
};
setInterval(() => getQueueStats('orders'), 60000); // Check every minute
Kafka Monitoring
// Monitor consumer lag
const admin = kafka.admin();
await admin.connect();
const consumerGroups = await admin.fetchOffsets({
groupId: 'order-processors',
topics: ['orders']
});
for (const topic of consumerGroups) {
for (const partition of topic.partitions) {
const lag = partition.high - partition.offset;
console.log({
topic: topic.topic,
partition: partition.partition,
consumerOffset: partition.offset,
highWaterMark: partition.high,
lag // Messages behind
});
if (lag > 100000) {
await alertOps(`Consumer lag: ${lag} messages on partition ${partition.partition}`);
}
}
}
Real-World Examples
Uber: 100 Trillion Kafka Messages/Year
Architecture:
- 4,000+ microservices
- 1,000+ Kafka clusters
- 100 trillion messages annually
- Sub-10ms P99 latency
Use Cases:
- Trip events (request, accept, start, complete)
- Real-time location updates
- Surge pricing calculations
- Driver matching
Key Patterns:
- Strict message ordering per rider/driver (partition by user ID)
- Exactly-once processing for billing
- Real-time stream processing for pricing
Robinhood: RabbitMQ for Financial Transactions
Architecture:
- RabbitMQ for order execution
- Guaranteed delivery for trades
- <10ms latency requirements
Use Cases:
- Stock trade orders
- Account deposits/withdrawals
- Margin call notifications
Key Patterns:
- Priority queues (market orders > limit orders)
- Dead letter queues for failed trades
- Idempotent processing (duplicate trades prevented)
Conclusion
Message queues are essential for building scalable, resilient distributed systems. Key takeaways:
RabbitMQ Best Practices:
- Use durable queues and persistent messages for reliability
- Implement dead letter queues for failed messages
- Configure prefetch limits to prevent consumer overload
- Use topic exchanges for flexible routing
Kafka Best Practices:
- Enable idempotence for exactly-once delivery
- Partition strategically (by user ID, tenant ID, etc.)
- Monitor consumer lag and rebalance as needed
- Use compression for high-throughput scenarios
General Best Practices:
- Design for idempotency (handle duplicate messages safely)
- Implement exponential backoff for retries
- Monitor queue depth and consumer lag
- Use structured logging with trace IDs for debugging
Choose RabbitMQ for complex routing and low latency, Kafka for high throughput and event streaming. Both excel when configured correctly for your specific use case.
Related Articles
GraphQL API Design - Production Architecture and Best Practices for Scalable Systems
Master GraphQL API design covering schema design principles, resolver optimization, N+1 query prevention with DataLoader, authentication and authorization patterns, caching strategies, error handling, and production deployment for high-performance GraphQL systems.
Testing Strategies - Unit, Integration, and E2E Testing Best Practices for Production Quality
Comprehensive guide to testing strategies covering unit tests, integration tests, end-to-end testing, test-driven development, mocking patterns, testing pyramid, and production testing practices for reliable software delivery.
Monitoring and Observability - Production Systems Performance and Debugging at Scale
Master monitoring and observability covering metrics collection with Prometheus, distributed tracing with OpenTelemetry, log aggregation, alerting strategies, SLOs/SLIs, and production debugging techniques for reliable systems.
Written by StaticBlock
StaticBlock is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.