Event-driven architecture is one of those patterns that sounds simple but gets complex fast. After running Kafka in production for years, here's what I've learned about doing it right.
Why Events?
Traditional request-response architectures create tight coupling:
Order Service → Payment Service → Inventory Service → Notification ServiceIf any service is down, the whole chain breaks. Events decouple this:
Order Service → [OrderCreated] → Kafka
├→ Payment Service
├→ Inventory Service
└→ Notification ServiceEach consumer processes independently. If notifications are down, orders still process.
Event Schema Design
The schema is your contract. Get it wrong, and you'll pay for it forever:
// Bad: Anemic event
interface OrderCreated {
orderId: string;
timestamp: string;
}
// Good: Rich event with all context needed for consumers
interface OrderCreated {
// Metadata
eventId: string; // Idempotency key
eventType: "order.created";
version: 1;
timestamp: string; // ISO 8601
source: "order-service";
correlationId: string; // Request tracing
// Payload
data: {
orderId: string;
customerId: string;
items: Array<{
productId: string;
quantity: number;
unitPrice: number;
currency: string;
}>;
totalAmount: number;
currency: string;
shippingAddress: Address;
metadata: Record<string, string>;
};
}Topic Design
Topics are your event streams. Design them thoughtfully:
// By domain aggregate (recommended for most cases)
orders.events → All order lifecycle events
payments.events → Payment processing events
inventory.events → Stock level changes
// By event type (when consumers need specific events)
order.created → Only creation events
order.shipped → Only shipment events
// By priority
notifications.high → Urgent alerts
notifications.low → Digest emailsProducer: Exactly-Once Semantics
Kafka 0.11+ supports idempotent producers. Always enable this:
import { Kafka, Partitioners } from "kafkajs";
const kafka = new Kafka({
clientId: "order-service",
brokers: ["kafka-1:9092", "kafka-2:9092", "kafka-3:9092"],
});
const producer = kafka.producer({
idempotent: true, // Prevent duplicates
maxInFlightRequests: 5, // Max parallelism
createPartitioner: Partitioners.DefaultPartitioner,
});
// Transactional produce — atomic multi-topic writes
async function createOrder(order: Order): Promise<void> {
const transaction = await producer.transaction();
try {
// Write to orders topic
await transaction.send({
topic: "orders.events",
messages: [{
key: order.id, // Partition by order ID
value: JSON.stringify({
eventType: "order.created",
eventId: uuid(),
timestamp: new Date().toISOString(),
data: order,
}),
headers: {
"correlation-id": order.correlationId,
"content-type": "application/json",
},
}],
});
// Write to analytics topic atomically
await transaction.send({
topic: "analytics.events",
messages: [{
key: order.customerId,
value: JSON.stringify({
eventType: "customer.ordered",
data: { customerId: order.customerId, amount: order.total },
}),
}],
});
await transaction.commit();
} catch (error) {
await transaction.abort();
throw error;
}
}Consumer: Handling Failures
The consumer is where most complexity lives:
const consumer = kafka.consumer({
groupId: "payment-processor",
sessionTimeout: 30000,
heartbeatInterval: 3000,
maxBytesPerPartition: 1048576, // 1MB
});
await consumer.subscribe({ topic: "orders.events", fromBeginning: false });
await consumer.run({
autoCommit: false, // Manual commits for exactly-once processing
eachMessage: async ({ topic, partition, message }) => {
const event = JSON.parse(message.value!.toString());
// Idempotency check
const processed = await redis.get(`processed:${event.eventId}`);
if (processed) {
await consumer.commitOffsets([{
topic, partition, offset: (BigInt(message.offset) + 1n).toString(),
}]);
return;
}
try {
// Process the event
await processPayment(event);
// Mark as processed (with TTL for cleanup)
await redis.set(`processed:${event.eventId}`, "1", "EX", 86400 * 7);
// Commit offset only after successful processing
await consumer.commitOffsets([{
topic, partition, offset: (BigInt(message.offset) + 1n).toString(),
}]);
} catch (error) {
// Dead letter queue for permanently failed messages
if (isRetryable(error)) {
throw error; // KafkaJS will retry
} else {
await sendToDeadLetter(event, error);
await consumer.commitOffsets([{
topic, partition, offset: (BigInt(message.offset) + 1n).toString(),
}]);
}
}
},
});Event Sourcing Pattern
Instead of storing current state, store the sequence of events that led to it:
// Event store
interface EventStore {
append(streamId: string, events: DomainEvent[], expectedVersion: number): Promise<void>;
read(streamId: string): Promise<DomainEvent[]>;
}
// Rebuild state from events
function rebuildOrder(events: DomainEvent[]): Order {
return events.reduce((state, event) => {
switch (event.type) {
case "OrderCreated":
return { ...event.data, status: "pending", version: 1 };
case "PaymentReceived":
return { ...state, status: "paid", version: state.version + 1 };
case "OrderShipped":
return { ...state, status: "shipped", trackingId: event.data.trackingId, version: state.version + 1 };
case "OrderCancelled":
return { ...state, status: "cancelled", reason: event.data.reason, version: state.version + 1 };
default:
return state;
}
}, {} as Order);
}Monitoring & Observability
You can't run Kafka blind. Key metrics to watch:
| Metric | Alert Threshold | Why |
|---|---|---|
| Consumer lag | > 10,000 messages | Consumers can't keep up |
| Under-replicated partitions | > 0 | Data loss risk |
| Request latency p99 | > 100ms | Broker overload |
| Active controller count | != 1 | Split brain scenario |
Key Lessons
- Events are facts, not commands. "OrderCreated" not "CreateOrder"
- Schema evolution is hard. Use a schema registry from day one
- Ordering matters. Use partition keys wisely (entity ID, not random)
- Consumers will fail. Build idempotency into every consumer
- Monitor consumer lag. It's your canary in the coal mine
Event-driven systems trade consistency for availability and scalability. Understand the tradeoffs before committing. When done right, they're the backbone of every large-scale system I've built.