Explain how you would design a fault-tolerant, eventually-consistent service for processing high-volume events (millions/day). Include choices for data storage, messaging, idempotency, and monitoring.

Question

VirtualInterview.AI · Accepted Answer

For millions/day I’d build a pipeline with durable partitioned messaging (Kafka) to absorb bursts and provide replay. Consumers are horizontally scaled, stateless workers that write to a strongly available primary store for transactional needs (Cassandra or DynamoDB) and an append-only event store for audit. I’d enforce idempotency with a combination of client-generated UUIDs and a dedupe table with TTLs, plus at-least-once processing with idempotent handlers. For fault tolerance I’d use consumer groups, topic partitioning, and multi-AZ clusters. Monitoring includes end-to-end SLAs, lag, processing latency histograms, and SLO-driven alerts; we also add business-level KPIs and automated replays. In a previous role this architecture handled 5M events/day with 99.95% availability and sub-second median processing.

Sample Answer

Keywords