IntermediateTECHNICAL
Explain how you would design a fault-tolerant, eventually-consistent service for processing high-volume events (millions/day). Include choices for data storage, messaging, idempotency, and monitoring.
Backend Developer
General

Sample Answer

For millions/day I’d build a pipeline with durable partitioned messaging (Kafka) to absorb bursts and provide replay. Consumers are horizontally scaled, stateless workers that write to a strongly available primary store for transactional needs (Cassandra or DynamoDB) and an append-only event store for audit. I’d enforce idempotency with a combination of client-generated UUIDs and a dedupe table with TTLs, plus at-least-once processing with idempotent handlers. For fault tolerance I’d use consumer groups, topic partitioning, and multi-AZ clusters. Monitoring includes end-to-end SLAs, lag, processing latency histograms, and SLO-driven alerts; we also add business-level KPIs and automated replays. In a previous role this architecture handled 5M events/day with 99.95% availability and sub-second median processing.

Keywords

Use Kafka for durable, replayable messaging and partitioningChoose Cassandra/DynamoDB for high write throughput and availabilityIdempotency via UUIDs + dedupe table and at-least-once semanticsComprehensive monitoring: lag, latency, SLOs, business KPIs