I’d build the pipeline on Spark Structured Streaming reading from Kafka, use durable checkpointing on S3/HDFS, and push outputs via transactional or idempotent sinks. Practically I used this pattern for a system that handled ~50k events/s: Spark reads with Kafka source, stores offsets in the checkpoint, and I use foreachBatch to write atomically — either Kafka transactions (producer.enable.idempotence=true, transactional.id) or a database sink that accepts dedup keys. Cluster resilience comes from Kafka replication factor=3, min.insync.replicas=2, and Spark checkpointing plus dynamic allocation with retries so executor failures simply restart tasks. For schema evolution I used Avro + Confluent Schema Registry, enforcing backward/forward compatibility and adding fields with defaults. Operational checks: consumer lag, under-replicated partitions, end-to-end latency (SLO: 99th percentile <200ms), and automated alerts via Prometheus/Grafana. CI in Git, containerized with Docker, and orchestrated with Airflow.
Takes 5-10 minutes
Get AI-powered feedback on your answer and improve your skills