I'd build a Kafka-fronted ingestion pipeline with Spark (Structured Streaming) consumers that output to Elasticsearch using the bulk API. Kafka gives back-pressure and replay; Spark handles JSON flattening, enrichment, and field pruning so we only index high-value fields (typically 10–20% of raw keys). I prefer micro-batches of 5–30s and bulk payloads of ~5–15MB (or 5k–15k docs) to hit steady throughput (we target 5–10k events/s). Use index templates with explicit mappings (keyword for aggregations, text+keyword multi-fields, avoid nested where possible) and ILM with hot-warm-frozen tiers, rollover at ~50GB or daily. For backfills, run Spark batch jobs writing to a new write-optimized index with replicas=0 and refresh_interval=-1, then re-enable and swap aliases when done. For schema changes, route new fields through an ingest pipeline and create versioned indices + alias swap to avoid runtime mapping conflicts. I led a 3-person team to a similar design in 6 weeks, cutting ES storage by ~50% and keeping 95th percentile search latency under 200ms.
Takes 5-10 minutes
Get AI-powered feedback on your answer and improve your skills