I’d build a hybrid pipeline: client-side instrumentation streams events (design operations, selection context) via WebSocket to a message broker (Kafka) for real-time features and to a feature store for offline training. Offline training uses nightly jobs (PyTorch/TensorFlow) producing a distilled model for server inference and a compact on-device model (<30 MB) for ultra-low-latency cases. Serving uses gRPC/TorchServe with autoscaling behind a gateway; target p95 latency is 100–150 ms for server suggestions and <50 ms for on-device. Monitoring includes Prometheus/Datadog for latency, data-drift detectors, and automatic canary evaluation. Fallbacks: deterministic heuristics or cached suggestions to guarantee 99.99% availability. Trade-offs are cost vs latency (more edge compute ups cost), model complexity vs interpretability, and freshness vs consistency — solved with feature expiration and periodic retrain cadence (daily/weekly) depending on drift signals.
Get AI-powered feedback on your answer and improve your skills
Takes 5-10 minutes