IntermediateSITUATIONAL
A critical service in production is experiencing increased latency and intermittent 5xx errors. Walk me through how you would triage the incident using monitoring and logging data (Prometheus, Grafana, ELK/CloudWatch) and what immediate mitigations you might apply to restore service while preserving data for postmortem.
DevOps Engineer
General

Sample Answer

When I see latency and 5xx spikes I start by correlating metrics and logs: I open Grafana dashboards tied to Prometheus for request latency percentiles, error rates, and resource metrics (CPU, memory, I/O) across the last 30–60 minutes. I then tail application logs in ELK or CloudWatch to find error patterns and stack traces, focusing on the first error timestamps. For a recent incident impacting ~12% of users, I identified a surge in DB connection wait time and a few downstream timeouts. Immediate mitigations I used were: enable a degraded route (circuit-breaker) to failfast noncritical features, scale the service horizontally by adding 3 replicas (autoscaler rules applied), and increase DB connection pool temporarily. I snapshot logs and metrics, set retention and export for postmortem, and created a brief timeline with root-cause candidates for the on-call rotation.

Keywords

Correlate Prometheus/Grafana metrics with ELK/CloudWatch logs to identify first-failure timeImmediate mitigations: failfast/circuit-breaker, horizontal scaling, DB pool tuningPreserve artifacts: snapshots/exports of logs and metrics and timeline for postmortem
Related Questions

On your resume you mention working on a cross-functional project (e.g., involving multiple teams or stakeholders). Describe a situation from that project where priorities conflicted—how did you navigate the trade-offs and what was the final outcome?

IntermediateSITUATIONAL

Walk me through a recent multi-channel digital marketing campaign you managed end-to-end. How did you set objectives, choose channels, allocate budget, and measure success?

IntermediateBEHAVIORAL

In your resume you note improving or optimizing [a process, KPI, or metric]. What specific baseline metrics did you start from, what steps did you personally take, and how did you verify that the improvement was due to your changes rather than external factors?

IntermediatePROBLEM_SOLVING

Based on your hydrology and irrigation engineering background, explain how you would estimate the irrigation water requirement for a kharif crop in a semi-arid region of Gujarat. Walk me through each step: from reference evapotranspiration estimation, crop coefficient selection, effective rainfall calculation, to arriving at canal discharge for a given command area.

IntermediateTECHNICAL

In your civil engineering studies, what specific design coursework or project work did you complete related to irrigation channels or canals (e.g., design of lined/unlined canals, distributaries, minors)? Describe one such design in detail, including how you determined discharge, permissible velocity, section dimensions, and lining choice for Gujarat-type soil and climate conditions.

IntermediateTECHNICAL