Root Cause Analysis
67 metrics ranked · 58 deseasonalized · 9 raw (no baseline)
Every metric around the alert, brightness scaled by how far it left its seasonal normal. Most barely moved — a few scream. This page is our opinion on which ones, and the reasoning behind it.
The opinion
What to look at next
Two independent methods rank every metric — one asks which deviated most from its seasonal normal, the other which deviated first. Where they agree is the strongest signal the data offers.
Both methods independently point at these 2 metrics:
Judged blind — names and method rankings withheld. The AI ranked 3 of 8 candidates; 2 match the methods' consensus, 1 dissent.
- Crossed 3 sigma at bucket 4, the earliest onset among candidates with operationally meaningful magnitude.
- Its smoothed series ramps and peaks at 54 sigma, far above background noise.
- It led StockInOutKSNewInventoryViewpost_lat's onset by 3 buckets, consistent with an origin position.
- Raw peak deviation of 75 against a near-zero baseline marks a large, real error-rate jump.
- Pre-alert mean was effectively zero (0.167), so the post-alert rise is a genuine departure, not noise.
- The series shows a clean rise-peak-decay shape, rising from 4.7 at bucket 4 to its peak.
- Crossed 3 sigma at bucket 7, an early onset shortly after CreateParentOrderViewpostInternal_err.
- Spiked to 59.5 sigma, the largest deviation in the dossier.
- The spike is narrow: it jumps to 59.5 at bucket 7 and collapses to 0.8 by bucket 10.
- Raw peak deviation of 1880.6 against a small baseline is a large, meaningful excursion.
- Crossed 3 sigma at bucket 8, an early onset.
- Raw peak deviation of 999.8 against a pre-alert mean of 981.2 represents a near-doubling, operationally meaningful.
- Pre-alert baseline of 981.2 confirms the post-alert level roughly doubled.
- Its series sits in a sustained plateau, reaching 3.6 sigma at bucket 10.
14 of 14 claims verified against the data.
The reasoning
How the methods voted
Each candidate is placed by both verdicts at once: further left = crossed the 3σ band sooner after the alert (more root-like timing), higher = bigger peak deviation from its own seasonal normal. Top-left is where causes live; big-but-late points read as cascade victims.
The evidence
How each candidate's deviation evolved
The same σ-shift the map summarizes as one point, minute by minute. A spike right after the alert reads differently from a slow ramp or a late surge — consensus candidates are highlighted.
The proof
Inspect every metric
The full ranking, one metric at a time: what it actually did against its seasonal normal. The visible gap is the anomaly.
magnitude #1 · lead time #2CreateParentOrderViewpostInternal_err
The fine print
What this page does — and doesn't — claim
It does claim:
Starting from raw telemetry and only an alert timestamp, the pipeline narrowed 67 metrics to a candidate set of at most 10 — with onset times and magnitude evidence for each.
It does not claim:
- that any one metric caused the incident;
- that magnitude or lead-time is “right” — without ground truth there is no right;
- to handle drift-shaped incidents or capacity creep — the methods target step-shaped shifts against quiet baselines.
Only your team can validate this:
- Which (if any) of these candidates was the actual root cause your team identified during the incident?
- Are there metrics your team flagged that the ranking missed?
- Do the onset times above line up with your incident timeline?