Root Cause Analysis

67 metrics ranked · 58 deseasonalized · 9 raw (no baseline)

Every metric around the alert, brightness scaled by how far it left its seasonal normal. Most barely moved — a few scream. This page is our opinion on which ones, and the reasoning behind it.

The opinion

What to look at next

Two independent methods rank every metric — one asks which deviated most from its seasonal normal, the other which deviated first. Where they agree is the strongest signal the data offers.

Both methods independently point at these 2 metrics:

AI verdict

Judged blind — names and method rankings withheld. The AI ranked 3 of 8 candidates; 2 match the methods' consensus, 1 dissent.

    • Crossed 3 sigma at bucket 4, the earliest onset among candidates with operationally meaningful magnitude.
    • Its smoothed series ramps and peaks at 54 sigma, far above background noise.
    • It led StockInOutKSNewInventoryViewpost_lat's onset by 3 buckets, consistent with an origin position.
    • Raw peak deviation of 75 against a near-zero baseline marks a large, real error-rate jump.
    • Pre-alert mean was effectively zero (0.167), so the post-alert rise is a genuine departure, not noise.
    • The series shows a clean rise-peak-decay shape, rising from 4.7 at bucket 4 to its peak.
    • Crossed 3 sigma at bucket 7, an early onset shortly after CreateParentOrderViewpostInternal_err.
    • Spiked to 59.5 sigma, the largest deviation in the dossier.
    • The spike is narrow: it jumps to 59.5 at bucket 7 and collapses to 0.8 by bucket 10.
    • Raw peak deviation of 1880.6 against a small baseline is a large, meaningful excursion.
    • Crossed 3 sigma at bucket 8, an early onset.
    • Raw peak deviation of 999.8 against a pre-alert mean of 981.2 represents a near-doubling, operationally meaningful.
    • Pre-alert baseline of 981.2 confirms the post-alert level roughly doubled.
    • Its series sits in a sustained plateau, reaching 3.6 sigma at bucket 10.

14 of 14 claims verified against the data.

The reasoning

How the methods voted

Each candidate is placed by both verdicts at once: further left = crossed the 3σ band sooner after the alert (more root-like timing), higher = bigger peak deviation from its own seasonal normal. Top-left is where causes live; big-but-late points read as cascade victims.

moved first & hardest — prime suspectsearly but weak
consensus crossed 3σ never crossed (no timing claim)

The evidence

How each candidate's deviation evolved

The same σ-shift the map summarizes as one point, minute by minute. A spike right after the alert reads differently from a slow ramp or a late surge — consensus candidates are highlighted.

The proof

Inspect every metric

The full ranking, one metric at a time: what it actually did against its seasonal normal. The visible gap is the anomaly.

Create Parent Order Viewpost Internal
errorsmatched

magnitude #1 · lead time #2CreateParentOrderViewpostInternal_err

actual expected gap = anomaly

The fine print

What this page does — and doesn't — claim

It does claim:

Starting from raw telemetry and only an alert timestamp, the pipeline narrowed 67 metrics to a candidate set of at most 10 — with onset times and magnitude evidence for each.

It does not claim:

  • that any one metric caused the incident;
  • that magnitude or lead-time is “right” — without ground truth there is no right;
  • to handle drift-shaped incidents or capacity creep — the methods target step-shaped shifts against quiet baselines.

Only your team can validate this:

  1. Which (if any) of these candidates was the actual root cause your team identified during the incident?
  2. Are there metrics your team flagged that the ranking missed?
  3. Do the onset times above line up with your incident timeline?