The State of Data Incidents in 2025

We looked at the patterns across data incidents we've helped teams respond to this year. A few things stood out — most of them uncomfortable.

Time to Detection Is Still the Problem

The median time from an incident starting to a human noticing it is measured in days, not minutes. The culprit is almost always the same: the failure mode was not one the pipeline was watching for.

Detection trigger	% of incidents
An internal user noticed a weird dashboard	46%
An external stakeholder noticed	18%
A scheduled quality check fired	22%
A pipeline hard-failure	14%

More than half the time, a human caught the problem before the system did.

The Most Common Root Causes

Four categories account for the overwhelming majority of incidents:

1. Upstream Schema Changes

A vendor renamed a field, added a nullable column, or changed a type. Ingestion succeeded. Downstream joins silently dropped rows.

2. Stale Data

The pipeline ran, but read from a source that hadn't updated. Downstream looks current. It isn't.

3. Duplicate Rows From Retries

An ingestion job retried after a transient failure and wrote both attempts. Aggregates double-count until someone notices.

4. Logic Drift in Transformations

A dbt model was updated to "fix" one metric but broke another. No regression test caught it because no regression test existed for the downstream metric.

Time to Resolution

Once detected, resolution time varies wildly based on one thing: whether the team has lineage.

Teams with automated lineage: median 47 minutes
Teams without: median 6 hours

The difference is not talent. It is time spent answering the question "what depends on this column?"

What Actually Works

Across the teams with the best numbers, a few practices kept showing up:

Freshness checks on every source, not just the obvious ones
Schema change alerts from upstream vendors, not discovered at query time
Row-count anomaly detection with adaptive thresholds
Lineage at the column level, not just table level
A shared incident channel where every data issue, no matter how small, gets posted

None of these are exotic. The teams that struggled were not missing sophistication — they were missing consistency.

Looking Ahead

The bar for data reliability is rising. As more business decisions get automated on top of data, the cost of a silent wrong answer goes up. The teams that invest in detection and lineage now are the ones that will still be trusted in 18 months.

Want help closing the detection gap? Talk to us.