The State of Data Incidents in 2025
What's actually going wrong in modern data stacks right now — and how long it takes teams to notice.
By Pallisade Team
We looked at the patterns across data incidents we've helped teams respond to this year. A few things stood out — most of them uncomfortable.
Time to Detection Is Still the Problem
The median time from an incident starting to a human noticing it is measured in days, not minutes. The culprit is almost always the same: the failure mode was not one the pipeline was watching for.
| Detection trigger | % of incidents |
|---|---|
| An internal user noticed a weird dashboard | 46% |
| An external stakeholder noticed | 18% |
| A scheduled quality check fired | 22% |
| A pipeline hard-failure | 14% |
More than half the time, a human caught the problem before the system did.
The Most Common Root Causes
Four categories account for the overwhelming majority of incidents:
1. Upstream Schema Changes
A vendor renamed a field, added a nullable column, or changed a type. Ingestion succeeded. Downstream joins silently dropped rows.
2. Stale Data
The pipeline ran, but read from a source that hadn't updated. Downstream looks current. It isn't.
3. Duplicate Rows From Retries
An ingestion job retried after a transient failure and wrote both attempts. Aggregates double-count until someone notices.
4. Logic Drift in Transformations
A dbt model was updated to "fix" one metric but broke another. No regression test caught it because no regression test existed for the downstream metric.
Time to Resolution
Once detected, resolution time varies wildly based on one thing: whether the team has lineage.
- Teams with automated lineage: median 47 minutes
- Teams without: median 6 hours
The difference is not talent. It is time spent answering the question "what depends on this column?"
What Actually Works
Across the teams with the best numbers, a few practices kept showing up:
- Freshness checks on every source, not just the obvious ones
- Schema change alerts from upstream vendors, not discovered at query time
- Row-count anomaly detection with adaptive thresholds
- Lineage at the column level, not just table level
- A shared incident channel where every data issue, no matter how small, gets posted
None of these are exotic. The teams that struggled were not missing sophistication — they were missing consistency.
Looking Ahead
The bar for data reliability is rising. As more business decisions get automated on top of data, the cost of a silent wrong answer goes up. The teams that invest in detection and lineage now are the ones that will still be trusted in 18 months.
Want help closing the detection gap? Talk to us.
Tags:
Need Help With Your Security Posture?
Our team can help you identify and fix vulnerabilities before attackers find them.