Back to All Posts
Data ReliabilityNovember 20, 2025

The State of Data Incidents in 2025

What's actually going wrong in modern data stacks right now — and how long it takes teams to notice.

By Pallisade Team

We looked at the patterns across data incidents we've helped teams respond to this year. A few things stood out — most of them uncomfortable.

Time to Detection Is Still the Problem

The median time from an incident starting to a human noticing it is measured in days, not minutes. The culprit is almost always the same: the failure mode was not one the pipeline was watching for.

Detection trigger% of incidents
An internal user noticed a weird dashboard46%
An external stakeholder noticed18%
A scheduled quality check fired22%
A pipeline hard-failure14%

More than half the time, a human caught the problem before the system did.

The Most Common Root Causes

Four categories account for the overwhelming majority of incidents:

1. Upstream Schema Changes

A vendor renamed a field, added a nullable column, or changed a type. Ingestion succeeded. Downstream joins silently dropped rows.

2. Stale Data

The pipeline ran, but read from a source that hadn't updated. Downstream looks current. It isn't.

3. Duplicate Rows From Retries

An ingestion job retried after a transient failure and wrote both attempts. Aggregates double-count until someone notices.

4. Logic Drift in Transformations

A dbt model was updated to "fix" one metric but broke another. No regression test caught it because no regression test existed for the downstream metric.

Time to Resolution

Once detected, resolution time varies wildly based on one thing: whether the team has lineage.

  • Teams with automated lineage: median 47 minutes
  • Teams without: median 6 hours

The difference is not talent. It is time spent answering the question "what depends on this column?"

What Actually Works

Across the teams with the best numbers, a few practices kept showing up:

  1. Freshness checks on every source, not just the obvious ones
  2. Schema change alerts from upstream vendors, not discovered at query time
  3. Row-count anomaly detection with adaptive thresholds
  4. Lineage at the column level, not just table level
  5. A shared incident channel where every data issue, no matter how small, gets posted

None of these are exotic. The teams that struggled were not missing sophistication — they were missing consistency.

Looking Ahead

The bar for data reliability is rising. As more business decisions get automated on top of data, the cost of a silent wrong answer goes up. The teams that invest in detection and lineage now are the ones that will still be trusted in 18 months.


Want help closing the detection gap? Talk to us.

Tags:

incidentsdata reliabilityobservability

Need Help With Your Security Posture?

Our team can help you identify and fix vulnerabilities before attackers find them.