The Data Reliability Checklist for Growing Teams

If you are a data team of three to fifteen people, you are probably past the point where "we'll notice if something breaks" works, but not yet at the point where you have a dedicated platform team. This checklist is for you.

Work through it in order. Each item builds on the previous.

Foundations

[ ] Every critical table has a documented owner
[ ] Every critical table has a documented freshness SLA
[ ] Every critical table has a documented schema contract, even if informal
[ ] "Critical" is defined — which tables actually matter to the business
[ ] A single channel exists where all data incidents get posted

Detection

[ ] Freshness checks run on every critical source
[ ] Row-count anomaly checks with adaptive (not hard-coded) thresholds
[ ] Null-rate checks on columns used in joins or metrics
[ ] Uniqueness checks on anything used as a key
[ ] Schema drift detection on upstream sources
[ ] Checks run often enough to catch problems within your SLA window

Lineage

[ ] You can trace any dashboard number back to its source tables
[ ] You can list every downstream consumer of any given column
[ ] Lineage is generated automatically, not maintained by hand
[ ] Lineage includes dbt models, BI tool queries, and reverse-ETL syncs

Response

[ ] A written runbook exists for "a quality check failed"
[ ] A written runbook exists for "an upstream source is down"
[ ] A written runbook exists for "a table needs to be restored"
[ ] At least one person other than the on-call has executed each runbook
[ ] Incidents get a postmortem within one week

Testing

[ ] dbt tests cover every critical model
[ ] Metric definitions have regression tests
[ ] Schema changes run through CI before merging
[ ] A staging environment exists and is actually used

Observability

[ ] Pipeline run history is retained for at least 90 days
[ ] Query cost and duration are tracked per model
[ ] A dashboard exists showing pipeline health at a glance
[ ] Alerts are tuned — no channel is muted by anyone on the team

Backups & Recovery

[ ] Time-travel windows on critical tables match your detection lag
[ ] Restore procedures are documented
[ ] A restore drill has been run in the last quarter
[ ] Quality baselines are stored outside the warehouse

Culture

[ ] Every incident has a named owner and a resolution date
[ ] "The dashboard looks weird" is treated as an incident, not a question
[ ] Producers know who their downstream consumers are
[ ] Breaking changes to metrics require a review

Scoring

Under 15 checked: Start with detection. Most incidents are invisible.
15–25 checked: Focus on lineage and response. You see problems but take too long to fix them.
25–35 checked: You are in good shape. Focus on automation and culture.
35+: You are ahead of most teams twice your size.

Want a second pair of eyes on your reliability setup? Get in touch.