Data ReliabilitySeptember 30, 2025
The Data Reliability Checklist for Growing Teams
A practical checklist for data teams that have outgrown ad-hoc monitoring but aren't ready for a full observability platform.
By Pallisade Team
If you are a data team of three to fifteen people, you are probably past the point where "we'll notice if something breaks" works, but not yet at the point where you have a dedicated platform team. This checklist is for you.
Work through it in order. Each item builds on the previous.
Foundations
- [ ] Every critical table has a documented owner
- [ ] Every critical table has a documented freshness SLA
- [ ] Every critical table has a documented schema contract, even if informal
- [ ] "Critical" is defined — which tables actually matter to the business
- [ ] A single channel exists where all data incidents get posted
Detection
- [ ] Freshness checks run on every critical source
- [ ] Row-count anomaly checks with adaptive (not hard-coded) thresholds
- [ ] Null-rate checks on columns used in joins or metrics
- [ ] Uniqueness checks on anything used as a key
- [ ] Schema drift detection on upstream sources
- [ ] Checks run often enough to catch problems within your SLA window
Lineage
- [ ] You can trace any dashboard number back to its source tables
- [ ] You can list every downstream consumer of any given column
- [ ] Lineage is generated automatically, not maintained by hand
- [ ] Lineage includes dbt models, BI tool queries, and reverse-ETL syncs
Response
- [ ] A written runbook exists for "a quality check failed"
- [ ] A written runbook exists for "an upstream source is down"
- [ ] A written runbook exists for "a table needs to be restored"
- [ ] At least one person other than the on-call has executed each runbook
- [ ] Incidents get a postmortem within one week
Testing
- [ ] dbt tests cover every critical model
- [ ] Metric definitions have regression tests
- [ ] Schema changes run through CI before merging
- [ ] A staging environment exists and is actually used
Observability
- [ ] Pipeline run history is retained for at least 90 days
- [ ] Query cost and duration are tracked per model
- [ ] A dashboard exists showing pipeline health at a glance
- [ ] Alerts are tuned — no channel is muted by anyone on the team
Backups & Recovery
- [ ] Time-travel windows on critical tables match your detection lag
- [ ] Restore procedures are documented
- [ ] A restore drill has been run in the last quarter
- [ ] Quality baselines are stored outside the warehouse
Culture
- [ ] Every incident has a named owner and a resolution date
- [ ] "The dashboard looks weird" is treated as an incident, not a question
- [ ] Producers know who their downstream consumers are
- [ ] Breaking changes to metrics require a review
Scoring
- Under 15 checked: Start with detection. Most incidents are invisible.
- 15–25 checked: Focus on lineage and response. You see problems but take too long to fix them.
- 25–35 checked: You are in good shape. Focus on automation and culture.
- 35+: You are ahead of most teams twice your size.
Want a second pair of eyes on your reliability setup? Get in touch.
Tags:
checklistdata reliabilitybest practices
Need Help With Your Security Posture?
Our team can help you identify and fix vulnerabilities before attackers find them.