Backups Aren't Enough: Testing Data Restorability
Time travel, snapshots, and fail-safe windows give the illusion of safety. Restorability is what you actually care about — and most teams have never tested it.
By Pallisade Team
Most data teams assume their warehouse's time-travel feature is a backup. It is not. It is a safety net with a very specific shape, and if you have not tested it against the actual ways your data gets corrupted, you do not know whether it works.
The Illusion of Safety
Snowflake has time travel. BigQuery has snapshots. Redshift has automated snapshots. All three give a comforting sense that "we can always roll back." But:
- Time travel windows are typically 1–7 days. Silent corruption often goes unnoticed longer.
- Rolling back a single table is easy. Rolling back a coordinated set of tables without breaking referential integrity is not.
- Fail-safe recovery requires vendor support tickets and has no SLA.
- Backups of the source system do not help if the corruption happened during transformation.
Restorability Is Different From Backup
The question is not "do I have a copy?" The question is: can I, today, restore this specific table to a known-good state in under an hour, without breaking anything downstream?
Most teams have never once attempted this end-to-end.
The Restorability Drill
Once a quarter, pick a non-critical table and run a full restore drill:
- Identify a known-good timestamp to restore to
- Snapshot the current state before touching anything
- Execute the restore using your documented runbook
- Re-run downstream dbt models that depend on the table
- Verify data quality checks pass
- Compare a few sample metrics to the known-good state
- Time the entire exercise
If step 3 starts with "we'll figure out the runbook during the drill," that is the first problem to fix.
Common Failure Modes
What usually goes wrong during a drill:
Dependency tangle
Restoring table A requires table B to exist at the same historical point, but B has its own time-travel window that has already expired.
Permissions gap
The user who runs scheduled jobs cannot restore. The user who can restore does not have access from a non-production machine during an incident.
Downstream cache
BI tools or ML feature stores cached the bad data. Restoring the warehouse does not propagate the fix automatically.
No known-good state
Without a quality baseline, "restored" is indistinguishable from "currently broken." You need to know what the right answer looks like before you can verify the restore worked.
Minimum Viable Restorability
If you do nothing else:
- Document the exact steps to restore your five most critical tables
- Extend time-travel windows on those tables to match your detection lag
- Store daily quality baselines (row counts, checksums, top aggregates) in a separate system
- Run the drill at least once, with a stopwatch
An untested recovery plan is not a recovery plan. It is a hope.
Want help stress-testing your restore procedures? Get in touch.
Tags:
Need Help With Your Security Posture?
Our team can help you identify and fix vulnerabilities before attackers find them.