Backups Aren't Enough: Testing Data Restorability

Most data teams assume their warehouse's time-travel feature is a backup. It is not. It is a safety net with a very specific shape, and if you have not tested it against the actual ways your data gets corrupted, you do not know whether it works.

The Illusion of Safety

Snowflake has time travel. BigQuery has snapshots. Redshift has automated snapshots. All three give a comforting sense that "we can always roll back." But:

Time travel windows are typically 1–7 days. Silent corruption often goes unnoticed longer.
Rolling back a single table is easy. Rolling back a coordinated set of tables without breaking referential integrity is not.
Fail-safe recovery requires vendor support tickets and has no SLA.
Backups of the source system do not help if the corruption happened during transformation.

Restorability Is Different From Backup

The question is not "do I have a copy?" The question is: can I, today, restore this specific table to a known-good state in under an hour, without breaking anything downstream?

Most teams have never once attempted this end-to-end.

The Restorability Drill

Once a quarter, pick a non-critical table and run a full restore drill:

Identify a known-good timestamp to restore to
Snapshot the current state before touching anything
Execute the restore using your documented runbook
Re-run downstream dbt models that depend on the table
Verify data quality checks pass
Compare a few sample metrics to the known-good state
Time the entire exercise

If step 3 starts with "we'll figure out the runbook during the drill," that is the first problem to fix.

Common Failure Modes

What usually goes wrong during a drill:

Dependency tangle

Restoring table A requires table B to exist at the same historical point, but B has its own time-travel window that has already expired.

Permissions gap

The user who runs scheduled jobs cannot restore. The user who can restore does not have access from a non-production machine during an incident.

Downstream cache

BI tools or ML feature stores cached the bad data. Restoring the warehouse does not propagate the fix automatically.

No known-good state

Without a quality baseline, "restored" is indistinguishable from "currently broken." You need to know what the right answer looks like before you can verify the restore worked.

Minimum Viable Restorability

If you do nothing else:

Document the exact steps to restore your five most critical tables
Extend time-travel windows on those tables to match your detection lag
Store daily quality baselines (row counts, checksums, top aggregates) in a separate system
Run the drill at least once, with a stopwatch

An untested recovery plan is not a recovery plan. It is a hope.

Want help stress-testing your restore procedures? Get in touch.