Back to All Posts
Data EngineeringNovember 2, 2025

Backups Aren't Enough: Testing Data Restorability

Time travel, snapshots, and fail-safe windows give the illusion of safety. Restorability is what you actually care about — and most teams have never tested it.

By Pallisade Team

Most data teams assume their warehouse's time-travel feature is a backup. It is not. It is a safety net with a very specific shape, and if you have not tested it against the actual ways your data gets corrupted, you do not know whether it works.

The Illusion of Safety

Snowflake has time travel. BigQuery has snapshots. Redshift has automated snapshots. All three give a comforting sense that "we can always roll back." But:

  • Time travel windows are typically 1–7 days. Silent corruption often goes unnoticed longer.
  • Rolling back a single table is easy. Rolling back a coordinated set of tables without breaking referential integrity is not.
  • Fail-safe recovery requires vendor support tickets and has no SLA.
  • Backups of the source system do not help if the corruption happened during transformation.

Restorability Is Different From Backup

The question is not "do I have a copy?" The question is: can I, today, restore this specific table to a known-good state in under an hour, without breaking anything downstream?

Most teams have never once attempted this end-to-end.

The Restorability Drill

Once a quarter, pick a non-critical table and run a full restore drill:

  1. Identify a known-good timestamp to restore to
  2. Snapshot the current state before touching anything
  3. Execute the restore using your documented runbook
  4. Re-run downstream dbt models that depend on the table
  5. Verify data quality checks pass
  6. Compare a few sample metrics to the known-good state
  7. Time the entire exercise

If step 3 starts with "we'll figure out the runbook during the drill," that is the first problem to fix.

Common Failure Modes

What usually goes wrong during a drill:

Dependency tangle

Restoring table A requires table B to exist at the same historical point, but B has its own time-travel window that has already expired.

Permissions gap

The user who runs scheduled jobs cannot restore. The user who can restore does not have access from a non-production machine during an incident.

Downstream cache

BI tools or ML feature stores cached the bad data. Restoring the warehouse does not propagate the fix automatically.

No known-good state

Without a quality baseline, "restored" is indistinguishable from "currently broken." You need to know what the right answer looks like before you can verify the restore worked.

Minimum Viable Restorability

If you do nothing else:

  • Document the exact steps to restore your five most critical tables
  • Extend time-travel windows on those tables to match your detection lag
  • Store daily quality baselines (row counts, checksums, top aggregates) in a separate system
  • Run the drill at least once, with a stopwatch

An untested recovery plan is not a recovery plan. It is a hope.


Want help stress-testing your restore procedures? Get in touch.

Tags:

backuprecoverydisaster recoverydata reliability

Need Help With Your Security Posture?

Our team can help you identify and fix vulnerabilities before attackers find them.