Back to All Posts
EngineeringNovember 12, 2025

Anatomy of a Pipeline Failure: A Postmortem Template

How to run an effective postmortem when your data pipeline fails, with a complete template.

By Pallisade Team

It's 3:47 PM on a Friday. The CEO messages: "Why does the revenue dashboard show $0?"

Your heart rate doubles. You check Airflow. Failed. You check Slack. 47 unread messages. You check your calendar. The board meeting is Monday.

Welcome to the pipeline failure club. Membership is mandatory. How you respond defines your team.

Why Postmortems Matter

Bad teams: "It's fixed. Let's move on."

Good teams: "It's fixed. Let's make sure it never happens again."

A proper postmortem:

  • Prevents repeat incidents
  • Builds institutional knowledge
  • Improves on-call experience
  • Demonstrates maturity to stakeholders
  • Creates accountability without blame

The Postmortem Template

Part 1: Incident Summary

## Incident Summary

Title: Daily Revenue Pipeline Failure - 2025-11-12

Severity: SEV-2 (Major business impact, no data loss)

Duration: 4 hours 23 minutes (14:12 - 18:35 UTC)

Impact:

  • Revenue dashboard showed $0 for 4+ hours
  • Finance team delayed month-end close
  • CFO escalated to Engineering leadership
  • 12 Slack threads, 3 executive inquiries

Root Cause: Expired Stripe API credentials

Part 2: Timeline

Be specific. Timestamps matter.

## Timeline (All times UTC)
TimeEvent
06:00Daily pipeline scheduled start
06:01Stripe extraction job started
06:03Stripe API returned 401 Unauthorized
06:04Job retried (attempt 2/3) - failed
06:05Job retried (attempt 3/3) - failed
06:06Job marked as failed, no alert triggered
14:12CFO opened revenue dashboard, saw $0
14:15#data-alerts Slack message from CFO
14:23On-call engineer acknowledged
14:45Root cause identified (Stripe API key expired)
15:30New API key generated and deployed
16:00Pipeline manually triggered
18:35All downstream jobs completed

Part 3: Root Cause Analysis

Use the 5 Whys technique:

## Root Cause Analysis

Why did the dashboard show $0? → The revenue pipeline didn't run successfully.

Why didn't the pipeline run successfully? → The Stripe extraction job failed.

Why did the Stripe extraction job fail? → The Stripe API returned 401 Unauthorized.

Why did the API return 401? → The API key had expired.

Why did the API key expire without warning? → We had no monitoring on credential expiration dates.

Root Cause: No credential lifecycle management. Stripe API key created 12 months ago with default expiration.

Part 4: What Went Wrong

Be honest. This isn't about blame—it's about learning.

## What Went Wrong

Detection

  • [ ] No alert on pipeline failure - Job failed silently
  • [ ] 4+ hour detection time - Found by user, not monitoring
  • [ ] No freshness alert - No SLO defined for revenue data

Response

  • [ ] On-call not paged - Slack notification only
  • [ ] Runbook outdated - Last updated 8 months ago
  • [ ] No escalation path - CFO found out before Engineering

Recovery

  • [ ] Manual key rotation - No automated rotation in place
  • [ ] Manual pipeline trigger - No "catch-up" automation

Part 5: What Went Right

Celebrate wins, even in failure.

## What Went Right
  • [x] On-call responded in 8 minutes once alerted
  • [x] Root cause found in 22 minutes
  • [x] No data loss - Pipeline is idempotent
  • [x] Clear communication - Stakeholders updated every 30 min
  • [x] Backfill worked - Historical data was recoverable

Part 6: Action Items

This is the most important part. Specific, assigned, and deadline-driven.

## Action Items
PriorityActionOwnerDeadlineStatus
P0Add PagerDuty alert on pipeline failure@aliceNov 14✅ Done
P0Define freshness SLO for revenue data@bobNov 15✅ Done
P1Implement credential expiry monitoring@aliceNov 22🔄 In Progress
P1Add Stripe API key rotation automation@charlieNov 29📅 Scheduled
P1Update on-call runbook@bobNov 18📅 Scheduled
P2Add "catch-up" mode for failed pipelines@aliceDec 6📅 Scheduled
P2Quarterly credential audit process@charlieDec 15📅 Scheduled

Part 7: Lessons Learned

Synthesize for future reference.

## Lessons Learned
  1. Credential expiration is a ticking time bomb. Every API key, service account, and token needs lifecycle management.
  1. Silent failures are the worst failures. If no one gets paged, the failure didn't happen—until it did.
  1. SLOs are detection mechanisms. A freshness SLO would have caught this in minutes, not hours.
  1. Runbooks rot. Quarterly review is mandatory.
  1. Pipeline failures should auto-escalate. 1 hour silent = Slack. 2 hours = PagerDuty. 4 hours = VP.

Postmortem Meeting Template

Duration: 45 minutes

Attendees: Incident responders, affected stakeholders, engineering leadership

Agenda

  1. Summary read-through (5 min)

- Owner presents the summary

  1. Timeline review (10 min)

- Walk through the timeline - Correct any inaccuracies

  1. Root cause discussion (10 min)

- Validate the 5 Whys - Identify any missing causes

  1. Action item review (15 min)

- Prioritize action items - Assign owners and deadlines - Identify blockers

  1. Close-out (5 min)

- Schedule follow-up if needed - Confirm documentation location

Common Anti-Patterns

Anti-Pattern 1: The Blame Game

❌ "This happened because John pushed bad code."

✅ "This happened because our code review process didn't catch the issue."

Anti-Pattern 2: The Quick Fix

❌ "We fixed it, let's move on."

✅ "We fixed the symptom. Let's address the root cause."

Anti-Pattern 3: The Vanishing Action Items

❌ Action items that never get done.

✅ Track in Jira/Linear. Review weekly. Report on completion.

Anti-Pattern 4: The Novel

❌ 20-page document no one reads.

✅ 2-page summary with links to details.

Automate the Prevention

Most pipeline failures have predictable causes:

CausePrevention
Credential expirationMonitor expiry dates, auto-rotate
Schema changesSchema validation, change detection
API rate limitsBackoff strategies, caching
Resource exhaustionCapacity monitoring, auto-scaling
Upstream delaysDependency tracking, alerting

Pallisade auto-detects these risks and generates fixes before they become incidents.


Postmortem Template Download

Want the full template as a Markdown file?

Download Postmortem Template →


Better yet: Prevent the postmortem entirely.

Pallisade monitors credential expiration, pipeline health, and data freshness—so you find issues before your CEO does.

Get Your Free DRR Score →

Tags:

postmortemincident responsepipelinesdata engineeringon-call

Need Help With Your Security Posture?

Our team can help you identify and fix vulnerabilities before attackers find them.