Anatomy of a Pipeline Failure: A Postmortem Template
How to run an effective postmortem when your data pipeline fails, with a complete template.
By Pallisade Team
It's 3:47 PM on a Friday. The CEO messages: "Why does the revenue dashboard show $0?"
Your heart rate doubles. You check Airflow. Failed. You check Slack. 47 unread messages. You check your calendar. The board meeting is Monday.
Welcome to the pipeline failure club. Membership is mandatory. How you respond defines your team.
Why Postmortems Matter
Bad teams: "It's fixed. Let's move on."
Good teams: "It's fixed. Let's make sure it never happens again."
A proper postmortem:
- Prevents repeat incidents
- Builds institutional knowledge
- Improves on-call experience
- Demonstrates maturity to stakeholders
- Creates accountability without blame
The Postmortem Template
Part 1: Incident Summary
## Incident Summary
Title: Daily Revenue Pipeline Failure - 2025-11-12
Severity: SEV-2 (Major business impact, no data loss)
Duration: 4 hours 23 minutes (14:12 - 18:35 UTC)
Impact:
- Revenue dashboard showed $0 for 4+ hours
- Finance team delayed month-end close
- CFO escalated to Engineering leadership
- 12 Slack threads, 3 executive inquiries
Root Cause: Expired Stripe API credentials
Part 2: Timeline
Be specific. Timestamps matter.
## Timeline (All times UTC)
Time Event 06:00 Daily pipeline scheduled start 06:01 Stripe extraction job started 06:03 Stripe API returned 401 Unauthorized 06:04 Job retried (attempt 2/3) - failed 06:05 Job retried (attempt 3/3) - failed 06:06 Job marked as failed, no alert triggered 14:12 CFO opened revenue dashboard, saw $0 14:15 #data-alerts Slack message from CFO 14:23 On-call engineer acknowledged 14:45 Root cause identified (Stripe API key expired) 15:30 New API key generated and deployed 16:00 Pipeline manually triggered 18:35 All downstream jobs completed
Part 3: Root Cause Analysis
Use the 5 Whys technique:
## Root Cause Analysis
Why did the dashboard show $0?
→ The revenue pipeline didn't run successfully.
Why didn't the pipeline run successfully?
→ The Stripe extraction job failed.
Why did the Stripe extraction job fail?
→ The Stripe API returned 401 Unauthorized.
Why did the API return 401?
→ The API key had expired.
Why did the API key expire without warning?
→ We had no monitoring on credential expiration dates.
Root Cause: No credential lifecycle management. Stripe API key created 12 months ago with default expiration.
Part 4: What Went Wrong
Be honest. This isn't about blame—it's about learning.
## What Went Wrong
Detection
- [ ] No alert on pipeline failure - Job failed silently
- [ ] 4+ hour detection time - Found by user, not monitoring
- [ ] No freshness alert - No SLO defined for revenue data
Response
- [ ] On-call not paged - Slack notification only
- [ ] Runbook outdated - Last updated 8 months ago
- [ ] No escalation path - CFO found out before Engineering
Recovery
- [ ] Manual key rotation - No automated rotation in place
- [ ] Manual pipeline trigger - No "catch-up" automation
Part 5: What Went Right
Celebrate wins, even in failure.
## What Went Right
- [x] On-call responded in 8 minutes once alerted
- [x] Root cause found in 22 minutes
- [x] No data loss - Pipeline is idempotent
- [x] Clear communication - Stakeholders updated every 30 min
- [x] Backfill worked - Historical data was recoverable
Part 6: Action Items
This is the most important part. Specific, assigned, and deadline-driven.
## Action Items
Priority Action Owner Deadline Status P0 Add PagerDuty alert on pipeline failure @alice Nov 14 ✅ Done P0 Define freshness SLO for revenue data @bob Nov 15 ✅ Done P1 Implement credential expiry monitoring @alice Nov 22 🔄 In Progress P1 Add Stripe API key rotation automation @charlie Nov 29 📅 Scheduled P1 Update on-call runbook @bob Nov 18 📅 Scheduled P2 Add "catch-up" mode for failed pipelines @alice Dec 6 📅 Scheduled P2 Quarterly credential audit process @charlie Dec 15 📅 Scheduled
Part 7: Lessons Learned
Synthesize for future reference.
## Lessons Learned
- Credential expiration is a ticking time bomb. Every API key, service account, and token needs lifecycle management.
- Silent failures are the worst failures. If no one gets paged, the failure didn't happen—until it did.
- SLOs are detection mechanisms. A freshness SLO would have caught this in minutes, not hours.
- Runbooks rot. Quarterly review is mandatory.
- Pipeline failures should auto-escalate. 1 hour silent = Slack. 2 hours = PagerDuty. 4 hours = VP.
Postmortem Meeting Template
Duration: 45 minutes
Attendees: Incident responders, affected stakeholders, engineering leadership
Agenda
- Summary read-through (5 min)
- Owner presents the summary
- Timeline review (10 min)
- Walk through the timeline - Correct any inaccuracies
- Root cause discussion (10 min)
- Validate the 5 Whys - Identify any missing causes
- Action item review (15 min)
- Prioritize action items - Assign owners and deadlines - Identify blockers
- Close-out (5 min)
- Schedule follow-up if needed - Confirm documentation location
Common Anti-Patterns
Anti-Pattern 1: The Blame Game
❌ "This happened because John pushed bad code."
✅ "This happened because our code review process didn't catch the issue."
Anti-Pattern 2: The Quick Fix
❌ "We fixed it, let's move on."
✅ "We fixed the symptom. Let's address the root cause."
Anti-Pattern 3: The Vanishing Action Items
❌ Action items that never get done.
✅ Track in Jira/Linear. Review weekly. Report on completion.
Anti-Pattern 4: The Novel
❌ 20-page document no one reads.
✅ 2-page summary with links to details.
Automate the Prevention
Most pipeline failures have predictable causes:
| Cause | Prevention |
|---|---|
| Credential expiration | Monitor expiry dates, auto-rotate |
| Schema changes | Schema validation, change detection |
| API rate limits | Backoff strategies, caching |
| Resource exhaustion | Capacity monitoring, auto-scaling |
| Upstream delays | Dependency tracking, alerting |
Pallisade auto-detects these risks and generates fixes before they become incidents.
Postmortem Template Download
Want the full template as a Markdown file?
Download Postmortem Template →
Better yet: Prevent the postmortem entirely.
Pallisade monitors credential expiration, pipeline health, and data freshness—so you find issues before your CEO does.
Tags:
Need Help With Your Security Posture?
Our team can help you identify and fix vulnerabilities before attackers find them.