Anatomy of a Pipeline Failure: A Postmortem Template

It's 3:47 PM on a Friday. The CEO messages: "Why does the revenue dashboard show $0?"

Your heart rate doubles. You check Airflow. Failed. You check Slack. 47 unread messages. You check your calendar. The board meeting is Monday.

Welcome to the pipeline failure club. Membership is mandatory. How you respond defines your team.

Why Postmortems Matter

Bad teams: "It's fixed. Let's move on."

Good teams: "It's fixed. Let's make sure it never happens again."

A proper postmortem:

Prevents repeat incidents
Builds institutional knowledge
Improves on-call experience
Demonstrates maturity to stakeholders
Creates accountability without blame

The Postmortem Template

Part 1: Incident Summary

## Incident Summary Title: Daily Revenue Pipeline Failure - 2025-11-12 Severity: SEV-2 (Major business impact, no data loss) Duration: 4 hours 23 minutes (14:12 - 18:35 UTC) Impact: Revenue dashboard showed $0 for 4+ hours Finance team delayed month-end close CFO escalated to Engineering leadership 12 Slack threads, 3 executive inquiries

Root Cause: Expired Stripe API credentials

Part 2: Timeline

Be specific. Timestamps matter.

## Timeline (All times UTC)
Time Event
06:00 Daily pipeline scheduled start
06:01 Stripe extraction job started
06:03 Stripe API returned 401 Unauthorized
06:04 Job retried (attempt 2/3) - failed
06:05 Job retried (attempt 3/3) - failed
06:06 Job marked as failed, no alert triggered
14:12 CFO opened revenue dashboard, saw $0
14:15 #data-alerts Slack message from CFO
14:23 On-call engineer acknowledged
14:45 Root cause identified (Stripe API key expired)
15:30 New API key generated and deployed
16:00 Pipeline manually triggered
18:35 All downstream jobs completed

Time	Event
06:00	Daily pipeline scheduled start
06:01	Stripe extraction job started
06:03	Stripe API returned 401 Unauthorized
06:04	Job retried (attempt 2/3) - failed
06:05	Job retried (attempt 3/3) - failed
06:06	Job marked as failed, no alert triggered
14:12	CFO opened revenue dashboard, saw $0
14:15	#data-alerts Slack message from CFO
14:23	On-call engineer acknowledged
14:45	Root cause identified (Stripe API key expired)
15:30	New API key generated and deployed
16:00	Pipeline manually triggered
18:35	All downstream jobs completed

Part 3: Root Cause Analysis

Use the 5 Whys technique:

## Root Cause Analysis

Why did the dashboard show $0?
→ The revenue pipeline didn't run successfully.


Why didn't the pipeline run successfully?
→ The Stripe extraction job failed.


Why did the Stripe extraction job fail?
→ The Stripe API returned 401 Unauthorized.


Why did the API return 401?
→ The API key had expired.


Why did the API key expire without warning?
→ We had no monitoring on credential expiration dates.


Root Cause: No credential lifecycle management. Stripe API key created 12 months ago with default expiration.

Part 4: What Went Wrong

Be honest. This isn't about blame—it's about learning.

## What Went Wrong
Detection

[ ] No alert on pipeline failure - Job failed silently
[ ] 4+ hour detection time - Found by user, not monitoring
[ ] No freshness alert - No SLO defined for revenue data

Response

[ ] On-call not paged - Slack notification only
[ ] Runbook outdated - Last updated 8 months ago
[ ] No escalation path - CFO found out before Engineering

Recovery

[ ] Manual key rotation - No automated rotation in place
[ ] Manual pipeline trigger - No "catch-up" automation

Part 5: What Went Right

Celebrate wins, even in failure.

## What Went Right

[x] On-call responded in 8 minutes once alerted
[x] Root cause found in 22 minutes
[x] No data loss - Pipeline is idempotent
[x] Clear communication - Stakeholders updated every 30 min
[x] Backfill worked - Historical data was recoverable

Part 6: Action Items

This is the most important part. Specific, assigned, and deadline-driven.

## Action Items
Priority Action Owner Deadline Status
P0 Add PagerDuty alert on pipeline failure @alice Nov 14 ✅ Done
P0 Define freshness SLO for revenue data @bob Nov 15 ✅ Done
P1 Implement credential expiry monitoring @alice Nov 22 🔄 In Progress
P1 Add Stripe API key rotation automation @charlie Nov 29 📅 Scheduled
P1 Update on-call runbook @bob Nov 18 📅 Scheduled
P2 Add "catch-up" mode for failed pipelines @alice Dec 6 📅 Scheduled
P2 Quarterly credential audit process @charlie Dec 15 📅 Scheduled

Priority	Action	Owner	Deadline	Status
P0	Add PagerDuty alert on pipeline failure	@alice	Nov 14	✅ Done
P0	Define freshness SLO for revenue data	@bob	Nov 15	✅ Done
P1	Implement credential expiry monitoring	@alice	Nov 22	🔄 In Progress
P1	Add Stripe API key rotation automation	@charlie	Nov 29	📅 Scheduled
P1	Update on-call runbook	@bob	Nov 18	📅 Scheduled
P2	Add "catch-up" mode for failed pipelines	@alice	Dec 6	📅 Scheduled
P2	Quarterly credential audit process	@charlie	Dec 15	📅 Scheduled

Part 7: Lessons Learned

Synthesize for future reference.

## Lessons Learned

Credential expiration is a ticking time bomb. Every API key, service account, and token needs lifecycle management.


Silent failures are the worst failures. If no one gets paged, the failure didn't happen—until it did.


SLOs are detection mechanisms. A freshness SLO would have caught this in minutes, not hours.


Runbooks rot. Quarterly review is mandatory.


Pipeline failures should auto-escalate. 1 hour silent = Slack. 2 hours = PagerDuty. 4 hours = VP.

Postmortem Meeting Template

Duration: 45 minutes

Attendees: Incident responders, affected stakeholders, engineering leadership

Agenda

Summary read-through (5 min)

- Owner presents the summary

Timeline review (10 min)

- Walk through the timeline - Correct any inaccuracies

Root cause discussion (10 min)

- Validate the 5 Whys - Identify any missing causes

Action item review (15 min)

- Prioritize action items - Assign owners and deadlines - Identify blockers

Close-out (5 min)

- Schedule follow-up if needed - Confirm documentation location

Common Anti-Patterns

Anti-Pattern 1: The Blame Game

❌ "This happened because John pushed bad code."

✅ "This happened because our code review process didn't catch the issue."

Anti-Pattern 2: The Quick Fix

❌ "We fixed it, let's move on."

✅ "We fixed the symptom. Let's address the root cause."

Anti-Pattern 3: The Vanishing Action Items

❌ Action items that never get done.

✅ Track in Jira/Linear. Review weekly. Report on completion.

Anti-Pattern 4: The Novel

❌ 20-page document no one reads.

✅ 2-page summary with links to details.

Automate the Prevention

Most pipeline failures have predictable causes:

Cause	Prevention
Credential expiration	Monitor expiry dates, auto-rotate
Schema changes	Schema validation, change detection
API rate limits	Backoff strategies, caching
Resource exhaustion	Capacity monitoring, auto-scaling
Upstream delays	Dependency tracking, alerting

Pallisade auto-detects these risks and generates fixes before they become incidents.

Postmortem Template Download

Want the full template as a Markdown file?

Download Postmortem Template →

Better yet: Prevent the postmortem entirely.

Pallisade monitors credential expiration, pipeline health, and data freshness—so you find issues before your CEO does.

Get Your Free DRR Score →