JSON webhook debugging: an SRE-style runbook for on-call engineers
Structure triage for HTTP 4xx, signature failures, and poisoned payloads without leaking secrets into random pastebins.
Start from receipts, not guesses
Capture raw headers and body checksums before formatting prettily—timestamps drift, signatures include newline bytes, and gzip layers confuse juniors. Store redacted copies in the ticket, not public URLs.
Confirm whether your consumer validates HMAC with the raw body or a parsed object—reordering keys breaks naive clients.
If retries storm, circuit-break upstream or raise pager thresholds temporarily while you fix root cause—not symptoms.
Formatter-first culture
Pretty JSON exposes missing commas and double-encoded strings faster than squinting at minified blobs. Pair formatters with JSONSchema validation in staging to catch drift before prod.
Teach support to reproduce with curl snippets referencing staging secrets, never production.
When vendors ship XML masquerading as JSON, fail loudly and file breakage tickets early.
Idempotency and replay hygiene
Webhook handlers must tolerate duplicates—use idempotency keys and transactional outboxes. Log replay attempts with trace IDs shared across services.
Document replay windows: some providers allow seven-day lookbacks; others expire after an hour.
Never “fix” production by manually re-postingwithout understanding side effects on downstream accounting.
Observability tables that matter
Dashboard p95 processing time per partner, error taxonomy (schema vs signature vs timeout), and payload size histograms. Sudden size spikes often mean base64 blobs bloating events.
Alert on error budget burn, not every 400—noise kills pager trust.
Save exemplar failing payloads categorized by partner version strings.
Security without theatricality
Rotate signing secrets on schedule; dual-sign during rotation windows when possible. Ban personal pastebins—use corporate snippets with TTL.
Assume SSRF risks if webhooks fetch URLs provided in payloads—sandbox egress.
Run tabletop exercises where someone pastes a prod key into chat—practice revocation under stress.
Post-incident storytelling
Blameless summaries should include timeline graphs, missing monitors added, and training updates. Link runbooks that changed.
Reward teams that document weird edge cases—future you inherits the gift.
If customers saw impact, coordinate comms before closing tickets.
Developer tooling alignment
Merge AI JSON utilities support quick validation during bridges or sales engineer demos—keep bookmarks in runbooks beside jq recipes.
Encourage pairing browser tools with local linters for defense in depth.
When rate limits bite, batch transforms offline rather than hammering shared formatters.