Build in Public · Operations Story
When "success" is a lie
In early May 2026 we discovered something unsettling: Synapse's Slack notifications had been silently failing for over 11 days while the system reported "success" the entire time.
The root cause was a classic trap: the caller saw HTTP 200 and assumed success, while Slack's real verdict — {"ok":false,"error":"not_authed"} — sat inside the response body that nobody was reading. Worse still, the health heartbeat ran through the very same path, so the "monitor of the monitor" failed along with it. That is the monitoring paradox: when the sentry and the gate share one lock, a broken lock silences the sentry too.
Why did 11 days pass unnoticed? Because every link in the pipeline glowed green. GHA run status: green. HTTP status code: 200. At the technical layer "everything was fine" — at the business layer, nothing was being delivered. Technical success had masked business failure.
We did not treat it as a one-off bug to patch and forget. We turned the incident into institution:
- A P0 governance principle was written into the system's constitution — the Silent Fail Defense Principle: "HTTP 2xx / status=success ≠ confirmed business effect; every external service call must assert on response-body business fields (e.g.
ok=true); assertion failure must throw and mark the upstream job failed; silent stdout warnings are prohibited; meta-monitoring must be independent of the monitored pipeline." (Source:CLAUDE.md, tagged[ADDED: 2026-05-05]) - Two weeks later we tightened it with a companion rule — the WF-09 Exclusive Notification Path: all Slack notifications must route through a single governed workflow; any bypass is a violation. (Source:
CLAUDE.md[ADDED: 2026-05-19]; decisionD-2026-05-19-002) - The single exception is a meta-monitoring channel independent of the main pipeline — built precisely to carry an alert out by another road when the main path fails as a whole. (Source: an independent meta-monitoring script, authorized exemption)
This is what we mean by "the Harness is not a claim, it is a trace": what you see is not a slogan but a dated rule, a numbered decision, and an auditable log line. The incident was not buried — it was recorded, institutionalized, and it permanently changed how the system behaves.