2026-04-19 P2 ops-practical Incident LogAI EngineeringSynapse

How One Missing git init Brought Down an AI Team for 4 Days

Background: An Automation Pipeline That “Looked Like It Was Running”

In mid-April we shipped an AI intelligence-daily pipeline: every morning at 8:00 Dubai time, a remote Agent searches the frontier landscape, generates an HTML daily report, commits it, pushes to git, and triggers a Slack notification to the president. The first two weeks went smoothly. Until one day the president asked: “Where are the daily reports from the last few days?”

I checked git log. The most recent commit was four days old. No new Slack notifications either. But the remote agent’s task logs all showed the task starting on time, search returning results, HTML generated cleanly, git push “successful.” Every layer was green.

Diagnosis: Every Layer Said “Fine”

First reaction: the Slack webhook is broken. Manual curl test — works.

Second reaction: cron config is wrong. Pulled scheduler logs — fired on time, normal duration, file paths fine.

Digging into the lower layer revealed the truth: the agent’s working directory was a freshly-created worktree, and that directory had no .git at all. git add && git commit && git push was returning “fatal: not a git repository”, exit code 128. But the outer wrapper script only checked if [ $? -eq 0 ] — 128 isn’t 0, so the local log did record one line of “failure,” but the upstream scheduler only cared whether the agent process exited normally, and a clean exit was treated as success.

A missing git init orphaned 4 days of daily reports in a local directory no one was watching.

The Cascade: Single-Point Fragility in AI Teams

Once I had the root cause, I traced the full impact across those 4 days. It was sobering.

Daily report broken → the downstream intelligence-action Agent (runs at 10:00, reads the day’s report to generate improvement suggestions) had empty input, and for 4 days output “no new intelligence.”

Empty action suggestions → the brain-trust review queue had nothing in it for 4 days, recording “no items to review” — which got written off as “things have been quiet lately.”

No Slack notifications → from the president’s perspective, the AI team appeared to have done nothing for 4 days.

In other words, a configuration miss at the infrastructure layer (no git init) propagated up the automation chain, amplifying as it went: data layer cut off → decision layer starved → notification layer silent → management layer misled. The whole AI team “looked like it was working” while it was silently down for 96 hours.

Reflection: The Special Failure Modes of AI Automation

This incident reframed something for me. The primary failure points in AI automation systems are almost never in the AI model itself. Token generation is remarkably stable. The failures are always in the glue layer: directory permissions, environment variables, git repo state, expired API credentials, changed webhook URLs, filesystem quotas.

In traditional software, these issues “fail loudly”: users report bugs immediately, monitors alarm immediately, logs go red. AI automation pipelines have a special amplifier — every node is wrapped by an Agent, and Agents naturally tend to “do their best to complete and report success.” They auto-catch exceptions, retry, degrade, even use an LLM-generated description to “explain” what they did. The result: exceptions get digested at every hop, and what upstream sees is an unbroken chain of “completed successfully.”

Three Reusable Principles

1. The glue layer needs independent heartbeat checks. Don’t trust the agent’s self-reported “success.” Add a stateless checker at the pipeline tail: read the latest commit timestamp from git log and alert if it exceeds a threshold; verify the Slack channel received the day’s notification. Cost: tens of lines of code. Lifesaving.

2. Convert silent failure into loud failure. Wrapper scripts should always explicitly check specific exit codes and stderr; unknown errors should be raised, not swallowed into a log file. In AI pipelines especially, the “catch every exception → continue executing → report completion” pattern must be banned — it’s the core source of silent failure.

3. Make pre-flight validation the first step of any pipeline. When an automation script starts, run an environment self-check first: is the directory a git repo? Does the remote branch exist? Are credentials still valid? Is the downstream webhook reachable? Ten lines of code that can save you 4 days.

AI Agents make automation feel “smart,” but the side effect of smart is masking errors that should have been exposed. Truly reliable AI engineering teams aren’t bottlenecked by model capability — they’re bottlenecked by whether you’re willing to write explicit checks for the environment layer and place sensors on silent failure. Infrastructure doesn’t get more reliable just because there’s AI running on top of it.

If you’re building an AI engineering team, take a look at our open-source Synapse framework.