In March 2026, a set of autonomous LLM workflows consumed roughly $2.2k of prepaid API credits in a short window and depleted credits that were also supporting customer-facing product work. The incident was self-inflicted, but it was not random. It followed a specific architecture pattern that looked productive in the short term and became unsafe at scale.
The system combined:
- continuous multi-agent loops
- stateful conversation chaining
- tool access
- Slack visibility
- weak stop conditions
The result was a familiar sequence:
- the system kept producing output after it had stopped making useful progress
- one workflow fell into repetitive but syntactically normal status updates
- another workflow degraded further and began leaking tool-like wrapper text into a public Slack thread
This case study argues that the incident is best understood as a systems-design failure that exposed known model failure modes rather than as an isolated “bad model” event.
Autonomous agents; long-context degradation; tool leakage; agent safety; multi-agent systems; prompt injection; LLM operations
Autonomous LLM systems often look strongest in their earliest stages: they can speak continuously, coordinate between roles, call tools, and produce visible artifacts in public channels. Those same characteristics can conceal structural fragility. This case study documents one such incident and focuses on how routine implementation decisions created a high-risk runtime pattern.
The purpose is not to generalize from one anecdote into a universal law. It is to provide a concrete, technically legible failure case for other teams building similar systems under startup constraints.
This case study is organized around four questions:
- Which design decisions materially increased the likelihood of runaway behavior?
- Which failure modes were visible in the resulting public artifacts?
- Is the incident better explained as a model problem or a code-and-architecture problem?
- What safeguards should have been present before these workflows were allowed to run continuously?
The analysis in this repository uses sanitized summaries derived from:
- three exported Slack thread artifacts
- local and remote implementation summaries
- public OpenAI documentation
- two research papers relevant to long-context degradation and degenerate generation
Raw private logs, raw shell histories, billing exports, identifiers, and secret-bearing artifacts are excluded from this repository. Because the underlying materials were redacted before publication, this is a qualitative case study rather than a reproducible benchmark.
Three concurrent activities mattered most.
First, a local two-agent workflow was explicitly instructed to work continuously, keep the conversation moving, and not end. It used the Responses API with stored conversation state and chained turn history through previous_response_id.
Second, a cloud-hosted workflow was initially configured for a fixed turn budget and was later extended from 96 turns to 960 turns while live.
Third, a separate resumed coding session spawned multiple worker agents in parallel, increasing the total number of active autonomous processes during the same overnight window.
None of these design choices is automatically fatal on its own. The problem was their combination.
One thread stayed coherent but unproductive. It repeatedly announced that the queue was blocked, that no trusted delta had appeared, and that no further action would be taken. This is the less dramatic failure mode, but still expensive. The model remained syntactically fine while producing almost no incremental value.
A second thread degraded further. Instead of only repeating status messages, the assistant began emitting fragments that looked like tool-call wrappers, malformed JSON, and mixed-language control text. That pattern strongly suggests an agent runtime allowing internal or tool-adjacent text to leak into the user-facing channel.
The most expensive part of the incident was not the visible Slack output by itself. The larger cost driver was repeated stateful API usage across multiple long-running loops. OpenAI’s conversation-state guidance explicitly notes that chained responses keep prior context in play for billing and runtime behavior.1
The incident was primarily a systems-design failure, not a pure model failure.
The model did show degraded behavior, but the runtime gave it too many chances to fail:
- the conversations were long-lived and stateful rather than regularly compacted or reset1
- the agents processed arbitrary text that could influence later tool behavior2
- the system relied on freeform assistant text instead of structured outputs between steps3
- raw assistant text was forwarded into Slack, so malformed output became user-visible
- the loops were allowed to continue after they had clearly stopped producing new value
Research literature points in the same direction. Long-context setups degrade model performance in predictable ways, and text generation systems can fall into repetitive or degenerate modes under the wrong conditions.45
This repository is intentionally sanitized, which limits exact reproducibility. The analysis therefore has at least four limitations:
- It does not expose raw request-level traces.
- It relies on qualitative interpretation of exported visible artifacts.
- It describes one incident rather than a population of incidents.
- It cannot fully separate model behavior from orchestration behavior at every individual turn.
These limits matter, but they do not undermine the central architectural conclusion.
The core fragilities were:
- continuous mode as a default
- no hard stop after repeated no-op turns
- stateful chaining without an explicit compaction or reset policy
- no budget kill switch at the workflow level
- raw assistant text treated as publishable output
- insufficient separation between tool execution traces and public-facing text
A useful way to frame it is:
- the model supplied the failure expression
- the code supplied the failure opportunity
Minimum safeguards for this class of system:
- Hard turn caps.
- Hard budget caps per workflow and per key.
- Automatic shutdown after repeated no-op or near-duplicate turns.
- Structured outputs between agent steps.
- Separate internal traces from user-visible messages.
- Context compaction or reset rules for long-running sessions.
- Human approval for writes and for resuming paused long-lived sessions.
- Distinct API keys and projects per workflow for attribution and containment.
One uncomfortable lesson from this incident is that AI coding agents can make unsafe designs feel operationally normal. The system was not built by hand from scratch over weeks; it was assembled quickly from natural-language requests, which made it easier to cross from experiment to live autonomous runtime without a serious hazard review.
That matters because “the code runs” is not the same as “the system is safe to operate.” If you are using an AI coding agent to build autonomous workflows, require it to produce guardrails as first-class deliverables:
- explicit stop conditions
- budget caps
- output-sanitization boundaries
- human-approval points
- failure-mode warnings in plain language
If those things are missing, treat the implementation as incomplete even if the feature appears to work.
The main lesson is not “never build autonomous workflows.” It is “do not let a demo architecture become a production runtime pattern.”
It is easy to build a system that looks powerful because it talks continuously, calls tools, posts evidence, and appears to self-coordinate. That same system can be fragile if:
- it cannot measure novelty
- it cannot stop itself
- it cannot constrain what text is allowed to flow into downstream systems
The incident prompted five immediate remediation steps:
- all active continuous workflows were stopped
- credentials and tokens in the affected environment were rotated or queued for rotation
- autonomous systems were moved behind stricter turn, budget, and approval gates
- user-visible publishing paths were reviewed so raw assistant text would not be forwarded without stronger filtering
- the incident was documented in sanitized form for internal review and public learning
These steps do not erase the incident, but they do matter operationally. A public postmortem is more credible when it pairs diagnosis with concrete control changes.
This incident should be understood as a runaway agent-systems problem. The models contributed to the visible degradation, but the root cause was the architecture: long-lived self-talk loops, tool access, stateful chaining, public text forwarding, and missing operational guardrails.
If you are building agent systems, the safest rule is:
start with short-lived, typed, budgeted workflows and earn your way into autonomy.
Footnotes
-
OpenAI, “Conversation State,” OpenAI API Docs, accessed March 23, 2026, https://developers.openai.com/api/docs/guides/conversation-state. ↩ ↩2
-
OpenAI, “Safety in Building Agents,” OpenAI API Docs, accessed March 23, 2026, https://developers.openai.com/api/docs/guides/agent-builder-safety. ↩
-
OpenAI, “Safety in Building Agents.” ↩
-
Nelson F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” arXiv, July 6, 2023, https://doi.org/10.48550/arXiv.2307.03172. ↩
-
Ari Holtzman et al., “The Curious Case of Neural Text Degeneration,” arXiv, April 21, 2019, https://doi.org/10.48550/arXiv.1904.09751. ↩