War Stories — 11 Failures That Built an OS

2026-03-28Resource Management

1. The OOM Massacre

10 containers, 68GB RAM, all dead by morning.

We deployed 10 agent containers overnight. Each spawned Docker containers for CRMs, databases, APIs. By morning, the host was dead. 68GB of RAM exhausted. Windows warned the user. The agents didn't notice.

Why? Agents are ephemeral. Resources are not. An agent spawns a container, finishes, exits. The container keeps running. Nobody watches it. 10 agents x 3-5 containers x 8 hours = 50 orphaned processes eating RAM.

The pre-Unix equivalent

Programs before Unix allocated memory directly. When a program crashed, memory leaked. No garbage collector. No process table tracking ownership.

What the OS evolved

GC daemon with memory pressure monitoring. Phase-scoped cleanup (kill -PGID pattern). Hierarchical resource ownership: Goal → Phase → Resource. Shell auto-detection registers spawned containers. Memory limits on all spawned processes.

2026-03-27Lifecycle

2. The Zombie Services

Deploy and walk away. Service dies silently. Nobody notices.

Agent deploys CRM on port 8081. Reports "done." Exits. Three hours later, CRM crashes silently. Dashboard still green. User tries to access — nothing.

In every agent framework, deployment = task complete. In an OS, deployment = the beginning of lifecycle management.

The pre-Unix equivalent

Daemons before init/systemd: if a background process died, nobody restarted it. No watchdog. No health checks.

What the OS evolved

DomainDaemons that outlive the agent. Service lifecycle: deploy → monitor → restart → alert. Like systemd for AI agents. Two-phase gating: cheap health check first, LLM only when needed.

2026-03-27Memory

3. The Amnesiac Agents

Agent 1 learns the API. Agent 2 starts from zero. 200K tokens wasted.

Agent #1 discovers the CRM API at /api/v1/Lead with admin credentials. 15 minutes. 200K tokens. Agent #1 exits. Agent #2 needs to "add 3 leads." Starts from zero. Same discovery. Same cost. Every single time.

The pre-Unix equivalent

Programs before shared filesystems: every program had its own data format, its own storage. No way to share knowledge between processes.

What the OS evolved

TheLoom — three-layer persistent memory. Agents save SKILL.md files. Next agent reads the skill first. 5K tokens instead of 200K. 40x cheaper. The LLM reads markdown directly — no database, no indexing.

2026-03-26Honesty

4. The Green Lies

Dashboard green. CRM dead. Database empty. API returning 500s.

Agent tries Docker (not available). Fails. Verify command also uses Docker. Also fails. GoalRunner retries forever. Meanwhile, the agent found a workaround — CRM is actually running via apt-get. Dashboard shows red. Reality is green. Or vice versa.

The LLM wrote the verify command. The LLM got it wrong. Nobody checks the checker.

The pre-Unix equivalent

Programs returning exit code 0 when they segfault. No distinction between "I succeeded" and "I think I succeeded."

What the OS evolved

Three-tier verification: auto (shell check), ask_user (human confirms), none (honest yellow). Smart diagnosis rewrites bad verify commands. Evidence-based: if verify fails but processes are running, investigate before declaring failure.

2026-03-27Evolution

5. The Infinite Evolution Loop

Evolution activates same broken tool. Forever. Burning tokens.

Demand: "capability_gap: docker_management." Evolution activates Docker tool. Docker doesn't work (no daemon). Demand fires again. Evolution activates same tool. Infinite loop. Every cycle burned tokens doing exactly the same thing.

The pre-Unix equivalent

Retry without backoff. No state tracking. No escalation path.

What the OS evolved

Demand lifecycle with states: active → attempting → escalated → resolved. Exponential backoff (1min, 2min, 4min...). Auto-escalation to user after 6 failed attempts. 5-pattern stuck detector (from OpenHands research): repeating tool calls, identical outputs, error-only loops, ping-pong oscillation.

2026-03-28Budget

6. The Token Bonfire

10 daemons polling at 3am. Answer: nothing. Cost: $86/day.

DomainDaemons run LLM calls every 30 seconds. At 3am with no activity, the answer is always "no." 10 daemons x 120 calls/hour x $0.003 = $3.60/hour doing nothing.

The pre-Unix equivalent

Busy-waiting before interrupts. The CPU spins checking "anything new?" millions of times per second instead of sleeping until an event fires.

What the OS evolved

Two-phase gating. fast_check() is cheap (HTTP/API, zero LLM). Only if needs_attention=true does the expensive LLM tick fire. Sleep tiers: active (30s), idle (5min), dormant (1hr). $0.10/day instead of $86/day.

2026-03-28Budget

7. The Token Hoarders

Agent burns 50K tokens on a 5K task. Nobody tracks spending.

A human employee who spent 10x budget on a simple task would get fired. Agents? Nobody even knows what they spend. No per-agent accounting. No budgets. No "you're over budget, stop."

The pre-Unix equivalent

Programs before ulimit and cgroups: one runaway process could consume all system resources with no limit.

What the OS evolved

Per-agent token budgets (50K default). Session compaction. Observation masking (strip verbose tool output). Dynamic tool pruning (only show relevant tools per command). Cost-aware model routing (Haiku for diagnosis, Sonnet for planning). 200K → 5K for repeated tasks.

2026-03-28Architecture

8. Agents Aren't First-Class Citizens

The root cause of all 7 stories above.

In every framework — LangChain, CrewAI, AutoGen, OpenHands — agents are function calls. No identity, no lifecycle, no budget, no permissions, no persistence, no audit trail.

Agents in frameworks are like programs in DOS. They run. They exit. Nobody manages what happens between or after.

In OpenSculpt, agents are first-class citizens — like Unix processes: identity (agent ID), lifecycle state machine (spawned → running → blocked → done), resource ownership (containers, files, services), token budgets (per-agent accounting), capability gates (what tools you can use), supervision (parent tracks children), persistent memory (TheLoom), and immutable audit trail.

The OS insight

You can't bolt process management onto a framework. You can't add lifecycle to function calls. You need a fundamentally different architecture: an operating system where agents are the unit of execution, not functions.

2026-03-27Testing

9. 814 Tests Pass. Product Completely Broken.

pytest green. Every real scenario fails. Unit tests mock everything.

814 unit tests passed. Then we ran the actual product: green ticks on failures, recursive goal spawning, evolution infinite loops, fake improvement counts. 814 tests. Zero real-world scenarios verified.

Unit tests mock the LLM. Mock the shell. Mock the filesystem. The mocks all return success. The real world doesn't.

The lesson

Unit tests are necessary but wildly insufficient for agentic systems. The OS now requires E2E scenario testing: deploy in Docker, send real commands via Playwright, verify the service actually works. If you can't curl it, it's not deployed.

2026-03-29Evolution

10. Evolution Spent 32 Cycles Achieving Nothing

14 evolved .py files. Zero used. Pure waste.

The evolution engine ran 32 cycles overnight. It generated 14 Python files. None were ever imported, called, or used. It scanned arxiv papers about video security while the user was trying to deploy a CRM.

Evolution always picked the easy path — writing documentation instead of code — because the prompt rewarded "any improvement" equally.

What the OS evolved

Complete rewrite: evolution is now demand-driven. No demands = no token burn. DemandSolver uses LLM reasoning: reads the codebase, reads the environment, reads past fixes, then decides what to build. MetaEvolver (random mutation) disabled. Only real user failures trigger evolution.

The self-evolution insight

Evolution without accountability is waste. Every evolved artifact must have a caller. Every cycle must address a real demand. The evolution engine itself had to evolve — from random paper scanning to demand-driven problem solving.

2026-03-30Dead Code

11. The Constraint Store That Nobody Called

Beautiful SQLite infrastructure. Zero callers. Dashboard shows "0 constraints." Nobody notices.

Built a SQLite constraint learning system. ConstraintStore class, ResolutionCache class, async methods, test coverage. Initialized on boot. Wired into architecture docs. Neither was ever called from running code. Zero constraints stored. Zero resolutions cached. Federation synced nothing.

The dashboard showed "0 constraints, 0 resolutions" for days. Nobody questioned why the OS hadn't learned anything.

The lesson

Dead infrastructure is worse than no infrastructure. It creates the illusion of capability. Every component must have a CALLER. Audit: trace from user action → event → component → persistence. If any link is missing, it's dead code. Delete it.

The fix (24 hours later)

Deleted SQLite stores entirely. Replaced with .opensculpt/constraints.md and .opensculpt/resolutions.md — plain text the LLM reads directly. The LLM IS the search engine. No indexes, no queries, no cosine similarity. Wired actual WRITING: pre-flight failures write constraints. Successful retries write resolutions. Now the OS actually learns.

The LLM-native insight

An agentic OS powered by LLMs should store knowledge as text the LLM can read — not as database rows that need a query layer. .md files beat SQLite when the reader is an LLM.

Every failure = a missing OS feature

The pattern is clear: agents need the same infrastructure that programs got from Unix.

Failure	Missing OS Feature	Linux Equivalent	OpenSculpt Fix
OOM Massacre	Resource lifecycle	Process table + GC	GC daemon + hierarchical ownership
Zombie Services	Service monitoring	init / systemd	DomainDaemons + auto-restart
Amnesiac Agents	Shared memory	Filesystem (ext4)	TheLoom + SKILL.md
Green Lies	Honest verification	Exit codes + signals	Smart diagnosis + 3-tier verify
Infinite Loop	Demand lifecycle	Retry + backoff	Demand states + escalation
Token Bonfire	Efficient scheduling	cron + interrupts	fast_check + sleep tiers
Token Hoarders	Resource quotas	ulimit + cgroups	Budgets + compaction
No Identity	Process model	PID + signals	AgentRuntime + lifecycle FSM
814 Tests	Integration testing	make check	Playwright E2E scenarios
Idle Evolution	Accountability	SLAs	Demand-driven + scoring
Dead Infra	Wiring audit	strace + lsof	Call-path tracing + .md files

The pattern is undeniable

Every framework gives you agents. None manage what happens after. That's not a feature gap — it's a missing operating system layer. And the OS builds itself: every failure is a chisel strike.

Build With Us Read the Manifesto Back to Home