- Published on
From semiconductor yield to LLM agents: tooling, long-horizon evaluation, and system reliability.
Why this matters
If an agent is 95% accurate per step, a 10-step task succeeds only about 59.87% of the time (0.95¹⁰ ≈ 0.5987). Stretch to 100 steps and success drops to 0.592% (0.95¹⁰⁰ ≈ 0.00592). In other words: multi-step workflows magnify small errors.
This is the same problem semiconductor manufacturing solved decades ago. A modern chip undergoes hundreds to thousands of steps; fabs learned to keep every step very close to perfect and to catch drift before it snowballs. That discipline raised “yield” from brittle prototypes to predictable mass production.
This post proposes an Agent Yield Stack—a closed-loop, engineering-first approach that ports fab-proven ideas (SPC/APC/FDC, poka-yoke, DFT/BIST, redundancy, formal constraints, degradation, and RCA analytics) into LLM agent design. It’s aimed at research-engineer readers who want agents that are observable, controllable, self-correcting, and measurable.
The mental model: treat agent reliability as “yield”
Agent yield = fraction of tasks completed correctly under bounded budget (steps, tokens, wall-time).
Two complementary views:
- Stepwise accuracy: quality of each individual action.
- Chain success: multiplicative product of all steps’ accuracies (the compounding problem).
Metrics you can actually ship with
- Agent Yield (overall success rate under fixed budget)
- Slope of Decay (success vs. chain length)
- TTD / MTTR (time-to-detect / mean-time-to-repair an error)
- Rework Rate (fraction of chains requiring backtrack/redo)
- Goodput (useful tokens per total tokens)
- FAR / FRR (false alarm / false reject rates of anomaly detection)
- Stability Index (rolling signal that fuses contradictions, tool failures, verification misses)
No metrics -> no improvement. Yield engineering starts here.
The Agent Yield Stack (overview)
Goal: Turn open-loop agents into closed-loop systems: monitor -> control -> prevent -> self-test -> verify -> recover -> learn.
Layers (top to bottom):
- Telemetry — Structured traces for every step (inputs, tools, outputs, features).
- SPC (Statistical Process Control) — Control charts over step-level quality signals; trigger early warnings.
- APC/R2R (Run-to-Run Control) — Auto-tune strategy based on last step’s error (sampling, temperature, tool priority, step size).
- FDC (Fault Detection & Classification) — Lightweight detectors that label failure modes (planning drift, factual error, tool misuse, loops).
- Poka-Yoke (Error-Proofing/SOP) — Pre-action checklists, format contracts, two-man rules on risky ops.
- DFT/BIST (Design-for-Test / Built-In Self-Test) — Embedded micro-tests and assertions during runtime.
- Selective Redundancy + Soft Formal Constraints — Multi-path at critical gates; invariants/units/equations checks.
- Graceful Degradation & Human-in-the-Loop — Switch to safer, partial outputs when stability collapses.
- RCA & Yield Analytics — Multi-factor analysis across logs; defect knowledge base; continuous improvement.
Think of it as porting a fab’s process control loop into an agent runtime.
Layer-by-layer: what fabs do, how to translate, and what to prototype
1) Telemetry (make the process observable)
In fabs: sensors on every tool; traceability across steps. For agents: capture structured per-step traces:
- Evidence–claim consistency score
- Tool call status & latency
- Result structure validity (schema, units, ranges)
- Contradiction count vs. context
- Retrieval coverage & novelty
- Self-reported confidence (bounded, calibrated)
Prototype: a thin middleware that wraps each tool/model call, emits a JSON event per step, and computes 3–5 cheap features.
2) SPC — statistical process control
In fabs: control charts and control limits flag drift before specs are violated. For agents: choose 1–2 signals (e.g., evidence-claim consistency; tool success rate). Maintain a rolling mean/variance; trigger out-of-control if a point crosses limits or exhibits runs/trends.
Action on breach: rollback one step, re-retrieve, or switch to a safer template. Prototype: a sliding-window Z-score with Shewhart-style rules; log "SPC breach" events and ensuing corrections.
3) APC / Run-to-Run control — auto-tune the next step
In fabs: after each wafer, tweak recipe parameters to cancel drift. For agents: adapt strategy parameters based on last step’s error:
- Increase self-consistency samples from 1 -> 3 on instability
- Lower temperature, shorten plan depth, or cap tool fan-out
- Reprioritize tools (e.g., switch to high-precision calculator, stricter retriever)
Prototype policy:
"If stability index drops for 3 consecutive steps -> [raise n-samples, lower temp, enforce strict template] for the next 5 steps."
4) FDC — fault detection & classification
In fabs: real-time anomaly detection with fault labels to speed repair. For agents: a small watchdog tags failures:
- Factual inconsistency (claim contradicts source)
- Planning drift (actions diverge from stated goal)
- Tool misfit (repeated API 4xx/empty results)
- Infinite loop (repeated intents within N steps)
- Format breach (schema/contract violations)
Each label maps to a remedy: re-retrieve, switch tool, summarize state and re-plan, or escalate.
Prototype: rule-first (cheap and explainable), then back it with a tiny classifier trained from your traces.
5) Poka-Yoke — error-proofing with SOPs and checklists
In fabs: fixtures and SOPs make wrong actions hard/impossible. For agents: enforce pre-action and pre-submit gates:
- Before executing a destructive op: show risk summary + confirmation template.
- Before calling an API: check required params present and sane (types, ranges, units).
- Before final answer: run a semantic checklist (answer addresses question, cites sources, no contradictions).
Prototype: declarative “guard” functions. If a guard fails, the step can’t proceed.
6) DFT/BIST — build self-tests into the runtime
In fabs: scan chains and BIST catch defects early and cheaply. For agents: embed quick micro-tests at natural boundaries:
- Code: compile & run 3–5 smoke tests
- Math: recompute with a tool; verify identities/units
- SQL: lint & dry-run with limits
- Summarization: ask a verification sub-agent to re-derive key facts
If a micro-test fails -> auto-repair cycle instead of propagating bad state.
Prototype: a per-task “assertion registry.” Firing an assertion routes to a targeted fix-it prompt/tool.
7) Selective redundancy + soft formal constraints
In fabs: spare rows/cols in SRAM, TMR voting in safety-critical logic. For agents: apply redundancy only at critical gates (to control cost):
- Three independent computations for final numbers (LLM reasoning, Python, SQL) -> majority vote
- Two alternative retrieval chains for pivotal facts -> cross-check citations
- Soft formal checks: accounting identities, conservation laws, unit balances, domain invariants
Prototype: define two “critical gates” in your flow; fan-out and vote only there.
8) Graceful degradation & human-in-the-loop
In fabs: ship binned parts, disable weak cores; don’t scrap everything. For agents: when the Stability Index falls below a threshold:
- Degrade to partial, verifiable output (facts + TODOs) instead of full synthesis
- Ask for help: surface missing inputs or ambiguities explicitly
- Switch mode: from creative to conservative templates
Prototype: a one-liner policy: If (SPC breach ∧ 2× FDC in last 5 steps) -> produce “verified-facts + request-inputs” instead of final answer.
9) RCA & yield analytics — learn systematically
In fabs: multi-factor analysis to isolate root causes; yield ramps over months. For agents: build a Yield Dashboard and a Defect KB:
- Correlate failures with factors: context length, date queries, tool latency, prompt patterns
- Maintain a taxonomy of failure modes + remedies
- Run small DOE-style A/Bs (e.g., template vs. tool swap) to confirm causality
Prototype: weekly batch jobs that cluster failures, then append to a living “Agent Defect Playbook.”
Putting it together: an end-to-end flow
User task
-> Plan (telemetry on)
-> Step k: retrieve / call tool / reason
↳ Guards (poka-yoke) pass?
↳ Micro-tests (BIST) pass?
↳ SPC update: within control limits?
- If breach -> APC adjusts strategy; optional rollback/retrieval refresh
↳ Watchdog (FDC): any fault labels?
- If yes -> route to remedy playbook
-> Critical gate?
↳ Selective redundancy + invariants check
-> Stability index below threshold?
↳ Degrade: partial verified output + request for missing info
-> Finalize
-> Log to RCA/analytics
This is not a monolith; it’s a set of thin, composable interceptors around your agent’s steps.
What to evaluate (and how)
Tasks: multi-hop retrieval QA, retrieval-compute-synthesis (reports), code-gen-and-fix, structured data question answering.
Design:
- Baseline: open-loop agent with no controls
- Yield Stack: enable layers incrementally (ablation): SPC -> +APC -> +FDC -> +BIST -> +Redundancy
Report:
- Agent Yield at chain lengths [5, 10, 20, 50]
- Slope of Decay (shallower is better)
- TTD/MTTR and Rework Rate
- Cost overhead vs. accuracy gain (esp. at redundant gates)
- FAR/FRR for FDC
Stress:
- Inject explicit faults: noisy retrieval, API intermittency, contradictory evidence, near-duplicates
- Observe recovery behavior and the point where degradation triggers
A good litmus test: after enabling just SPC+APC+one micro-test, can you cut the decay slope by >30% on your longest chains while keeping token overhead within 1.3–1.8×? Then add a single redundant gate and watch tail-error collapse.
What’s novel here vs. common agent tricks
Most agent posts stop at self-consistency or reflection. Those are helpful, but they’re still open-loop heuristics. The Agent Yield Stack is different in three ways:
- Process control: it treats the agent as a controlled process with telemetry, control limits, and run-to-run tuning.
- Error-proofing & self-test: it prevents obvious mistakes and tests during execution, not only at the end.
- Selective redundancy & degradation: it spends compute surgically (only where it matters) and fails gracefully when stability collapses.
That’s the difference between a one-off demo and a system you can operate.
Closing
Semiconductor yield engineering turned thousand-step processes into reliable mass production by making every step observable, controlled, and correctable. LLM agents face the same compounding failure physics. By importing SPC/APC/FDC, poka-yoke, DFT/BIST, selective redundancy, formal invariants, graceful degradation, and RCA analytics, we can raise agent yield from fragile demos to dependable systems.
This is the path from “it worked once on stage” to “it works every day in production.”