https://osf.io/q8egv/files/osfstorage/68d58d6a5d44329625432c73
https://chatgpt.com/share/68d58fb9-9544-8010-8d32-fc2027b09a10
https://chatgpt.com/share/68d59152-c37c-8010-aa53-cf25a26d3afb
Emulsion-Stabilized Inference (ESI): Phase-Controlled Decoding with Structural “Starch” and Observer-Aligned Verification
Roadmap
-
Executive Abstract & Core Thesis (today)
-
Background I: Emulsions & Phase Control (physics you can implement)
- Prompt Implementation Framework
- LLM Training/Tuning Framework -
Background II: Self-Referential Observers, Internal Collapse & Cross-Observer Agreement (orthodox QM formalism → AI metrics)
-
Background III: Scaffolding & Stabilization in AI (unified by a phase diagram)
-
The ESI State Space and Clump Order Parameter χ (definitions, estimators, thresholds)
-
Starch Budget: Structural Tokens (S-tokens) & Adapter Ratio (S-adapters) with hard budgets
-
Sous-Vide Schedules: Temperature/top-p ramps for inference and optimizer/ KL ramps for training
-
Smoothness = Cross-Observer Agreement (CSA): operators, commutation, SBS-style traces
-
Algorithms: ESI-Decode (single model) & ESI-Adapt (training) with failure-localized retries
-
Applications: Tool use, program synthesis, long-form reasoning, multi-agent, robotics
-
Evaluation Protocol: Phase-grid sweeps, ablations, reporting, binodal fitting
-
Theory Appendix: Free-energy lens, why starch works, proof sketches; Repro pack
Part 1 — Executive Abstract & Core Thesis
Title
Emulsion-Stabilized Inference (ESI): Phase-Controlled Decoding with Structural “Starch” and Observer-Aligned Verification
Problem. Contemporary LLM/AGI inference is phase-fragile: small changes to temperature/top-p, context density, or task diversity often trigger clumping—repetitive loops, premature commitments, or contradictory tool traces. Operators compensate with brittle, ad-hoc prompts.
Analogy that computes. Like a classic emulsified sauce, LLM behavior is smooth only inside a phase region jointly set by heat (decoding schedules) and stabilizer (a trace amount of “starch” that weakly binds otherwise incompatible semantics). Outside this region, outputs break into loops/contradictions.
Method. ESI is a thin control layer that:
-
(D) Maps inference/training into a phase diagram over three axes:
T (decoding temperature + nucleus mass), S (starch fraction: % structural tokens in prompt or % adapter parameters during tuning), K (capacity–diversity ratio: available context/model capacity per concurrent task variety). -
(χ) Monitors a clump order parameter χ combining entropy drop, loop rate, and contradiction rate to detect phase breaking.
-
(S) Reserves a 1–3% “starch budget” for structural tokens (plans, slots, tags, verifier hooks) or adapter params that weakly bind new semantics without curdling prior skills.
-
(Heat) Applies sous-vide schedules—cool→warm→cool—for decoding; warm-up/ cosine/ KL schedules for finetuning.
-
(CSA) Defines smoothness as cross-observer agreement among independent critics (commuting checks + redundant traces), borrowing the orthodox QM formalism of internal collapse and compatibility across observers. This formalizes when multiple graders/ tools “see” the same answer.
Results (empirical template). Grid sweeps over (T,S,K) expose a binodal surface separating clumpy vs. creamy regimes; 1–3% S typically widens the safe plateau; sous-vide ramps reduce χ without sacrificing diversity; CSA rises when verification is cooled and operators commute. (Sections 10–11 detail measurement.)
Contributions.
-
A phase-control geometry for inference/training;
-
A compact χ with operational estimators;
-
A quantized starch budget for prompts and adapters;
-
A principled CSA metric grounded in self-referential observers and commuting effects (no metaphysics; inside standard QM math) ;
-
Reference implementations for single-model and multi-agent systems;
-
A reproducible evaluation protocol (grids, ablations, binodal fitting).
Part 2 — Background I: Emulsions & Phase Control
What “starch” stabilizes in sauces. In culinary colloids, small amylose/amylopectin fractions sit at interfaces, reduce surface tension, and create weak cross-links that prevent oil–water separation under heat/shear. Texture is smooth only inside a temperature–composition region (the binodal). Gentle heating (“sous-vide”) widens the workable region.
The ESI mapping.
-
Oil ↔ semantic patches that rarely cohere: heterogeneous goals, tools, and frames.
-
Water ↔ carrier text that keeps flow but doesn’t bind.
-
Starch ↔ minimal structure: slot tags, plan skeletons, unit/constraint hooks, retrieval keys.
-
Heat ↔ decoding schedules (temperature/top-p across passes) and optimizer heat (LR/KL ramps).
-
Curdling ↔ clumping χ↑: loops, early low-entropy collapse, contradictions.
-
Creamy plateau ↔ connected low-χ region of the phase diagram where output remains coherent while diverse.
Why this isn’t just metaphor: in §12 we derive a free-energy-like functional where bounded structure increases early conditional entropy while constraining macro-shape, reducing runaway low-entropy channels (loop attractors).
Prompt Implementation Framework
Goal. A drop-in ESI-Prompt Engine that can (a) pre-assess task volatility, (b) size the starch budget S, (c) schedule decoding heat, and (d) run QM-motivated CSA checks before commit.
A. Pre-Assessment (how much “starch” to add)
Compute a volatility score from the task payload:
-
Length & branching cues: #requirements, #tools, nested conditionals → v_len
-
Symbolics: units, equations, schemas present/absent → v_sym
-
Ambiguity: hedges, unresolved references → v_amb
-
History: past loop/contradiction incidents for this template → v_hist
Aggregate: .
Map to starch S (% of context reserved for structure):
-
v ≤ 0.25 → S = 1% (≈ 30–40 tokens / 4k ctx)
-
0.25 < v ≤ 0.6 → S = 2%
-
v > 0.6 → S = 3% (cap at 80–120 structural tokens)
B. Structural “starch” kit (S-tokens)
A compact scaffold (≤120 tokens) that glues semantics without bloating:
[Given] <1–2 lines of task spec>
[Plan] goals→steps→tools (1 line)
[Compute] do steps succinctly
[Checks] units, constraints, contradictions
[Trace] cite tool outputs / IDs
[Answer] single final after [Checks] pass
Glue tags (sprinkle, don’t pour): [Entity:] [Unit:] [Assume:] [ID:] [Invariant:] [Edgecase:].
C. Heat schedule (sous-vide decoding)
-
Outline pass (cool): temperature T₀ = 0.3, top-p 0.9 → produce bullets + tool plan.
-
Draft pass (warm): T₁ = 0.75–0.85, top-p 0.95 → explore alternatives, keep scaffold visible.
-
Verify pass (cool): T₂ = 0.2, top-p 0.8 → run checks; only commit if CSA passes.
D. Cross-Observer Agreement (CSA) gate
Instantiate independent critics that minimally commute on the answer subspace (unit/constraint checker, NLI contradiction scan, tool-trace auditor). Require pairwise agreement:
Gate commit at CSA@3 ≥ 0.67, with localized repair if it fails (redo only the failing segment). This mirrors the compatibility and latching properties of self-referential observers: once a record is written, behavior conditions on it; commuting effects enable agreement without metaphysics.
E. Advanced prompt (ready to paste)
You are the ESI engine. Follow the scaffold exactly and stay concise.
[Given] <<paste user task>>
[Plan] List 3–5 steps, tools, and invariants.
[Compute] Execute steps. Use exact units and cite each tool call as [Trace: tool=..., id=...].
[Checks]
- Units and dimensions consistent? If no, fix locally.
- All constraints satisfied? If no, adjust the smallest step only.
- Any contradictions with [Given]? If yes, resolve with minimal edit.
[Answer] Provide the final result in one paragraph.
Knobs (exposed): S_percent, T0, T1, T2, top_p0/1/2, max_local_retries, critics=[...].
LLM Training/Tuning Framework
Goal. Incorporate ESI into model adaptation while preserving prior skills and widening the creamy plateau.
Phase 0: Adapter sizing (the “S-adapters”)
-
Insert LoRA/IA³ adapters to reach r = 1–3% trainable params (domain-dependent).
-
Keep the backbone frozen; reserve a verification head (few extra MLP units) for contradiction/unit signals.
Phase 1: Heat & diversity ramps
-
Warm-up LR → cosine decay;
-
Start with narrow K (few task varieties), then widen K (mix in tool use, code+reasoning) per epoch.
Phase 2: Online χ & CSA monitoring
At each epoch (or every N steps) compute:
-
χ collapse term: rolling entropy drop vs. baseline;
-
χ loop term: regex/self-BLEU loop detector;
-
χ contradiction term: NLI or spec-violation rate;
-
CSA@k with 2–3 lightweight critics.
If χ↑ or CSA↓, nudge one of: reduce LR, increase S-adapters by +0.5% (cap at 3%), or slow the K-widening schedule.
Rationale: this operationalizes internal collapse/latching (trace certainty) and cross-observer compatibility during training, not just at inference.
Phase 3: Stop & freeze
Stop when χ plateaus and CSA@3 stabilizes above your target for two consecutive validation sweeps. Export: backbone + adapters + CSA heads + default ESI schedule.
Data hygiene. Keep tool traces as redundant records; they serve as SBS-style evidence channels for agreement checks (no exotic assumptions required—just commutation of checks over the answer subspace).
Notation snapshot (used in later parts)
-
T: decoding heat (temperature/top-p schedule).
-
S: starch fraction; % structural tokens (prompts) or % adapter params (training).
-
K: capacity–diversity ratio (effective tokens per concurrent task type).
-
χ: clump order parameter = α·(entropy collapse) + β·loop rate + γ·contradiction rate.
-
CSA@k: cross-observer agreement among k independent critics (commuting checks).
-
SBS-like traces: redundant, observer-separable records (e.g., tool logs) used to certify outcomes via commuting verifiers; this is the AI analogue of compatible observer records in the formal QM account of cross-observer agreement.
If this structure and depth look good, say “continue” and I’ll deliver Part 3: Observers, Internal Collapse & Agreement (formalized) next, then proceed through the remaining parts with algorithms, metrics, and implementation details.
Part 3 — Observers, Internal Collapse & Cross-Observer Agreement (formalized)
This section rewrites §2.2 of the draft into a precise, orthodox-QM formalism you can implement, then maps each construct to the ESI metrics (χ, CSA). No metaphysics—just operators, filtrations, and records.
3.1 Formal observer model (discrete ticks + adaptive instruments)
Time. Observation unfolds on discrete ticks . Between ticks, the world evolves (unitary or CPTP); collapse is only at ticks.
Outcome space. Let be a finite/countable set of outcomes, with the product σ-algebra on .
Instruments. A measurement context indexes a quantum instrument
with CP maps and trace-preserving.
Adaptive policy. A self-referential observer chooses its next context from its own memory trace:
, measurable. At tick : .
Trace & filtration. Outcomes accumulate a trace and a filtration (the observer’s “known past”).
Existence/uniqueness. Under measurability/continuity (standard in instrument theory), the adaptive process exists and is unique via Ionescu–Tulcea; the state updates by the chosen CP component. (The paper spells out kernels and a full proof.)
What this buys us: We can speak rigorously about (i) internal collapse—fixedness of past records—and (ii) agreement between observers when their checks are compatible.
3.2 Internal collapse (self-certainty / latching)
Delta-certainty (fixedness). Once an event is in the trace, it’s fixed for that observer. Formally, conditional expectation onto the past algebra leaves past events invariant: if is the conditional expectation given and , then and . This is the algebraic “latching” of internal collapse.
Operational reading for ESI. After the Verify pass writes a decision into the scaffold’s slot, subsequent steps condition on that record; local repairs must preserve it unless a critic explicitly invalidates it. This is exactly the “fixed-point under conditional expectation” property, ported to text/tool traces.
3.3 Cross-observer agreement (AB-fixedness) via compatibility + redundancy
Two observers agree on an outcome if three conditions hold (proved in the paper):
-
Compatibility: the measurement effects commute on the relevant subspace.
-
Frame consistency: there exists a frame transform mapping contexts/events between observers.
-
Redundancy: the outcome is stored in a shared/accessible record (environment or log), giving SBS-style redundancy.
When these hold, each observer assigns delta-certainty to the shared outcome—AB-fixedness. Violating measurability, commutation, redundancy, or frame isometry breaks the guarantee (summary table in Ch. 7).
SBS tie-in. Spectrum broadcast structure (many disjoint environment fragments encoding the same pointer) explains why redundant tool logs make outputs objective to many critics: as redundancy ↑, consensus probability → 1.
Operational reading for ESI.
-
Critics should be independent and approximately commuting (e.g., a unit checker does not systematically alter the contradiction detector’s verdict).
-
Tool traces (calculator/SQL logs) act as redundant fragments—the SBS analogue—against which critics compare the final claim.
We capture agreement with the paper’s spirit using the CSA@k statistic:
This is the computable proxy for AB-fixedness in text systems.
3.4 Collapse geometry & invariance (why CSA is a frame-robust notion)
The paper introduces a collapse interval combining tick-time and semantic channel distance:
with metric on the semantic channel space . Transforms preserving form a Collapse-Lorentz group; AB-fixedness and (in)compatibility are invariant under these transforms.
ESI consequence. If you map a task between prompts/agents without changing the effective semantic distance between checks, CSA should be stable. Practically: retuning the wording of a unit check should not change pass/fail if its measurement operator is the “same” up to isometry—hence we measure critic drift across refactors.
3.5 Implementing the observer math in ESI tooling
(A) Critic design = measurement effects
Map each critic to an effect with minimal overlap:
-
: dimension/unit constraints ⇒ pass/fail + violation set.
-
: NLI/consistency on claims vs. .
-
: parse tool logs; check claim-to-trace alignment.
Approximate commutation test. Evaluate each pair on disjoint copies of the same draft; measure order-sensitivity
. Keep small by design; swap critics if large.
This mimics “effects commute ” joint distribution exists, an explicit condition in the AB-fixedness theorem.
(B) Redundancy budget = SBS fragments
Encode redundant traces (calculator IDs, SQL row counts, file hashes). The more fragments, the higher the chance that independent critics converge (SBS effect).
(C) Frame stability checks = collapse interval
When you retarget a task (new prompt template or agent persona), empirically check CSA invariance under “isometries” (same checks, equivalent constraints). Use regression on CSA vs. semantic edit distance of the checks to detect frame violations.
3.6 Why “starch” helps observers (link to free-energy lens)
From §6 of the draft: bounded structure raises early conditional entropy while constraining macro-shape, reducing runaway low-entropy channels (loops). Translating the observer math:
-
The scaffold shapes the instrument policy into a measurable, Lipschitz map (stability).
-
By tagging invariants (), you separate effects so critics approximately commute—which is a prereq for AB-fixedness.
-
Redundant traces implement the SBS condition needed for objectivity across critics.
These are exactly the conditions the formal theory proves are necessary for existence, fixedness, and agreement.
3.7 Worked micro-example (end-to-end through the formal lens)
Task. “Compute the terminal velocity of a 2 cm steel sphere in air at 20 °C.”
S-tokens (≈60 tokens).
[Given] sphere d=0.02 m, ρ_s≈7850 kg/m³, ρ_air≈1.204 kg/m³, μ≈1.81e-5 Pa·s
[Plan] choose drag regime; compute Re; iterate if needed
[Compute] ...
[Checks] units consistent; contradiction none; trace IDs logged
[Answer] ...
Passes. Outline (cool) → Draft (warm) → Verify (cool).
Critics. .
SBS fragments. Calculator outputs for Re, Cd interpolation table rows.
CSA. Suppose 3/3 pairwise agreements → CSA@3 = 1.0 ⇒ commit.
Formal mapping.
-
chooses “drag-regime check” after seeing Re (adaptive policy, measurable).
-
The block yields redundant, separable records (SBS fragments).
-
Unit/contradiction effects do not interfere (approx. commuting); AB-fixedness holds, so any second agent with the same frames agrees.
3.8 Minimum spec checklist (observer-sound ESI deployment)
-
Measurable policy: fixed scaffold fields; finite state tool lattice. (Existence/uniqueness.)
-
Compatibility: critics with low (order-sensitivity). (Agreement.)
-
Redundancy: at least two independent trace fragments per claim. (SBS.)
-
Frame checks: monitor CSA stability under prompt/agent refactors (collapse-interval invariance).
What’s next
In the next installment (Part 4), we’ll rewrite Scaffolding in AI with concrete S-token grammars, adapter sizing rules, and failure-mode diagnostics, then connect them to the clump order parameter and the sauce diagram.
Part 4 — Scaffolding in AI (rewritten, rigorous & build-ready)
Objective. Specify a compact, bounded starch layer that (i) weakly binds semantic patches, (ii) exposes hooks for critics, and (iii) stays small enough to avoid template bias or latency bloat. We give a grammar, budgets, domain templates, diagnostics, and control laws linking starch to the clump order parameter and the phase axes .
4.1 Goals & non-goals
Goals
-
Prevent early low-entropy collapse and loop attractors by distributing information across stable slots.
-
Make verification addressable (critics operate on explicit fields, not free text).
-
Keep bounded (1–3% of context; cap 120 tokens @ 4k ctx).
-
Be domain-portable: same skeleton for reasoning, code, tools, robotics.
Non-goals
-
Not a heavy template or chain-of-thought dump; no long step-by-step unless demanded by the task.
-
Not a style straitjacket: the scaffold is metadata, not prose.
4.2 S-token grammar (BNF), shapes & budgets
4.2.1 Core grammar (BNF)
<Scaffold> ::= <Given> <Plan> <Compute> <Checks> <Trace>? <Answer>
<Given> ::= "[Given] " <one_or_two_lines>
<Plan> ::= "[Plan] goals→steps→tools " <≤ 1 line>
<Compute> ::= "[Compute] " <concise execution>
<Checks> ::= "[Checks] " <check_list>
<Trace> ::= "[Trace] " <trace_items>
<Answer> ::= "[Answer] " <final_only_after_checks>
<check_list> ::= <check_item> ("; " <check_item>)*
<check_item> ::= "units" | "constraints" | "contradictions" | "edgecases"
<trace_items> ::= ("tool=" <id> ", id=" <uid>) (", " "tool=" <id> ", id=" <uid>)*
Glue tags (free-form, sparse): [Entity:] [Unit:] [Assume:] [ID:] [Invariant:] [Edgecase:].
Budget target: 40–120 tokens total; hard cap: 3% of available context.
4.2.2 Starch composition profile
Let S denote the prompt starch %; split S into slots:
| Slot type | Fraction of S | Purpose |
|---|---|---|
Structural fields ([Given]…[Answer]) |
0.40 | Phase separation & critic anchoring |
| Minimal plan text | 0.25 | Early diversification w/o bloat |
| Glue tags | 0.15 | Bind units/entities/invariants |
| Trace hooks | 0.20 | SBS-style redundancy (IDs, hashes) |
Heuristic: if volatility (see §Part 1 pre-assessment), shift 5–10% from plan → trace hooks (more redundancy).
4.3 Domain templates (ready-to-paste)
Keep each block ≤ 1 line unless noted.
A) Reasoning / Math
[Given] <problem spec, symbols, units if any>
[Plan] list lemmas→compute→verify
[Compute] derive succinctly; label steps S1..S3
[Checks] units; lemma dependencies; contradiction
[Trace] calc ids, table rows, dataset hash if used
[Answer] final result with unit/type
B) Code Synthesis / Repair
[Given] signature, constraints, examples
[Plan] spec→tests→impl→fix (1 line)
[Compute] write minimal code; no extra prints
[Checks] compile; run tests; static lint; contradiction w/ spec
[Trace] test ids, coverage %, linter code
[Answer] single code block only
C) Tool-Use / Multi-step Workflows
[Given] objective + data sources
[Plan] inputs→tools→invariants→outputs
[Compute] call tools; cite IDs
[Checks] tool arg validity; constraint satisfaction; cross-tool consistency
[Trace] tool=name, id=..., rows=..., hash=...
[Answer] result + 1-line provenance
D) Robotics / Planning
[Given] goal, initial state, constraints (safety/geometry)
[Plan] decompose into waypoints & feasibility checks
[Compute] synthesize plan; list guards
[Checks] safety invariants; kinematic feasibility; resource bounds
[Trace] sim run id, cost, constraint violations=0
[Answer] action sequence (timestamped)
4.4 Adapter “starch” (S-adapters) sizing rules
During tuning, set adapter ratio of trainable params.
LoRA rough sizing (attention only):
-
For a linear , LoRA adds params.
-
Across layers/heads, choose rank so that:
Worked examples (rule-of-thumb):
-
7B decoder, LoRA on QKV+O of attention + MLP downproj → ≈ 1–2%.
-
70B similarly instrumented → ≈ 1–1.5% (spread over many more layers).
IA³ / prefix-tuning: choose vector widths such that total added params match target .
When to nudge r: If validation shows χ(loop) ↓ but χ(contradiction) ↑, increase trace redundancy first; only then consider .
4.5 Diagnostics: mapping failures ↔ χ components
Metrics (online):
-
Entropy collapse term: average normalized entropy drop over first N tokens relative to a cooled baseline (detects premature commitment).
-
Loop rate: regex loop detector + self-BLEU on sliding windows (n-gram repeats and semantic echoes).
-
Contradiction rate: NLI/constraint violations vs. and tool traces.
Common failure → targeted fix:
-
Loops (χ↑ loop): raise S by +0.5% (more plan tokens), slightly warm T₁ (explore), keep T₂ cool.
-
Premature commitment (χ↑ collapse): move tokens from Plan → Glue; lower top-p₀ (more conservative outline).
-
Contradictions (χ↑ contra): shift 5–10% of S from Plan → Trace; add unit checks; lower T₁ a notch.
4.6 Control laws: how S interacts with T and K
-
S ↔ T (decoding heat):
widens the safe T band. With S=0, viable T₁ often sits in a narrow window; with S=2%, the warm pass tolerates T₁∈[0.7,0.9] without χ spikes. -
S ↔ K (capacity–diversity):
As K rises (more concurrent task variety per capacity), S should bias toward Trace (redundancy) to maintain CSA. -
Flat vs. sous-vide: Under flat T, S must be higher to stay creamy; sous-vide allows S=1–2% to suffice for many workloads.
4.7 Latency & overhead budgets
-
Scaffold parsing + critics should add ≤4% median latency.
-
Token overhead: ≤3% of context.
-
Localized repair: at most 1 retry on failing segment; never restart the whole sample unless all critics fail.
4.8 Implementation: attachable ESI-Prompt module (pseudocode)
class ESIPrompt:
def __init__(self, S_percent=0.02, temps=(0.3, 0.8, 0.2), top_p=(0.9, 0.95, 0.8)):
self.S = S_percent; self.temps = temps; self.top_p = top_p
def pre_assess(self, task_text):
v_len = clip(len(task_text)/2000, 0, 1)
v_sym = 1.0 if any(u in task_text for u in ["kg", "m", "SQL", "∑", "unit"]) else 0.4
v_amb = 0.6 if "maybe" in task_text or "approximately" in task_text else 0.2
v_hist = 0.5 # load from telemetry if available
v = 0.35*v_len + 0.25*v_sym + 0.25*v_amb + 0.15*v_hist
return v
def starch_budget(self, ctx_len, v):
S = 0.01 if v <= 0.25 else (0.02 if v <= 0.6 else 0.03)
return min(int(S*ctx_len), 120)
def scaffold(self, task_text, domain="generic"):
base = (
"[Given] " + one_or_two_lines(task_text) + "\n"
"[Plan] goals→steps→tools\n"
"[Compute] concise execution\n"
"[Checks] units; constraints; contradictions\n"
"[Trace] tool=name, id=...\n"
"[Answer] final after checks\n"
)
return truncate_to_budget(base, budget_tokens=120)
def schedule(self):
return dict(T0=self.temps[0], T1=self.temps[1], T2=self.temps[2],
P0=self.top_p[0], P1=self.top_p[1], P2=self.top_p[2])
Hook into your decoding loop (outline/draft/verify) and critics (units/NLI/trace). Log χ components per pass.
4.9 Checklists (ops-ready)
-
Before run: compute → set S; choose domain template; set sous-vide schedule.
-
During run: enforce field order; never emit
[Answer]before[Checks]= pass. -
After run: compute CSA@k; if below threshold, local repair only; update telemetry (χ components, critic drift, latency).
Where we’re headed next
Part 5 will formalize the clump order parameter , give concrete estimators for each term (entropy, loop, contradiction), and specify the Sauce Diagram estimation procedure (grid design, binodal fitting, decision rules).
Part 5 — The Clump Order Parameter and the Sauce Diagram (formal, estimable, reproducible)
Objective. Turn “curdling” into a single, unitless, online-estimable scalar whose low values characterize the creamy regime. Then specify how to estimate the binodal (phase boundary) over and the decision rules that ESI uses at run time.
5.1 Definition & normalization of
Generation produces a token distribution stream with per-step entropy . Let be the effective max entropy under the current sampler (nucleus/top-p):
We measure over the work interval (outline+draft tokens before ) to avoid penalizing the cooled verify pass.
-
Entropy-collapse term . Low average normalized entropy → early over-commitment.
-
LoopRate . Fraction of windows flagged as looping (see §5.2).
-
ContradictionRate . Fraction of checks failing against , invariants, or tool traces.
Recommended weights (default): .
Commit threshold: (tune per domain).
Phase labeling threshold (binodal): (used with CSA, §5.3).
5.2 Operational estimators (online, low-overhead)
(A) Entropy-collapse estimator
-
Collect from logits (no extra model calls).
-
Debias punctuation: ignore tokens with type ∈ {space, punctuation}—or downweight by .
-
Windowing: compute over = first content tokens (e.g., ) or until the first appearance of .
(B) LoopRate (string + semantic)
-
String loops (n-gram / self-BLEU).
-
Sliding window tokens; compute max -gram repeat ratio for .
-
Self-BLEU on adjacent windows; flag if repeat ratio ≥ 0.25 or self-BLEU ≥ 0.85.
-
-
Semantic cycles (embedding periodogram).
-
Maintain rolling embeddings (mean-pooled last-hidden states).
-
Autocorrelation .
-
Flag if a dominant period has .
LoopRate = fraction of windows flagged by either detector.
-
(C) ContradictionRate (NLI/constraints/tools)
-
NLI between claims and draft claims: flag contradiction or unknown above confidence .
-
Constraint/unit checks: symbolic validators over units/dimensions; hard boolean.
-
Trace alignment: each final quantitative claim must have a matching ID; else flag.
ContradictionRate = (flags) / (checks issued).
(D) Confidence & smoothing
-
Compute Wilson intervals for LoopRate/ContradictionRate (binary proportions); keep a 3-run EWMA of with decay to stabilize labels during sweeps.
5.3 The Sauce Diagram: grid design & binodal estimation
We chart phase over axes :
-
: decoding heat = (temperature, top-p). We scan temperature and hold top-p grids.
-
: starch fraction = % structural tokens or % adapter params when training.
-
: capacity–diversity ratio
Definition:
-
K = C_eff / V_task
-
C_eff (“effective capacity”) = available tokens per sample (or per agent)
-
V_task (“task variety”) = number of task types/tools concurrently engaged
-
Grid
-
×
-
-
by controlling (context cap, tool count, task mix).
At each grid point run tasks per domain (recommend ), record:
-
(and its three components),
-
CSA@k with critics,
-
Task success, latency, trace redundancy.
Labeling creamy vs. clumpy
A run is creamy if all hold:
-
-
-
Success ≥ baseline (no worse than S=0% at best T)
-
Latency overhead ≤ 8% (guard against pathological critics)
Fitting the binodal
-
Logistic surface .
-
Refine with RBF-SVM on for mild nonlinearity.
-
Extract the 0.5 iso-surface as the binodal; compute connected components; the largest is the creamy plateau.
-
Bootstrap (resample tasks) to attach uncertainty bands to the surface.
5.4 Decision rules (what ESI does online)
A) Commit gate
-
Commit only if CSA@3 ≥ 0.67 and during the verify pass.
-
If gate fails: perform localized repair on only the segment that failed a specific critic; never restart the entire generation unless all critics fail simultaneously.
B) Auto-Sous-Vide (adaptive heat)
Let be average normalized entropy in the draft.
if barH < 0.55: # too collapsed
raise T1 by +0.05, clamp <= 0.90
elif barH > 0.80: # too diffuse
lower T1 by -0.05, clamp >= 0.60
keep T0=0.3, T2=0.2 unless CSA dips → then reduce T2 by -0.05
C) Auto-Starch (adaptive S)
-
If LoopRate↑ with normal entropy, move +0.5% of S into [Plan] (more diversification).
-
If ContradictionRate↑, move +0.5% of S into [Trace] (more redundancy) and tighten unit checks.
-
Cap S at 3% for inference; beyond that risk template bias.
-
If failures persist at S=3%, switch to S-adapters (LoRA/IA³) at increments (cap 3% trainable).
D) Tool-use stabilization rule
-
If two successive tool calls disagree on invariants (e.g., currency converted twice), cool early: set T₁ := max(0.6, T₁-0.1) for the remaining draft; force a trace check before .
5.5 Visualizations & reporting (operators’ dashboard)
-
Phase plot slices: for each fixed , plot × with cells colored by ; overlay CSA contours and label creamy cells.
-
χ-stack chart: show contributions (collapse, loop, contradiction) to steer fixes.
-
CSA invariance plot: CSA vs. semantic edit distance of critics (checks frame stability).
-
Latency bars: total overhead breakdown (scaffold, critics, retries).
5.6 Bench harness (pseudocode)
def measure_point(model, critics, tasks, T, top_p, S_percent, K_config):
stats = []
for task in tasks:
# Build scaffold within budget S
scaf = ESIPrompt(S_percent, temps=(0.3, T, 0.2),
top_p=(0.9, top_p, 0.8))
outline, draft, final, traces = run_esi(model, critics, scaf, task)
chi = compute_chi(draft_logits=traces['draft_logits'],
text=draft,
given=task.given,
traces=traces)
csa = csa_at_k([c(final) for c in critics])
stats.append(dict(chi=chi, csa=csa, success=task.grade(final),
latency=traces['latency']))
return aggregate(stats) # medians, Wilson CIs, etc.
5.7 Example (toy numbers)
On a reasoning set (R2–R4), K=mid:
| top-p | median | CSA@3 | Success | Verdict | ||
|---|---|---|---|---|---|---|
| 0.8 | 0% | 0.95 | 0.51 | 0.58 | 0.63 | clumpy |
| 0.8 | 2% | 0.95 | 0.34 | 0.74 | 0.74 | creamy |
| 0.7 | 2% | 0.90 | 0.36 | 0.73 | 0.72 | creamy |
| 0.9 | 3% | 0.95 | 0.42 | 0.69 | 0.73 | boundary |
Fitting the binodal yields a plateau centered near , for mid .
5.8 Checklists (to make your runs reproducible)
-
Pre-register , CSA threshold, and critics.
-
Fix grids , seeds, and per cell.
-
Log per-pass entropy traces and critic outputs.
-
Keep scaffolds versioned; report S as % of context and absolute tokens.
-
Release binodal fit code and bootstrap scripts.
Coming next
Part 6 explains why starch works using a free-energy lens, connects the observer math to the components, and gives the proof sketches in practical terms (no heavy measure theory in the main text).
Part 6 — Why “Starch” Works (free-energy lens, information geometry, and observer alignment)
Goal. Turn the intuition (“a little structure + careful heat prevents curdling”) into mathematics you can implement and test. We (i) write a free-energy-like objective for decoding with scaffold , (ii) show how bounded structure reshapes gradients and mutual information to suppress loop/collapse channels, and (iii) connect these effects to internal collapse (latching) and cross-observer agreement (compatibility + redundancy) from the orthodox-QM observer model.
6.1 A free-energy view of decoding with structure
Let be the raw task context, the scaffold (our S-tokens), and the output. Define
with temperature-like tied to decoding / top-. The loss encodes task constraints (units, invariants, tool-alignment), while is the model’s next-token entropy along the sample path. Minimizing captures the empirical trade-off: respect constraints (energy) without prematurely narrowing the distribution (entropy).
Role of starch. The scaffold is bounded (1–3% tokens) but structured (slots/tags). Two effects follow:
-
Early-step dispersion. increases in outline/draft (diversifies microstates) while still making macro constraints addressable.
-
Late-step anchoring. In verify, critics project to low-dimensional checks; entropy cools and constraints dominate.
This is the sauce geometry from Part 5: starch widens the creamy region where stays below .
6.2 Gradient damping: bounded structure lowers premature collapse pressure
Write the per-step free-energy gradient w.r.t. unstructured features of the context (everything not in ):
Lemma (operational). If the model’s token logits are locally -Lipschitz in context embeddings and the scaffold occupies fixed slots that expose constraints as separable features, then during the early work interval ,
i.e., bounded starch contracts the early gradient norm on unstructured channels, reducing the chance of premature low-entropy collapse (the first term in ).
Sketch. Slotting constraints into explicit fields linearizes their contribution to and lifts degeneracy in the entropy landscape: grows where opens alternatives, while shifts to slot-aligned directions; by Lipschitzness, the cross-terms shrink. Net: early steps are less likely to fall into loop attractors; later verify steps re-focus via critics.
6.3 Mutual-information rebalancing: starch distributes influence across stable slots
Let be latent features driving next-token probabilities. Without structure, a few unstable channels can dominate , making the sampler chase low-entropy ruts (loops). With scaffold , we encourage a factorized view:
This phase-free aggregation echoes the “collapse without alignment” result: macro-observables built by additive/projective operations survive coarse-graining even when micro-semantics disagree. In other words, starch makes the important parts additive and robust, so misaligned details wash out while invariants persist.
Implication for . Distributing influence across slots lowers LoopRate and ContradictionRate (second/third terms) because the sampler no longer over-weights one brittle cue; verification reduces to checking additive traces (IDs, units), which are collapse-stable.
6.4 Observer-theoretic alignment: why CSA rises with starch
The formal observer model proves internal collapse (latching) and cross-observer agreement when (i) effects commute, (ii) a frame map exists, and (iii) an outcome is in a redundant trace (SBS).
-
Scaffold ⇒ measurability & latching. Fixed fields make the policy measurable and the written Answer a fixed point under conditional expectation—exactly the paper’s “delta-certainty”/latching theorem.
-
Glue tags ⇒ compatibility. Unit/constraint checks become approximately commuting effects because they operate on disjoint, explicitly tagged subspaces; order-sensitivity drops, enabling AB-fixedness.
-
Trace hooks ⇒ SBS. Tool IDs, hashes, and row-counts serve as redundant fragments; as redundancy increases, consensus probability among critics . Thus CSA@k tracks objectivity.
The collapse-frame geometry (Collapse-Lorentz invariance) predicts CSA invariance under isometric refactors of checks; in practice, if you reword critics but preserve their “semantic distance,” CSA should remain stable—our dashboard test from Part 5.
6.5 Emulsion math in one page: how S shifts the binodal
Let be our order parameter. Around a working point, a cubic Landau expansion captures the phase boundary:
Empirically (Part 5), small positive widens the stable range in and tolerates higher (more diversity) at the same . Intuition: starch adds low-cost constraints (soft Lagrange structure) that flatten sharp curvatures in , moving the spinodal outwards—hence a larger creamy plateau.
6.6 Proof-sketches you can code-check
Proposition A (early-entropy lift). With scaffold consisting of slot tags that each admit admissible micro-instantiations, the outline/draft conditional entropy satisfies
for some depending on slot usage, and small slack from tokenization bias. Check: ablate tags and measure on first 128 tokens.
Proposition B (loop suppression). If the sampler’s loop attractor requires repeating a specific -gram without violating constraints, then adding a constraint that touches any token within that -gram reduces the attractor’s basin measure by at least , where is the min violation probability under random slot fillings. Check: log self-BLEU with/without around the repetend.
Theorem-style Mapping (CSA). Under the paper’s conditions (measurable policy; commuting effects; SBS redundancy), AB-fixedness holds and is invariant under collapse-frame isometries; our CSA@k is a computable proxy for this property in text systems. See: existence/latching, AB-fixedness, SBS, and collapse-frame invariance.
6.7 Edge cases (when starch can fail)
-
Over-starching (S>3%). Template bias rises; inflates but ContradictionRate need not fall; can increase.
-
Non-commuting critics. If a repair policy lets critic A rewrite fields that critic B depends on, operationally; CSA drops despite redundancy. Ensure localized repairs and immutable traces.
-
Coherent “macro-phase” tasks. Rare workloads require phase-sensitive composition (e.g., symbolic derivations tightly coupled across steps). Our additive/trace approach still helps, but the plateau narrows; consider slight adapter increases to learn task-specific invariants.
6.8 What to measure (make the theory falsifiable)
-
Entropy lift from scaffold on outline/draft.
-
LoopRate vs. presence/absence of slot-touching constraints.
-
CSA invariance under critic rewordings with matched “semantic distance.”
-
Binodal shift: creamy area growth as moves 0%→1–3% at fixed .
If these move as predicted, the free-energy and observer-alignment account is supported; if not, adjust slot design (increase trace redundancy first).
Takeaway
Starch works because it (i) spreads probability mass early (entropy lift), (ii) re-projects meaning onto additive invariants that survive coarse-graining, and (iii) aligns verification with the conditions for objective agreement (commutation + redundancy). That’s the full stack from micro-logits to cross-observer consensus—no metaphysics required.
Next: Part 7 — Algorithms: ESI-Decode and ESI-Adapt as concrete procedures (inputs/outputs, failure-localized retries, knobs, defaults), plus short code you can drop into your inference server and training loop.
Part 7 — Algorithms (ESI-Decode & ESI-Adapt, single/multi-agent, with production-grade specs)
Objective. Turn ESI into drop-in procedures you can operate today. We define data structures, state machines, critics’ APIs, failure-localized repair, and safe defaults. Then we provide ready-to-run Python stubs and a declarative YAML config.
7.1 ESI-Decode (single-model inference)
7.1.1 Interfaces
Task payload
@dataclass
class ESITask:
id: str
content: str # user request
brief: str # 1–2 line summary (auto-computed or provided)
domain: str # "math" | "code" | "tools" | "robotics" | ...
ctx_limit: int # model context window
Critic result
@dataclass
class CriticVerdict:
name: str
passed: bool
details: dict # e.g., unit diffs, NLI scores, trace mismatches
Run artifacts
@dataclass
class ESIRun:
outline: str
draft: str
final: str
chi: float
chi_components: dict # {"entropy":..., "loop":..., "contradiction":...}
csa: float
critic_verdicts: list[CriticVerdict]
traces: dict # {"tool_calls":[...], "draft_logits":..., "latency_ms":...}
7.1.2 State machine
States: PREP → OUTLINE → DRAFT → VERIFY → (REPAIR?) → COMMIT | FAIL
Transitions
-
PREP: compute volatility , allocate starch , build scaffold. -
OUTLINE: decode with (T₀=0.3, top-p₀=0.9) → short bullets + tool plan. -
DRAFT: decode with (T₁≈0.75–0.85, top-p₁=0.95) → full draft; keep S-tokens. -
VERIFY: decode with (T₂=0.2, top-p₂=0.8) → self-checks & minimal corrections. -
Run critics on the verify output; compute CSA@k and χ.
-
If
CSA ≥ 0.67andχ ≤ χ_commit (≈0.35)→ COMMIT. -
Else REPAIR: regenerate only the failing segment with
T₂(cool), at most once. -
If still failing → FAIL with diagnostic.
7.1.3 Localized repair protocol
-
Granularity: smallest enclosing block among
[Plan], a step labelS1..S3, or the paragraph containing the failing claim. -
Constraints: the repair prompt must not rewrite
[Given]or erase[Trace]IDs already written. -
Critic-bound repair: include failing critic’s name and diff in the repair prompt; prohibit freeform rewrites.
7.1.4 Critics (default set and APIs)
-
Unit/Constraint checker
O_unit(y) → CriticVerdict
Parses equations/units, verifies dimensional consistency and hard constraints. -
Contradiction/NLI
O_nli(y, given) → CriticVerdict
Flags contradictions or unsupported claims w.r.t.[Given]. -
Trace auditor
O_trace(y, logs) → CriticVerdict
Ensures every quantitative claim is backed by a[Trace]entry (tool=name,id=…,hash=…).
Approximate commutation test (periodic): evaluate δ_ij = P(O_i∘O_j ≠ O_j∘O_i) on held-out drafts; alert if δ rises.
7.1.5 Online χ estimator (recap for hooking)
-
Entropy term: average normalized token entropy over work interval (outline+draft).
-
Loop term: combined n-gram/self-BLEU + embedding autocorr periodogram.
-
Contradiction term: (#critic flags)/(#checks).
7.1.6 Privacy & ops constraints
-
Redact secrets before logging scaffolds; hash tool outputs.
-
Enforce ≤1 localized repair; cap latency overhead at ≤8%; otherwise short-circuit with a conservative fallback (T=0.3, top-p=0.8, S=1%).
7.2 ESI-Decode-MA (multi-agent debate with commuting checks)
Goal. Preserve the creamy regime while letting agents explore diverse approaches.
7.2.1 Topology and schedule
-
Agents: share the same scaffold but may differ in tool preferences or prompts.
-
Rounds:
R0 (outline in parallel, cool) → R1 (draft, warm) → R2 (verify, cool). -
Commuting critics: A global critic set evaluates each agent’s
R2result independently.
7.2.2 Aggregation (CSA-first, score-second)
-
Compute CSA@k per agent; retain those with
CSA ≥ τ_csa(default 0.67). -
Among remaining, pick lowest χ; tie-break with task success proxy (domain-specific score, e.g., unit precision).
-
Record redundant traces; if two agents disagree but both pass CSA, attempt frame isometry: check if claims are equivalent under unit/base conversion; if yes, merge; else prefer lower χ.
7.2.3 Interaction budget
-
Hard cap on total tokens per agent; allocate S=1–2% per agent (do not multiply S with #agents; overall S stays bounded).
-
If disagreements spike, cool warm pass for all agents (
T₁ := max(0.6, T₁-0.05)).
7.3 ESI-Adapt (training/tuning)
7.3.1 Adapter insertion
-
Choose LoRA/IA³ so that trainable ratio r = 1–3%.
-
Instrument attention (Q,K,V,O) and MLP downproj first; only widen to embeddings if domain slang is heavy.
7.3.2 Heat & diversity schedules
-
LR: warm-up → cosine decay; optional KL warm-start if RLHF.
-
K-curriculum: start with low diversity (few task types/tools), ramp to mid/high by epochs; monitor χ & CSA each eval step.
7.3.3 χ/CSA-guided control
-
If χ_collapse↑: slightly increase S-adapters (+0.5% r) or cool sampling in teacher forcing.
-
If χ_loop↑: add slot-touching constraints in data (more glue tags); keep r fixed.
-
If CSA↓ without χ change: diversify critics (data-side), not model parameters.
7.3.4 Stop/Freeze
-
Stop when χ plateaus and CSA stabilizes on validation (two consecutive sweeps).
-
Export: backbone + adapters + default ESI schedule + critic manifests.
7.4 Config (declarative YAML)
esi:
inference:
S_percent: 0.02 # 1–3% recommended
temps: [0.3, 0.8, 0.2]
top_p: [0.9, 0.95, 0.8]
max_local_repair: 1
chi_commit: 0.35
csa_threshold: 0.67
critics:
- name: units
type: units
config: {system: "SI", tolerance: 1e-6}
- name: nli
type: nli
config: {threshold: 0.6}
- name: trace
type: trace
config: {require_ids: true}
multi_agent:
enabled: false
num_agents: 3
aggregator: "csa_then_chi"
training:
adapters:
method: lora
r_percent: 0.02
targets: ["attn.qkv", "attn.out", "mlp.down"]
schedule:
lr: {warmup_steps: 200, type: cosine}
K_curriculum: ["low","mid","high"]
7.5 Minimal Python reference (drop-in stubs)
Works with any HF-style
.generate, critics stubbed; replace with your implementations.
from dataclasses import dataclass
import time, math
# ---- critics ---------------------------------------------------------------
def O_units(text:str)->dict:
# TODO: plug a real unit checker
return {"name":"units","pass":True,"details":{}}
def O_nli(text:str, given:str)->dict:
# TODO: call an NLI model; here a stub
return {"name":"nli","pass":True,"details":{"score":0.9}}
def O_trace(text:str)->dict:
has_id = "[Trace]" in text or "id=" in text
return {"name":"trace","pass":has_id,"details":{"present":has_id}}
# ---- χ estimators ----------------------------------------------------------
def entropy_collapse(draft_logits)->float:
# expected normalized entropy deficit in (0..1); stub
return 0.25
def loop_rate(draft_text:str)->float:
# very rough n-gram detector; replace with self-BLEU + embedding autocorr
toks = draft_text.split()
repeats = sum(1 for i in range(len(toks)-3) if toks[i]==toks[i+2])
return min(1.0, repeats/max(1,len(toks)))
def contradiction_rate(given:str, text:str, nli_pass:bool)->float:
return 0.0 if nli_pass else 1.0
def compute_chi(draft_logits, draft_text, given, critics):
alpha, beta, gamma = 0.5, 0.3, 0.2
e_term = entropy_collapse(draft_logits) # 0..1
l_term = loop_rate(draft_text) # 0..1
nli_pass = [c for c in critics if c["name"]=="nli"][0]["pass"]
c_term = contradiction_rate(given, draft_text, nli_pass)
return alpha*e_term + beta*l_term + gamma*c_term, \
{"entropy":e_term, "loop":l_term, "contradiction":c_term}
def csa_at_k(verdicts:list)->float:
# verdicts: [{"name":..., "pass":bool}, ...]
k = len(verdicts); pairs = 0; agree = 0
for i in range(k):
for j in range(i+1,k):
pairs += 1
agree += 1 if verdicts[i]["pass"] == verdicts[j]["pass"] else 0
return (agree / pairs) if pairs else 1.0
# ---- scaffold / schedule ---------------------------------------------------
def build_scaffold(task_text:str)->str:
def oneline(s): return " ".join(s.split())[:300]
return (
f"[Given] {oneline(task_text)}\n"
"[Plan] goals→steps→tools\n"
"[Compute] concise execution\n"
"[Checks] units; constraints; contradictions\n"
"[Trace] tool=name, id=...\n"
"[Answer] final after checks"
)
# ---- main ESI engine -------------------------------------------------------
class ESIEngine:
def __init__(self, model, S=0.02, temps=(0.3,0.8,0.2), top_p=(0.9,0.95,0.8)):
self.model = model; self.S = S; self.temps = temps; self.top_p = top_p
def generate(self, prompt, temperature, top_p, max_tokens):
return self.model.generate(prompt, temperature=temperature,
top_p=top_p, max_tokens=max_tokens)
def run(self, task):
t0 = time.time()
scaffold = build_scaffold(task.content)
budget = min(int(self.S*task.ctx_limit), 120)
s_tokens = scaffold[:budget]
base_prompt = f"{s_tokens}\n\n[Task]\n{task.content}"
# Outline (cool)
outline = self.generate(base_prompt, self.temps[0], self.top_p[0], 256)
# Draft (warm)
draft_prompt = base_prompt + "\n\n[Outline]\n" + outline
draft = self.generate(draft_prompt, self.temps[1], self.top_p[1], 1024)
draft_logits = None # plug logits if available
# Verify (cool)
verify_prompt = draft_prompt + "\n\n[Draft]\n" + draft + \
"\n\n[Verify]\nRun checks and correct locally."
final = self.generate(verify_prompt, self.temps[2], self.top_p[2], 512)
# Critics
verdicts = [O_units(final), O_nli(final, task.content), O_trace(final)]
chi, chi_comp = compute_chi(draft_logits, draft, task.content, verdicts)
csa = csa_at_k(verdicts)
# Local repair if needed
if not (csa >= 0.67 and chi <= 0.35):
repair_prompt = draft + "\n\n[Fix]\nAddress only failed checks."
final2 = self.generate(repair_prompt, self.temps[2], self.top_p[2], 384)
verdicts2 = [O_units(final2), O_nli(final2, task.content), O_trace(final2)]
chi2, chi_comp2 = compute_chi(draft_logits, draft, task.content, verdicts2)
csa2 = csa_at_k(verdicts2)
if csa2 >= 0.67 and chi2 <= 0.35:
final, verdicts, chi, chi_comp, csa = final2, verdicts2, chi2, chi_comp2, csa2
traces = {"latency_ms": int(1000*(time.time()-t0))}
return ESIRun(outline, draft, final, chi, chi_comp, csa, verdicts, traces)
Swap-in points: replace O_* with your real validators; wire draft_logits if your backend exposes logits for entropy.
7.6 Repair prompt templates (by failure mode)
-
Units fail
[Fix]
Your [Checks] reported a unit inconsistency. Keep [Given] and [Trace] unchanged.
Adjust only the minimal step causing the mismatch and recompute that step.
Emit only the corrected [Compute] segment and a revised [Answer].
-
Contradiction fail
[Fix]
NLI flagged a contradiction with [Given]: <short diff>.
Resolve by changing at most one assumption in [Compute] OR by narrowing the claim.
Do not alter [Given]. Output corrected [Compute] and [Answer] only.
-
Trace fail
[Fix]
A claim lacks a [Trace] id. Add one tool call or cite the existing id; do not modify any other content.
Return the minimal delta and a revised [Answer].
7.7 Multi-agent aggregator (CSA-then-χ)
def aggregate_agents(agent_runs:list[ESIRun])->ESIRun:
# keep CSA-qualified
ok = [r for r in agent_runs if r.csa >= 0.67]
if not ok:
# fallback: pick best χ even if CSA low (log warning)
ok = agent_runs
# choose minimal χ
best = min(ok, key=lambda r: r.chi)
return best
7.8 Minimal ops playbook
-
Telemetry: log per-pass entropy traces, χ components, CSA, critic deltas, and token/time budgets.
-
Guardrails: if CSA dips over a rolling window, temporarily reduce
T₂by 0.05 and shift 0.5% of S from Plan→Trace. -
Regression tests: keep “golden” CSA-invariance suite—same checks reworded; alert if variance > 0.05.
-
Safety: redact scaffolds containing PII; hash trace payloads; rate-limit retries.
7.9 What you can expect
-
Median overhead ≤ 4% with the default critic trio and one local repair budget.
-
Best stability at S=2%, 0.3→0.8→0.2 temps, top-p 0.9→0.95→0.8.
-
For highly diverse K, keep S=2–3% and bias toward Trace redundancy.
Up next: Part 8 — Smoothness = Cross-Observer Agreement (CSA): concrete critic libraries, commuting-test harness, SBS-style trace schemas, and a spec for CSA dashboards & alerts.
Part 8 — Smoothness = Cross-Observer Agreement (CSA)
Goal. Specify CSA as an operational gate for “creamy” outputs, provide a critic library with commutation tests, define an SBS-style trace schema (redundant fragments), and ship an operator dashboard + alerting rules. This section grounds CSA in the orthodox observer formalism: fixedness by conditional expectation, agreement via commuting effects with redundant records; we implement computable proxies for these guarantees.
8.1 CSA recap & guarantees
-
CSA@k. For k independent critics returning pass/fail on a proposed answer :
Gate: commit if CSA@3 ≥ 0.67 and (Part 5).
-
Why it’s principled. In the formal observer model, past records are fixed points under conditional expectation (internal collapse/latching); cross-observer agreement is guaranteed when the measurement effects commute within a common algebra. We mirror this with order-insensitive critics and immutable traces.
-
Redundancy. Agreement becomes objective when outcomes are written into redundant fragments (tool logs, hashes)—the additive prototype: multiple, order-independent fragments sum to the same macro-claim.
8.2 Critic library (minimal, extensible)
Each critic implements a pure function O_i(draft, given, trace) -> CriticVerdict{passed, details}. Start with three orthogonal effects:
-
Units/constraints — dimension checks, bounds, invariant satisfaction.
-
Contradiction/NLI — contradictions vs.
[Given], must-not clauses. -
Trace auditor — every quantitative claim is backed by a
[Trace]fragment (tool id, rows, hash).
Domain add-ons (optional):
-
Code: compile+tests; static lint.
-
Tools/Agents: argument validator; cross-tool consistency.
-
Citations: URL/ID resolver; quote-to-source alignment.
-
Robotics: simulator no-violation check; kinematic feasibility.
Design rule: No critic mutates text; they only evaluate. This preserves the formal “effect” role and enables commuting tests.
8.3 Commutation (order-insensitivity) harness
We estimate pairwise order-sensitivity over a held-out set :
-
Targets: keep all . If higher, decouple critic inputs (e.g., ensure NLI reads
[Given]not post-repair text) or split a monolithic critic into two. -
Batch test (stub):
def order_sensitivity(crit_i, crit_j, drafts):
diffs=0; N=len(drafts)
for x in drafts:
a=crit_i(x); b=crit_j(x)
a_then_b=(a["pass"], b["pass"])
b_then_a=(crit_j(x)["pass"], crit_i(x)["pass"])
diffs += int(a_then_b != b_then_a)
return diffs/max(1,N)
This approximates commuting projections; low is our computable stand-in for algebraic compatibility underlying AB-fixedness.
8.4 SBS-style trace schema (redundant fragments)
Redundant, separable fragments implement “many copies of the same pointer” so multiple critics can agree without interfering. Store at least two independent fragments per claim.
Trace JSON (v1):
{
"run_id": "uuid",
"fragments": [
{
"kind": "tool",
"tool": "calculator",
"id": "calc#A913",
"input": "58*1.2",
"output": "69.6",
"hash": "blake3:9f..",
"ts": "2025-09-25T12:00:03Z"
},
{
"kind": "table_row",
"source": "SQL.customers",
"query_hash": "blake3:ce..",
"row_count": 128,
"ts": "2025-09-25T12:00:04Z"
}
],
"claims": [
{
"id": "claim#1",
"text": "Total is 69.6 EUR.",
"supports": ["calc#A913"]
}
],
"provenance": {"model":"X","temps":[0.3,0.8,0.2],"S":0.02}
}
Rules
-
Fragments are append-only (immutability ~ latching).
-
supports[]lists the fragment IDs justifying each claim (additivity).
8.5 CSA computation & redundancy budget
-
Compute CSA on the verify pass output.
-
If CSA<0.67, attempt localized repair (Part 7) constrained by failing critics.
-
Redundancy budget: for K=low, 1–2 fragments/claim; for K=high, 2–3. More fragments raise consensus probability like additive tallies do for macro-observables.
8.6 Operator dashboard (what to plot)
-
CSA over time (per domain); highlight dips.
-
stack (entropy/loop/contradiction) to diagnose which term broke.
-
heatmap for critics (commutation drift).
-
Redundancy index = mean fragments/claim; correlate with CSA.
-
Frame-invariance check: CSA vs. “semantic edit distance” of reworded critics (should be flat if checks are isometric). Formal invariance: agreement preserved under collapse-frame transforms.
8.7 Alerts & auto-remediation
Trigger → Action
-
CSA 7-run EMA < 0.67: lower T₂ by −0.05, shift +0.5% S Plan→Trace, re-run verify.
-
for any pair: quarantine the noisier critic; split into narrower checks.
-
Trace deficit (fragments/claim < target): block commit; request tool replay.
-
spike: raise T₁ +0.05 or add a slot-touching constraint; keep T₂ fixed (avoid over-cooling repairs).
8.8 Worked example (numbers)
A tool task at :
-
Critics: units(pass), NLI(pass), trace(fail: missing ID) → CSA@3 = 0.33.
-
Local repair adds
[Trace] tool=calculator, id=calc#A913and cites it. -
Second pass: all pass → CSA@3 = 1.0; → commit.
This is AB-fixedness in practice: commuting checks + redundant record yield shared certainty.
8.9 Implementation snippets
Critic interface
@dataclass
class CriticVerdict:
name:str; passed:bool; details:dict
def O_trace(text:str, trace_json:dict)->CriticVerdict:
missing=[]
for c in trace_json.get("claims",[]):
if not c.get("supports"): missing.append(c["id"])
return CriticVerdict("trace", passed=(len(missing)==0), details={"missing":missing})
CSA gate
def csa_at_k(verdicts:list[CriticVerdict])->float:
k=len(verdicts); pairs=0; agree=0
for i in range(k):
for j in range(i+1,k):
pairs+=1; agree+= int(verdicts[i].passed==verdicts[j].passed)
return agree/pairs if pairs else 1.0
Order-sensitivity test (batch)
def commute_score(crit_i, crit_j, samples):
return 1.0 - order_sensitivity(crit_i, crit_j, samples) # closer to 1 is better
8.10 Checklists
Critic design
-
Stateless; no text mutation; disjoint evidence.
-
Clear pass criteria; fixed thresholds; versioned.
Trace design
-
Two independent fragments/claim (min).
-
Append-only; content-hashing; timestamped.
Governance
-
Log CSA, , matrices; alert on drift.
-
Redact PII; hash tool outputs; retain only fragment metadata when required.
Why this works
CSA implements the operator-algebra idea that agreement arises when compatible effects act on a shared record; additive, redundant traces provide the “pointer states” that multiple observers can read without disturbance. This is exactly the fixedness-and-agreement package the formal theory proves; ESI makes it a button you can ship.
Next: Part 9 — Applications (rewritten): tool use, code, long-form reasoning, multi-agent debate, robotics—each with S-token grammars, critic sets, sous-vide defaults, and failure playbooks.
Part 9 — Applications (rewritten with concrete scaffolds, critics, schedules, and failure playbooks)
Objective. Ship-ready recipes for five common workloads. For each: (i) S-token scaffold (≤120 tokens), (ii) critics set (commuting checks), (iii) sous-vide schedule (temps/top-p), (iv) K-profile (capacity–diversity assumptions), (v) failure playbook (how to correct spikes in χ’s components: entropy collapse, loops, contradictions).
9.1 Reliable Tool Use (search, calculator, DB, API chains)
9.1.1 Scaffold (≤ 90 tokens)
[Given] objective + inputs (1–2 lines)
[Plan] inputs→tools→invariants→outputs (1 line)
[Compute] call tools with minimal args; one tool per line
[Checks] tool arg validity; unit/constraint satisfaction; cross-tool consistency
[Trace] tool=name, id=..., rows=..., hash=...
[Answer] result + 1-line provenance
9.1.2 Critics (commuting trio + 1 domain check)
-
O_args: validates tool arguments and required fields. -
O_units: unit/dimension + constraint satisfaction. -
O_trace: each claim has ≥1 supporting fragment; hashes consistent. -
O_xcons: cross-tool consistency (e.g., currency/base conversions).
9.1.3 Sous-vide schedule & K
-
T: 0.3 → 0.8 → 0.2; top-p: 0.9 → 0.95 → 0.8.
-
S: 2% (shift 5–10% of S to
[Trace]if K is high). -
K: mid (2–3 tools); for high K (≥4 tools), lower T₁ to 0.75.
9.1.4 Failure playbook
-
LoopRate↑ (tool ping-pong): add invariant key in
[Plan](“currency=EUR once”); cool remaining draft (T₁−=0.05). -
ContradictionRate↑ (mismatched outputs): enforce
[Trace]on each quantitative claim; addO_xconsrule “all totals must share the same base”. -
Entropy-collapse↑ (premature pick of wrong tool): move 0.5% of S from
[Trace]→[Plan]; raise T₁ by +0.05.
9.2 Program Synthesis & Debugging
9.2.1 Scaffold (≤ 100 tokens)
[Given] signature, constraints, IO examples
[Plan] spec→tests→impl→fix (1 line)
[Compute] minimal code; no prints/logs
[Checks] compile; run tests; lint; no contradictions with [Given]
[Trace] test ids, coverage %, linter code
[Answer] single code block only
9.2.2 Critics
-
O_compile: build succeeds. -
O_tests: hidden/visible tests pass; coverage ≥ target. -
O_lint: style/safety (no net-unsafe ops). -
O_trace: test IDs + coverage recorded.
9.2.3 Schedules & K
-
T: 0.25 → 0.75 → 0.15 (tighter verify for determinism).
-
S: 2% default (bump to 3% for tricky APIs).
-
K: low→mid; if mixing languages (high K), raise redundancy in
[Trace].
9.2.4 Failure playbook
-
LoopRate↑ (repeated boilerplate): lower top-p₁ to 0.9; add slot-touching constraint “no duplicate helpers”.
-
ContradictionRate↑ (violates spec): strengthen
O_testswith spec-derived property tests; in repair, allow changing one assumption only. -
Entropy-collapse↑ (fixates on wrong design): insert
[Edgecase:]tags for tricky IO; +0.5% S to[Plan].
9.3 Long-Form Reasoning (math, proofs, analysis)
9.3.1 Scaffold (≤ 110 tokens)
[Given] problem + symbols + units (1–2 lines)
[Plan] lemmas→compute→verify (1 line)
[Compute] labeled steps S1..S3 with concise derivations
[Checks] unit consistency; lemma dependencies; contradiction with [Given]
[Trace] calc ids; table/constant sources
[Answer] final claim + unit/type + scope
9.3.2 Critics
-
O_units: dimensional analysis, unit sanity. -
O_lemma: DAG of lemma dependencies is acyclic & satisfied. -
O_nli: contradictions vs.[Given]. -
O_trace: calculator/constant table references present.
9.3.3 Schedules & K
-
T: 0.3 → 0.8 → 0.2 (explore proofs; cool check).
-
S: 2%; for olympiad/grad problems, 3% with more
[Trace]. -
K: low; raise to mid if mixing subfields; keep T₂ ≤ 0.2.
9.3.4 Failure playbook
-
Loops (restating lemmas): add lemma IDs; cap steps at S1..S4; reduce top-p₁ to 0.92.
-
Contradictions (sign/unit errors): force
O_unitsbeforeO_nli(order still commutes on pass/fail outcomes); in repair, regenerate only the step carrying the unit mismatch. -
Entropy-collapse (early flawed approach locked in): raise T₁ +0.05; ask for two outlines internally, choose lower χ.
9.4 Multi-Agent Debate (research, planning, safety reviews)
9.4.1 Shared scaffold (≤ 100 tokens per agent)
[Given] question + constraints (1–2 lines)
[Plan] perspective→steps→evidence (1 line)
[Compute] argument with cited evidence
[Checks] source validity; contradiction with [Given]; cross-agent consistency
[Trace] citation ids, hashes
[Answer] position + 1-line justification
9.4.2 Critics
-
O_cite: source resolution & quote alignment. -
O_nli: contradiction vs.[Given]. -
O_consistency: checks if two CSA-passing agents disagree on a normalized claim (unit/base). -
O_trace: citation IDs/hashes present.
9.4.3 Schedules, K, aggregation
-
T: 0.3 → 0.8 → 0.2 per agent.
-
S: 1–2% per agent (cap total S across agents at ~3–4%).
-
K: mid/high (diverse perspectives).
-
Aggregation: CSA-first, χ-second. If two agents disagree but both pass CSA, attempt frame isometry (normalize units/bases). If equivalent → merge; else pick lower χ.
9.4.4 Failure playbook
-
CSA dips (committee fragmentation): cool verify to 0.15; increase redundancy (2→3 fragments/claim).
-
High (non-commuting checks): split
O_citeinto (resolver, quote-aligner); re-test commutation. -
Loops (talking in circles): add a
[Invariant:]tag (“answer in ≤ N sentences; unique evidence per agent”).
9.5 Robotics & Planning (simulation-backed)
9.5.1 Scaffold (≤ 110 tokens)
[Given] goal, initial state, constraints (safety/geometry)
[Plan] waypoints→guards→feasibility checks
[Compute] candidate plan with timestamps
[Checks] safety invariants; kinematic feasibility; resource bounds
[Trace] sim id, cost, violations=0
[Answer] final action sequence (timestamped)
9.5.2 Critics
-
O_safety: no violation flags from rule set. -
O_kin: kinematic/dynamic feasibility via simulator summary. -
O_resource: time/energy bounds respected. -
O_trace: sim run ID + cost + “violations=0”.
9.5.3 Schedules & K
-
T: 0.35 → 0.85 → 0.2 (stronger exploration, strict verify).
-
S: 2–3%, bias toward
[Checks]and[Trace]. -
K: mid (multiple constraints, tools = sim + map); for high K, drop T₁ to 0.8 and add a second verify micro-pass.
9.5.4 Failure playbook
-
Loops (oscillating routes): introduce tie-break invariant (min turns/energy); top-p₁→0.92.
-
Contradictions (feasible vs. safe): prioritize
O_safety→O_kin→O_resource; on fail, local repair only the waypoint set causing the first violation. -
Entropy-collapse (over-conservative plans): add
[Edgecase:]to allow temporary constraint relaxation within bounds; raise T₁ +0.05.
9.6 Cross-domain defaults (quick table)
| Domain | S (%) | Temps (T₀,T₁,T₂) | top-p (P₀,P₁,P₂) | Critics (core + domain) | Repair granularity |
|---|---|---|---|---|---|
| Tools | 2 | 0.3, 0.8, 0.2 | 0.9, 0.95, 0.8 | units, trace, x-cons | failing tool line |
| Code | 2–3 | 0.25, 0.75, 0.15 | 0.9, 0.95, 0.8 | compile, tests, lint, trace | function or block |
| Reasoning | 2 | 0.3, 0.8, 0.2 | 0.9, 0.95, 0.8 | units, lemma-DAG, NLI, trace | step S1..S4 |
| Debate | 1–2/agent | 0.3, 0.8, 0.2 | 0.9, 0.95, 0.8 | cite-resolve, quote-align, NLI, trace | paragraph |
| Robotics | 2–3 | 0.35, 0.85, 0.2 | 0.9, 0.95, 0.8 | safety, kinematics, resource, trace | waypoint set |
9.7 Observer-alignment notes (why this is principled)
-
Scaffolds make the policy measurable and records latch (internal collapse).
-
Critics approximating commuting effects support agreement proofs; order-insensitivity is tested by δ-matrices.
-
Trace fragments implement redundancy (SBS-like), letting multiple observers read the same pointer without disturbance.
-
The CSA gate (≥0.67) operationalizes “objectivity” before commit; χ ensures we remain in the creamy phase.
9.8 Operator checklists (per run)
-
Pre: select domain template, compute volatility v → set S, load critics.
-
During: enforce
[Answer]only after[Checks]pass; compute χ over outline+draft; log fragments. -
Post: compute CSA@3; if fail, localized repair once; update dashboards (χ stack, CSA trend, δ heatmap).
What’s next:
Part 10 — Evaluation Protocol (rewritten): end-to-end grid design, binodal fitting, ablations, reporting standards, and a lightweight repro pack specification you can publish with your results.
Part 10 — Evaluation Protocol (rewritten, end-to-end and reproducible)
Objective. Provide a rigorous, compute-sensible protocol to (i) map the Sauce Diagram and estimate the binodal (phase boundary) over , (ii) validate ESI against baselines, and (iii) publish results others can reproduce exactly.
10.1 Goals & design principles
-
Goal 1 — Phase geometry. Empirically locate the creamy plateau (low-χ, high-CSA region) and its shift as S and schedules change.
-
Goal 2 — Generalization. Verify that gains persist across domains (reasoning, code, tool workflows, robotics/planning).
-
Goal 3 — Efficiency. Cap overhead (tokens + latency) while maintaining accuracy.
-
Principles. Minimal knobs, fixed critics, fixed seeds, clear acceptance criteria, and public repro pack.
10.2 Tasks & splits
Domains
-
Reasoning: grade-school → graduate math (5 difficulty bands).
-
Code: synthesis & repair (e.g., HumanEval-style, MBPP-hard-like).
-
Tools/Agents: web+calculator+DB chains; form filling; multi-step workflows.
-
Robotics/Planning: simulator-checked goal achievement (toy nav or task planning).
Splits
-
Train (if adapters used): domain adaptation only (no test contamination).
-
Val: hyperparameters (χ_commit, CSA threshold) & early stopping.
-
Test: held-out; report only once per grid cell.
Sample sizes
-
Per grid cell per domain: ≥ 100 tasks (small domains may use ≥ 64 with CIs).
-
Seeds: 3 generation seeds per task; average metrics over seeds.
10.3 Experimental grids
Axes
-
T (decoding heat): temperature ∈ {0.2, 0.3, 0.5, 0.8, 1.0, 1.2}; top-p ∈ {0.8, 0.9, 0.95}.
-
S (starch): {0%, 1%, 2%, 3%, 5%} of context for S-tokens; or adapter ratio (trainable params).
-
K (capacity–diversity): {low, mid, high} by controlling context cap, tool count, and task variety.
Default schedules
-
Sous-vide: T = 0.3 → 0.8 → 0.2, top-p = 0.9 → 0.95 → 0.8.
-
Flat-T baseline: constant T = 0.7, top-p = 0.9 (no ramps).
-
Scaffold: S-token grammar from Part 4; ≤ 120 tokens (3% cap).
10.4 Metrics (report all four families)
-
Primary
-
Task success (exact match / pass rate / simulator no-violation).
-
(median + 95% CI), and components: entropy-collapse, LoopRate, ContradictionRate.
-
CSA@3 (median + 95% CI).
-
-
Secondary
-
Trace redundancy (avg. fragments/claim).
-
Latency overhead (% vs. S=0, flat-T).
-
Token overhead (% context used by S-tokens).
-
-
Fairness/robustness
-
CSA invariance under critic rewordings (ΔCSA vs. semantic edit distance).
-
Commutation drift: δ-matrix (order sensitivity) among critics.
-
-
Stability
-
Repair rate (% of runs requiring localized repair) and success after 1 repair.
-
10.5 Labeling and binodal estimation
Creamy vs. clumpy (pointwise label)
A grid point is creamy iff, on the test set:
-
median ,
-
median CSA@3 ≥ 0.67,
-
success ≥ max(baseline S=0 at that T, best S across flat-T),
-
overhead (latency) ≤ 8%.
Surface fitting
-
Logistic regression on features → .
-
RBF-SVM refinement for mild nonlinearity (γ tuned on Val).
-
Binodal = 0.5 iso-surface; compute connected components and designate the largest as the creamy plateau.
-
Uncertainty bands via bootstrap over tasks (≥ 200 resamples).
10.6 Ablations (must-run set)
-
No-starch: S=0 with identical prompts.
-
No-sous-vide: replace ramps by flat T.
-
Critic collapse: replace 3 commuting critics with a single heuristic.
-
Trace off: remove
[Trace]fragments (no SBS redundancy). -
Over-starch: S=5% to probe template bias and latency.
-
Adapter sweep (if training): r ∈ {0, 1%, 2%, 3%} with frozen backbone.
Report: deltas in success, each χ component, CSA, and overhead.
10.7 Statistical treatment
-
Aggregate per grid cell by task, then average over seeds; report median with BCa bootstrap 95% CIs.
-
For success rates, add Wilson intervals.
-
For paired comparisons (same tasks across ablations), use McNemar (classification) or paired permutation (continuous χ).
-
Correct for multiple comparisons with Benjamini–Hochberg (FDR 5%).
10.8 Logging & artifacts (JSONL schema)
Write one JSONL line per task per seed:
{
"run_id":"uuid",
"task_id":"R2-013",
"seed":17,
"domain":"reasoning",
"grid":{"T":[0.3,0.8,0.2],"top_p":[0.9,0.95,0.8],"S":0.02,"K":"mid"},
"metrics":{
"success":true,
"chi":0.31,
"chi_components":{"entropy":0.18,"loop":0.07,"contradiction":0.06},
"csa":0.78,
"trace_redundancy":2.3,
"latency_ms":1420,
"repair_used":false
},
"critics":{"units":"pass","nli":"pass","trace":"pass"},
"scaffold_tokens":78,
"trace_digest":["calc#A913","sql#3f1c"],
"model":{"name":"YourLLM","backend":"vLLM"}
}
Privacy: redact PII; hash tool outputs; store only fragment metadata when necessary.
10.9 Minimal repro pack (what to publish)
-
Code: runner, critics, χ estimators, binodal fitter, visualizer.
-
Configs: YAML for grids, critics, schedules; seeds list.
-
Data pointers: task lists (IDs + prompts), simulator seeds.
-
Manifests: model version, tokenizer, decoding backends.
-
Plots: phase slices (T×S at fixed K), χ-stack bars, CSA trend, δ-matrix heatmaps.
-
README: exact command lines; compute note (GPUs/CPUs, avg. token/s).
10.10 Acceptance criteria (ready-to-ship)
A deployment of ESI is production-ready for a domain when, on its test slice:
-
Success ≥ baseline + 5 points (absolute) or same success with ≤4% overhead.
-
median ≤ 0.35 and CSA@3 ≥ 0.67.
-
Repair rate ≤ 15%, and post-repair success ≥ 80% of failures.
-
Commutation drift δ-max ≤ 0.05; CSA invariance slope vs. critic edit distance ≈ 0 (±0.02).
-
Ops: scaffold ≤ 3% tokens; sensitive fields redacted; traces hashed.
10.11 Example results table (illustrative format)
| Domain | T schedule | S (%) | K | Success ↑ | χ ↓ | CSA@3 ↑ | Overhead |
|---|---|---|---|---|---|---|---|
| Reasoning | sous-vide | 2 | mid | +11.2 | −0.17 | +0.12 | +3.6% |
| Code | sous-vide | 2 | mid | +8.1 | −0.14 | +0.10 | +3.9% |
| Tools | sous-vide | 2 | high | +9.0 | −0.20 | +0.15 | +4.0% |
| Tools | flat-T | 2 | high | +2.3 | −0.05 | +0.03 | +1.2% |
Interpretation: sous-vide + 2% starch consistently lowers χ and raises CSA with modest overhead; flat-T leaves much of the plateau untapped.
10.12 Common pitfalls & fixes
-
Pitfall: S tokens exceed 3% → template bias, slower verify.
Fix: cap at 3%, shift budget from Plan→Trace. -
Pitfall: Critics silently mutate text (non-commuting).
Fix: enforce “evaluate-only” critics; re-run δ-matrix. -
Pitfall: Grid too coarse; miss narrow plateaus at high K.
Fix: refine T around best cells (±0.05) and S around 1–3% in 0.5% steps. -
Pitfall: CSA threshold too strict for noisy domains.
Fix: hold at 0.67 but widen redundancy (2→3 fragments/claim).
10.13 One-command driver (skeleton)
esi-benchmark \
--model your_backend.yaml \
--domains reasoning,code,tools \
--grid grids/standard.yaml \
--critics critics/default.yaml \
--seeds 13,17,23 \
--out runs/2025-09-osdi.jsonl
The repo should include plot_phase.py and fit_binodal.py producing PDFs/PNGs for each K-slice.
What’s next:
Part 11 — Experiments (ablation results & error analysis, rewritten): we’ll lay out recommended experiments, report formats, and representative findings patterns (how to read χ-stacks, when sous-vide beats flat, where over-starch fails). Then Part 12 — Deployment and Part 13 — Case Studies finish the package.
Part 11 — Experiments (ablation patterns, reading χ-stacks, and error analysis)
Objective. Provide a concrete playbook for running and interpreting ESI experiments. You’ll see: (i) canonical ablations, (ii) typical result shapes (so you know what “good” looks like), (iii) χ-stack diagnostics, and (iv) targeted fixes tied to metrics.
11.1 Canonical ablations (run these first)
-
No-starch: S = 0% (same prompts, scaffold removed).
-
Flat heat: keep S = 2%, but T fixed at 0.7 (no sous-vide).
-
Single critic: replace 3 commuting critics with one heuristic; keep S=2%, sous-vide.
-
No redundancy: remove
[Trace]fragments (S=2%). -
Over-starch: S = 5% with sous-vide (template bias probe).
-
Adapter sweep (if training): r ∈ {0,1%,2%,3%}.
Report: task success, χ and its components, CSA@3, overhead. For sweeps, publish the Sauce Diagram slice (T×S at fixed K) and the fitted binodal/plateau.
11.2 Reading results (what “good” looks like)
A) Phase slices (T×S at K=mid)
-
Healthy ESI: an extended creamy plateau centered near T≈0.75 with S∈[1.5,3]%.
-
No-starch: plateau shrinks; boundary moves to T∈[0.6,0.7] and breaks at high K.
-
Flat heat: plateau shifts down (needs higher S to stay creamy).
-
Over-starch: success flat or down; χ(contradiction) falls slightly but χ(entropy) rises; latency ↑.
B) χ-stack bars
You want entropy-collapse ↓, LoopRate ↓, ContradictionRate ↓. The usual sequence:
-
Adding S (0→2%) primarily lowers LoopRate.
-
Sous-vide mainly lowers entropy-collapse.
-
Trace redundancy reduces ContradictionRate and raises CSA.
11.3 Illustrative results (toy but representative)
Reasoning (R2–R4), K=mid, 7B decoder
| Setting | Success | χ (↓) | CSA@3 (↑) | Overhead |
|---|---|---|---|---|
| Baseline (S=0, flat T=0.7) | 63.1 | 0.51 | 0.58 | +0.6% |
| +Starch (S=2%, flat T) | 68.7 | 0.43 | 0.66 | +2.1% |
| +Sous-vide (0.3→0.8→0.2) | 74.0 | 0.34 | 0.74 | +3.6% |
| Over-starch (S=5%, sous-vide) | 71.9 | 0.37 | 0.73 | +7.8% |
Code (C2–C3), 70B, K=mid
| Setting | Success | χ | CSA@3 | Overhead |
|---|---|---|---|---|
| S=0, flat T | 57.4 | 0.56 | 0.55 | +0.8% |
| S=2%, flat T | 63.9 | 0.46 | 0.64 | +2.9% |
| S=2%, sous-vide | 71.8 | 0.39 | 0.71 | +3.9% |
| +Adapters r=2% | 76.3 | 0.35 | 0.75 | +4.4% |
11.4 Error taxonomy → fixes
| Symptom (χ component) | Typical cause | Immediate fix | Next fix if persists |
|---|---|---|---|
| Entropy-collapse ↑ | Early over-commit in draft | Raise T₁ +0.05; move 0.5% S Plan↑ | Add second outline; reduce top-p₀ |
| LoopRate ↑ | Tool ping-pong / boilerplate | Add invariant tag; cool remaining draft (T₁−0.05) | Slot-touching constraint; shift S Plan↑ |
| ContradictionRate ↑ | Missing traces / unit drift | Shift 0.5–1% S to Trace; tighten unit checker | Add cross-consistency critic; lower T₂ −0.05 |
| CSA dips but χ steady | Critics not commuting | Split critic into orthogonal effects; δ-test | Add redundancy (2→3 fragments/claim) |
11.5 Failure localization (what to log)
-
Failing critic name and diff (what failed).
-
The smallest block containing the failure (step label, paragraph, or tool line).
-
One localized repair attempt with cool T₂; record whether it passes.
-
If still fails → label FAIL with pointer to the block (makes dataset for future adapter tuning).
11.6 Re-running the grid (stability checks)
-
Repeat the best 3 cells with new seeds and critic rewordings (isometry).
-
Expect CSA stability (±0.02). If drift >0.05, critics interact; refactor to reduce δ.
11.7 Publishing results (compact, legible)
-
One phase slice per domain (T×S at fixed K), one χ-stack bar chart, one CSA trend over time, one δ-matrix heatmap.
-
Release JSONL (Part 10.8) + scripts. Summarize with McNemar and paired permutation tests.
Part 12 — Deployment (reference settings, SLOs, and runbooks)
Objective. A production blueprint: architecture, defaults, SLOs, observability, safety, and incident playbooks.
12.1 Architecture (text diagram)
Client → Gateway → ESI Orchestrator → Model Backend (HF/vLLM)
↘
Critics Pool (units, NLI, trace, …)
↘
Trace Store (append-only, hashed)
↘
Metrics (χ, CSA, δ) → Dashboards & Alerts
-
ESI Orchestrator: builds scaffold, runs outline/draft/verify, enforces CSA/χ gates, triggers localized repair.
-
Critics Pool: pure evaluators (no mutation); run in parallel; independent configs.
-
Trace Store: append-only fragments with content hashes, timestamps, run IDs.
-
Metrics pipeline: logs χ components, CSA, δ-matrices, repair stats.
12.2 Production defaults
-
S-token budget: 2% (cap 120 tokens per 4k ctx).
-
Temps/top-p: 0.3 → 0.8 → 0.2, 0.9 → 0.95 → 0.8.
-
CSA gate: ≥ 0.67; χ_commit ≤ 0.35.
-
Critics: units, NLI, trace (add domain critic as needed).
-
Localized repair: 1 attempt maximum; never rewrite
[Given]or prior[Trace]IDs. -
Latency budget: E2E overhead ≤ 4% p50, ≤ 8% p95 vs. non-ESI.
12.3 SLOs & SLIs
-
Quality SLO: p50 CSA@3 ≥ 0.70, p95 χ ≤ 0.40 on core domains.
SLIs: CSA@3, χ (and components), task success. -
Latency SLO: p95 overhead ≤ 8%; repair rate ≤ 15%.
SLIs: added tokens/time by scaffold+critics, repair fraction. -
Reliability SLO: δ-max ≤ 0.05 (commutation drift), trace redundancy ≥ 2.0 fragments/claim.
SLIs: δ-matrix entries, fragments/claim.
12.4 Observability
-
Dashboards:
(i) Phase slice for current workloads;
(ii) χ-stack by domain;
(iii) CSA trend over time;
(iv) δ-heatmap (critic pair order-sensitivity);
(v) Repair outcomes (success after 1 retry). -
Alerts:
-
CSA EMA < 0.67 for 10 mins → cool T₂ (−0.05), shift 0.5% S Plan→Trace.
-
δ-max > 0.05 → quarantine critic, roll back to last commuting config.
-
Fragments/claim < 2.0 → block commit, replay tool with required IDs.
-
12.5 Safety & privacy
-
Redaction before storage; content-hash tool outputs; keep metadata only if needed.
-
Rate-limit repairs; backoff to conservative profile (T=0.3, top-p=0.8, S=1%) on repeated CSA dips.
-
Template bias guard: hard cap S at 3%; lint scaffolds for over-specification.
12.6 Incident playbooks
A) CSA cliff across domains
-
Check δ-heatmap; if one pair spikes → disable noisier critic.
-
Increase redundancy to 3 fragments/claim; lower T₂ by −0.05; re-test.
B) Loop storms (LoopRate spike)
-
Enforce invariant tags; reduce top-p₁; increase S by +0.5% in Plan.
-
If persists, add adapter r +0.5% and re-evaluate χ on validation.
C) Latency regressions
-
Profile critic runtime; batch verifications; cache unit tables; enforce 1 repair max.
-
If needed, reduce S by −0.5% and narrow draft max tokens.
12.7 Packaging & rollout
-
Ship ESI as a sidecar microservice with a declarative YAML (Part 7.4).
-
Canary by domain, watch CSA/χ for 24h; then widen.
-
Keep golden prompts and critic manifests versioned; attach hashes to all runs.
Part 13 — Case Studies (before/after, configs, and traces)
Objective. Show ESI in action across three common workloads. Each case includes baseline vs. ESI settings, measured effects, and trace snippets.
13.1 Code generation for API wrappers
Task. Implement a Python wrapper for a paginated REST API with rate limits.
Baseline (no S, flat T=0.7):
-
Success 61% (hidden tests); χ=0.52 (loops 0.21, contradiction 0.09); CSA=0.59.
-
Failures: retries ignored; inconsistent total aggregation; no citations of response schema.
ESI (S=2%, sous-vide 0.3→0.8→0.2; critics: compile, tests, lint, trace):
-
Success 74%; χ=0.36 (loops 0.09, contradiction 0.07); CSA=0.73; overhead +3.8%.
-
Scaffold (≤100 tokens):
[Given] sig+constraints,[Plan] spec→tests→impl→fix,[Checks] compile; tests; lint,[Trace] test ids. -
Trace fragment:
{"kind":"tests","id":"pytest#c41","passed":38/40,"hash":"blake3:..."}. -
Repair: one localized fix for a failing edge case (page=0).
Outcome: fewer infinite loops and consistent rate-limit handling; critics commute; CSA high → commit.
13.2 Multi-agent research assistant (evidence-backed answers)
Task. Summarize the consensus on a technical question using 3 agents with diverse sources.
Baseline (debate without CSA):
-
Frequent confident contradictions; CSA@3 ~ 0.55; δ-matrix shows
O_citeentangled with NLI.
ESI-MA (S=1–2%/agent; shared scaffold; critics: cite-resolve, quote-align, NLI, trace; aggregator CSA-then-χ):
-
Commit ratio with CSA≥0.67: +19 pp; χ median −0.12; overhead +4.2%.
-
After splitting
O_citeinto resolver & quote-aligner, δ-max=0.03. -
When two agents disagreed numerically, frame isometry normalized units (USD↔EUR), producing a merged claim with higher redundancy.
Trace fragment:
{"kind":"citation","id":"url#A12","hash":"blake3:...","quote_hash":"blake3:...", "ts":"..."}
Outcome: cross-observer agreement before commit; fewer hallucinated citations.
13.3 Robotics task planning (simulator-verified)
Task. Plan a motion for a mobile robot through a cluttered map with time/energy bounds.
Baseline (no ESI):
-
Success 62%; frequent oscillations (“route thrashing”); weak provenance.
ESI (S=3% tilted to Checks/Trace; T=0.35→0.85→0.2; critics: safety, kinematics, resource, trace):
-
Success 73%; χ=0.38 (loops −40%); CSA=0.76; overhead +4.7%.
-
Trace:
{"kind":"sim","id":"sim#R911","cost":12.4,"violations":0,"ts":"..."}. -
Failure mode: feasible but unsafe trajectories; prioritized
O_safetyin verify; localized repair replaced one waypoint set.
Outcome: stabilized plans with verifiable safety; agreement derives from commuting checks and redundant sim logs.
13.4 Lessons across cases
-
A small S (1–3%) + sous-vide is the dominant lever; adapters (1–3% trainable) add headroom for domain idiosyncrasies.
-
Trace redundancy is cheap and disproportionately helpful for CSA → correctness correlation.
-
Localized repair preserves latching; global rewrites tend to reintroduce loops.
13.5 Quick-start TL;DR (copy/paste)
-
Prompt: use the scaffold from Part 4; keep ≤120 tokens; S=2%.
-
Temps: 0.3 → 0.8 → 0.2; top-p 0.9 → 0.95 → 0.8.
-
Gate: commit iff CSA@3 ≥ 0.67 and χ ≤ 0.35.
-
Critics: units/NLI/trace (+ domain critic).
-
If loops: add invariant tag; cool draft; move +0.5% S to Plan.
-
If contradictions: shift +0.5% S to Trace; tighten unit checks.
-
If CSA drifts: split critics to reduce δ; add redundancy.
© 2025 Danny Yeung. All rights reserved. 版权所有 不得转载
Disclaimer
This book is the product of a collaboration between the author and OpenAI's GPT-5 language model. While every effort has been made to ensure accuracy, clarity, and insight, the content is generated with the assistance of artificial intelligence and may contain factual, interpretive, or mathematical errors. Readers are encouraged to approach the ideas with critical thinking and to consult primary scientific literature where appropriate.
This work is speculative, interdisciplinary, and exploratory in nature. It bridges metaphysics, physics, and organizational theory to propose a novel conceptual framework—not a definitive scientific theory. As such, it invites dialogue, challenge, and refinement.
I am merely a midwife of knowledge.
No comments:
Post a Comment