The Science Behind AgentOps¶
TL;DR: One model of knowledge decay (Darr 1995) suggests ~17%/week without reinforcement. Knowledge compounds when retrieval × usage beats decay and scale friction. Early growth can be exponential; long-run growth requires active limits-to-growth controls.
The Journey¶
This wasn't designed in a vacuum. It came from years of connecting dots across fields:
- Knowledge OS — The insight that git + files can become a durable bookkeeping substrate
- DevOps — The Three Ways applied to knowledge, not just code
- Cognitive Science — Why 40% load is optimal (35 years of research)
- MemRL — Reinforcement learning for retrieval and utility-scoring systems (2026)
- Thermodynamics — The Brownian Ratchet as progress model
Each piece validated the intuition. The math fell out naturally.
Claim Status (So We Don't Overclaim)¶
To make this model durable under critique, we separate claim types:
| Tier | Claim Type | Standard of Evidence | Examples in this doc |
|---|---|---|---|
| A | Established external evidence | Peer-reviewed or widely replicated findings | Forgetting curves, cognitive load limits, lost-in-the-middle behavior |
| B | Internal empirical evidence | Reproducible internal measurements with clear methodology | Time-to-resolution deltas, token-cost deltas, reuse-rate trends |
| C | Working hypothesis | Mechanistic proposal under active testing | Ratchet model details, exact operating point for σ × ρ, cross-project transfer effects |
This document mixes all three tiers. The goal is to make each claim explicit, measurable, and falsifiable.
Part 1: The Problem (With Evidence)¶
Knowledge Decays. Fast.¶
Citation: Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations: Productivity in Franchises." Management Science, 41(11), 1750-1762.
Finding: Organizational knowledge depreciates at approximately 17% per week without active reinforcement.
Week 0: 100%
Week 1: 83% (lost 17%)
Week 2: 69% (lost another 17% of what remained)
Week 4: 47%
Week 8: 22%
Why this matters for AI: Every Claude session starts close to Week 0 unless the environment resurfaces prior work. Without bookkeeping plus retrieval, you're always on the left side of the decay curve.
The Forgetting Curve¶
Citation: Ebbinghaus, H. (1885). Über das Gedächtnis (Memory: A Contribution to Experimental Psychology).
Ebbinghaus discovered that memory decays exponentially without reinforcement, but each retrieval strengthens the memory and slows future decay.
Memory Strength
│
100%│╲
│ ╲
50%│ ╲______ retrieval here
│ ╲_____ slows decay
25%│ ╲____
└─────────────────────────
Time →
The insight: It's not about storing more. It's about retrieving at the right time.
Part 2: The Math (Plain English)¶
The Equation¶
dK/dt = I(t) - δ·K + σ·ρ·K
Don't panic. Here's what each piece means:
| Symbol | What It Is | Plain English | Example |
|---|---|---|---|
K |
Knowledge stock | How much useful stuff you've accumulated | "We have 156 learnings stored" |
dK/dt |
Rate of change | Is the pile growing or shrinking? | "+5 learnings this week" or "-10 lost to decay" |
I(t) |
Input rate | New knowledge coming in | "Forged 3 sessions today" |
δ |
Decay rate | How fast you forget (0.17/week) | "17% gone each week if unused" |
σ |
Retrieval coverage | How much of the useful stock are you actually surfacing? | "Surfaced 70% of retrievable artifacts" |
ρ |
Decision influence rate | Of what you surfaced, how much later had evidence-backed use? | "30% of surfaced artifacts were referenced or applied" |
Implementation note: In the CLI implementation (
metrics_health.go), sigma measures unique surfaced retrievable artifacts / total retrievable artifacts over the last 10 sessions, rho measures the fraction of surfaced artifacts later evidenced byreferenceorappliedcitations, and delta measures average age of active learnings in days. The escape velocity check usesσ × ρ > δ/100to normalize delta to a ratio comparable with sigma and rho.
Dimensional Check¶
To keep this scientifically defensible:
Kis measured in useful knowledge units (for example: validated learnings).I(t)is knowledge units per week.δ,σ, andρare rates/probabilities per week or per retrieval cycle.- All terms in
dK/dtresolve to knowledge units per week.
Breaking It Down¶
I(t) — The input. You forge transcripts, extract learnings, write retros. Knowledge goes in.
- δ·K — The decay. Every week, 17% of your knowledge becomes stale or forgotten. This is fighting against you.
+ σ·ρ·K — The compounding term. This is the magic.
When you retrieve knowledge (σ) and actually use it (ρ), two things happen:
1. That knowledge gets reinforced (Ebbinghaus)
2. New knowledge is created from the application
The ·K means it's proportional to how much you already have. More knowledge → more compounding.
System Dynamics Correction: Limits to Growth¶
The baseline equation is structurally correct, but idealized. In real systems, reinforcing loops hit balancing loops at scale.
A scale-aware form:
dK/dt = I(t) - δ·K + σ(K,t)·ρ·K - φ·K²
where:
σ(K,t) = σ_max / (1 + (K / Kσ)^n)
Interpretation:
- σ(K,t) declines as the corpus grows unless retrieval/index quality improves.
- φ·K² captures scale friction: indexing overhead, latency, noise, governance cost, and cognitive overhead.
- Kσ is the knowledge scale where retrieval starts degrading materially.
This adds the missing Limits to Growth balancing loop from System Dynamics.
The Escape Velocity Condition¶
Rearrange the equation at steady state:
General growth condition:
I(t) + K(σ·ρ - δ) > 0
Operational self-sustaining check: σ × ρ > δ/100
With scale friction:
ρ·σ(K,t) > δ/100 + φ·K - I(t)/K
Meaning: early growth can be exponential, but long-run growth plateaus unless you actively prevent σ collapse and friction growth.
If your retrieval effectiveness times your citation rate exceeds 0.17 in self-sustaining mode, you're compounding. If not, you either need fresh input I(t) or stronger controls to avoid stagnation.
| Scenario | σ | ρ | σ × ρ | vs δ | Result |
|---|---|---|---|---|---|
| No bookkeeping or retrieval | 0 | 0 | 0 | < 0.17 | Decaying |
| Store but don't retrieve | 0.1 | 0.1 | 0.01 | < 0.17 | Decaying |
| Retrieve but don't use | 0.8 | 0.1 | 0.08 | < 0.17 | Decaying |
| AgentOps target | 0.7 | 0.3 | 0.21 | > 0.17 | Compounding |
The 0.04 margin matters. Small edge, compounded over time, becomes massive.
Loop-Dominance Translation (System Dynamics)¶
This model is a stock-and-flow system with competing loops:
R1reinforcing loop: retrieval -> usage -> stronger priors -> better future retrieval.B1balancing loop: decay/staleness drains the stock.B2balancing loop: scale friction reduces retrieval and increases operating cost asKgrows.
Expected phases:
- Bootstrap phase:
R1 > B1, rapid gains. - Flywheel phase:
R1 > B1 + B2, compounding with healthy margins. - Saturation risk:
B2grows; gains flatten. - Renewal phase: pruning, re-indexing, tiering, and stronger feedback push
R1back above balancing loops.
Part 3: DevOps Foundation (The Three Ways)¶
Citation: Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.
DevOps isn't about tools. It's about three principles:
The First Way: Flow¶
Optimize the flow of work from left to right (dev → ops → customer).
In AgentOps: Knowledge flows from sessions → forge → store → inject → next session.
Session → Forge → Store → Inject → Session
↓
(no bottlenecks)
We don't batch. We stream. Every session feeds the next.
The Second Way: Feedback¶
Create feedback loops at every stage.
In AgentOps:
- /vibe validates code quality
- /pre-mortem catches failures before they happen
- ao feedback trains the utility scorer
- Citation tracking shows what's actually used
Action → Measurement → Learning → Adjustment
↑ │
└──────────────────────────────────┘
The Third Way: Continuous Learning¶
Create a culture of experimentation and learning from failure.
In AgentOps:
- /retro extracts learnings from every significant work
- /post-mortem closes the loop on epics
- Failures become learnings, not just incidents
Failure → Retro → Learning → Pattern → Skill
↓
(never make same mistake)
The connection: DevOps optimizes code flow. AgentOps optimizes knowledge flow. Same principles, different domain.
Part 4: Cognitive Science (Why 40%)¶
The Research Stack¶
| Researcher | Year | Finding | Application |
|---|---|---|---|
| Miller | 1956 | Working memory holds 7±2 chunks | Context windows have real limits |
| Cowan | 2001 | Core capacity is ~4 items | Optimal load is lower than max |
| Sweller | 1988 | Cognitive Load Theory | Three types of load compete |
| Paas & van Merriënboer | 2020 | Updated CLT | JIT loading reduces extraneous load |
| Barkley | 2015 | Executive function limits | Performance collapses at overload |
| Csikszentmihalyi | 1990 | Flow state | Optimal challenge zone |
| Yerkes & Dodson | 1908 | Inverted-U performance curve | Peak at moderate arousal |
| Liu et al. | 2023 | "Lost in the Middle" | LLMs can't retrieve from crowded contexts |
Citations:
- Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
- Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
- Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
- Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory: Methods to manage working memory load." Current Directions in Psychological Science, 29(4), 394-398.
- Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
The Pattern¶
Every study finds the same thing: performance peaks at moderate load.
Performance
│
100%│ ╭───╮
│ ╭─╯ ╰─╮
75%│ ╭─╯ ╰─╮
│ ╭─╯ ╰─╮ collapse
50%│ ╭─╯ ╰──────
│╭─╯
25%│
└────────────────────────────────
0% 20% 40% 60% 80% 100%
Context Utilization
40% isn't arbitrary. It's where decades of research say performance lives.
Why This Matters for LLMs¶
Liu et al. (2023) showed that LLMs have a "lost in the middle" problem. When context is crowded: - Information at the start: retrieved well - Information at the end: retrieved well - Information in the middle: lost
Retrieval Accuracy by Position:
Start │████████████████████│ High
│ │
Mid │████████░░░░░░░░░░░░│ Low ← "Lost in the middle"
│ │
End │████████████████████│ High
The fix: Don't fill context to 100%. Stay at 40%. The middle stays findable.
Part 5: MemRL (Scale-Control Mechanism)¶
Citation: Zhang, S., Wang, J., Zhou, R., et al. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192
The Problem MemRL Solves¶
Traditional retrieval uses recency or similarity. But not all knowledge is equally useful.
Traditional RAG:
Query → Find similar → Return top K → Hope it helps
Problem: Recent ≠ Useful. Similar ≠ Helpful.
The MemRL Solution¶
Use reinforcement learning to learn what's actually useful:
# Each piece of knowledge has a utility score
utility = 0.5 # Start neutral
# When retrieved and used successfully
utility = (1 - α) × utility + α × 1.0 # Reward
# When retrieved but not helpful
utility = (1 - α) × utility + α × 0.0 # Penalty
# Ranking combines freshness AND utility
score = z_norm(freshness) + λ × z_norm(utility)
The insight: The system learns from feedback. Over time, useful knowledge rises and noise sinks, which helps prevent σ from collapsing as K grows.
How We Use It¶
# User gives feedback
ao feedback L15 --reward 1.0 # "This learning was helpful"
ao feedback L12 --reward 0.0 # "This was irrelevant"
# System updates utility scores
# Future retrieval ranks by usefulness, not just recency
The math connection: MemRL is one practical control to keep σ(K,t) high enough to offset scale friction. It is a mechanism, not a guarantee.
Part 6: The Brownian Ratchet (Our Contribution)¶
The Physics¶
A Brownian ratchet is a thought experiment from thermodynamics:
- Molecules bounce randomly (thermal motion)
- A pawl allows motion in only one direction (one-way gate)
- Net result: forward movement from random chaos
Random Motion One-Way Gate Net Progress
↓ ↓ ↓
←→←→←→←→ ───────┤► ──────────►
(chaos) (filter) (ratchet)
The Software Analog¶
| Physics | Software | Example |
|---|---|---|
| Random motion | Multiple parallel attempts | 4 polecats trying different approaches |
| One-way gate | Validation gates | Tests, CI, /vibe, /pre-mortem |
| Net forward movement | Merged/locked progress | Code in main, issues closed, learnings stored |
Why This Model Matters¶
Traditional thinking: Minimize variance. One developer, one approach, careful steps.
Ratchet thinking: Maximize controlled variance. Many attempts, filter aggressively, lock successes.
Traditional:
───────────────────────────────────► (slow, fragile)
Ratchet:
═══╦═══╦═══╦═══╗
═══╬═══╬═══╬═══╬════════════════════► (fast, resilient)
═══╩═══╩═══╩═══╝
↑
some fail, most succeed
failures are cheap
The Key Property¶
You can always add more chaos. You can't un-ratchet.
- Failed experiment? Try another. (Chaos is cheap.)
- Merged code? It is hard to regress accidentally. (Ratchet holds.)
- Stored learning? It compounds if retrieval quality stays high. (Progress can lock, but scale can still add drag.)
This is why progress can be made one-way at the artifact level, while system-level growth still needs active scale management.
Part 7: Putting It All Together¶
The Full Picture¶
┌─────────────────────────────────────────────────────────────────┐
│ THE AGENTOPS SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DEVOPS LAYER (The Three Ways) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Flow: Session → Forge → Store → Inject → Session │ │
│ │ Feedback: Vibe, Pre-mortem, Citations, Utility scores │ │
│ │ Learning: Retros, Post-mortems, Pattern extraction │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ COGNITIVE LAYER (40% Rule) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Context utilization: 35% checkpoint, 40% alert │ │
│ │ JIT loading: Load what's needed, when it's needed │ │
│ │ Lost-in-middle prevention: Don't crowd the context │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ MEMRL LAYER (Utility Learning) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Feedback loop: Use → Reward/Penalize → Update utility │ │
│ │ Retrieval: Freshness + Utility scoring │ │
│ │ Result: σ (retrieval effectiveness) improves over time │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ RATCHET LAYER (Progress Locking) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Chaos: Multiple attempts, parallel exploration │ │
│ │ Filter: Validation gates (tests, vibe, CI) │ │
│ │ Ratchet: Merge, close, store (permanent) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ THE GOAL (Escape Velocity) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ σ × ρ > δ/100 │ │
│ │ │ │
│ │ retrieval × evidence-backed use > aging threshold │ │
│ │ │ │
│ │ When true: KNOWLEDGE COMPOUNDS │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Why Each Piece Matters¶
| Layer | What It Does | Which Variable It Improves |
|---|---|---|
| DevOps | Flow, feedback, learning | I(t) — more knowledge in |
| Cognitive | Optimal load | σ — better retrieval |
| MemRL | Utility learning | σ — smarter retrieval |
| Scale controls (tiering/pruning/indexing) | Limits-to-growth mitigation | Holds σ(K,t) up and φ down |
| Ratchet | Lock progress | Prevents regression of K |
Every layer serves the equation. The added constraint is explicit: long-run growth needs scale controls, not just early flywheel activation.
Part 8: Evidence (Internal Pilots, Not Causal Proof Yet)¶
What We've Measured So Far¶
These are internal observations and should be read as Tier B evidence (operational telemetry, not randomized causal proof):
| Metric | Baseline (internal) | AgentOps condition (internal) | Direction |
|---|---|---|---|
| Same-issue resolution | 45 min | 3 min | Faster |
| Token cost per issue | $2.40 | $0.15 | Lower |
| Context collapse rate | ~65% at 60% load | 0% at 40% load | Lower |
| Knowledge reuse | ~0% | ~15% (growing) | Higher |
Why This Is Not Yet "Proof"¶
- Baselines are historical, not fully randomized.
- Team maturity and task mix can confound outcomes.
- Some gains can come from process discipline independent of memory quality.
Evaluation Design to Make It Bulletproof¶
For causal confidence, run a controlled protocol:
- Randomize comparable tasks across conditions (
memory-on,memory-off,memory-on + utility learning). - Pre-register primary metrics: resolution time, token cost, defect rate, reuse precision@k, and citation rate
ρ. - Track estimated decay
δ_tand retrieval effectivenessσ_tweekly. - Segment by corpus size (
Kbuckets) to detect limits-to-growth behavior. - Require out-of-sample replication across projects, not just one team.
Falsifiable Predictions¶
The model should be treated as wrong if repeated experiments show:
ρ·σ(K,t) <= δ + φ·K - I(t)/Kwhile performance still compounds.- Retrieval quality does not degrade with scale and no compensating controls are needed.
memory-on + utility learningdoes not outperformmemory-onasKincreases.- Gains disappear under simple task-randomized comparisons.
Part 9: Limits to Growth and Control Policy¶
System Dynamics says reinforcing loops eventually hit balancing loops. We model that explicitly and design around it.
Main Scale Risks¶
| Risk | Loop Effect | Observable Symptom |
|---|---|---|
| Corpus bloat | Lowers σ(K,t) |
Falling precision@k, more irrelevant recalls |
| Retrieval latency/cost | Raises effective φ |
Slower sessions, rising token burn |
| Quality drift | Raises effective δ |
More stale/contradictory learnings |
| Cognitive overload | Lowers ρ |
Retrieved items cited less in final outputs |
Control Actions (Operational)¶
| Control | Primary Variable | Practical Mechanism |
|---|---|---|
| Tiering + archival | φ down |
Keep hot set small, cold set searchable |
| Utility-based pruning | σ up, δ down |
Remove low-value or stale memories |
| Re-index + embedding refresh | σ up |
Improve retrieval quality as schema evolves |
| Citation incentives/UX | ρ up |
Make reuse cheaper than re-derivation |
| Drift audits | δ down |
Detect and repair stale knowledge clusters |
Operating Rule¶
Track loop dominance continuously:
health(t) = ρ·σ(K,t) - (δ + φ·K - I(t)/K)
If health(t) > 0, the system is in compounding mode.
If health(t) <= 0, growth has hit limits and controls must be tightened.
Conclusion: The Goal Is The Math¶
Everything in AgentOps exists to achieve one thing:
Operational check:
σ × ρ > δ/100
Scale-aware form:
ρ·σ(K,t) > δ + φ·K - I(t)/K
When this is true, knowledge compounds. When it's false, growth stalls or reverses.
This is a control problem, not a slogan. Reinforcing loops must stay stronger than balancing loops over time.
Every feature, every skill, every CLI command serves this inequality:
| Feature | How It Helps |
|---|---|
/forge |
Increases I(t) — more knowledge in |
/inject |
Increases σ — better retrieval |
/vibe, /pre-mortem |
Filter bad work before it wastes cycles |
ao feedback |
Improves σ via utility learning |
| Tiering/pruning/re-indexing | Prevents limits-to-growth collapse in σ and φ |
| Ratchet chain | Prevents K from regressing |
| 40% rule | Keeps σ high by avoiding lost-in-middle |
The goal is the math, with explicit scale limits. The system is only "bulletproof" if we measure loop dominance and adapt controls as K grows.
References¶
Knowledge Decay¶
- Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations." Management Science, 41(11), 1750-1762.
- Ebbinghaus, H. (1885). Über das Gedächtnis. Leipzig: Duncker & Humblot.
Cognitive Science¶
- Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
- Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
- Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
- Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory." Current Directions in Psychological Science, 29(4), 394-398.
- Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
- Yerkes, R. M., & Dodson, J. D. (1908). "The relation of strength of stimulus to rapidity of habit-formation." Journal of Comparative Neurology and Psychology, 18(5), 459-482.
LLM Context¶
- Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
Memory-Augmented Learning¶
- Zhang, S., Wang, J., Zhou, R., Liao, J., Feng, Y., Zhang, W., Wen, Y., Li, Z., Xiong, F., Qi, Y., Tang, B., & Wen, M. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192
DevOps¶
- Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.
Systems Dynamics¶
- Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing.
- Meadows, D. H., Meadows, D. L., Randers, J., & Behrens, W. W. (1972). The Limits to Growth. Universe Books.
"The goal is the math. Everything else is implementation."