The Science Behind AgentOps¶

TL;DR: One model of knowledge decay (Darr 1995) suggests ~17%/week without reinforcement. Knowledge compounds when retrieval × usage beats decay and scale friction. Early growth can be exponential; long-run growth requires active limits-to-growth controls.

The Journey¶

This wasn't designed in a vacuum. It came from years of connecting dots across fields:

Knowledge OS — The insight that git + files can become a durable bookkeeping substrate
DevOps — The Three Ways applied to knowledge, not just code
Cognitive Science — Why 40% load is optimal (35 years of research)
MemRL — Reinforcement learning for retrieval and utility-scoring systems (2026)
Thermodynamics — The Brownian Ratchet as progress model

Each piece validated the intuition. The math fell out naturally.

Claim Status (So We Don't Overclaim)¶

To make this model durable under critique, we separate claim types:

Tier	Claim Type	Standard of Evidence	Examples in this doc
A	Established external evidence	Peer-reviewed or widely replicated findings	Forgetting curves, cognitive load limits, lost-in-the-middle behavior
B	Internal empirical evidence	Reproducible internal measurements with clear methodology	Time-to-resolution deltas, token-cost deltas, reuse-rate trends
C	Working hypothesis	Mechanistic proposal under active testing	Ratchet model details, exact operating point for `σ × ρ`, cross-project transfer effects

This document mixes all three tiers. The goal is to make each claim explicit, measurable, and falsifiable.

Part 1: The Problem (With Evidence)¶

Knowledge Decays. Fast.¶

Citation: Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations: Productivity in Franchises." Management Science, 41(11), 1750-1762.

Finding: Organizational knowledge depreciates at approximately 17% per week without active reinforcement.

Text Only

Week 0: 100%
Week 1:  83%  (lost 17%)
Week 2:  69%  (lost another 17% of what remained)
Week 4:  47%
Week 8:  22%

Why this matters for AI: Every Claude session starts close to Week 0 unless the environment resurfaces prior work. Without bookkeeping plus retrieval, you're always on the left side of the decay curve.

The Forgetting Curve¶

Citation: Ebbinghaus, H. (1885). Über das Gedächtnis (Memory: A Contribution to Experimental Psychology).

Ebbinghaus discovered that memory decays exponentially without reinforcement, but each retrieval strengthens the memory and slows future decay.

Text Only

Memory Strength
    │
100%│╲
    │ ╲
 50%│  ╲______ retrieval here
    │         ╲_____ slows decay
 25%│              ╲____
    └─────────────────────────
         Time →

The insight: It's not about storing more. It's about retrieving at the right time.

Part 2: The Math (Plain English)¶

The Equation¶

Text Only

dK/dt = I(t) - δ·K + σ·ρ·K

Don't panic. Here's what each piece means:

Symbol	What It Is	Plain English	Example
`K`	Knowledge stock	How much useful stuff you've accumulated	"We have 156 learnings stored"
`dK/dt`	Rate of change	Is the pile growing or shrinking?	"+5 learnings this week" or "-10 lost to decay"
`I(t)`	Input rate	New knowledge coming in	"Forged 3 sessions today"
`δ`	Decay rate	How fast you forget (0.17/week)	"17% gone each week if unused"
`σ`	Retrieval coverage	How much of the useful stock are you actually surfacing?	"Surfaced 70% of retrievable artifacts"
`ρ`	Decision influence rate	Of what you surfaced, how much later had evidence-backed use?	"30% of surfaced artifacts were referenced or applied"

Implementation note: In the CLI implementation (metrics_health.go), sigma measures unique surfaced retrievable artifacts / total retrievable artifacts over the last 10 sessions, rho measures the fraction of surfaced artifacts later evidenced by reference or applied citations, and delta measures average age of active learnings in days. The escape velocity check uses σ × ρ > δ/100 to normalize delta to a ratio comparable with sigma and rho.

Dimensional Check¶

To keep this scientifically defensible:

K is measured in useful knowledge units (for example: validated learnings).
I(t) is knowledge units per week.
δ, σ, and ρ are rates/probabilities per week or per retrieval cycle.
All terms in dK/dt resolve to knowledge units per week.

Breaking It Down¶

I(t) — The input. You forge transcripts, extract learnings, write retros. Knowledge goes in.

- δ·K — The decay. Every week, 17% of your knowledge becomes stale or forgotten. This is fighting against you.

+ σ·ρ·K — The compounding term. This is the magic.

When you retrieve knowledge (σ) and actually use it (ρ), two things happen: 1. That knowledge gets reinforced (Ebbinghaus) 2. New knowledge is created from the application

The ·K means it's proportional to how much you already have. More knowledge → more compounding.

System Dynamics Correction: Limits to Growth¶

The baseline equation is structurally correct, but idealized. In real systems, reinforcing loops hit balancing loops at scale.

A scale-aware form:

Text Only

dK/dt = I(t) - δ·K + σ(K,t)·ρ·K - φ·K²

where:
σ(K,t) = σ_max / (1 + (K / Kσ)^n)

Interpretation: - σ(K,t) declines as the corpus grows unless retrieval/index quality improves. - φ·K² captures scale friction: indexing overhead, latency, noise, governance cost, and cognitive overhead. - Kσ is the knowledge scale where retrieval starts degrading materially.

This adds the missing Limits to Growth balancing loop from System Dynamics.

The Escape Velocity Condition¶

Rearrange the equation at steady state:

Text Only

General growth condition:
I(t) + K(σ·ρ - δ) > 0

Operational self-sustaining check: σ × ρ > δ/100

With scale friction:
ρ·σ(K,t) > δ/100 + φ·K - I(t)/K

Meaning: early growth can be exponential, but long-run growth plateaus unless you actively prevent σ collapse and friction growth.

If your retrieval effectiveness times your citation rate exceeds 0.17 in self-sustaining mode, you're compounding. If not, you either need fresh input I(t) or stronger controls to avoid stagnation.

Scenario	σ	ρ	σ × ρ	vs δ	Result
No bookkeeping or retrieval	0	0	0	< 0.17	Decaying
Store but don't retrieve	0.1	0.1	0.01	< 0.17	Decaying
Retrieve but don't use	0.8	0.1	0.08	< 0.17	Decaying
AgentOps target	0.7	0.3	0.21	> 0.17	Compounding

The 0.04 margin matters. Small edge, compounded over time, becomes massive.

Loop-Dominance Translation (System Dynamics)¶

This model is a stock-and-flow system with competing loops:

R1 reinforcing loop: retrieval -> usage -> stronger priors -> better future retrieval.
B1 balancing loop: decay/staleness drains the stock.
B2 balancing loop: scale friction reduces retrieval and increases operating cost as K grows.

Expected phases:

Bootstrap phase: R1 > B1, rapid gains.
Flywheel phase: R1 > B1 + B2, compounding with healthy margins.
Saturation risk: B2 grows; gains flatten.
Renewal phase: pruning, re-indexing, tiering, and stronger feedback push R1 back above balancing loops.

Part 3: DevOps Foundation (The Three Ways)¶

Citation: Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.

DevOps isn't about tools. It's about three principles:

The First Way: Flow¶

Optimize the flow of work from left to right (dev → ops → customer).

In AgentOps: Knowledge flows from sessions → forge → store → inject → next session.

Text Only

Session → Forge → Store → Inject → Session
            ↓
      (no bottlenecks)

We don't batch. We stream. Every session feeds the next.

The Second Way: Feedback¶

Create feedback loops at every stage.

In AgentOps: - /vibe validates code quality - /pre-mortem catches failures before they happen - ao feedback trains the utility scorer - Citation tracking shows what's actually used

Text Only

Action → Measurement → Learning → Adjustment
   ↑                                  │
   └──────────────────────────────────┘

The Third Way: Continuous Learning¶

Create a culture of experimentation and learning from failure.

In AgentOps: - /retro extracts learnings from every significant work - /post-mortem closes the loop on epics - Failures become learnings, not just incidents

Text Only

Failure → Retro → Learning → Pattern → Skill
                                         ↓
                               (never make same mistake)

The connection: DevOps optimizes code flow. AgentOps optimizes knowledge flow. Same principles, different domain.

Part 4: Cognitive Science (Why 40%)¶

The Research Stack¶

Researcher	Year	Finding	Application
Miller	1956	Working memory holds 7±2 chunks	Context windows have real limits
Cowan	2001	Core capacity is ~4 items	Optimal load is lower than max
Sweller	1988	Cognitive Load Theory	Three types of load compete
Paas & van Merriënboer	2020	Updated CLT	JIT loading reduces extraneous load
Barkley	2015	Executive function limits	Performance collapses at overload
Csikszentmihalyi	1990	Flow state	Optimal challenge zone
Yerkes & Dodson	1908	Inverted-U performance curve	Peak at moderate arousal
Liu et al.	2023	"Lost in the Middle"	LLMs can't retrieve from crowded contexts

Citations:

Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory: Methods to manage working memory load." Current Directions in Psychological Science, 29(4), 394-398.
Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.

The Pattern¶

Every study finds the same thing: performance peaks at moderate load.

Text Only

Performance
    │
100%│          ╭───╮
    │        ╭─╯   ╰─╮
 75%│      ╭─╯       ╰─╮
    │    ╭─╯           ╰─╮ collapse
 50%│  ╭─╯               ╰──────
    │╭─╯
 25%│
    └────────────────────────────────
    0%   20%   40%   60%   80%  100%
              Context Utilization

40% isn't arbitrary. It's where decades of research say performance lives.

Why This Matters for LLMs¶

Liu et al. (2023) showed that LLMs have a "lost in the middle" problem. When context is crowded: - Information at the start: retrieved well - Information at the end: retrieved well - Information in the middle: lost

Text Only

Retrieval Accuracy by Position:

Start │████████████████████│ High
      │                    │
Mid   │████████░░░░░░░░░░░░│ Low  ← "Lost in the middle"
      │                    │
End   │████████████████████│ High

The fix: Don't fill context to 100%. Stay at 40%. The middle stays findable.

Part 5: MemRL (Scale-Control Mechanism)¶

Citation: Zhang, S., Wang, J., Zhou, R., et al. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192

The Problem MemRL Solves¶

Traditional retrieval uses recency or similarity. But not all knowledge is equally useful.

Text Only

Traditional RAG:
  Query → Find similar → Return top K → Hope it helps

Problem: Recent ≠ Useful. Similar ≠ Helpful.

The MemRL Solution¶

Use reinforcement learning to learn what's actually useful:

Python

# Each piece of knowledge has a utility score
utility = 0.5  # Start neutral

# When retrieved and used successfully
utility = (1 - α) × utility + α × 1.0  # Reward

# When retrieved but not helpful
utility = (1 - α) × utility + α × 0.0  # Penalty

# Ranking combines freshness AND utility
score = z_norm(freshness) + λ × z_norm(utility)

The insight: The system learns from feedback. Over time, useful knowledge rises and noise sinks, which helps prevent σ from collapsing as K grows.

How We Use It¶

Bash

# User gives feedback
ao feedback L15 --reward 1.0   # "This learning was helpful"
ao feedback L12 --reward 0.0   # "This was irrelevant"

# System updates utility scores
# Future retrieval ranks by usefulness, not just recency

The math connection: MemRL is one practical control to keep σ(K,t) high enough to offset scale friction. It is a mechanism, not a guarantee.

Part 6: The Brownian Ratchet (Our Contribution)¶

The Physics¶

A Brownian ratchet is a thought experiment from thermodynamics:

Molecules bounce randomly (thermal motion)
A pawl allows motion in only one direction (one-way gate)
Net result: forward movement from random chaos

Text Only

    Random Motion          One-Way Gate           Net Progress
         ↓                      ↓                      ↓
    ←→←→←→←→              ───────┤►              ──────────►
    (chaos)               (filter)               (ratchet)

The Software Analog¶

Physics	Software	Example
Random motion	Multiple parallel attempts	4 polecats trying different approaches
One-way gate	Validation gates	Tests, CI, /vibe, /pre-mortem
Net forward movement	Merged/locked progress	Code in main, issues closed, learnings stored

Why This Model Matters¶

Traditional thinking: Minimize variance. One developer, one approach, careful steps.

Ratchet thinking: Maximize controlled variance. Many attempts, filter aggressively, lock successes.

Text Only

Traditional:
  ───────────────────────────────────► (slow, fragile)

Ratchet:
  ═══╦═══╦═══╦═══╗
  ═══╬═══╬═══╬═══╬════════════════════► (fast, resilient)
  ═══╩═══╩═══╩═══╝
       ↑
   some fail, most succeed
   failures are cheap

The Key Property¶

You can always add more chaos. You can't un-ratchet.

Failed experiment? Try another. (Chaos is cheap.)
Merged code? It is hard to regress accidentally. (Ratchet holds.)
Stored learning? It compounds if retrieval quality stays high. (Progress can lock, but scale can still add drag.)

This is why progress can be made one-way at the artifact level, while system-level growth still needs active scale management.

Part 7: Putting It All Together¶

The Full Picture¶

Text Only

┌─────────────────────────────────────────────────────────────────┐
│                     THE AGENTOPS SYSTEM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEVOPS LAYER (The Three Ways)                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Flow: Session → Forge → Store → Inject → Session        │    │
│  │ Feedback: Vibe, Pre-mortem, Citations, Utility scores   │    │
│  │ Learning: Retros, Post-mortems, Pattern extraction      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  COGNITIVE LAYER (40% Rule)                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Context utilization: 35% checkpoint, 40% alert          │    │
│  │ JIT loading: Load what's needed, when it's needed       │    │
│  │ Lost-in-middle prevention: Don't crowd the context      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  MEMRL LAYER (Utility Learning)                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Feedback loop: Use → Reward/Penalize → Update utility   │    │
│  │ Retrieval: Freshness + Utility scoring                  │    │
│  │ Result: σ (retrieval effectiveness) improves over time  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  RATCHET LAYER (Progress Locking)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Chaos: Multiple attempts, parallel exploration          │    │
│  │ Filter: Validation gates (tests, vibe, CI)              │    │
│  │ Ratchet: Merge, close, store (permanent)                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  THE GOAL (Escape Velocity)                                     │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │              σ × ρ > δ/100                               │   │
│  │                                                          │   │
│  │    retrieval × evidence-backed use > aging threshold     │   │
│  │                                                          │   │
│  │    When true: KNOWLEDGE COMPOUNDS                        │   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why Each Piece Matters¶

Layer	What It Does	Which Variable It Improves
DevOps	Flow, feedback, learning	`I(t)` — more knowledge in
Cognitive	Optimal load	`σ` — better retrieval
MemRL	Utility learning	`σ` — smarter retrieval
Scale controls (tiering/pruning/indexing)	Limits-to-growth mitigation	Holds `σ(K,t)` up and `φ` down
Ratchet	Lock progress	Prevents regression of `K`

Every layer serves the equation. The added constraint is explicit: long-run growth needs scale controls, not just early flywheel activation.

Part 8: Evidence (Internal Pilots, Not Causal Proof Yet)¶

What We've Measured So Far¶

These are internal observations and should be read as Tier B evidence (operational telemetry, not randomized causal proof):

Metric	Baseline (internal)	AgentOps condition (internal)	Direction
Same-issue resolution	45 min	3 min	Faster
Token cost per issue	$2.40	$0.15	Lower
Context collapse rate	~65% at 60% load	0% at 40% load	Lower
Knowledge reuse	~0%	~15% (growing)	Higher

Why This Is Not Yet "Proof"¶

Baselines are historical, not fully randomized.
Team maturity and task mix can confound outcomes.
Some gains can come from process discipline independent of memory quality.

Evaluation Design to Make It Bulletproof¶

For causal confidence, run a controlled protocol:

Randomize comparable tasks across conditions (memory-on, memory-off, memory-on + utility learning).
Pre-register primary metrics: resolution time, token cost, defect rate, reuse precision@k, and citation rate ρ.
Track estimated decay δ_t and retrieval effectiveness σ_t weekly.
Segment by corpus size (K buckets) to detect limits-to-growth behavior.
Require out-of-sample replication across projects, not just one team.

Falsifiable Predictions¶

The model should be treated as wrong if repeated experiments show:

ρ·σ(K,t) <= δ + φ·K - I(t)/K while performance still compounds.
Retrieval quality does not degrade with scale and no compensating controls are needed.
memory-on + utility learning does not outperform memory-on as K increases.
Gains disappear under simple task-randomized comparisons.

Part 9: Limits to Growth and Control Policy¶

System Dynamics says reinforcing loops eventually hit balancing loops. We model that explicitly and design around it.

Main Scale Risks¶

Risk	Loop Effect	Observable Symptom
Corpus bloat	Lowers `σ(K,t)`	Falling precision@k, more irrelevant recalls
Retrieval latency/cost	Raises effective `φ`	Slower sessions, rising token burn
Quality drift	Raises effective `δ`	More stale/contradictory learnings
Cognitive overload	Lowers `ρ`	Retrieved items cited less in final outputs

Control Actions (Operational)¶

Control	Primary Variable	Practical Mechanism
Tiering + archival	`φ` down	Keep hot set small, cold set searchable
Utility-based pruning	`σ` up, `δ` down	Remove low-value or stale memories
Re-index + embedding refresh	`σ` up	Improve retrieval quality as schema evolves
Citation incentives/UX	`ρ` up	Make reuse cheaper than re-derivation
Drift audits	`δ` down	Detect and repair stale knowledge clusters

Operating Rule¶

Track loop dominance continuously:

Text Only

health(t) = ρ·σ(K,t) - (δ + φ·K - I(t)/K)

If health(t) > 0, the system is in compounding mode. If health(t) <= 0, growth has hit limits and controls must be tightened.

Conclusion: The Goal Is The Math¶

Everything in AgentOps exists to achieve one thing:

Text Only

Operational check:
σ × ρ > δ/100

Scale-aware form:
ρ·σ(K,t) > δ + φ·K - I(t)/K

When this is true, knowledge compounds. When it's false, growth stalls or reverses.

This is a control problem, not a slogan. Reinforcing loops must stay stronger than balancing loops over time.

Every feature, every skill, every CLI command serves this inequality:

Feature	How It Helps
`/forge`	Increases `I(t)` — more knowledge in
`/inject`	Increases `σ` — better retrieval
`/vibe`, `/pre-mortem`	Filter bad work before it wastes cycles
`ao feedback`	Improves `σ` via utility learning
Tiering/pruning/re-indexing	Prevents limits-to-growth collapse in `σ` and `φ`
Ratchet chain	Prevents `K` from regressing
40% rule	Keeps `σ` high by avoiding lost-in-middle

The goal is the math, with explicit scale limits. The system is only "bulletproof" if we measure loop dominance and adapt controls as K grows.

References¶

Knowledge Decay¶

Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations." Management Science, 41(11), 1750-1762.
Ebbinghaus, H. (1885). Über das Gedächtnis. Leipzig: Duncker & Humblot.

Cognitive Science¶

Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory." Current Directions in Psychological Science, 29(4), 394-398.
Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
Yerkes, R. M., & Dodson, J. D. (1908). "The relation of strength of stimulus to rapidity of habit-formation." Journal of Comparative Neurology and Psychology, 18(5), 459-482.

LLM Context¶

Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.

Memory-Augmented Learning¶

Zhang, S., Wang, J., Zhou, R., Liao, J., Feng, Y., Zhang, W., Wen, Y., Li, Z., Xiong, F., Qi, Y., Tang, B., & Wen, M. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192

DevOps¶

Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.

Systems Dynamics¶

Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing.
Meadows, D. H., Meadows, D. L., Randers, J., & Behrens, W. W. (1972). The Limits to Growth. Universe Books.

"The goal is the math. Everything else is implementation."