Skip to content

The Science Behind AgentOps

TL;DR: One model of knowledge decay (Darr 1995) suggests ~17%/week without reinforcement. Knowledge compounds when retrieval × usage beats decay and scale friction. Early growth can be exponential; long-run growth requires active limits-to-growth controls.


The Journey

This wasn't designed in a vacuum. It came from years of connecting dots across fields:

  1. Knowledge OS — The insight that git + files can become a durable bookkeeping substrate
  2. DevOps — The Three Ways applied to knowledge, not just code
  3. Cognitive Science — Why 40% load is optimal (35 years of research)
  4. MemRL — Reinforcement learning for retrieval and utility-scoring systems (2026)
  5. Thermodynamics — The Brownian Ratchet as progress model

Each piece validated the intuition. The math fell out naturally.

Claim Status (So We Don't Overclaim)

To make this model durable under critique, we separate claim types:

Tier Claim Type Standard of Evidence Examples in this doc
A Established external evidence Peer-reviewed or widely replicated findings Forgetting curves, cognitive load limits, lost-in-the-middle behavior
B Internal empirical evidence Reproducible internal measurements with clear methodology Time-to-resolution deltas, token-cost deltas, reuse-rate trends
C Working hypothesis Mechanistic proposal under active testing Ratchet model details, exact operating point for σ × ρ, cross-project transfer effects

This document mixes all three tiers. The goal is to make each claim explicit, measurable, and falsifiable.


Part 1: The Problem (With Evidence)

Knowledge Decays. Fast.

Citation: Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations: Productivity in Franchises." Management Science, 41(11), 1750-1762.

Finding: Organizational knowledge depreciates at approximately 17% per week without active reinforcement.

Text Only
Week 0: 100%
Week 1:  83%  (lost 17%)
Week 2:  69%  (lost another 17% of what remained)
Week 4:  47%
Week 8:  22%

Why this matters for AI: Every Claude session starts close to Week 0 unless the environment resurfaces prior work. Without bookkeeping plus retrieval, you're always on the left side of the decay curve.

The Forgetting Curve

Citation: Ebbinghaus, H. (1885). Über das Gedächtnis (Memory: A Contribution to Experimental Psychology).

Ebbinghaus discovered that memory decays exponentially without reinforcement, but each retrieval strengthens the memory and slows future decay.

Text Only
Memory Strength
    │
100%│╲
    │ ╲
 50%│  ╲______ retrieval here
    │         ╲_____ slows decay
 25%│              ╲____
    └─────────────────────────
         Time →

The insight: It's not about storing more. It's about retrieving at the right time.


Part 2: The Math (Plain English)

The Equation

Text Only
dK/dt = I(t) - δ·K + σ·ρ·K

Don't panic. Here's what each piece means:

Symbol What It Is Plain English Example
K Knowledge stock How much useful stuff you've accumulated "We have 156 learnings stored"
dK/dt Rate of change Is the pile growing or shrinking? "+5 learnings this week" or "-10 lost to decay"
I(t) Input rate New knowledge coming in "Forged 3 sessions today"
δ Decay rate How fast you forget (0.17/week) "17% gone each week if unused"
σ Retrieval coverage How much of the useful stock are you actually surfacing? "Surfaced 70% of retrievable artifacts"
ρ Decision influence rate Of what you surfaced, how much later had evidence-backed use? "30% of surfaced artifacts were referenced or applied"

Implementation note: In the CLI implementation (metrics_health.go), sigma measures unique surfaced retrievable artifacts / total retrievable artifacts over the last 10 sessions, rho measures the fraction of surfaced artifacts later evidenced by reference or applied citations, and delta measures average age of active learnings in days. The escape velocity check uses σ × ρ > δ/100 to normalize delta to a ratio comparable with sigma and rho.

Dimensional Check

To keep this scientifically defensible:

  • K is measured in useful knowledge units (for example: validated learnings).
  • I(t) is knowledge units per week.
  • δ, σ, and ρ are rates/probabilities per week or per retrieval cycle.
  • All terms in dK/dt resolve to knowledge units per week.

Breaking It Down

I(t) — The input. You forge transcripts, extract learnings, write retros. Knowledge goes in.

- δ·K — The decay. Every week, 17% of your knowledge becomes stale or forgotten. This is fighting against you.

+ σ·ρ·K — The compounding term. This is the magic.

When you retrieve knowledge (σ) and actually use it (ρ), two things happen: 1. That knowledge gets reinforced (Ebbinghaus) 2. New knowledge is created from the application

The ·K means it's proportional to how much you already have. More knowledge → more compounding.

System Dynamics Correction: Limits to Growth

The baseline equation is structurally correct, but idealized. In real systems, reinforcing loops hit balancing loops at scale.

A scale-aware form:

Text Only
dK/dt = I(t) - δ·K + σ(K,t)·ρ·K - φ·K²

where:
σ(K,t) = σ_max / (1 + (K / Kσ)^n)

Interpretation: - σ(K,t) declines as the corpus grows unless retrieval/index quality improves. - φ·K² captures scale friction: indexing overhead, latency, noise, governance cost, and cognitive overhead. - is the knowledge scale where retrieval starts degrading materially.

This adds the missing Limits to Growth balancing loop from System Dynamics.

The Escape Velocity Condition

Rearrange the equation at steady state:

Text Only
General growth condition:
I(t) + K(σ·ρ - δ) > 0

Operational self-sustaining check: σ × ρ > δ/100

With scale friction:
ρ·σ(K,t) > δ/100 + φ·K - I(t)/K

Meaning: early growth can be exponential, but long-run growth plateaus unless you actively prevent σ collapse and friction growth.

If your retrieval effectiveness times your citation rate exceeds 0.17 in self-sustaining mode, you're compounding. If not, you either need fresh input I(t) or stronger controls to avoid stagnation.

Scenario σ ρ σ × ρ vs δ Result
No bookkeeping or retrieval 0 0 0 < 0.17 Decaying
Store but don't retrieve 0.1 0.1 0.01 < 0.17 Decaying
Retrieve but don't use 0.8 0.1 0.08 < 0.17 Decaying
AgentOps target 0.7 0.3 0.21 > 0.17 Compounding

The 0.04 margin matters. Small edge, compounded over time, becomes massive.

Loop-Dominance Translation (System Dynamics)

This model is a stock-and-flow system with competing loops:

  • R1 reinforcing loop: retrieval -> usage -> stronger priors -> better future retrieval.
  • B1 balancing loop: decay/staleness drains the stock.
  • B2 balancing loop: scale friction reduces retrieval and increases operating cost as K grows.

Expected phases:

  1. Bootstrap phase: R1 > B1, rapid gains.
  2. Flywheel phase: R1 > B1 + B2, compounding with healthy margins.
  3. Saturation risk: B2 grows; gains flatten.
  4. Renewal phase: pruning, re-indexing, tiering, and stronger feedback push R1 back above balancing loops.

Part 3: DevOps Foundation (The Three Ways)

Citation: Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.

DevOps isn't about tools. It's about three principles:

The First Way: Flow

Optimize the flow of work from left to right (dev → ops → customer).

In AgentOps: Knowledge flows from sessions → forge → store → inject → next session.

Text Only
Session → Forge → Store → Inject → Session
            ↓
      (no bottlenecks)

We don't batch. We stream. Every session feeds the next.

The Second Way: Feedback

Create feedback loops at every stage.

In AgentOps: - /vibe validates code quality - /pre-mortem catches failures before they happen - ao feedback trains the utility scorer - Citation tracking shows what's actually used

Text Only
Action → Measurement → Learning → Adjustment
   ↑                                  │
   └──────────────────────────────────┘

The Third Way: Continuous Learning

Create a culture of experimentation and learning from failure.

In AgentOps: - /retro extracts learnings from every significant work - /post-mortem closes the loop on epics - Failures become learnings, not just incidents

Text Only
Failure → Retro → Learning → Pattern → Skill
                                         ↓
                               (never make same mistake)

The connection: DevOps optimizes code flow. AgentOps optimizes knowledge flow. Same principles, different domain.


Part 4: Cognitive Science (Why 40%)

The Research Stack

Researcher Year Finding Application
Miller 1956 Working memory holds 7±2 chunks Context windows have real limits
Cowan 2001 Core capacity is ~4 items Optimal load is lower than max
Sweller 1988 Cognitive Load Theory Three types of load compete
Paas & van Merriënboer 2020 Updated CLT JIT loading reduces extraneous load
Barkley 2015 Executive function limits Performance collapses at overload
Csikszentmihalyi 1990 Flow state Optimal challenge zone
Yerkes & Dodson 1908 Inverted-U performance curve Peak at moderate arousal
Liu et al. 2023 "Lost in the Middle" LLMs can't retrieve from crowded contexts

Citations:

  • Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
  • Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
  • Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
  • Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory: Methods to manage working memory load." Current Directions in Psychological Science, 29(4), 394-398.
  • Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.

The Pattern

Every study finds the same thing: performance peaks at moderate load.

Text Only
Performance
    │
100%│          ╭───╮
    │        ╭─╯   ╰─╮
 75%│      ╭─╯       ╰─╮
    │    ╭─╯           ╰─╮ collapse
 50%│  ╭─╯               ╰──────
    │╭─╯
 25%│
    └────────────────────────────────
    0%   20%   40%   60%   80%  100%
              Context Utilization

40% isn't arbitrary. It's where decades of research say performance lives.

Why This Matters for LLMs

Liu et al. (2023) showed that LLMs have a "lost in the middle" problem. When context is crowded: - Information at the start: retrieved well - Information at the end: retrieved well - Information in the middle: lost

Text Only
Retrieval Accuracy by Position:

Start │████████████████████│ High
      │                    │
Mid   │████████░░░░░░░░░░░░│ Low  ← "Lost in the middle"
      │                    │
End   │████████████████████│ High

The fix: Don't fill context to 100%. Stay at 40%. The middle stays findable.


Part 5: MemRL (Scale-Control Mechanism)

Citation: Zhang, S., Wang, J., Zhou, R., et al. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192

The Problem MemRL Solves

Traditional retrieval uses recency or similarity. But not all knowledge is equally useful.

Text Only
Traditional RAG:
  Query → Find similar → Return top K → Hope it helps

Problem: Recent ≠ Useful. Similar ≠ Helpful.

The MemRL Solution

Use reinforcement learning to learn what's actually useful:

Python
# Each piece of knowledge has a utility score
utility = 0.5  # Start neutral

# When retrieved and used successfully
utility = (1 - α) × utility + α × 1.0  # Reward

# When retrieved but not helpful
utility = (1 - α) × utility + α × 0.0  # Penalty

# Ranking combines freshness AND utility
score = z_norm(freshness) + λ × z_norm(utility)

The insight: The system learns from feedback. Over time, useful knowledge rises and noise sinks, which helps prevent σ from collapsing as K grows.

How We Use It

Bash
# User gives feedback
ao feedback L15 --reward 1.0   # "This learning was helpful"
ao feedback L12 --reward 0.0   # "This was irrelevant"

# System updates utility scores
# Future retrieval ranks by usefulness, not just recency

The math connection: MemRL is one practical control to keep σ(K,t) high enough to offset scale friction. It is a mechanism, not a guarantee.


Part 6: The Brownian Ratchet (Our Contribution)

The Physics

A Brownian ratchet is a thought experiment from thermodynamics:

  1. Molecules bounce randomly (thermal motion)
  2. A pawl allows motion in only one direction (one-way gate)
  3. Net result: forward movement from random chaos
Text Only
    Random Motion          One-Way Gate           Net Progress
         ↓                      ↓                      ↓
    ←→←→←→←→              ───────┤►              ──────────►
    (chaos)               (filter)               (ratchet)

The Software Analog

Physics Software Example
Random motion Multiple parallel attempts 4 polecats trying different approaches
One-way gate Validation gates Tests, CI, /vibe, /pre-mortem
Net forward movement Merged/locked progress Code in main, issues closed, learnings stored

Why This Model Matters

Traditional thinking: Minimize variance. One developer, one approach, careful steps.

Ratchet thinking: Maximize controlled variance. Many attempts, filter aggressively, lock successes.

Text Only
Traditional:
  ───────────────────────────────────► (slow, fragile)

Ratchet:
  ═══╦═══╦═══╦═══╗
  ═══╬═══╬═══╬═══╬════════════════════► (fast, resilient)
  ═══╩═══╩═══╩═══╝
       ↑
   some fail, most succeed
   failures are cheap

The Key Property

You can always add more chaos. You can't un-ratchet.

  • Failed experiment? Try another. (Chaos is cheap.)
  • Merged code? It is hard to regress accidentally. (Ratchet holds.)
  • Stored learning? It compounds if retrieval quality stays high. (Progress can lock, but scale can still add drag.)

This is why progress can be made one-way at the artifact level, while system-level growth still needs active scale management.


Part 7: Putting It All Together

The Full Picture

Text Only
┌─────────────────────────────────────────────────────────────────┐
│                     THE AGENTOPS SYSTEM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEVOPS LAYER (The Three Ways)                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Flow: Session → Forge → Store → Inject → Session        │    │
│  │ Feedback: Vibe, Pre-mortem, Citations, Utility scores   │    │
│  │ Learning: Retros, Post-mortems, Pattern extraction      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  COGNITIVE LAYER (40% Rule)                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Context utilization: 35% checkpoint, 40% alert          │    │
│  │ JIT loading: Load what's needed, when it's needed       │    │
│  │ Lost-in-middle prevention: Don't crowd the context      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  MEMRL LAYER (Utility Learning)                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Feedback loop: Use → Reward/Penalize → Update utility   │    │
│  │ Retrieval: Freshness + Utility scoring                  │    │
│  │ Result: σ (retrieval effectiveness) improves over time  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  RATCHET LAYER (Progress Locking)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Chaos: Multiple attempts, parallel exploration          │    │
│  │ Filter: Validation gates (tests, vibe, CI)              │    │
│  │ Ratchet: Merge, close, store (permanent)                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  THE GOAL (Escape Velocity)                                     │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │              σ × ρ > δ/100                               │   │
│  │                                                          │   │
│  │    retrieval × evidence-backed use > aging threshold     │   │
│  │                                                          │   │
│  │    When true: KNOWLEDGE COMPOUNDS                        │   │
│  │                                                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why Each Piece Matters

Layer What It Does Which Variable It Improves
DevOps Flow, feedback, learning I(t) — more knowledge in
Cognitive Optimal load σ — better retrieval
MemRL Utility learning σ — smarter retrieval
Scale controls (tiering/pruning/indexing) Limits-to-growth mitigation Holds σ(K,t) up and φ down
Ratchet Lock progress Prevents regression of K

Every layer serves the equation. The added constraint is explicit: long-run growth needs scale controls, not just early flywheel activation.


Part 8: Evidence (Internal Pilots, Not Causal Proof Yet)

What We've Measured So Far

These are internal observations and should be read as Tier B evidence (operational telemetry, not randomized causal proof):

Metric Baseline (internal) AgentOps condition (internal) Direction
Same-issue resolution 45 min 3 min Faster
Token cost per issue $2.40 $0.15 Lower
Context collapse rate ~65% at 60% load 0% at 40% load Lower
Knowledge reuse ~0% ~15% (growing) Higher

Why This Is Not Yet "Proof"

  • Baselines are historical, not fully randomized.
  • Team maturity and task mix can confound outcomes.
  • Some gains can come from process discipline independent of memory quality.

Evaluation Design to Make It Bulletproof

For causal confidence, run a controlled protocol:

  1. Randomize comparable tasks across conditions (memory-on, memory-off, memory-on + utility learning).
  2. Pre-register primary metrics: resolution time, token cost, defect rate, reuse precision@k, and citation rate ρ.
  3. Track estimated decay δ_t and retrieval effectiveness σ_t weekly.
  4. Segment by corpus size (K buckets) to detect limits-to-growth behavior.
  5. Require out-of-sample replication across projects, not just one team.

Falsifiable Predictions

The model should be treated as wrong if repeated experiments show:

  • ρ·σ(K,t) <= δ + φ·K - I(t)/K while performance still compounds.
  • Retrieval quality does not degrade with scale and no compensating controls are needed.
  • memory-on + utility learning does not outperform memory-on as K increases.
  • Gains disappear under simple task-randomized comparisons.

Part 9: Limits to Growth and Control Policy

System Dynamics says reinforcing loops eventually hit balancing loops. We model that explicitly and design around it.

Main Scale Risks

Risk Loop Effect Observable Symptom
Corpus bloat Lowers σ(K,t) Falling precision@k, more irrelevant recalls
Retrieval latency/cost Raises effective φ Slower sessions, rising token burn
Quality drift Raises effective δ More stale/contradictory learnings
Cognitive overload Lowers ρ Retrieved items cited less in final outputs

Control Actions (Operational)

Control Primary Variable Practical Mechanism
Tiering + archival φ down Keep hot set small, cold set searchable
Utility-based pruning σ up, δ down Remove low-value or stale memories
Re-index + embedding refresh σ up Improve retrieval quality as schema evolves
Citation incentives/UX ρ up Make reuse cheaper than re-derivation
Drift audits δ down Detect and repair stale knowledge clusters

Operating Rule

Track loop dominance continuously:

Text Only
health(t) = ρ·σ(K,t) - (δ + φ·K - I(t)/K)

If health(t) > 0, the system is in compounding mode. If health(t) <= 0, growth has hit limits and controls must be tightened.


Conclusion: The Goal Is The Math

Everything in AgentOps exists to achieve one thing:

Text Only
Operational check:
σ × ρ > δ/100

Scale-aware form:
ρ·σ(K,t) > δ + φ·K - I(t)/K

When this is true, knowledge compounds. When it's false, growth stalls or reverses.

This is a control problem, not a slogan. Reinforcing loops must stay stronger than balancing loops over time.

Every feature, every skill, every CLI command serves this inequality:

Feature How It Helps
/forge Increases I(t) — more knowledge in
/inject Increases σ — better retrieval
/vibe, /pre-mortem Filter bad work before it wastes cycles
ao feedback Improves σ via utility learning
Tiering/pruning/re-indexing Prevents limits-to-growth collapse in σ and φ
Ratchet chain Prevents K from regressing
40% rule Keeps σ high by avoiding lost-in-middle

The goal is the math, with explicit scale limits. The system is only "bulletproof" if we measure loop dominance and adapt controls as K grows.


References

Knowledge Decay

  • Darr, E. D., Argote, L., & Epple, D. (1995). "The Acquisition, Transfer, and Depreciation of Knowledge in Service Organizations." Management Science, 41(11), 1750-1762.
  • Ebbinghaus, H. (1885). Über das Gedächtnis. Leipzig: Duncker & Humblot.

Cognitive Science

  • Miller, G. A. (1956). "The magical number seven, plus or minus two." Psychological Review, 63(2), 81-97.
  • Cowan, N. (2001). "The magical number 4 in short-term memory." Behavioral and Brain Sciences, 24(1), 87-114.
  • Sweller, J. (1988). "Cognitive load during problem solving." Cognitive Science, 12(2), 257-285.
  • Paas, F., & van Merriënboer, J. J. (2020). "Cognitive-load theory." Current Directions in Psychological Science, 29(4), 394-398.
  • Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
  • Yerkes, R. M., & Dodson, J. D. (1908). "The relation of strength of stimulus to rapidity of habit-formation." Journal of Comparative Neurology and Psychology, 18(5), 459-482.

LLM Context

  • Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.

Memory-Augmented Learning

  • Zhang, S., Wang, J., Zhou, R., Liao, J., Feng, Y., Zhang, W., Wen, Y., Li, Z., Xiong, F., Qi, Y., Tang, B., & Wen, M. (2025). "MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory." arXiv:2601.03192. https://arxiv.org/abs/2601.03192

DevOps

  • Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press.

Systems Dynamics

  • Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing.
  • Meadows, D. H., Meadows, D. L., Randers, J., & Behrens, W. W. (1972). The Limits to Growth. Universe Books.

"The goal is the math. Everything else is implementation."