AgentOps Incident Runbook — Consumer Recovery¶

Audience: Anyone responding to a broken AgentOps installation. Assumption: You are stressed and need copy-pasteable commands. Each section is self-contained.

Table of Contents¶

Emergency Kill Switches
Scenario A: Broken Skills After Update
Scenario B: Evolve Pushed Bad Code to Main
Scenario C: Hook Scripts Fail at Session Start
Rollback Options
Root Cause Analysis
Prevention Checklist

1. Emergency Kill Switches¶

Do these FIRST if sessions are broken. Restore functionality, then investigate.

Bash

# Disable ALL AgentOps hooks globally (instant, affects all sessions)
export AGENTOPS_HOOKS_DISABLED=1

# Stop evolve from running (persistent across sessions)
mkdir -p ~/.config/evolve
echo "incident $(date -Iseconds)" > ~/.config/evolve/KILL

To make the hook disable persistent, add to your shell profile:

Bash

echo 'export AGENTOPS_HOOKS_DISABLED=1' >> ~/.zshrc

Per-Hook Kill Switches¶

If you know which hook is broken, disable just that one:

Hook	Kill Switch Env Var
session-start.sh	`AGENTOPS_SESSION_START_DISABLED=1`
pre-mortem-gate.sh	`AGENTOPS_SKIP_PRE_MORTEM_GATE=1`
task-validation-gate.sh	`AGENTOPS_TASK_VALIDATION_DISABLED=1`
pending-cleaner.sh	`AGENTOPS_PENDING_CLEANER_DISABLED=1`
precompact-snapshot.sh	`AGENTOPS_PRECOMPACT_DISABLED=1`
All hooks (global)	`AGENTOPS_HOOKS_DISABLED=1`

Every hook script checks AGENTOPS_HOOKS_DISABLED=1 at the top and exits immediately if set.

2. Scenario A: Broken Skills After Update¶

Symptom: Consumer ran the install script and now Claude sessions are broken — hooks error, skills don't load, or sessions hang at start.

Triage (< 5 min)¶

Bash

# 1. Kill hooks immediately
export AGENTOPS_HOOKS_DISABLED=1

# 2. Check what version was installed
cat ~/.claude/skills/agentops/plugin.json 2>/dev/null | jq -r '.version'
# Or check the marketplace cache
cat ~/.claude/plugins/marketplaces/agentops-marketplace/plugin.json 2>/dev/null | jq -r '.version'

# 3. Check if skills are symlinks (known failure mode)
ls -la ~/.claude/skills/ | head -20

# 4. Test a hook manually
bash ~/.claude/skills/agentops/hooks/session-start.sh

Fix: Reinstall from a known-good version¶

Bash

# Remove broken installation
rm -rf ~/.claude/skills/agentops

# Remove any symlinks (known failure: installer cannot write through symlinks)
find ~/.claude/skills -maxdepth 1 -type l -delete

# Reinstall from latest Claude plugin
claude plugin marketplace update agentops-marketplace
claude plugin install agentops@agentops-marketplace

# OR reinstall marketplace + plugin
claude plugin marketplace add boshu2/agentops
claude plugin install agentops@agentops-marketplace

Fix: Nuke and reinstall (if pinning doesn't work)¶

Bash

# Nuclear option: remove everything and reinstall
rm -rf ~/.claude/skills/agentops
rm -rf ~/.claude/plugins/marketplaces/agentops-marketplace

# Reinstall plugin
claude plugin marketplace add boshu2/agentops
claude plugin install agentops@agentops-marketplace

Re-enable hooks after fix¶

Bash

unset AGENTOPS_HOOKS_DISABLED
# Remove from shell profile if you added it there

3. Scenario B: Evolve Pushed Bad Code to Main¶

Symptom: /evolve ran autonomously, committed code that breaks builds, tests, or other skills. The regression gate failed to catch it, or evolve committed before the gate ran.

Triage (< 5 min)¶

Bash

# 1. Stop evolve immediately
mkdir -p ~/.config/evolve
echo "incident: bad code on main $(date -Iseconds)" > ~/.config/evolve/KILL

# Also set local stop in the repo
echo "emergency stop" > .agents/evolve/STOP

# 2. Check what evolve did
cat .agents/evolve/cycle-history.jsonl 2>/dev/null   # cycle outcomes
cat .agents/evolve/session-summary.md 2>/dev/null     # session wrap-up
ls -lt .agents/evolve/fitness-*.json 2>/dev/null      # fitness snapshots

# 3. Find evolve's commits
git log --oneline -20   # look for evolve/rpi commit messages

Revert evolve's changes¶

Bash

# Find the last good commit (before evolve ran)
# Look at fitness snapshots for session_start_sha
jq -r '.cycle_start_sha' .agents/evolve/fitness-0.json 2>/dev/null

# Or find it manually
git log --oneline -30 | less

# Revert everything after the known-good SHA
GOOD_SHA="<paste sha here>"
git revert --no-commit ${GOOD_SHA}..HEAD
git commit -m "revert: evolve incident — rolling back to ${GOOD_SHA}"

# Verify
cd cli && go build ./cmd/ao && go test ./...
./tests/run-all.sh

If evolve is mid-run (still executing)¶

Bash

# Kill switch stops it at the next cycle boundary
mkdir -p ~/.config/evolve
echo "emergency stop" > ~/.config/evolve/KILL

# If it's in a tmux session, also kill the process
# Find the session
tmux list-sessions | grep -i evolve
# Kill it
tmux kill-session -t <session-name>

Re-enable evolve after fix¶

Bash

rm ~/.config/evolve/KILL
rm .agents/evolve/STOP 2>/dev/null

4. Scenario C: Hook Scripts Fail at Session Start¶

Symptom: Claude session hangs, errors on start, or prints garbage JSON. The session-start hook or one of the other 11 hook scripts has a bug.

Triage (< 5 min)¶

Bash

# 1. Disable hooks to get a working session
export AGENTOPS_HOOKS_DISABLED=1

# 2. Check hook error log
cat .agents/ao/hook-errors.log 2>/dev/null | tail -20

# 3. Test each hook manually to find the broken one
for hook in ~/.claude/skills/agentops/hooks/*.sh; do
  echo "--- Testing: $(basename $hook) ---"
  timeout 5 bash "$hook" 2>&1 | tail -3
  echo "Exit: $?"
done

# 4. Check for syntax errors
for hook in ~/.claude/skills/agentops/hooks/*.sh; do
  bash -n "$hook" 2>&1 && echo "OK: $(basename $hook)" || echo "BROKEN: $(basename $hook)"
done

Common hook failures¶

Missing binary (ao, jq, shellcheck):

Bash

# Check if ao CLI is installed
which ao    # should be /opt/homebrew/bin/ao
which jq    # required by several hooks

# If ao is missing, hooks degrade gracefully (|| true pattern)
# But session-start.sh may still produce broken JSON

Infinite loop or timeout:

Bash

# hooks.json defines timeouts per hook (2-120s)
# If a hook exceeds its timeout, Claude kills it
# Check hooks.json for timeout values
jq '.hooks | to_entries[] | .value[].hooks[] | {command: .command[:60], timeout: .timeout}' \
  ~/.claude/skills/agentops/hooks/hooks.json

Bad JSON output from session-start.sh:

Bash

# session-start.sh must output valid JSON with hookSpecificOutput
# Test it and validate output
bash ~/.claude/skills/agentops/hooks/session-start.sh 2>/dev/null | jq .
# If jq fails, the output is malformed

Fix: Disable just the broken hook¶

Once you identify the broken hook, disable only that one instead of all hooks. Use the per-hook kill switches from Section 1.

If the broken hook doesn't have its own kill switch, you can patch it:

Bash

# Add a kill switch to the top of the hook (after the shebang)
# This is a temporary fix in the installed copy
HOOK_FILE=~/.claude/skills/agentops/hooks/<broken-hook>.sh
# Back it up first
cp "$HOOK_FILE" "${HOOK_FILE}.bak"
# Make it exit immediately
sed -i '' '2i\
exit 0  # TEMPORARY: disabled during incident\
' "$HOOK_FILE"

5. Rollback Options¶

Option A: Reinstall latest Claude plugin¶

Bash

# Refresh marketplace source and reinstall plugin
claude plugin marketplace update agentops-marketplace
claude plugin install agentops@agentops-marketplace

Option B: Pin to a specific commit¶

Bash

# Clone and install from a known-good commit
cd /tmp
git clone https://github.com/boshu2/agentops.git agentops-recovery
cd agentops-recovery
git checkout <known-good-sha>

# Copy skills manually
rm -rf ~/.claude/skills/agentops
cp -r . ~/.claude/skills/agentops

Option C: Sync from marketplace cache¶

Bash

# The marketplace cache may have a working version
cd ~/.claude/plugins/marketplaces/agentops-marketplace
git log --oneline -10   # find a good state
git checkout <good-sha>

# Then reinstall plugin from marketplace source
claude plugin install agentops@agentops-marketplace

Option D: Nuclear reinstall¶

Bash

# Remove everything AgentOps-related
rm -rf ~/.claude/skills/agentops
rm -rf ~/.claude/plugins/marketplaces/agentops-marketplace
find ~/.claude/skills -maxdepth 1 -type l -delete   # remove symlinks

# Clear any cached state
rm -rf ~/.config/evolve/KILL 2>/dev/null

# Fresh install (plugin path)
claude plugin marketplace add boshu2/agentops
claude plugin install agentops@agentops-marketplace

# Verify
cat ~/.claude/skills/agentops/.claude-plugin/plugin.json | jq -r '.version'
bash ~/.claude/skills/agentops/hooks/session-start.sh 2>/dev/null | jq .

6. Root Cause Analysis¶

After restoring service, investigate what went wrong.

Was it an evolve regression?¶

Bash

# Check evolve history
cat .agents/evolve/cycle-history.jsonl | jq -s '.'

# Check fitness snapshots for regressions
for f in .agents/evolve/fitness-*-post.json; do
  echo "--- $f ---"
  jq '[.goals[] | select(.result == "fail") | .id]' "$f" 2>/dev/null
done

# Check GOALS.yaml for broken check commands
cat GOALS.yaml

Was it a bad commit?¶

Bash

# Use git bisect to find the breaking commit
git bisect start
git bisect bad HEAD
git bisect good <last-known-good-sha>

# For each step, run the relevant test
./tests/run-all.sh && git bisect good || git bisect bad

# When done
git bisect reset

Was it a hook script bug?¶

Bash

# Shellcheck all hooks
shellcheck -x -P SCRIPTDIR hooks/*.sh

# Check for recent hook changes
git log --oneline -20 -- hooks/

# Run the hook preflight validation
./scripts/validate-hook-preflight.sh

# Run full test suite
./tests/run-all.sh

Was it a dependency issue?¶

Bash

# Check if ao CLI is working
ao status
ao flywheel status

# Check Go CLI builds
cd cli && go build ./cmd/ao && go test ./...

# Check for missing system tools
for cmd in jq shellcheck git ao; do
  which "$cmd" >/dev/null 2>&1 && echo "OK: $cmd" || echo "MISSING: $cmd"
done

7. Prevention Checklist¶

Before releasing a new version¶

./tests/run-all.sh passes (all tiers)
./tests/smoke-test.sh passes
cd cli && go build ./cmd/ao && go test -race ./... clean
shellcheck -x -P SCRIPTDIR hooks/*.sh clean
./scripts/validate-hook-preflight.sh passes
All hook scripts test manually: bash hooks/<name>.sh
Plugin and marketplace versions match: jq -r '.version' .claude-plugin/plugin.json equals jq -r '.metadata.version' .claude-plugin/marketplace.json
session-start.sh output is valid JSON: bash hooks/session-start.sh 2>/dev/null | jq .

Before running evolve¶

GOALS.yaml check commands all work: run each check: value manually
Evolve kill switch is clear: test ! -f ~/.config/evolve/KILL && echo "clear"
Git working tree is clean: git status --porcelain is empty
Know the current HEAD: git rev-parse HEAD (save this for revert)
Set a reasonable cycle cap: --max-cycles=3 for first run

After evolve completes¶

Review cycle-history.jsonl for any regressions
Run ./tests/run-all.sh manually (don't trust evolve's self-assessment)
Check git log --oneline -20 for reasonable commit messages
Run git diff <pre-evolve-sha>..HEAD --stat to see total scope of changes

Hook script authoring rules¶

Every hook MUST check AGENTOPS_HOOKS_DISABLED=1 at the top
Every hook MUST fail open (exit 0 on error, never set -e)
Every hook MUST respect its timeout in hooks.json
Guard all external commands: command -v <tool> >/dev/null 2>&1 && ...
JSON output must be valid — test with jq . before committing
Log failures to .agents/ao/hook-errors.log, never to stderr

Quick Reference Card¶

Text Only

STOP EVERYTHING:     export AGENTOPS_HOOKS_DISABLED=1
STOP EVOLVE:         echo "stop" > ~/.config/evolve/KILL
CHECK HOOK ERRORS:   cat .agents/ao/hook-errors.log | tail -20
TEST HOOKS:          for h in hooks/*.sh; do bash -n "$h"; done
REINSTALL:           claude plugin marketplace update agentops-marketplace && claude plugin install agentops@agentops-marketplace
NUCLEAR REINSTALL:   rm -rf ~/.claude/skills/agentops ~/.claude/plugins/marketplaces/agentops-marketplace && claude plugin marketplace add boshu2/agentops && claude plugin install agentops@agentops-marketplace
REVERT EVOLVE:       git revert --no-commit <good-sha>..HEAD && git commit -m "revert: evolve incident"
VERSION CHECK:       jq -r '.version' ~/.claude/skills/agentops/.claude-plugin/plugin.json