Example Session: Platform Operations¶

Profile: platform-ops Scenario: Respond to production incident Duration: ~45 min

Session Flow¶

Text Only

User: Alert: High error rate on order-service

Skills loaded: operations

Immediate Actions: - Assess severity (P1 - major functionality) - Check user impact (orders failing) - Start incident timeline

Text Only

User: Investigate the error spike

Skills loaded: operations, monitoring

Actions: - Check error logs for patterns - Review recent deployments - Analyze metrics dashboards - Identify root cause hypothesis

Findings: - Database connection pool exhausted - Started after 10am traffic spike - Pool sized for normal load, not peak

Text Only

User: Fix the connection pool issue

Skills loaded: operations, validation

Actions: - Increase connection pool size - Deploy configuration change - Monitor recovery - Validate tracer request succeeds

Result: Error rate back to normal

Text Only

User: Document this incident

Skills loaded: meta

Actions: - Create incident timeline - Document root cause - Write action items - Update runbook

Output: Incident postmortem in .agents/learnings/

Text Only

User: Add alerting for this

Skills loaded: monitoring

Actions: - Add connection pool utilization alert - Link to new runbook - Set appropriate threshold

Time: ~45 min (MTTR)