#13928: Classify session lock timeouts separately and improve lock diagnostics
agents
stale
Cluster:
Rate Limit Management Enhancements
## Summary
This PR improves reliability diagnostics around session lock contention and prevents lock-contention errors from being misclassified as provider/model failover.
## AI disclosure
- AI-assisted: **Yes** (OpenClaw + Codex).
- Testing level: **Fully tested for touched paths** (36 targeted tests passed).
- Prompts/session logs: available on request (sanitized excerpts can be shared).
- Code understanding: confirmed; changes were reviewed manually before submission.
- Guide code word: **lobster-biscuit**.
### What changed
- Added structured lock-timeout errors:
- `SessionFileLockTimeoutError` (`SESSION_FILE_LOCK_TIMEOUT`)
- `SessionStoreLockTimeoutError` (`SESSION_STORE_LOCK_TIMEOUT`)
- Added owner diagnostics to timeout messages:
- `owner_alive=0|1`
- `owner_age_ms=<n>`
- Updated failover classification to **not** treat session lock timeout errors as provider failover signals.
- Added regression tests:
- failover classification excludes lock-timeout errors
- model fallback does not continue on lock-timeout errors
- session write lock timeout exposes structured diagnostics
## Why
Previously, lock contention could appear in `All models failed ...` summaries together with provider cooldown/rate-limit failures. This mixed local lock failures with provider state and made incidents harder to triage.
## Scope / risk
- Small and localized.
- No lock acquisition semantics changed.
- No config or migration changes.
## Test plan
```bash
pnpm vitest run src/agents/failover-error.test.ts src/agents/model-fallback.test.ts src/agents/session-write-lock.test.ts
```
Result on local run: all tests passed (36/36).
Most Similar PRs
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
68.5%
#21828: fix: acquire session write lock in delivery mirror and gateway chat...
by inkolin · 2026-02-20
67.7%
#22359: fix(agents): classify overloaded service errors as timeout
by AIflow-Labs · 2026-02-21
67.6%
#16609: fix: resolve session store race condition and contextTokens updates
by battman21 · 2026-02-14
67.1%
#23210: fix: avoid cooldown on timeout/unknown failovers
by nydamon · 2026-02-22
66.3%
#11821: fix(auth): trigger failover on 401 status code from expired OAuth t...
by AnonO6 · 2026-02-08
66.1%
#17231: fix(failover): recognize model_cooldown as rate-limit for fallback
by thebtf · 2026-02-15
65.9%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
65.8%
#22368: fix: first-token timeout + provider-level skip for model fallback
by 88plug · 2026-02-21
65.8%
#21033: fix(failover): classify connection errors as timeout for model fail...
by zerone0x · 2026-02-19
65.7%