#19614: fix: clear browser element refs on failover to prevent stale ref errors
size: XS
## Summary
- Problem: When an agent run times out and fails over to a fallback model, the new model inherits stale browser element refs (e.g. `e144`) from the previous run, causing "Unknown ref" / "Element not found" errors.
- Why it matters: Browser-based agent tasks become permanently broken after a model failover — the fallback model cannot interact with the page.
- What changed: Added `clearRoleRefsCache()` to `pw-session.ts`, called in both `forceDisconnectPlaywrightForTarget()` and `closePlaywrightBrowserConnection()` to clear the module-level `roleRefsByTarget` Map on teardown.
- What did NOT change (scope boundary): No changes to Playwright connection lifecycle, failover logic, or ref assignment. Only the cache cleanup on disconnect.
## Change Type (select all)
- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [x] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [x] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Fixes #20550
- Closes #19663
## User-visible / Behavior Changes
- After a model failover during browser interaction, the fallback model now takes a fresh DOM snapshot instead of reusing stale element refs from the timed-out run.
- No config changes required.
## Security Impact (required)
- New permissions/capabilities? `No`
- Secrets/tokens handling changed? `No`
- New/changed network calls? `No`
- Command/tool execution surface changed? `No`
- Data access scope changed? `No`
## Repro + Verification
### Environment
- OS: Any
- Runtime/container: Node.js 22+
- Model/provider: Any with model fallback configured
- Integration/channel: Browser/Playwright agent tasks
### Steps
1. Configure model fallbacks (primary + fallback model)
2. Start a browser-based agent task
3. Let the primary model timeout mid-interaction (with element refs like `e144` in conversation)
4. Observe the fallback model attempt to use stale refs
### Expected
- Fallback model takes a fresh snapshot and interacts with current page state
### Actual (before fix)
- Fallback model gets "Unknown ref e144" / "Element e144 not found or not visible" errors because `restoreRoleRefsForTarget()` copies stale refs from the persisted cache
## Evidence
- [x] Failing test/log before + passing after
- [ ] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
New unit test: `clearRoleRefsCache prevents stale refs from being restored after failover`. All 7 existing pw-session tests pass.
## Human Verification (required)
- Verified scenarios: Unit tests pass (7/7 existing + 1 new), build passes (`pnpm build`). Browser service initializes correctly with the patched code and handles disconnect/reconnect cycles during restarts without errors.
- Edge cases checked: Cache cleared in both disconnect paths (force disconnect and normal close), refs not restored after clearance
- What you did **not** verify: Full e2e failover with live model timeout (requires specific timing conditions). The fix is a straightforward cache clear on teardown — correctness is validated by unit test.
## Compatibility / Migration
- Backward compatible? `Yes`
- Config/env changes? `No`
- Migration needed? `No`
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly: Revert the `clearRoleRefsCache()` calls in `forceDisconnectPlaywrightForTarget()` and `closePlaywrightBrowserConnection()`
- Files/config to restore: `src/browser/pw-session.ts`
- Known bad symptoms reviewers should watch for: If the cache clear is too aggressive, models might re-snapshot unnecessarily after normal page navigations (performance cost, not correctness issue)
## Risks and Mitigations
- Risk: Clearing the cache too eagerly could cause unnecessary re-snapshots on normal disconnects (not just failovers).
- Mitigation: The cache is only cleared on connection teardown, which is the correct boundary — a new connection should never trust refs from a previous one.
## AI-assisted
This PR was AI-assisted. The code is understood and unit-tested.
Most Similar PRs
#23816: fix(agents): model fallback skipped during session overrides and pr...
by ramezgaberiel · 2026-02-22
70.7%
#22064: fix(failover): bypass models allowlist for configured fallback models
by winston-bepresent · 2026-02-20
70.3%
#19636: fix(agents): harden overflow recovery observability + subagent term...
by Jackten · 2026-02-18
70.0%
#15859: Graceful fallback + transparent model-failure logging
by wboudy · 2026-02-14
69.7%
#6686: fix: clear Playwright's default colorScheme override on CDP-connect...
by Terwox · 2026-02-01
69.0%
#20275: fix(cli): include primary model in allowlist when adding fallbacks
by MFS-code · 2026-02-18
68.9%
#12195: fix(agents): sync config fallback for lookupContextTokens cold-star...
by mcaxtr · 2026-02-09
68.3%
#14744: fix(context): key MODEL_CACHE by provider/modelId to prevent collis...
by lailoo · 2026-02-12
68.0%
#22660: feat(agents): prioritize fallback-chain recovery and configurable r...
by sauerdaniel · 2026-02-21
67.9%
#23299: fix(status): show runtime model context limit instead of stale sess...
by SidQin-cyber · 2026-02-22
67.9%