← Back to PRs

#19614: fix: clear browser element refs on failover to prevent stale ref errors

by 88plug open 2026-02-18 01:13 View on GitHub →
size: XS
## Summary - Problem: When an agent run times out and fails over to a fallback model, the new model inherits stale browser element refs (e.g. `e144`) from the previous run, causing "Unknown ref" / "Element not found" errors. - Why it matters: Browser-based agent tasks become permanently broken after a model failover — the fallback model cannot interact with the page. - What changed: Added `clearRoleRefsCache()` to `pw-session.ts`, called in both `forceDisconnectPlaywrightForTarget()` and `closePlaywrightBrowserConnection()` to clear the module-level `roleRefsByTarget` Map on teardown. - What did NOT change (scope boundary): No changes to Playwright connection lifecycle, failover logic, or ref assignment. Only the cache cleanup on disconnect. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Fixes #20550 - Closes #19663 ## User-visible / Behavior Changes - After a model failover during browser interaction, the fallback model now takes a fresh DOM snapshot instead of reusing stale element refs from the timed-out run. - No config changes required. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: Any - Runtime/container: Node.js 22+ - Model/provider: Any with model fallback configured - Integration/channel: Browser/Playwright agent tasks ### Steps 1. Configure model fallbacks (primary + fallback model) 2. Start a browser-based agent task 3. Let the primary model timeout mid-interaction (with element refs like `e144` in conversation) 4. Observe the fallback model attempt to use stale refs ### Expected - Fallback model takes a fresh snapshot and interacts with current page state ### Actual (before fix) - Fallback model gets "Unknown ref e144" / "Element e144 not found or not visible" errors because `restoreRoleRefsForTarget()` copies stale refs from the persisted cache ## Evidence - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) New unit test: `clearRoleRefsCache prevents stale refs from being restored after failover`. All 7 existing pw-session tests pass. ## Human Verification (required) - Verified scenarios: Unit tests pass (7/7 existing + 1 new), build passes (`pnpm build`). Browser service initializes correctly with the patched code and handles disconnect/reconnect cycles during restarts without errors. - Edge cases checked: Cache cleared in both disconnect paths (force disconnect and normal close), refs not restored after clearance - What you did **not** verify: Full e2e failover with live model timeout (requires specific timing conditions). The fix is a straightforward cache clear on teardown — correctness is validated by unit test. ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Revert the `clearRoleRefsCache()` calls in `forceDisconnectPlaywrightForTarget()` and `closePlaywrightBrowserConnection()` - Files/config to restore: `src/browser/pw-session.ts` - Known bad symptoms reviewers should watch for: If the cache clear is too aggressive, models might re-snapshot unnecessarily after normal page navigations (performance cost, not correctness issue) ## Risks and Mitigations - Risk: Clearing the cache too eagerly could cause unnecessary re-snapshots on normal disconnects (not just failovers). - Mitigation: The cache is only cleared on connection teardown, which is the correct boundary — a new connection should never trust refs from a previous one. ## AI-assisted This PR was AI-assisted. The code is understood and unit-tested.

Most Similar PRs