#22525: [Bug]: Session snapshot not reloading skills after gateway restart or VM reboot — requires /new to take effect

by zwffff open 2026-02-21 08:08 View on GitHub →

gateway agents size: S

Cluster: Skill and Session Management Fixes

## Summary - **Problem:** After gateway restart or VM reboot, the agent kept using old session skill snapshots; newly configured skills did not appear until the user ran `/new`. - **Why it matters:** Users expect a restart to pick up config/skill changes; having to type `/new` was non-obvious and looked like a skill config bug. - **What changed:** On gateway startup we call `bumpSkillsSnapshotVersion({ reason: "manual" })` so the in-memory version is higher than any persisted session snapshot. On the next message, each session sees `shouldRefreshSnapshot` true and rebuilds its skills snapshot from current config. - **What did NOT change (scope boundary):** No change to session store schema, `/new` behavior, or skill loading logic; only the startup bump that forces a refresh on first use after restart. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #22517 - Related # ## User-visible / Behavior Changes - After gateway restart or VM reboot, the next message in each session triggers a skills snapshot refresh; new/updated skills (and config) take effect without requiring `/new`. - No config or CLI changes. ## Security Impact (required) - New permissions/capabilities? **No** - Secrets/tokens handling changed? **No** - New/changed network calls? **No** - Command/tool execution surface changed? **No** - Data access scope changed? **No** - If any Yes, explain risk + mitigation: N/A ## Repro + Verification ### Environment - OS: Linux (Ubuntu) - Runtime/container: Node 22+ - Model/provider: Any - Integration/channel (if any): Telegram, WhatsApp, Web UI - Relevant config (redacted): `openclaw.json` with a skill (e.g. `gog`) and env/metadata ### Steps 1. Configure a skill and confirm it shows ✓ ready in `openclaw skills list`. 2. Start a chat (create a session), then change skill config (e.g. fix metadata or env). 3. Restart the gateway (`openclaw gateway stop` / `openclaw gateway start`) or reboot the VM. 4. Send a message in the same chat (no `/new`) and try to use the skill. ### Expected - Skill is available and works with the updated config after restart, without `/new`. ### Actual (before fix) - Skill stayed unavailable until the user typed `/new`. ## Evidence - [x] Failing test/log before + passing after: Existing logic in `ensureSkillSnapshot` already refreshes when `snapshotVersion > (session.skillsSnapshot?.version ?? 0)`; the fix ensures `snapshotVersion` is > 0 after restart so that condition holds. - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Human Verification (required) - **Verified scenarios:** Build and `src/agents/skills/refresh.test.ts` pass; startup path calls `bumpSkillsSnapshotVersion` once. - **Edge cases checked:** No change to session store format; `OPENCLAW_TEST_FAST` fast path unchanged; watcher/version logic unchanged except for the one startup bump. - **What you did not verify:** Full E2E with real gateway restart and multiple channels. ## Compatibility / Migration - Backward compatible? **Yes** - Config/env changes? **No** - Migration needed? **No** - If yes, exact upgrade steps: N/A ## Failure Recovery (if this breaks) - **How to disable/revert:** Revert this PR or remove the `bumpSkillsSnapshotVersion` call in `server-startup.ts`. - **Files/config to restore:** `src/gateway/server-startup.ts` - **Known bad symptoms reviewers should watch for:** None expected; if version bumps too often it could cause extra snapshot rebuilds on first message after start (one-time per session). ## Risks and Mitigations - **Risk:** None identified; single in-memory version bump at startup, no new I/O or secrets. - **Mitigation:** N/A  <h3>Greptile Summary</h3> This PR bundles two separate bug fixes that should be documented together in the PR description: 1. **Skills snapshot refresh after gateway restart** (fix #22517): Calls `bumpSkillsSnapshotVersion()` on gateway startup to ensure in-memory version is higher than persisted session snapshots, forcing refresh on next message 2. **Reasoning default based on model capability** (fix #22456): Adds `resolveReasoningDefault()` to automatically enable reasoning when a model has `reasoning: true` in the catalog, unless explicitly overridden by user Both fixes follow established patterns in the codebase and include test coverage. The skills fix solves the root cause where `globalVersion` started at 0 after restart, preventing `shouldRefreshSnapshot` from triggering. The reasoning fix implements the expected behavior where model capabilities inform default settings. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minor documentation concern - Both fixes are technically sound and follow established patterns. The skills snapshot fix correctly addresses the root cause where `globalVersion` initialized to 0. The reasoning default fix implements expected behavior with proper fallback logic. Tests are included for both changes. Score reduced by 1 because the PR description only mentions fix #22517 but the PR actually includes an unrelated fix for #22456 - this creates confusion about the PR scope but doesn't affect code quality. - No files require special attention - all changes follow existing patterns <sub>Last reviewed commit: 59facef</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>