#19878: fix: Handle compaction when fallback model has smaller context window

by gaurav10gg open 2026-02-18 08:07 View on GitHub →

agents size: S

-## Summary - **Problem:** When primary model exhausts quota and falls back to a smaller-context model, compaction fails if `inputTokens + outputTokens > fallbackContextWindow` - **Why it matters:** This causes total outage of all sessions (including cron jobs) until quota resets (1-2 hours) - **What changed:** Added pre-flight check in `compaction-safeguard.ts` to detect overflow and prune more aggressively before summarization - **What did NOT change:** No changes to summarization logic itself, only pruning strategy when context window is exceeded ## Change Type - [x] Bug fix ## Scope - [x] Gateway / orchestration - [x] Memory / storage ## Linked Issue/PR - Closes #19822 (Issue #1) ## User-visible / Behavior Changes - When falling back to smaller-context models during compaction, users will see new warning logs: ``` Compaction safeguard: aggressive pruning for fallback model (contextWindow=X, needed=Y); dropped Z messages to fit. ``` - Compaction will now succeed instead of failing with "context length exceeded" errors - Older messages may be pruned more aggressively to fit within fallback model limits ## Security Impact - New permissions/capabilities? **No** - Secrets/tokens handling changed? **No** - New/changed network calls? **No** - Command/tool execution surface changed? **No** - Data access scope changed? **No** ## Repro + Verification ### Environment - OS: Linux/WSL2 (from issue report) - Runtime/container: Node.js - Model/provider: `google-antigravity` (primary), `nvidia/z-ai/glm5` (fallback) - Relevant config: ```json { "agent": { "model": "google-antigravity/claude-opus-4-5-thinking", "fallback": ["nvidia/z-ai/glm5"] } } ``` ### Steps 1. Use `google-antigravity` as primary model with large context (e.g., claude-opus-4-5-thinking) 2. Allow session to grow to 150K+ tokens 3. Exhaust Antigravity quota (triggers 429) 4. Fallback model (`nvidia/z-ai/glm5`, 202K context) attempts compaction ### Expected - Compaction succeeds with aggressive pruning - Warning logs appear - Session continues working ### Actual (Before Fix) - Compaction fails with: ``` auto-compaction failed: You passed 186753 input tokens and requested 16000 output tokens. However, the model's context length is only 202752 tokens ``` - All sessions enter `FailoverError: No available auth profile` ## Evidence - [x] Logic verified via code review - [ ] Manual testing pending (requires reproducing quota exhaustion scenario) ## Human Verification **Verified scenarios:** - Code compiles without TypeScript errors - Logic review: pre-flight check correctly calculates `totalNeededTokens` and triggers pruning when needed - Iterative pruning reduces `maxHistoryShare` from 0.5 → 0.3 → 0.2 as needed **Edge cases checked:** - Already-small sessions (no pruning triggered) - First pruning pass sufficient (doesn't hit second pass) - Second pruning pass needed (very large sessions) **What I did NOT verify:** - End-to-end testing with actual quota exhaustion (requires production-like setup) - Performance impact of iterative pruning on very large sessions ## Compatibility / Migration - Backward compatible? **Yes** - Config/env changes? **No** - Migration needed? **No** ## Failure Recovery - **How to disable/revert:** Revert `src/agents/pi-extensions/compaction-safeguard.ts` to previous version - **Files to restore:** Only `compaction-safeguard.ts` - **Known bad symptoms:** If compaction starts failing with "context exceeded" errors again, revert this change ## Risks and Mitigations - **Risk:** Aggressive pruning may drop important context from older messages - **Mitigation:** Dropped messages are summarized separately and fed as `previousSummary` to preserve context - **Risk:** Iterative pruning may add latency to compaction - **Mitigation:** Only triggers when context overflow is detected (rare case); most compactions unchanged  <h3>Greptile Summary</h3> This PR adds a pre-flight context window check in `compaction-safeguard.ts` to handle the scenario where a fallback model has a smaller context window than the primary model. When `inputTokens + reserveTokens + estimatedOutputTokens` exceeds the fallback model's context window, the code applies iterative aggressive pruning (reducing `maxHistoryShare` from 0.3 down to 0.2) to fit within limits before attempting summarization. - The fix addresses a real production outage scenario (issue #19822) where compaction fails with "context length exceeded" and cascades into total session failure - The `maxChunkTokens` calculation was also improved to account for reserve + output tokens, preventing chunks from consuming the full context window - Two issues identified: `reserveTokens` appears double-counted in the budget calculation (both directly and via `estimatedOutputTokens`), and messages dropped by the new aggressive pruning are not summarized — unlike the existing pruning path which summarizes dropped messages to preserve context - No tests were added for the new fallback pruning behavior <h3>Confidence Score: 3/5</h3> - The PR fixes a real outage scenario but has a token budget double-counting issue that causes unnecessarily aggressive pruning, and lacks tests for the new code path. - The core concept is sound and addresses a genuine production failure. However, the double-counting of reserveTokens in the budget check means pruning triggers earlier than necessary — dropping context that could have been preserved. Additionally, the new pruning path does not summarize dropped messages (unlike the existing pruning path), which means context is silently lost in the exact fallback scenario this PR targets. No unit tests were added for the new behavior. - src/agents/pi-extensions/compaction-safeguard.ts — double-counted token budget and unsummarized dropped messages <sub>Last reviewed commit: cd50fd1</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>