#9418: Context budgeting: cap maxTokens + retry compaction safely

by jroth1111 open 2026-02-05 06:22 View on GitHub →

agents stale

Cluster: Context Management Enhancements

## Summary Implements context-aware `maxTokens` capping that estimates input tokens and caps requests to the remaining context window, preventing overflow failures. Adds retry logic for compaction on overflow. ## Motivation We need predictable maxTokens enforcement to avoid context overflow failures and to make compaction recovery more reliable under load. ## How It Works 1. Estimate input tokens (system + messages) 2. Compute remaining context budget 3. Cap requested maxTokens by remaining budget + model maxTokens 4. If overflow still occurs, retry compaction (bounded attempts) ## Flow ```mermaid graph TD; A[Build Prompt] --> B[Estimate Input Tokens]; B --> C[Remaining Context]; C --> D[Cap maxTokens]; D --> E[Send Request]; E -->|overflow| F[Compaction Retry Loop]; F --> D; ``` ## Key Code Changes 1. **Context Estimation (`src/agents/pi-embedded-runner/extra-params.ts`):** - `estimateInputTokens()` - Counts tokens for system prompt + messages with safety margin - `calculateCappedMaxTokens()` - Caps requested `maxTokens` against model limits and remaining context - `resolveModelContextWindow()` / `resolveModelMaxTokens()` - Resolves model-specific limits (includes override for `zai/glm-4.7`) 2. **Stream Function Wrapper:** - `createMaxTokensCapWrapper()` - Wraps `StreamFn` to dynamically cap `maxTokens` before each API call - Logs when capping occurs (e.g., "capping maxTokens for zai/glm-4.7 from 500000 to 128000") 3. **Compaction Retry (`src/agents/pi-embedded-runner/run.ts`):** - Updated overflow handling to retry compaction with bounded attempts when context overflow occurs 4. **Model-Specific Overrides:** - `zai/glm-4.7`: contextWindow=200000, maxTokens=128000 ## Algorithm ``` remaining = contextWindow - inputTokens cappedMaxTokens = min(requestedMaxTokens, modelMaxTokens, remaining) ``` ## Testing - `pnpm test` (full) - Targeted tests for maxTokens capping and overflow compaction