#9418: Context budgeting: cap maxTokens + retry compaction safely
agents
stale
Cluster:
Context Management Enhancements
## Summary
Implements context-aware `maxTokens` capping that estimates input tokens and caps requests to the remaining context window, preventing overflow failures. Adds retry logic for compaction on overflow.
## Motivation
We need predictable maxTokens enforcement to avoid context overflow failures and to make compaction recovery more reliable under load.
## How It Works
1. Estimate input tokens (system + messages)
2. Compute remaining context budget
3. Cap requested maxTokens by remaining budget + model maxTokens
4. If overflow still occurs, retry compaction (bounded attempts)
## Flow
```mermaid
graph TD;
A[Build Prompt] --> B[Estimate Input Tokens];
B --> C[Remaining Context];
C --> D[Cap maxTokens];
D --> E[Send Request];
E -->|overflow| F[Compaction Retry Loop];
F --> D;
```
## Key Code Changes
1. **Context Estimation (`src/agents/pi-embedded-runner/extra-params.ts`):**
- `estimateInputTokens()` - Counts tokens for system prompt + messages with safety margin
- `calculateCappedMaxTokens()` - Caps requested `maxTokens` against model limits and remaining context
- `resolveModelContextWindow()` / `resolveModelMaxTokens()` - Resolves model-specific limits (includes override for `zai/glm-4.7`)
2. **Stream Function Wrapper:**
- `createMaxTokensCapWrapper()` - Wraps `StreamFn` to dynamically cap `maxTokens` before each API call
- Logs when capping occurs (e.g., "capping maxTokens for zai/glm-4.7 from 500000 to 128000")
3. **Compaction Retry (`src/agents/pi-embedded-runner/run.ts`):**
- Updated overflow handling to retry compaction with bounded attempts when context overflow occurs
4. **Model-Specific Overrides:**
- `zai/glm-4.7`: contextWindow=200000, maxTokens=128000
## Algorithm
```
remaining = contextWindow - inputTokens
cappedMaxTokens = min(requestedMaxTokens, modelMaxTokens, remaining)
```
## Testing
- `pnpm test` (full)
- Targeted tests for maxTokens capping and overflow compaction
Most Similar PRs
#19878: fix: Handle compaction when fallback model has smaller context window
by gaurav10gg · 2026-02-18
65.4%
#19593: feat(compaction): proactive handover before context overflow
by qualiobra · 2026-02-18
64.0%
#17345: feat: Memory kernel rebuild with token budgeting, summary sidecar, ...
by markmusson · 2026-02-15
63.3%
#5360: fix(compaction): add emergency pruning for context overflow
by sgwannabe · 2026-01-31
63.0%
#15749: fix: improve context overflow error with diagnostic details
by superlowburn · 2026-02-13
61.8%
#18886: fix(status): prefer configured contextTokens over model metadata
by BinHPdev · 2026-02-17
60.7%
#10273: fix(agents): detect and auto-compact mid-run context overflow
by terryops · 2026-02-06
60.5%
#17414: fix(sessions): refresh contextTokens when model override changes
by michaelbship · 2026-02-15
60.3%
#10505: feat(compaction): add timeout, model override, and diagnostic logging
by thebtf · 2026-02-06
59.9%
#13895: fix(usage): exclude cache tokens from context-window accounting
by zerone0x · 2026-02-11
59.8%