#21832: feat(agent): add self-verification loop with full-context evaluation

by artbred open 2026-02-20 12:39 View on GitHub →

size: XL

Cluster: Security Enhancements and Fixes

## Summary Adds an opt-in **self-verification loop** to the agent runner. After an agent completes a task, a separate verifier model evaluates whether the response actually addresses the user's request — using the **full conversation context**, user rules (AGENTS.md, SOUL.md), conversation history, and execution metadata. If verification fails, structured feedback is injected back into the conversation and the agent retries, up to a configurable max attempts. This is an **"LLM-as-a-judge"** pattern integrated directly into `runReplyAgent()`, designed to improve response quality without manual re-prompting. ## How it works ### Core loop 1. Agent produces a response via the normal `runAgentTurnWithFallback()` flow 2. **Deterministic pre-checks** (`shouldSkipVerification()`) decide if verification applies — skips tool-call turns, already-streamed content, messaging tool sends, and empty responses 3. **Keyword trigger**, `verifyAll` mode, or `verifyHeartbeat` mode decides whether to invoke the verifier 4. `assembleVerifierContext()` builds a token-budgeted evaluation prompt (~32K char budget) with all available context 5. `verifyAgentResponse()` makes a standalone LLM call to a verifier model 6. The verifier uses **chain-of-thought** reasoning across 3 dimensions (Goal Achievement, Completeness, Rule Compliance) 7. Returns `PASS` or `FAIL [category]: <feedback>` with a structured fail category 8. On failure, categorized feedback is injected as a new user message and the agent retries 9. Loop continues until PASS or max attempts exhausted 10. **Fail-open**: if the verifier errors, times out, or returns malformed output, the original response is delivered — verification never blocks delivery ### Deterministic pre-checks (no LLM call needed) These conditions skip verification entirely before any LLM call: - `stopReason === "tool_calls"` or `pendingToolCalls` — intermediate turn, not final - `didSendViaMessagingTool` — content already delivered (WhatsApp, Telegram, etc.) - Block streaming already sent (`directlySentBlockKeys.size > 0`) — can't un-send - Empty response text ### Full-context evaluation The verifier assembles context from multiple sources, ordered to mitigate the **"Lost in the Middle"** effect (critical context at START and END): ``` [SYSTEM PROMPT] ← evaluation instructions + 3-dimension rubric [USER MESSAGE] ← what was asked (near start, never trimmed) [USER RULES] ← AGENTS.md, SOUL.md excerpts (capped ~8K chars) [CONVERSATION HISTORY] ← last 5 turns from InboundHistory (zero I/O) [EXECUTION METADATA] ← stop_reason, duration [PREVIOUS FEEDBACK] ← prior verification feedback (on retries) [AGENT RESPONSE] ← what we're evaluating (at END for recency) ``` Token budget priority (trim from bottom up): 1. System prompt (~1K chars) — fixed 2. User message — never trimmed 3. Agent response — truncate tail if >12K chars 4. Execution metadata — tiny, always include 5. User rules — cap at ~8K chars 6. Conversation history — trim oldest first ### Heartbeat verification When `verifyHeartbeat: true`, the verifier also evaluates **heartbeat responses**. This catches lazy `HEARTBEAT_OK` replies when the agent should have taken proactive action based on `HEARTBEAT.md` tasks. Without `verifyHeartbeat`, heartbeat runs are always skipped (default behavior — no overhead on periodic checks that don't need verification). With `verifyHeartbeat: true`, **all** heartbeat responses are verified regardless of keyword triggers — this is intentional because the whole point is catching agents that respond `HEARTBEAT_OK` without doing real work. Example: If `HEARTBEAT.md` says "Check for new Kaggle competitions and participate" but the agent responds `HEARTBEAT_OK` without checking, the verifier flags it as `goal_missed` and the agent retries with proper task execution. ### Structured failure categories On failure, the verifier outputs a categorized response for smarter retries: - `goal_missed` — response doesn't address the actual question - `incomplete` — response is truncated or missing requested elements - `rule_violation` — response violates user rules (AGENTS.md, SOUL.md) - `tone_mismatch` — wrong tone, register, or communication style - `refusal` — agent refused a legitimate request The category is logged via `emitAgentEvent()` and included in the retry feedback prompt to help the agent focus its correction. ## Verification scope The verification loop covers: - **Main session responses** ✅ — all channel messages (WhatsApp, Telegram, Discord, etc.) - **Sub-agent responses** ✅ — sub-agents spawned via `sessions_spawn` route through `runReplyAgent()` via the gateway - **Heartbeat responses** ✅ — opt-in via `verifyHeartbeat: true` - **CLI provider** ❌ — intentionally excluded (local development) ## Configuration Add to `~/.openclaw/openclaw.json`: ```json { "agents": { "defaults": { "verifier": { "enabled": true, "model": "anthropic/claude-sonnet-4-5", "verifyAll": false, "verifyHeartbeat": true, "maxAttempts": 3, "triggerKeywords": ["done", "completed", "finished", "ready", "here you go"], "timeoutSeconds": 30 } } } } ``` ### Configuration reference | Field | Type | Default | Description | |-------|------|---------|-------------| | `enabled` | `boolean` | `false` | Enable the verification loop | | `verifyAll` | `boolean` | `false` | Verify **every** response regardless of trigger keywords | | `verifyHeartbeat` | `boolean` | `false` | Verify heartbeat responses — catches lazy `HEARTBEAT_OK` when tasks exist | | `model` | `string` | `"anthropic/claude-sonnet-4-5"` | Verifier model (recommended: different model family than agent) | | `maxAttempts` | `number` | `3` | Max attempts including original response | | `triggerKeywords` | `string[]` | `["done", "completed", ...]` | Keywords that trigger verification when `verifyAll` is `false` | | `timeoutSeconds` | `number` | `30` | Timeout for the verifier LLM call | All fields are optional with sensible defaults. The feature is **off by default** (`enabled: false`). ### Trigger modes - **Keyword mode** (default): verification only runs when the agent's response contains a trigger keyword (e.g., "done", "completed"). Lightweight — most responses skip verification entirely. - **Verify-all mode** (`verifyAll: true`): verification runs on every response. Higher cost but catches more issues. Recommended when quality matters more than latency. - **Heartbeat mode** (`verifyHeartbeat: true`): all heartbeat responses are verified regardless of keywords. Catches agents that reply `HEARTBEAT_OK` without checking their `HEARTBEAT.md` tasks. ### Recommended setup Use a **different model family** for the verifier than the agent to avoid self-bias: - Agent: `anthropic/claude-opus-4-6` → Verifier: `openai/gpt-4.1` or `kimi-coding/k2p5` - Agent: `openai/gpt-4.1` → Verifier: `anthropic/claude-sonnet-4-5` ## Design decisions - **Deterministic pre-checks**: Fast `shouldSkipVerification()` avoids LLM calls for cases that can be decided logically (tool calls, already-sent content, empty responses) - **Verifier call**: Standalone `completeSimple()` call, not a full agent turn — no tools, no session writes, no transcript pollution - **Full context assembly**: `assembleVerifierContext()` reads AGENTS.md/SOUL.md from workspace at verification time, includes conversation history from memory (zero additional I/O), and applies token budgeting - **Chain-of-thought evaluation**: Verifier reasons through 3 dimensions before verdicting — reduces snap-judgment failures - **Structured FAIL categories**: Categories enable smarter retry prompts (e.g., "Your response was flagged for `incomplete` — specifically: missing error handling") - **Retry mechanism**: Feedback injected into the same session conversation, not a new session — preserves full context - **"Lost in the Middle" mitigation**: Context ordering places critical information (user message, agent response) at boundaries, less critical (rules, history) in the middle - **Block streaming skip**: When block streaming has already sent content to the user, verification is skipped (can't un-send) - **Heartbeat always-verify**: When `verifyHeartbeat` is true, heartbeat runs bypass keyword triggers entirely — every heartbeat response goes through verification. This is the correct behavior because the problem being solved is agents that lazily respond `HEARTBEAT_OK` without checking tasks. - **No per-agent config in v1**: Only `agents.defaults.verifier` — per-agent overrides can be added later - **System prompt is hardcoded**: A well-tested verification prompt with CoT rubric, not user-configurable in v1 - **Fail-open always**: Any verifier error (API failure, timeout, malformed response, model unavailable) silently delivers the original response ## Files changed | File | Lines | Change | |------|-------|--------| | `src/auto-reply/reply/agent-verifier.ts` | 388 | Full-context verifier module: `assembleVerifierContext()`, `shouldSkipVerification()`, `verifyAgentResponse()`, structured FAIL parsing, workspace file reading | | `src/auto-reply/reply/agent-verifier-trigger.ts` | 23 | Keyword trigger detector (`shouldVerifyResponse()`) | | `src/auto-reply/reply/agent-runner.ts` | +218 | Verification loop integration in `runReplyAgent()`: pre-checks, context enrichment, retry with categorized feedback, heartbeat verification, lifecycle logging | | `src/config/types.agent-defaults.ts` | +18 | `AgentVerifierConfig` type definition (includes `verifyHeartbeat`) | | `src/config/zod-schema.agent-defaults.ts` | +12 | Zod runtime validation schema for verifier config | | `src/auto-reply/reply/agent-verifier.test.ts` | 362 | 30 unit tests: verifier logic, context assembly, structured FAIL parsing, skip pre-checks | | `src/auto-reply/reply/agent-verifier-trigger.test.ts` | 46 | 9 unit tests: keyword trigger detection |...