#20490: fix(queue): partition followup routing safety (rebased on latest main)

by Jackten open 2026-02-19 00:57 View on GitHub →

size: M

AI-assisted: Yes (Codex CLI). Testing level: Focused + type-aware lint/format checks completed (see Repro + Verification). ## Summary - Problem: Followup queue lane identity is session-centric (`queueKey = sessionKey ?? sessionIdFinal`), which is coarser than routing tuple identity. - Why it matters: If routing metadata quality degrades, collect mode may still process mixed destinations in one lane, creating context/routing risk. - What changed: - Derive queue key with routing fingerprint for routable channels. - Harden collect mode to fail safe (no collect) when routable metadata is missing/ambiguous. - Add regression tests for destination mismatch and missing metadata. - What did NOT change (scope boundary): - No model/provider behavior changes. - No UI changes. - No steer-path redesign in this PR. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #20486 - Related #4380 #5615 #4344 #3102 #788 #3520 ## User-visible / Behavior Changes - Safer queue behavior under load: queued followups are never collected when routing identity is missing or differs. - Same-destination collect behavior remains unchanged. ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: `N/A` ## Repro + Verification ### Environment - OS: Ubuntu 25.10 - Runtime/container: Node + pnpm dev checkout - Model/provider: n/a (queue routing unit tests) - Integration/channel (if any): WhatsApp/Slack routing semantics - Relevant config (redacted): queue mode `collect`; default `session.dmScope` ### Steps 1. Enqueue two followup runs with different `originatingTo` in collect mode. 2. Drain queue. 3. Verify two independent runs are executed (no synthetic collect prompt). 4. Enqueue two runs with same full routing tuple. 5. Verify one collected synthetic run is executed. 6. Enqueue run(s) with missing routable metadata. 7. Verify fail-safe individual processing. ### Expected - No collect across differing tuple. - No collect with missing routable tuple fields. - Collect only for exact tuple match. ### Actual - Before: safety depended on metadata quality and session-centric lane partition. - After: partition + collect checks enforce stricter invariant. ## Evidence - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) Planned/updated tests: - `src/auto-reply/reply/reply-flow.test.ts` - `does not collect when destination differs` - `does not collect when routable metadata missing` - `collects when full routing tuple matches` - `src/auto-reply/reply/get-reply-run*.test.ts` (or nearest queue-key unit location) - queue key differs for different routable destination tuple ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: - Cross-destination no-collect - Same-destination collect preserved - Missing metadata fail-safe path - Edge cases checked: - String thread IDs - Empty accountId fallback behavior - What you did **not** verify: - Full multi-channel e2e in production-sized environment ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`No`) - Migration needed? (`No`) - If yes, exact upgrade steps: `N/A` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: - Revert PR-A commit. - Files/config to restore: - `src/auto-reply/reply/get-reply-run.ts` - `src/auto-reply/reply/queue/drain.ts` - test files touched - Known bad symptoms reviewers should watch for: - Over-fragmented queue behavior (too many individual followups) ## Risks and Mitigations - Risk: Reduced batching could increase run count. - Mitigation: Restrict stricter behavior to routable tuple mismatch/missing only. - Risk: Legacy paths missing metadata may appear noisier. - Mitigation: Add PR-C diagnostics for targeted follow-up.  <h3>Greptile Summary</h3> This PR enhances queue safety by partitioning followup queues based on full routing tuple (channel, destination, account, thread) instead of just session identity, and hardens collect mode to fail-safe when routing metadata is incomplete. - Added `buildFollowupQueueKey()` that appends routing fingerprint `::route:channel|to|accountId|threadId` to base session identifier for routable channels - Modified collect drain logic to detect cross-channel items and force individual processing when routing metadata differs or is incomplete - Added fallback to extract channel from `run.messageProvider` when `originatingChannel` is missing - Comprehensive test coverage for: destination mismatch, missing metadata fail-safe, same-destination collect, thread ID preservation The implementation correctly ensures messages with different routing tuples are never collected into a single synthetic prompt, preventing context/routing confusion. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - Changes are narrowly scoped to queue partitioning logic with clear fail-safe semantics. Test coverage is comprehensive for all key scenarios (destination mismatch, missing metadata, same-destination collect, thread preservation). The backward-compatible approach (falling back to base session identifier when routing metadata incomplete) ensures no regression for existing flows. Logic is defensive and designed to prevent incorrect message batching rather than optimize throughput. - No files require special attention <sub>Last reviewed commit: f0f4731</sub>  <sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub>  --- ## Rebase Update (2026-02-19) - Rebasing completed on latest upstream `main`: - Base SHA: `42d11a3ec5f43897ce8d4adbceb896389728f174` - Then refreshed to newest upstream `main`: `2c05cbb43e48ebad03626d3125746fb1b9a8520f` - Rebase result: clean, no conflicts. - Current PR head SHA: `2d3979c6059897b145c58522a97f0137b35851c9` ### Focused Verification (passed) - `pnpm test src/auto-reply/reply/get-reply-run.media-only.test.ts src/auto-reply/reply/reply-flow.test.ts` - Result: `86/86` tests passed. - `pnpm exec oxfmt --check src/auto-reply/reply/commands-status.ts src/auto-reply/reply/get-reply-run.media-only.test.ts src/auto-reply/reply/get-reply-run.ts src/auto-reply/reply/queue.ts src/auto-reply/reply/queue/drain.ts src/auto-reply/reply/queue/enqueue.ts src/auto-reply/reply/reply-flow.test.ts` - Result: format check passed. - `pnpm exec oxlint --type-aware src/auto-reply/reply/commands-status.ts src/auto-reply/reply/get-reply-run.media-only.test.ts src/auto-reply/reply/get-reply-run.ts src/auto-reply/reply/queue.ts src/auto-reply/reply/queue/drain.ts src/auto-reply/reply/queue/enqueue.ts src/auto-reply/reply/reply-flow.test.ts` - Result: `0 warnings / 0 errors`.