← Back to PRs

#18115: fix: prevent voice message loss during concurrent update processing

by AlekseyRodkin open 2026-02-16 14:21 View on GitHub →
channel: telegram stale size: S
## Bug Description When the Telegram polling loop (grammY runner) hits a transient network error, it restarts with backoff. There is a potential race window where update offsets are persisted before voice message processing completes, so if the runner crashes mid-processing, those updates are acknowledged to Telegram but never fully handled. This is a narrow race (~5% occurrence) and users can retry manually, making it low severity. The root cause likely involves the update offset persistence in the runner's fetch cycle vs. the sink's processing pipeline. **Severity:** low ## Root Cause The race condition is in bot.ts where recordUpdateId (line ~227) persists the highest update_id after each update's middleware completes, but updates from different chats run concurrently (sequentialize only serializes per-chat). When a batch contains [update 100 (voice, chat A), update 101 (text, chat B)], update 101 can finish first and persist lastUpdateId=101 to disk. If the runner then crashes (network error) before update 100 finishes processing, on restart shouldSkipUpdate sees lastUpdateId=101 and skips update 100 — the voice message is permanently lost. The core issue is that the offset persistence tracks the MAX update_id completed rather than ensuring ALL update_ids up to that point have been processed. Combined with the grammY runner's concurrent sink processing, any out-of-order completion creates a window where lower-numbered updates can be silently dropped. ## Fix Created src/telegram/update-offset-watermark.ts with contiguous watermark tracker (TCP ACK window approach). Modified src/telegram/bot.ts to use watermark - tracks in-flight update IDs and only persists offset as highest contiguous completed update_id, preventing out-of-order completion from skipping lower-numbered updates. ## Regression Test Added src/telegram/update-offset-watermark.test.ts with 4 tests: verifies offset doesn't advance past in-flight updates, handles in-order completion, respects initial offset, and handles gaps in update IDs. ## Verification 1) Diff matches claimed changes — 3 files modified (bot.ts, new watermark module + test). 2) Security: .gitignore exists, no sensitive files in diff. 3) All tests pass (1 pre-existing failure on main unrelated to this fix). 4) Regression test covers exact race scenario: out-of-order completion doesn't advance offset past in-flight updates. 5) Fix is minimal and targeted — watermark tracker replaces direct MAX update_id persistence with contiguous-completion tracking (TCP ACK window approach). 6) Fix addresses root cause: offset only advances when all lower update IDs are complete, preventing skipped updates on crash. 7) Test would fail without fix — without watermark, completing update 101 before 100 would immediately persist offset=101. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR introduces a contiguous watermark tracker (`update-offset-watermark.ts`) that replaces the previous "max completed update_id" approach with a TCP ACK window pattern, ensuring the persisted offset only advances when all lower-numbered updates have completed. This correctly addresses the race condition where out-of-order completion could cause lower-numbered in-flight updates to be permanently skipped on crash/restart. - **New module** `update-offset-watermark.ts` cleanly implements the watermark tracker with `markStarted`/`markCompleted`/`getCurrentOffset` API and proper contiguous-completion logic. - **Integration in `bot.ts`** wires the watermark into the existing middleware pipeline, calling `markStarted` before handler dispatch and `markCompleted` after. - **Bug found:** `markStarted` is called unconditionally for all updates entering the middleware, but `recordUpdateId` short-circuits (skips `markCompleted`) for re-delivered updates with ID ≤ `lastUpdateId`. This leaves phantom entries in the `inFlight` set that permanently block watermark advancement — effectively the same class of bug this PR aims to fix, but triggered by update re-delivery instead of out-of-order completion. - **Test coverage** is solid for the watermark module in isolation (4 tests covering the core scenarios) but does not cover the integration-level bug with skipped updates. <h3>Confidence Score: 2/5</h3> - The watermark logic is sound in isolation but the bot.ts integration has a bug that can permanently block offset advancement after update re-delivery. - Score of 2 reflects that while the core watermark module is well-designed and tested, the integration in bot.ts has a logical bug where markStarted is called for re-delivered updates that will never receive a corresponding markCompleted call, causing phantom in-flight entries that block all future watermark advancement. This is the same class of update-loss bug the PR aims to fix. - src/telegram/bot.ts — the markStarted/markCompleted pairing in the middleware needs to be guarded against re-delivered (already-processed) update IDs. <sub>Last reviewed commit: e120cd2</sub> <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs