#11688: feat(telegram): add health check watchdog for long-polling

by rmfalco89 open 2026-02-08 05:11 View on GitHub →

channel: telegram stale

Cluster: Telegram Timeout and Recovery Improvements

## Summary - Adds a periodic health check watchdog to the Telegram long-polling monitor - Pings `bot.api.getMe()` every 2 minutes with a 10-second timeout - If the check fails or times out, assumes the TCP connection is silently dead and restarts polling with a fresh connection - Resets backoff counter on health-check-triggered restarts since they are controlled, not crash-induced ## Motivation After periods of inactivity, NAT gateways and firewalls may silently drop idle TCP connections. When this happens, the long-polling socket hangs indefinitely — Grammy receives no error and no timeout, so the bot appears alive but never receives updates. This is especially common on headless servers or machines behind carrier-grade NAT. The watchdog detects this condition and recovers automatically. ## Test plan - [ ] Verify normal polling still works without interference from the health check - [ ] Simulate a stale connection (e.g. firewall rule drop) and confirm the watchdog detects it and restarts polling - [ ] Confirm health check timer is properly cleaned up on shutdown 🤖 Generated with [Claude Code](https://claude.com/claude-code)  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds a watchdog around the Telegram long-polling runner that periodically pings `bot.api.getMe()` (2-minute interval, 10-second timeout). If the health check fails, it stops the current `@grammyjs/runner` instance and restarts polling, resetting the backoff counter for these controlled restarts. The change is isolated to `src/telegram/monitor.ts` and integrates into the existing restart/backoff loop for polling mode, with cleanup in `finally` to clear the interval and remove the abort listener. <h3>Confidence Score: 3/5</h3> - Mostly safe to merge, but the watchdog implementation has real timer/concurrency issues to fix first. - The overall approach (periodic getMe + restart runner) is sound and cleanup is attempted in a finally block, but the current implementation leaks per-check timeout timers and can run overlapping async interval callbacks on hung connections, leading to duplicated requests and stop/restart races. - src/telegram/monitor.ts  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>