#5561: fix(telegram): auto-restart on timeout + lower API timeout to 60s

by jesseproudman open 2026-01-31 17:07 View on GitHub →

channel: telegram gateway

Cluster: Telegram Timeout and Recovery Improvements

## Problem The Telegram channel handler was getting wedged and not recovering when: 1. grammY's long-polling timed out (500s default) 2. Inference API was slow/unresponsive 3. Agent runs blocked the session lane for 10+ minutes Once wedged, the channel stayed down until manual restart. ## Solution ### Auto-restart with exponential backoff - Channels now automatically restart when they exit unexpectedly - Backoff: 2s → 4s → 8s → 16s → 32s → 60s (capped), with 20% jitter - Attempt counter resets after 5 minutes of successful operation - Gives up after 10 consecutive failures to prevent infinite loops ### Lower API timeout - Reduced grammY API timeout from 500s to 60s - Allows faster detection and recovery from stuck requests ### Clean shutdown handling - Deliberate `stopChannel` calls cancel pending restarts - Reset attempt counters on deliberate stop ## Testing Tested overnight with a bot that was previously freezing every 1-2 hours. The auto-restart kicked in successfully on timeouts and recovered within seconds.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR improves Telegram channel resilience by (1) adding auto-restart behavior to the gateway channel manager when a channel task exits unexpectedly (with exponential backoff, jitter, and a max-attempt cap), and (2) reducing grammY API call timeouts to 60s by default to detect stuck requests sooner. Key integration points: - `src/gateway/server-channels.ts` now tracks per-account restart attempts and pending timers and schedules restarts from the channel task’s `.finally()` when the abort signal was not triggered. - `src/telegram/bot.ts` now always passes `client.timeoutSeconds` to grammY’s `Bot` so API calls don’t hang for ~500s by default. Notable issues: - The “reset attempts after 5 minutes of successful running” currently uses a timestamp set at start initiation rather than a confirmed healthy-running signal, and it isn’t cleared on deliberate stop; both can reset attempts in cases that don’t represent successful recovery. - Restart timers aren’t cancelled when scheduling new restarts or when manual starts happen, which can create overlapping restart behavior. - Telegram bot has two adjacent `bot.catch` handlers, which will double-log errors. <h3>Confidence Score: 3/5</h3> - Reasonably safe to merge, but restart bookkeeping has edge cases that can cause unexpected restart behavior and confusing logs. - Core changes are localized and conceptually straightforward (restart on unexpected exit + lower API timeout). However, the restart-attempt reset uses a start timestamp rather than a proven healthy-running window and isn’t cleared on deliberate stop, and restart timers aren’t de-duplicated/cancelled outside of stopChannel. These can weaken the intended max-attempt protection and create overlapping restarts in certain manual-start/rapid-exit scenarios. - src/gateway/server-channels.ts  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8)) - Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))