#11472: fix: retry media fetch on transient network errors
stale
Cluster:
Error Resilience and Retry Logic
## Summary
Adds exponential backoff retry to `fetchRemoteMedia()` in `src/media/fetch.ts` for transient network failures.
## Problem
When fetching media from provider APIs (Telegram, Discord, Slack, etc.), a single transient `TypeError: fetch failed` causes the entire inbound message to be dropped. The agent never sees the message, and there is no re-delivery mechanism.
This is especially common in VM/container environments where network connectivity to provider APIs can be intermittent.
## Fix
- Retry up to **3 times** with exponential backoff (**1s → 2s → 4s**)
- Only retries on network-level fetch failures (the `catch` block)
- **Does not retry** deterministic errors: HTTP status errors (`http_error`) or size limit violations (`max_bytes`)
- Logs each retry attempt for observability
- Properly cleans up `release` handle between retries
## Testing
Verified locally by:
1. Observing a `MediaFetchError: fetch_failed` dropping a Telegram photo message (see logs in #11471)
2. Applying the patch to `dist/deliver-BIDW_mg2.js`
3. Restarting the gateway
4. Successfully receiving the same photo on retry
Fixes #11471
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds retry-with-exponential-backoff around `fetchWithSsrFGuard()` in `src/media/fetch.ts` to reduce dropped inbound messages when media fetch fails due to transient network issues. The retry loop logs each backoff attempt and keeps existing behavior for HTTP status errors and max-bytes enforcement in the response handling path.
Key things to double-check before merge:
- The retry loop currently performs one more attempt than `MEDIA_FETCH_MAX_RETRIES` suggests (off-by-one).
- The retry is applied to any error thrown by `fetchWithSsrFGuard`, including deterministic SSRF/URL/redirect errors, which adds delay/noise and diverges from the stated goal of retrying only transient fetch failures.
<h3>Confidence Score: 3/5</h3>
- Mergeable after fixing retry semantics and error filtering
- Change is localized and the intent is clear, but current loop bounds add an extra attempt and the retry catches deterministic errors from fetchWithSsrFGuard (SSRF/URL/redirect validation), causing unnecessary delays and noisy logs in those scenarios.
- src/media/fetch.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#9232: Fix: Add automatic retry for network errors in message runs
by vishaltandale00 · 2026-02-05
77.6%
#23497: feat(retry): add retryHttpAsync utility with comprehensive coverage
by thinstripe · 2026-02-22
77.1%
#15585: fix: add retry/backoff for Gemini embedding API calls
by WalterSumbon · 2026-02-13
76.6%
#16913: fix(agent): increase transient HTTP retry from 1 to 3 with escalati...
by hou-rong · 2026-02-15
76.3%
#17435: fix(debounce): retry flush with exponential backoff to prevent sile...
by widingmarcus-cyber · 2026-02-15
75.9%
#8677: fix: add retry logic to OAuth token refresh
by skyblue-will · 2026-02-04
75.7%
#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
by zerone0x · 2026-02-14
74.9%
#12995: feat(infra): Add retry with exponential backoff for transient failures
by trevorgordon981 · 2026-02-10
74.7%
#19540: feat: add timeout and exponential backoff retry for frontend API calls
by Mozzzaic · 2026-02-17
74.3%
#19942: feat(telegram): configurable SSRF policy for media fetch
by onewesong · 2026-02-18
74.0%