#19421: feat(tools): apify social scrapers

by protoss70 open 2026-02-17 19:41 View on GitHub →

size: XL

## Summary Describe the problem and fix in 2–5 bullets: - **Problem:** OpenClaw had no native tool for scraping social media platforms — users had to rely on `web_fetch` which returns unstructured HTML and can't handle platform-specific data (posts, profiles, comments, jobs). - **Why it matters:** Structured social media data (engagement metrics, profile details, job listings) is essential for use cases like lead generation, competitor analysis, and content research. A dedicated tool gives the LLM reliable, formatted results instead of noisy HTML. - **What changed:** Added a new `social_platforms` tool backed by Apify Actors that supports Instagram, TikTok, YouTube, and LinkedIn with a two-phase async start/collect pattern, per-platform formatters, external content security wrapping, caching, and comprehensive configuration. - **What did NOT change (scope boundary):** No existing tools, config schemas, or security primitives were modified beyond the minimum needed to register the new tool. The `web_fetch` / `web_search` tools are untouched. No changes to the gateway, auth, or storage layers. ## AI-Assisted Contribution 🤖 This PR was built with heavy assistance from **Claude Code**. The code was personally tested on a VPS running OpenClaw for end-to-end functional verification. Comprehensive e2e tests with mocked Apify responses are included. I understand what the code does and have reviewed all generated output. - [x] Mark as AI-assisted in the PR title or description - [x] Note the degree of testing: **fully tested** (e2e test suite + manual VPS testing) - [x] Confirm I understand what the code does ## Change Type (select all) - [ ] Bug fix - [x] Feature - [ ] Refactor - [x] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #10959 - Related # ## User-visible / Behavior Changes - New `social_platforms` tool available to agents when `APIFY_API_KEY` is set (or `tools.social.apiKey` in config). - New `tools.social` config section with `enabled`, `apiKey`, `baseUrl`, `cacheTtlMinutes`, `maxResults`, and `allowedPlatforms` options. - New `APIFY_API_KEY` env var placeholder in `.env.example`. - Tool is auto-enabled when an API key is present; no action required for existing users who don't set the key. ## Security Impact (required) - New permissions/capabilities? `Yes` — new tool that can make outbound HTTP calls to Apify's API on behalf of the user. - Secrets/tokens handling changed? `Yes` — reads `APIFY_API_KEY` env var or `tools.social.apiKey` config value; passed via `Authorization: Bearer` header to Apify API only. - New/changed network calls? `Yes` — outbound HTTPS calls to `https://api.apify.com` (hardcoded prefix validation prevents SSRF to other hosts). - Command/tool execution surface changed? `Yes` — adds one new tool (`social_platforms`) to the agent tool registry. - Data access scope changed? `No` - If any `Yes`, explain risk + mitigation: - **Risk:** Apify API key leakage or misuse. **Mitigation:** Key is resolved via `normalizeSecretInput`, only sent to URLs matching the `https://api.apify.com` prefix (enforced by `resolveSocialBaseUrl` validation), and never included in tool output. - **Risk:** Prompt injection from scraped social media content. **Mitigation:** All results are wrapped with `wrapExternalContent` markers (`<<<EXTERNAL_UNTRUSTED_CONTENT>>>`) to signal untrusted data to the model. - **Risk:** SSRF via custom `baseUrl`. **Mitigation:** `baseUrl` is validated to start with `https://api.apify.com`; any other value throws an error. ## Repro + Verification ### Environment - OS: macOS / Linux (VPS) - Runtime/container: Node.js - Model/provider: N/A (tool-level change) - Integration/channel (if any): Apify API - Relevant config (redacted): `APIFY_API_KEY=apify_api_***` ### Steps 1. Set `APIFY_API_KEY` in env or config. 2. Call `social_platforms` with `action: "start"` and a request array (e.g., YouTube search for "web scraping"). 3. Call `social_platforms` with `action: "collect"` using the returned run references. ### Expected - Start returns run IDs and dataset IDs for each request. - Collect returns formatted markdown results wrapped in external content markers. ### Actual - Works as expected — start fires concurrent Apify Actor runs, collect fetches and formats results. ## Evidence Attach at least one: - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: Deployed on personal VPS running OpenClaw and tested all four platforms (Instagram, TikTok, YouTube, LinkedIn) with live Apify API calls. Verified start/collect flow, result formatting, and caching behavior end-to-end. Also ran the full e2e test suite locally. - Edge cases checked: Missing API key returns null tool, disabled platform rejects requests, API errors surface in errors array, pending runs reported correctly, cache returns on second collect, external content security wrapping present on all results, LinkedIn company action fires two concurrent runs (details + posts). - What you did **not** verify: Exhaustive testing of every `actorInput` option combination across all platforms. ## Compatibility / Migration - Backward compatible? `Yes` — purely additive; no existing behavior changed. - Config/env changes? `Yes` — new optional `APIFY_API_KEY` env var and `tools.social` config section. No action required for users who don't want the feature. - Migration needed? `No` - If yes, exact upgrade steps: N/A ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Set `tools.social.enabled: false` in config, or remove `APIFY_API_KEY` from env. The tool will not be registered. - Files/config to restore: Revert the commit; the tool is self-contained in `src/agents/tools/social-platforms.ts` and wired in via `src/agents/openclaw-tools.ts`. - Known bad symptoms reviewers should watch for: Unexpected tool registration when no API key is set, API key appearing in tool output/logs, unhandled promise rejections from Apify API timeouts. ## Risks and Mitigations - Risk: Apify API rate limits or downtime could cause tool calls to fail or hang. - Mitigation: 30-second HTTP timeout (`HTTP_TIMEOUT_MS`), errors are caught and surfaced in the response rather than crashing the agent. Results are cached for 15 minutes by default to reduce repeated calls. - Risk: Scraped content could contain prompt injection attempts. - Mitigation: All results wrapped with `wrapExternalContent` security markers. Raw data is placed inside `<details>` blocks to reduce surface area.