Skip to content

Thordata/scrape-apify-actors-with-developers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape Apify Actors with Developer Info

This project scrapes all public Actors from the Apify Store (via the public Algolia index) and enriches them with developer profile information, then exports the result to CSV / Markdown / Excel.

The script is designed to be:

  • Accurate – uses Algolia’s structured fields for stats/pricing instead of brittle HTML parsing.
  • Robust – async I/O, retries with backoff, and range-based pagination to bypass Algolia’s 16k pagination limit.
  • Ready for analysis – clean, normalized columns that work well in Excel, BI tools, or Python notebooks.

Preview

Excel preview

Data fields

Each row represents a single Actor and contains at least:

  • actor_url – full URL, e.g. https://apify.com/epctex/youtube-video-downloader
  • actor_name – Actor title from the Store (human‑readable)
  • pricing – human‑readable summary derived from currentPricingInfo, e.g.
    • FREE
    • FLAT_PRICE_PER_MONTH; 15 USD/month; trialMinutes=4320
    • PAY_PER_EVENT; minimalMaxTotalChargeUsd=0.5; primaryEvent=Scraped place; FREE=0.004 USD/event
    • PRICE_PER_DATASET_ITEM; unit=result; FREE=0.0005 USD/item
  • bookmarked – bookmark count (integer)
  • total_users – total users count (integer)
  • monthly_active_users – 30‑day active users (integer, when available)
  • developer_name – developer display name (fallbacks to userFullName / username)
  • developer_profile_urlhttps://apify.com/<username>
  • developer_joined – join date from the profile, e.g. Joined May 2023
  • developer_contacts – all external/contact links found on the profile main content, e.g.:
    • email addresses (user@example.com, mailto:user@example.com)
    • websites (https://example.com)
    • social profiles (LinkedIn, Twitter/X, GitHub, YouTube, etc.)

If a developer genuinely didn’t provide any external links, developer_contacts is empty for that row.

For transparency, developer profiles that consistently return HTTP errors (e.g. 404) are logged to:

  • developer_profile_failures.csv

How it works (high level)

  1. Discover Actors via Algolia

    • Uses the public Apify Store Algolia index prod_PUBLIC_STORE.
    • Applies numeric range splits on modifiedAt + paginated queries to bypass Algolia’s paginationLimitedTo (16k) limit.
    • Only keeps hits that have both username and name.
  2. Enrich with developer profiles

    • Deduplicates by username and fetches each profile page once (with retries and concurrency limits).
    • Parses the HTML using BeautifulSoup, focusing on the <main> section to avoid global footer/header noise.
    • Extracts:
      • Joined <Month> <Year> (regex over the full text blob).
      • All external/contact links and emails from the main content (with a conservative fallback if the page has no <main>).
  3. Export

    • CSV: apify_actors_with_developers.csv (full dataset).
    • Markdown: apify_actors_with_developers.md (first 200 rows for quick diff/preview).
    • Excel: apify_actors_with_developers.xlsx with:
      • Frozen header row.
      • Auto‑sized columns (capped to a reasonable width).
      • Clickable hyperlinks for actor_url and developer_profile_url.
      • Wrapped text for long fields like pricing and developer_contacts.

Tech stack

  • Python 3
  • httpx – async HTTP client
  • BeautifulSoup4 – HTML parsing
  • pandas – data wrangling & tabular exports
  • openpyxl – Excel (.xlsx) output
  • tenacity – retry with exponential backoff + jitter
  • tqdm – progress bars for long‑running jobs

All Python dependencies are listed in requirements.txt.

Setup

git clone https://github.com/Thordata/scrape-apify-actors-with-developers.git
cd scrape-apify-actors-with-developers

python -m venv .venv
source .venv/Scripts/activate  # on Windows Git Bash / WSL
# or: .venv\Scripts\activate   # on classic Windows CMD/PowerShell

pip install -r requirements.txt

Required environment variables

The script does not use your private Apify API token for listing Actors.
Instead, it relies on the public search‑only Algolia key that the Apify Store itself uses.

You need to provide:

  • APIFY_STORE_ALGOLIA_APP_ID – Algolia application id (from the Apify Store network requests)
  • APIFY_STORE_ALGOLIA_API_KEY – public search‑only key for prod_PUBLIC_STORE (from the Apify Store network requests)

Example (Git Bash / WSL):

export APIFY_STORE_ALGOLIA_APP_ID=<YOUR_APP_ID>
export APIFY_STORE_ALGOLIA_API_KEY=<YOUR_SEARCH_ONLY_KEY>

Important:

  • Do not commit your private Apify account tokens to Git.
  • The key used here should be the public search‑only key exposed by the Store UI, not your personal API token.

How to obtain the Algolia keys

  1. Open the Apify Store page (e.g. https://apify.com/store/categories)
  2. Open DevTools → Network
  3. Find a request to:
    • ...algolia.net/1/indexes/prod_PUBLIC_STORE/query
  4. Copy request headers:
    • x-algolia-application-id
    • x-algolia-api-key

Optional tuning

You can tweak runtime behaviour via environment variables:

  • APIFY_STORE_LIMIT_ACTORS – limit number of Actors for faster test runs (default: 0 = no limit).
    • e.g. export APIFY_STORE_LIMIT_ACTORS=200 for a quick smoke test.
  • APIFY_STORE_HITS_PER_PAGE – Algolia hitsPerPage (default: 100).
  • APIFY_STORE_PAGE_CONCURRENCY – concurrent Algolia page fetches (default: 15).
  • APIFY_STORE_PROFILE_CONCURRENCY – concurrent developer profile fetches (default: 30).
  • APIFY_STORE_PAGINATION_LIMIT – expected Algolia pagination limit (default: 16000).
  • APIFY_STORE_OUTPUT_BASENAME – base name for output files (default: apify_actors_with_developers).

Usage

1. Quick sample run (recommended)

Verify the pipeline and field correctness on a small sample:

export APIFY_STORE_LIMIT_ACTORS=200
python scrape_apify_actors.py

Check:

  • apify_actors_with_developers.csv and .xlsx – validate pricing, stats and contacts for a few Actors.
  • developer_profile_failures.csv – developer profiles that consistently returned HTTP errors (e.g. 404).

2. Full crawl (all public Actors)

Unset the limit (or set to 0) and run:

unset APIFY_STORE_LIMIT_ACTORS  # or: export APIFY_STORE_LIMIT_ACTORS=0
python scrape_apify_actors.py

Depending on your connection, the full crawl (≈ 22k Actors + ~15 developer profiles batches) usually completes within ~10–15 minutes.

Output files

After a successful run you should see:

  • apify_actors_with_developers.csv – full dataset.
  • apify_actors_with_developers.xlsx – full dataset, formatted for Excel.
  • apify_actors_with_developers.md – first 200 rows (Markdown table for quick inspection).
  • developer_profile_failures.csv – optional; only present if some developer profiles consistently failed (non‑200 HTTP).

Notes & caveats

  • This project is unofficial and not affiliated with Apify.
  • Data comes from public sources (Apify Store + public developer profiles).
    Please respect Apify’s Terms of Service and any applicable rate limits.
  • Some developers genuinely don’t provide external contact info – those rows will have an empty developer_contacts.
  • A small number of developer profile URLs might be dead or private; those are recorded in developer_profile_failures.csv.

Contributing

Suggestions and pull requests are welcome. Some ideas:

  • Add richer pricing normalization (e.g. per‑month cost estimate for PAY_PER_EVENT models).
  • Add more advanced segmentation (e.g. by category, MAU buckets) into separate sheets.
  • Integrate with a dashboard (e.g. Superset, Metabase) for “Apify ecosystem analytics”.

If you find this useful, consider ⭐ starring or forking the repo – it helps others discover it. :)

About

Scrape Apify Actors & developers info from Apify Store | 高效爬取Apify Store中Actor及其开发者全量信息 ✅ Extract: Actor URL/Name/Pricing/Users/Bookmarks + Developer Profile/Contacts/Join Date ✅ Output: Excel (styled)/CSV/Markdown ✅ Optimized: Async concurrency + cache + anti-blocking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages