Open Source Actors Scraper collects open-source actors from a public store listing and exports them as clean, structured data. It solves the time-consuming work of manually browsing and tracking open-source tools by turning listings into a searchable dataset you can plug into your workflow.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for open-source-actors-scraper you've just found your team — Let’s Chat. 👆👆
This project crawls listing pages, extracts key details for each open-source actor, and saves results in a consistent JSON dataset. It’s built for developers, researchers, and product teams who want an always-reproducible way to discover, catalog, and analyze open-source actors at scale.
- Starts from one or more user-provided seed URLs (start URLs).
- Limits crawl scope using a configurable maximum pages setting.
- Extracts structured listing details from each visited page.
- Stores normalized records in a dataset for filtering, export, and analysis.
- Logs each saved record to make runs easy to audit and debug.
| Feature | Description |
|---|---|
| Configurable start URLs | Point the scraper at any supported listing or category page to begin crawling. |
| Page limit controls | Keep runs predictable by capping how many pages are scraped per execution. |
| Fast HTML parsing | Uses server-side HTML parsing for reliable extraction from static markup. |
| Structured dataset output | Saves consistent JSON records that are easy to search, export, or feed into pipelines. |
| Built-in logging | Prints each saved result so you can verify progress and spot issues quickly. |
| TypeScript-first codebase | Strong typing improves maintainability and reduces runtime mistakes. |
| Extensible extraction logic | Add more fields or custom parsing rules with minimal changes. |
| Field Name | Field Description |
|---|---|
| name | The actor’s display name as shown in listings. |
| url | The canonical URL to the actor’s detail page. |
| description | Short summary text describing what the actor does. |
| isOpenSource | Boolean flag indicating whether the listing is open-source. |
| author | Publisher or maintainer name (if present on the page). |
| categories | Category tags or grouping labels (if available). |
| updatedAt | Last updated date/time parsed from the listing/detail page (if available). |
| stats | Public counters such as runs, likes, or popularity indicators (if available). |
| sourceRepoUrl | Link to the source repository when exposed publicly (if available). |
| scrapedAt | Timestamp of when the record was collected for traceability. |
[
{
"name": "Example Open Source Actor",
"url": "https://example.com/actors/example-open-source-actor",
"description": "Collects structured data from a listing page.",
"isOpenSource": true,
"author": "Example Maintainer",
"categories": ["data", "automation"],
"updatedAt": "2025-12-10T14:22:11.000Z",
"stats": { "runs": 12450, "likes": 312 },
"sourceRepoUrl": "https://github.com/example/example-open-source-actor",
"scrapedAt": "2025-12-12T09:00:00.000Z"
}
]
Open Source Actors Scraper/
├── src/
│ ├── main.ts
│ ├── crawler/
│ │ ├── createCrawler.ts
│ │ └── handlers.ts
│ ├── extractors/
│ │ ├── actorListingExtractor.ts
│ │ ├── actorDetailExtractor.ts
│ │ └── normalize.ts
│ ├── config/
│ │ ├── inputSchema.ts
│ │ └── defaults.ts
│ ├── storage/
│ │ ├── dataset.ts
│ │ └── dedupe.ts
│ └── utils/
│ ├── urls.ts
│ ├── logger.ts
│ └── timing.ts
├── test/
│ ├── fixtures/
│ │ └── sample-page.html
│ └── extractors.spec.ts
├── .editorconfig
├── .gitignore
├── package.json
├── tsconfig.json
├── README.md
└── LICENSE
- Product teams use it to map the open-source actor landscape, so they can shortlist tools faster and avoid duplicate research.
- Developers use it to generate a searchable catalog, so they can find relevant open-source actors for automation tasks quickly.
- Data analysts use it to collect structured listings over time, so they can track trends and popularity shifts with real data.
- Open-source maintainers use it to audit discoverability of their projects, so they can spot missing metadata and improve listing quality.
- Agencies use it to build curated directories for clients, so they can deliver recommendations backed by structured evidence.
How do I control what pages get scraped?
Set the seed URLs in the input under startUrls. The crawler begins from those pages and follows links according to the routing logic in src/crawler/handlers.ts. If you want stricter targeting, update URL filtering and enqueue rules inside the request handler.
How do I limit the run so it doesn’t crawl too much?
Use the maxPagesPerCrawl input. The crawler will stop after it processes that many pages, making runs predictable and safe for CI/CD or scheduled jobs.
Why are some fields missing in the output?
Some listings don’t expose all metadata on every page. Fields like author, categories, stats, or sourceRepoUrl are captured when present; otherwise they are omitted or set to null depending on the normalizer logic in src/extractors/normalize.ts.
Can I add more data fields?
Yes. Extend the extractor(s) in src/extractors/actorListingExtractor.ts and/or src/extractors/actorDetailExtractor.ts, then update the normalizer to keep the output schema consistent. Add a fixture and test under test/ to prevent regressions.
Primary Metric: ~35–70 pages/min on typical listing pages (HTML-only), depending on network latency and page size.
Reliability Metric: 98–99.5% successful requests on stable targets across repeated runs with retry enabled.
Efficiency Metric: ~120–220 MB peak memory usage during runs capped at 1,000 pages, with CPU usage staying moderate due to non-browser parsing.
Quality Metric: 92–97% field completeness for core fields (name, url, description), with optional metadata varying by page availability.
