Releases: JustAzul/web-scrapper-stdio
v1.5.0
Cloudflare Bypass
Patchright CDP Evasion
- Replaced Playwright with Patchright — a drop-in fork with CDP-level anti-detection
- Most Cloudflare-protected sites are now scraped without triggering any challenge
- Removes dependency on
playwright-stealth(CDP-level evasion is superior)
Turnstile Captcha Solving
- When
CAPTCHA_API_KEYis set, Cloudflare Turnstile challenges are solved automatically via third-party API - Supports 2Captcha, CapSolver, and CapMonster (all wire-compatible)
- When no API key is set, CF-protected pages return a clear error (existing behavior preserved)
How It Works
- Patchright's CDP patching handles most sites passively (no challenge triggered)
- If a challenge IS triggered, live page detection catches it (title/URL/DOM markers)
- Turnstile script is intercepted via
page.route()to capture sitekey and params - Token is solved via external API and injected into the page
- Page redirects to real content, scraping continues normally
New Configuration
CAPTCHA_API_KEY=your_key # Required for bypass, empty = disabled
CAPTCHA_PROVIDER=2captcha # 2captcha | capsolver | capmonster
CAPTCHA_BASE_URL= # Optional custom endpoint
CAPTCHA_TIMEOUT=120 # Seconds to wait for solvePerformance
Zero regression — patchright is marginally faster than Playwright:
- Cold start: 0.338s (was 0.350s)
- Static scrape: 1.996s (was 1.999s)
- JS-heavy: 2.280s (was 2.441s)
Testing
- 46 tests (18 new CF-specific tests)
- Validated against llm-stats.com, 2captcha Turnstile demo, discord.com, canva.com
v1.4.1
Bug Fixes
- Fix race condition in BrowserPool: Concurrent coroutines could pick the same slot due to unlocked slot selection. Added
_pick_lockto serialize slot selection + context creation. - Fix stale default:
extract_text_from_url()default grace period was still 2.0s instead of the configured 0.5s when called directly (not through MCP). - Deduplicate stealth logic: Pool now calls
apply_stealth()instead of inlining the same 3 Playwright calls. - Remove redundant validation:
ScrapeArgsdouble-validation removed fromscrape_webtool (FastMCP handles this). - Clean up: Removed unused imports, updated stale comments.
v1.4.0
Performance Optimizations
- Persistent browser pool: Chromium stays alive across requests, eliminating ~350ms launch overhead per scrape
- Smart DOM wait: MutationObserver-based content stabilization replaces fixed 2s sleep — resolves in ~200ms on static pages
- Warm-on-startup: Browser pool pre-warms at server start
- Dual transport: stdio (default) and Streamable HTTP for shared service mode
Configuration
New environment variables:
BROWSER_POOL_ENABLED(default:true) — toggle persistent browser poolBROWSER_POOL_SIZE(default:2) — number of Chromium instancesDEFAULT_GRACE_PERIOD_SECONDS(default:0.5) — grace period for JS renderingMCP_TRANSPORT(default:stdio) —stdioorstreamable-httpMCP_HTTP_PORT(default:8080) — HTTP server portMCP_HTTP_HOST(default:0.0.0.0) — HTTP server bind address
Breaking Changes
- Default
grace_period_secondschanged from2.0to0.5(configurable via env) - MCP server migrated from raw
mcp.ServertoFastMCPwrapper
Benchmarks
| Metric | v1.3.0 | v1.4.0 | Improvement |
|---|---|---|---|
| Static page scrape | 3.2s | 1.3s | -59% |
| JS-heavy page (Wikipedia) | 4.6s | 2.4s | -47% |
| 3 consecutive scrapes | 13.8s | 4.3s | -69% |
v1.3.0
Summary:
This release delivers new scraping features, performance optimizations, improved test coverage, and major enhancements to CI/CD workflows and documentation. The codebase is now more robust, maintainable, and easier to extend, with a focus on reliability and developer experience.
Features
- Support for
custom_elements_to_removein API scrape arguments and extraction- Commit: 851e38d
- Added
filter_none_valuesutility with comprehensive tests- Commit: 8c58e5a
- Added
click_selectorsupport toextract_text_from_url- Commit: 558f513
Performance
- Reuse singleton browser instance for all scrapes, reducing resource usage
- Commit: b905f3d
- Avoid repeated BeautifulSoup parsing by reusing soup objects
- Commit: 096bb1b
Refactors
- Refactored
mcp_serverto usefilter_none_valuesand supportclick_selector- Commit: 045de45
- Merged dynamic article extraction tests into a single random-domain test
- Commit: db7d1e0
Documentation
- Updated README: removed roadmap/contact/robots.txt sections, improved badge clarity, and added usage examples
Chore
- Added MIT license file
- Commit: 9791b2b
- Ignored all files/directories under
cursor/- Commit: 8ed75e1
- Removed empty
__init__.pyfiles fromsrcandtests/helpers- Commit: 48e723a
- Ensured trailing newline in config files
- Commit: 3b5ddee
- Updated and cleaned up workflows (build, test, release, coverage)
- Standardized on master branch, removed legacy/duplicate workflow files
Tests
- Improved JS delay and user agent tests for robustness and coverage
- Added and expanded tests for new utilities and features
Fixes
- Update badge Gist URLs and workflow gistIDs for build, test, and coverage
- Commit: 09e6060
- Filter out None values from tool arguments to prevent type errors
- Commit: a4d6fed
- Ensure output_format string is converted to OutputFormat enum in get_prompt
- Commit: d25f2c0
- Refactor JS-delay test to use real demo page and improve reliability
- Commit: 0623313
- Restore per-scrape browser launch and cleanup for test reliability
- Commit: 2af0a3b
- Ensure browser is always closed using finally block in extract_text_from_url
- Commit: d779341
Full Changelog: 1.2.0...1.3.0
v1.2.0
Release Notes: 1.2.0
Summary:
This release introduces a new grace_period_seconds feature for improved JavaScript rendering support, significant refactoring for configuration and test logic, improved documentation, and enhanced test coverage. The codebase is now more robust, with clearer configuration and more reliable extraction logic.
Features
- Add
grace_period_secondsparameter for JS rendering delay- Allows fine-tuning of wait time for JavaScript-rendered content.
- Commit: eb3b841
Refactors
- Configuration and server logic improvements
- Removed unused content length and grace period options from config.
- Updated server and scraper logic to use new parameters and improve clarity.
- Commit: 61a9264
Documentation
- README and usage updates
- Updated documentation to reflect new parameters and configuration changes.
- Commit: c80fcd6
Tests
- Test updates and improvements
- Refactored tests for new result format and
grace_period_secondssupport. - Improved test assertions and coverage for error handling and edge cases.
- Commit: 2c3de28
- Refactored tests for new result format and
Affected Files
- Modified:
README.md,src/config.py,src/mcp_server.py,src/scraper/__init__.py,tests/test_mcp_server.py,tests/test_scraper.py
- Full Changelog: 1.1.1...1.2.0
v1.1.1
Refactor
- refactor(config): add env-based config and type-safe parsing helpers
- refactor(mcp): adapt to new dict return type from extract_text_from_url
- refactor(scraper): return dict with metadata and error from extract_text_from_url
Docs
- docs(readme): rewrite and reorganize documentation for clarity, usage, and config
Test
- test(tests): add test_mcp_server.py for MCP server testing
CI
- ci(test): add separate test services for mcp and scraper in docker-compose
Fix
- fix(docker): format environment variable assignments in Dockerfile
- fix(mcp_server): remove hardcoded stdio to properly serve the mcp tool requests
v1.1.0
Release Notes: 1.1.0
Summary:
This release introduces a modular scraper architecture, significant refactoring for maintainability, improved documentation, and enhanced test coverage. Obsolete files and legacy code have been removed, and the codebase is now more consistent and easier to extend.
Features
- New modular scraper implementation and helpers
- Introduced a new scraper module with improved structure and extensibility.
- Added helpers for browser automation, content selection, error handling, HTML utilities, and rate limiting.
- Commit: 74c7852
Refactors
- Core and server logic improvements
- Centralized configuration, added a Logger, and improved debug/error handling.
- Reformatted and improved readability of server logic.
- Removed legacy and obsolete files (CLI, main, stdio_server, test runner, old scraper).
- Improved extraction logic and cleaned up dependencies.
- Commits: 3f6f876, fb95ddc, de5e0c7, c8167d9, 3752cf1, 01c145e
Style
- Code formatting and consistency
Documentation
- Documentation and configuration updates
Chore
- Cleanup and maintenance
- Removed obsolete files and updated project structure.
- Commit: c8167d9
Tests
- Test updates and improvements
Affected Files
- Added:
src/logger.py,src/mcp_server.py,src/scraper/__init__.py,src/scraper/helpers/*,tests/test_scraper.py - Modified:
.env.example,Dockerfile,README.md,docker-compose.yml,requirements.txt,src/__init__.py,src/config.py,tests/__init__.py,tests/test_helpers.py - Deleted:
CHANGELOG.md,src/main.py,src/scraper.py,src/stdio_server.py,tests/test_mcp.py
- Full Changelog: 1.0.0...1.1.0
v1.0.0
1. Release Overview
- Version: 1.0.0
- Goal: First stable release with robust anti-bot scraping, Dockerized deployment, and automated testing.
2. Major Features
- Playwright-based Scraper:
- Async scraping with Playwright and BeautifulSoup.
- Domain-specific and generic content extraction.
- Anti-Bot Evasion:
- Integrated
playwright-stealthfor fingerprint evasion. - Randomized user agent, viewport, and language per request.
- Navigator property spoofing.
- Rate limiting per domain.
- Integrated
- Robust Extraction Logic:
- Fallback to
<body>text for edge cases. - Handles redirects, 404s, and Cloudflare blocks gracefully.
- Fallback to
- Dockerized Workflow:
- Dockerfile and docker-compose for reproducible builds and test runs.
- Automated Testing:
- Pytest suite with coverage for extraction, error handling, and edge cases.
- CI-ready test execution via Docker Compose.
- Documentation:
- Comprehensive
README.mdwith setup, usage, and development workflow. - Changelog and release notes.
- Comprehensive