Skip to content

Releases: JustAzul/web-scrapper-stdio

v1.5.0

12 Apr 19:06
0510577

Choose a tag to compare

Cloudflare Bypass

Patchright CDP Evasion

  • Replaced Playwright with Patchright — a drop-in fork with CDP-level anti-detection
  • Most Cloudflare-protected sites are now scraped without triggering any challenge
  • Removes dependency on playwright-stealth (CDP-level evasion is superior)

Turnstile Captcha Solving

  • When CAPTCHA_API_KEY is set, Cloudflare Turnstile challenges are solved automatically via third-party API
  • Supports 2Captcha, CapSolver, and CapMonster (all wire-compatible)
  • When no API key is set, CF-protected pages return a clear error (existing behavior preserved)

How It Works

  1. Patchright's CDP patching handles most sites passively (no challenge triggered)
  2. If a challenge IS triggered, live page detection catches it (title/URL/DOM markers)
  3. Turnstile script is intercepted via page.route() to capture sitekey and params
  4. Token is solved via external API and injected into the page
  5. Page redirects to real content, scraping continues normally

New Configuration

CAPTCHA_API_KEY=your_key      # Required for bypass, empty = disabled
CAPTCHA_PROVIDER=2captcha     # 2captcha | capsolver | capmonster
CAPTCHA_BASE_URL=             # Optional custom endpoint
CAPTCHA_TIMEOUT=120           # Seconds to wait for solve

Performance

Zero regression — patchright is marginally faster than Playwright:

  • Cold start: 0.338s (was 0.350s)
  • Static scrape: 1.996s (was 1.999s)
  • JS-heavy: 2.280s (was 2.441s)

Testing

  • 46 tests (18 new CF-specific tests)
  • Validated against llm-stats.com, 2captcha Turnstile demo, discord.com, canva.com

v1.4.1

12 Apr 18:15
feb300d

Choose a tag to compare

Bug Fixes

  • Fix race condition in BrowserPool: Concurrent coroutines could pick the same slot due to unlocked slot selection. Added _pick_lock to serialize slot selection + context creation.
  • Fix stale default: extract_text_from_url() default grace period was still 2.0s instead of the configured 0.5s when called directly (not through MCP).
  • Deduplicate stealth logic: Pool now calls apply_stealth() instead of inlining the same 3 Playwright calls.
  • Remove redundant validation: ScrapeArgs double-validation removed from scrape_web tool (FastMCP handles this).
  • Clean up: Removed unused imports, updated stale comments.

v1.4.0

12 Apr 18:01
cddb328

Choose a tag to compare

Performance Optimizations

  • Persistent browser pool: Chromium stays alive across requests, eliminating ~350ms launch overhead per scrape
  • Smart DOM wait: MutationObserver-based content stabilization replaces fixed 2s sleep — resolves in ~200ms on static pages
  • Warm-on-startup: Browser pool pre-warms at server start
  • Dual transport: stdio (default) and Streamable HTTP for shared service mode

Configuration

New environment variables:

  • BROWSER_POOL_ENABLED (default: true) — toggle persistent browser pool
  • BROWSER_POOL_SIZE (default: 2) — number of Chromium instances
  • DEFAULT_GRACE_PERIOD_SECONDS (default: 0.5) — grace period for JS rendering
  • MCP_TRANSPORT (default: stdio) — stdio or streamable-http
  • MCP_HTTP_PORT (default: 8080) — HTTP server port
  • MCP_HTTP_HOST (default: 0.0.0.0) — HTTP server bind address

Breaking Changes

  • Default grace_period_seconds changed from 2.0 to 0.5 (configurable via env)
  • MCP server migrated from raw mcp.Server to FastMCP wrapper

Benchmarks

Metric v1.3.0 v1.4.0 Improvement
Static page scrape 3.2s 1.3s -59%
JS-heavy page (Wikipedia) 4.6s 2.4s -47%
3 consecutive scrapes 13.8s 4.3s -69%

v1.3.0

14 Jun 22:10
70c668a

Choose a tag to compare

Summary:
This release delivers new scraping features, performance optimizations, improved test coverage, and major enhancements to CI/CD workflows and documentation. The codebase is now more robust, maintainable, and easier to extend, with a focus on reliability and developer experience.

Features

  • Support for custom_elements_to_remove in API scrape arguments and extraction
  • Added filter_none_values utility with comprehensive tests
  • Added click_selector support to extract_text_from_url

Performance

  • Reuse singleton browser instance for all scrapes, reducing resource usage
  • Avoid repeated BeautifulSoup parsing by reusing soup objects

Refactors

  • Refactored mcp_server to use filter_none_values and support click_selector
  • Merged dynamic article extraction tests into a single random-domain test

Documentation

Chore

Tests

Fixes

  • Update badge Gist URLs and workflow gistIDs for build, test, and coverage
  • Filter out None values from tool arguments to prevent type errors
  • Ensure output_format string is converted to OutputFormat enum in get_prompt
  • Refactor JS-delay test to use real demo page and improve reliability
  • Restore per-scrape browser launch and cleanup for test reliability
  • Ensure browser is always closed using finally block in extract_text_from_url

Full Changelog: 1.2.0...1.3.0

v1.2.0

12 Jun 16:11
2c3de28

Choose a tag to compare

Release Notes: 1.2.0

Summary:
This release introduces a new grace_period_seconds feature for improved JavaScript rendering support, significant refactoring for configuration and test logic, improved documentation, and enhanced test coverage. The codebase is now more robust, with clearer configuration and more reliable extraction logic.

Features

  • Add grace_period_seconds parameter for JS rendering delay
    • Allows fine-tuning of wait time for JavaScript-rendered content.
    • Commit: eb3b841

Refactors

  • Configuration and server logic improvements
    • Removed unused content length and grace period options from config.
    • Updated server and scraper logic to use new parameters and improve clarity.
    • Commit: 61a9264

Documentation

  • README and usage updates
    • Updated documentation to reflect new parameters and configuration changes.
    • Commit: c80fcd6

Tests

  • Test updates and improvements
    • Refactored tests for new result format and grace_period_seconds support.
    • Improved test assertions and coverage for error handling and edge cases.
    • Commit: 2c3de28

Affected Files

  • Modified: README.md, src/config.py, src/mcp_server.py, src/scraper/__init__.py, tests/test_mcp_server.py, tests/test_scraper.py

v1.1.1

12 Jun 15:15
3f6a953

Choose a tag to compare

Refactor

  • refactor(config): add env-based config and type-safe parsing helpers
  • refactor(mcp): adapt to new dict return type from extract_text_from_url
  • refactor(scraper): return dict with metadata and error from extract_text_from_url

Docs

  • docs(readme): rewrite and reorganize documentation for clarity, usage, and config

Test

  • test(tests): add test_mcp_server.py for MCP server testing

CI

  • ci(test): add separate test services for mcp and scraper in docker-compose

Fix

  • fix(docker): format environment variable assignments in Dockerfile
  • fix(mcp_server): remove hardcoded stdio to properly serve the mcp tool requests

v1.1.0

11 Jun 20:29
dd9d284

Choose a tag to compare

Release Notes: 1.1.0

Summary:
This release introduces a modular scraper architecture, significant refactoring for maintainability, improved documentation, and enhanced test coverage. Obsolete files and legacy code have been removed, and the codebase is now more consistent and easier to extend.

Features

  • New modular scraper implementation and helpers
    • Introduced a new scraper module with improved structure and extensibility.
    • Added helpers for browser automation, content selection, error handling, HTML utilities, and rate limiting.
    • Commit: 74c7852

Refactors

  • Core and server logic improvements
    • Centralized configuration, added a Logger, and improved debug/error handling.
    • Reformatted and improved readability of server logic.
    • Removed legacy and obsolete files (CLI, main, stdio_server, test runner, old scraper).
    • Improved extraction logic and cleaned up dependencies.
    • Commits: 3f6f876, fb95ddc, de5e0c7, c8167d9, 3752cf1, 01c145e

Style

  • Code formatting and consistency
    • Reformatted logger and test files for PEP8 compliance and readability.
    • Removed trailing whitespace from __init__.py files.
    • Commits: cd753ac, d272992, c053c4d

Documentation

  • Documentation and configuration updates
    • Updated README with new usage instructions, removed CLI references, and added Cursor IDE integration.
    • Updated environment, Docker, and documentation for new config structure.
    • Commits: dd9d284, 6fbba89

Chore

  • Cleanup and maintenance
    • Removed obsolete files and updated project structure.
    • Commit: c8167d9

Tests

  • Test updates and improvements
    • Updated and added tests for new config, timing constants, and scraper logic.
    • Reformatted test files for clarity.
    • Commits: cd753ac, 47c70d0

Affected Files

  • Added: src/logger.py, src/mcp_server.py, src/scraper/__init__.py, src/scraper/helpers/*, tests/test_scraper.py
  • Modified: .env.example, Dockerfile, README.md, docker-compose.yml, requirements.txt, src/__init__.py, src/config.py, tests/__init__.py, tests/test_helpers.py
  • Deleted: CHANGELOG.md, src/main.py, src/scraper.py, src/stdio_server.py, tests/test_mcp.py

v1.0.0

09 Jun 18:30
077e662

Choose a tag to compare

1. Release Overview

  • Version: 1.0.0
  • Goal: First stable release with robust anti-bot scraping, Dockerized deployment, and automated testing.

2. Major Features

  • Playwright-based Scraper:
    • Async scraping with Playwright and BeautifulSoup.
    • Domain-specific and generic content extraction.
  • Anti-Bot Evasion:
    • Integrated playwright-stealth for fingerprint evasion.
    • Randomized user agent, viewport, and language per request.
    • Navigator property spoofing.
    • Rate limiting per domain.
  • Robust Extraction Logic:
    • Fallback to <body> text for edge cases.
    • Handles redirects, 404s, and Cloudflare blocks gracefully.
  • Dockerized Workflow:
    • Dockerfile and docker-compose for reproducible builds and test runs.
  • Automated Testing:
    • Pytest suite with coverage for extraction, error handling, and edge cases.
    • CI-ready test execution via Docker Compose.
  • Documentation:
    • Comprehensive README.md with setup, usage, and development workflow.
    • Changelog and release notes.