12 Apr 19:06

JustAzul

0510577

v1.5.0 Latest

Latest

Cloudflare Bypass

Patchright CDP Evasion

Replaced Playwright with Patchright — a drop-in fork with CDP-level anti-detection
Most Cloudflare-protected sites are now scraped without triggering any challenge
Removes dependency on playwright-stealth (CDP-level evasion is superior)

Turnstile Captcha Solving

When CAPTCHA_API_KEY is set, Cloudflare Turnstile challenges are solved automatically via third-party API
Supports 2Captcha, CapSolver, and CapMonster (all wire-compatible)
When no API key is set, CF-protected pages return a clear error (existing behavior preserved)

How It Works

Patchright's CDP patching handles most sites passively (no challenge triggered)
If a challenge IS triggered, live page detection catches it (title/URL/DOM markers)
Turnstile script is intercepted via page.route() to capture sitekey and params
Token is solved via external API and injected into the page
Page redirects to real content, scraping continues normally

New Configuration

CAPTCHA_API_KEY=your_key      # Required for bypass, empty = disabled
CAPTCHA_PROVIDER=2captcha     # 2captcha | capsolver | capmonster
CAPTCHA_BASE_URL=             # Optional custom endpoint
CAPTCHA_TIMEOUT=120           # Seconds to wait for solve

Performance

Zero regression — patchright is marginally faster than Playwright:

Cold start: 0.338s (was 0.350s)
Static scrape: 1.996s (was 1.999s)
JS-heavy: 2.280s (was 2.441s)

Testing

46 tests (18 new CF-specific tests)
Validated against llm-stats.com, 2captcha Turnstile demo, discord.com, canva.com

Assets 2

12 Apr 18:15

JustAzul

v1.4.1

feb300d

v1.4.1

Bug Fixes

Fix race condition in BrowserPool: Concurrent coroutines could pick the same slot due to unlocked slot selection. Added _pick_lock to serialize slot selection + context creation.
Fix stale default: extract_text_from_url() default grace period was still 2.0s instead of the configured 0.5s when called directly (not through MCP).
Deduplicate stealth logic: Pool now calls apply_stealth() instead of inlining the same 3 Playwright calls.
Remove redundant validation: ScrapeArgs double-validation removed from scrape_web tool (FastMCP handles this).
Clean up: Removed unused imports, updated stale comments.

Assets 2

12 Apr 18:01

JustAzul

v1.4.0

cddb328

v1.4.0

Performance Optimizations

Persistent browser pool: Chromium stays alive across requests, eliminating ~350ms launch overhead per scrape
Smart DOM wait: MutationObserver-based content stabilization replaces fixed 2s sleep — resolves in ~200ms on static pages
Warm-on-startup: Browser pool pre-warms at server start
Dual transport: stdio (default) and Streamable HTTP for shared service mode

Configuration

New environment variables:

BROWSER_POOL_ENABLED (default: true) — toggle persistent browser pool
BROWSER_POOL_SIZE (default: 2) — number of Chromium instances
DEFAULT_GRACE_PERIOD_SECONDS (default: 0.5) — grace period for JS rendering
MCP_TRANSPORT (default: stdio) — stdio or streamable-http
MCP_HTTP_PORT (default: 8080) — HTTP server port
MCP_HTTP_HOST (default: 0.0.0.0) — HTTP server bind address

Breaking Changes

Default grace_period_seconds changed from 2.0 to 0.5 (configurable via env)
MCP server migrated from raw mcp.Server to FastMCP wrapper

Benchmarks

Metric	v1.3.0	v1.4.0	Improvement
Static page scrape	3.2s	1.3s	-59%
JS-heavy page (Wikipedia)	4.6s	2.4s	-47%
3 consecutive scrapes	13.8s	4.3s	-69%

Assets 2

14 Jun 22:10

JustAzul

1.3.0

70c668a

v1.3.0

Summary:
This release delivers new scraping features, performance optimizations, improved test coverage, and major enhancements to CI/CD workflows and documentation. The codebase is now more robust, maintainable, and easier to extend, with a focus on reliability and developer experience.

Features

Support for custom_elements_to_remove in API scrape arguments and extraction
- Commit: 851e38d
Added filter_none_values utility with comprehensive tests
- Commit: 8c58e5a
Added click_selector support to extract_text_from_url
- Commit: 558f513

Performance

Reuse singleton browser instance for all scrapes, reducing resource usage
- Commit: b905f3d
Avoid repeated BeautifulSoup parsing by reusing soup objects
- Commit: 096bb1b

Refactors

Refactored mcp_server to use filter_none_values and support click_selector
- Commit: 045de45
Merged dynamic article extraction tests into a single random-domain test
- Commit: db7d1e0

Documentation

Updated README: removed roadmap/contact/robots.txt sections, improved badge clarity, and added usage examples
- Commits: 76cc302, 0258831, e3c8aad, 35b5e6a, 4653ecb, 1407f28, 5b35586

Chore

Added MIT license file
- Commit: 9791b2b
Ignored all files/directories under cursor/
- Commit: 8ed75e1
Removed empty __init__.py files from src and tests/helpers
- Commit: 48e723a
Ensured trailing newline in config files
- Commit: 3b5ddee
Updated and cleaned up workflows (build, test, release, coverage)
- Commits: 37dd86d, 2258c78, 950dc3e, 57bf6b1, d5d1463, 954ab47, 07018f1, 3230816, 0adc52c, 40d5aa9, 774dcf3, 51c6cd8, 60e95a7, c135d3c, c3545d4, af1234c, 1975d59
Standardized on master branch, removed legacy/duplicate workflow files
- Commits: 40d5aa9, 774dcf3, 51c6cd8

Tests

Improved JS delay and user agent tests for robustness and coverage
- Commits: 3a2b59a, 27108ae, dc15558, 483615d, c5182c1, db7d1e0, 0623313, 2af0a3b, d25f2c0
Added and expanded tests for new utilities and features
- Commits: 8c58e5a, d25f2c0, c5182c1

Fixes

Update badge Gist URLs and workflow gistIDs for build, test, and coverage
- Commit: 09e6060
Filter out None values from tool arguments to prevent type errors
- Commit: a4d6fed
Ensure output_format string is converted to OutputFormat enum in get_prompt
- Commit: d25f2c0
Refactor JS-delay test to use real demo page and improve reliability
- Commit: 0623313
Restore per-scrape browser launch and cleanup for test reliability
- Commit: 2af0a3b
Ensure browser is always closed using finally block in extract_text_from_url
- Commit: d779341

Full Changelog: 1.2.0...1.3.0

Assets 2

12 Jun 16:11

JustAzul

1.2.0

2c3de28

v1.2.0

Release Notes: 1.2.0

Summary:
This release introduces a new grace_period_seconds feature for improved JavaScript rendering support, significant refactoring for configuration and test logic, improved documentation, and enhanced test coverage. The codebase is now more robust, with clearer configuration and more reliable extraction logic.

Features

Add grace_period_seconds parameter for JS rendering delay
- Allows fine-tuning of wait time for JavaScript-rendered content.
- Commit: eb3b841

Refactors

Configuration and server logic improvements
- Removed unused content length and grace period options from config.
- Updated server and scraper logic to use new parameters and improve clarity.
- Commit: 61a9264

Documentation

README and usage updates
- Updated documentation to reflect new parameters and configuration changes.
- Commit: c80fcd6

Tests

Test updates and improvements
- Refactored tests for new result format and grace_period_seconds support.
- Improved test assertions and coverage for error handling and edge cases.
- Commit: 2c3de28

Affected Files

Modified: README.md, src/config.py, src/mcp_server.py, src/scraper/__init__.py, tests/test_mcp_server.py, tests/test_scraper.py

Full Changelog: 1.1.1...1.2.0

Assets 2

12 Jun 15:15

JustAzul

1.1.1

3f6a953

v1.1.1

Refactor

refactor(config): add env-based config and type-safe parsing helpers
refactor(mcp): adapt to new dict return type from extract_text_from_url
refactor(scraper): return dict with metadata and error from extract_text_from_url

Docs

docs(readme): rewrite and reorganize documentation for clarity, usage, and config

Test

test(tests): add test_mcp_server.py for MCP server testing

CI

ci(test): add separate test services for mcp and scraper in docker-compose

Fix

fix(docker): format environment variable assignments in Dockerfile
fix(mcp_server): remove hardcoded stdio to properly serve the mcp tool requests

Assets 2

11 Jun 20:29

JustAzul

1.1.0

dd9d284

v1.1.0

Release Notes: 1.1.0

Summary:
This release introduces a modular scraper architecture, significant refactoring for maintainability, improved documentation, and enhanced test coverage. Obsolete files and legacy code have been removed, and the codebase is now more consistent and easier to extend.

Features

New modular scraper implementation and helpers
- Introduced a new scraper module with improved structure and extensibility.
- Added helpers for browser automation, content selection, error handling, HTML utilities, and rate limiting.
- Commit: 74c7852

Refactors

Core and server logic improvements
- Centralized configuration, added a Logger, and improved debug/error handling.
- Reformatted and improved readability of server logic.
- Removed legacy and obsolete files (CLI, main, stdio_server, test runner, old scraper).
- Improved extraction logic and cleaned up dependencies.
- Commits: 3f6f876, fb95ddc, de5e0c7, c8167d9, 3752cf1, 01c145e

Style

Code formatting and consistency
- Reformatted logger and test files for PEP8 compliance and readability.
- Removed trailing whitespace from __init__.py files.
- Commits: cd753ac, d272992, c053c4d

Documentation

Documentation and configuration updates
- Updated README with new usage instructions, removed CLI references, and added Cursor IDE integration.
- Updated environment, Docker, and documentation for new config structure.
- Commits: dd9d284, 6fbba89

Chore

Cleanup and maintenance
- Removed obsolete files and updated project structure.
- Commit: c8167d9

Tests

Test updates and improvements
- Updated and added tests for new config, timing constants, and scraper logic.
- Reformatted test files for clarity.
- Commits: cd753ac, 47c70d0

Affected Files

Added: src/logger.py, src/mcp_server.py, src/scraper/__init__.py, src/scraper/helpers/*, tests/test_scraper.py
Modified: .env.example, Dockerfile, README.md, docker-compose.yml, requirements.txt, src/__init__.py, src/config.py, tests/__init__.py, tests/test_helpers.py
Deleted: CHANGELOG.md, src/main.py, src/scraper.py, src/stdio_server.py, tests/test_mcp.py

Full Changelog: 1.0.0...1.1.0

Assets 2

09 Jun 18:30

JustAzul

1.0.0

077e662

v1.0.0

1. Release Overview

Version: 1.0.0
Goal: First stable release with robust anti-bot scraping, Dockerized deployment, and automated testing.

2. Major Features

Playwright-based Scraper:
- Async scraping with Playwright and BeautifulSoup.
- Domain-specific and generic content extraction.
Anti-Bot Evasion:
- Integrated playwright-stealth for fingerprint evasion.
- Randomized user agent, viewport, and language per request.
- Navigator property spoofing.
- Rate limiting per domain.
Robust Extraction Logic:
- Fallback to <body> text for edge cases.
- Handles redirects, 404s, and Cloudflare blocks gracefully.
Dockerized Workflow:
- Dockerfile and docker-compose for reproducible builds and test runs.
Automated Testing:
- Pytest suite with coverage for extraction, error handling, and edge cases.
- CI-ready test execution via Docker Compose.
Documentation:
- Comprehensive README.md with setup, usage, and development workflow.
- Changelog and release notes.

Assets 2

Releases: JustAzul/web-scrapper-stdio

v1.5.0

Cloudflare Bypass

Patchright CDP Evasion

Turnstile Captcha Solving

How It Works

New Configuration

Performance

Testing

Uh oh!

v1.4.1

Bug Fixes

Uh oh!

v1.4.0

Performance Optimizations

Configuration

Breaking Changes

Benchmarks

Uh oh!

v1.3.0

Features

Performance

Refactors

Documentation

Chore

Tests

Fixes

Uh oh!

v1.2.0

Release Notes: 1.2.0

Features

Refactors

Documentation

Tests

Affected Files

Uh oh!

v1.1.1

Refactor

Docs

Test

CI

Fix

Uh oh!

v1.1.0

Release Notes: 1.1.0

Features

Refactors

Style

Documentation

Chore

Tests

Affected Files

Uh oh!

v1.0.0

1. Release Overview

2. Major Features

Uh oh!