feat(ssl-certificate): get ssl certificate support proxy #864

wakaka6 · 2025-03-21T07:41:42Z

Summary

Support proxy when getting ssl certificate

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    DefaultMarkdownGenerator,
    CrawlResult,
)
from crawl4ai.configs import ProxyConfig


async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            magic=True,
            fetch_ssl_certificate=True,
            proxy_config=ProxyConfig(server="socks5://127.0.0.1:1088"),
            markdown_generator=DefaultMarkdownGenerator(
                # content_filter=PruningContentFilter(
                #     threshold=0.48, threshold_type="fixed", min_word_threshold=0
                # )
            ),
        )
        result : CrawlResult = await crawler.arun(
            url="https://www.google.com", config=crawler_config
        )
        print("ssl:", result.ssl_certificate)
        print("markdown: ",result.markdown[:500])


if __name__ == "__main__":
    asyncio.run(main())

List of files changed and why

ssl_ceritficate.py

Support proxy when getting ssl certificate
Support export certificate to playwright format with ssl_ceritificate.to_playwright_format()
Support str(ssl_ceritificate)

proxy_config.py

Support for conversion of URLs with embedded credentials to ProxyConfig. The user and password in the URL with embedded credentials overrides self.username and self.password.

e.g.

ProxyConfig(server="http://user:pass@proxy-server:1080",username="", password="")
--(normalize)--> ProxyConfig(server="http://proxy-server:1080",username="user", password="pass")

async_crawler_strategy.py

Crawling will set the proxy according to the configuration.

How Has This Been Tested?

In the environment of network limitation, use http, https and socks5 proxy to test the website which is banned by firewall(like GFW), all of them can get SSL certificate(e.g. you can't access google directly in China, you need external proxy).
In the environment where there is no network restriction, you can also get the certificate without using proxy.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai

Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location. Key changes: - Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py - Updated type hints in async_configs.py to support ProxyConfig - Fixed proxy configuration handling in browser_manager.py - Updated documentation and examples to use new import path BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy

Adds a new 'reverse' parameter to URLPatternFilter that allows inverting the filter's logic. When reverse=True, URLs that would normally match are rejected and vice versa. Also removes unused 'scraped_html' from WebScrapingStrategy output to reduce memory usage. BREAKING CHANGE: WebScrapingStrategy no longer returns 'scraped_html' in its output dictionary

Add comprehensive table detection and extraction functionality to the web scraping system: - Implement intelligent table detection algorithm with scoring system - Add table extraction with support for headers, rows, captions - Update models to include tables in Media class - Add table_score_threshold configuration option - Add documentation and examples for table extraction - Include crypto analysis example demonstrating table usage This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.

…traction Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media. Key changes: - Added target_elements list parameter to CrawlerRunConfig - Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements - Updated documentation with examples and comparison between css_selector and target_elements - Fixed table extraction in content_scraping_strategy.py BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures

…nagement Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes: - Terminal-based dashboard with rich UI for displaying task statuses - Memory pressure monitoring and adaptive dispatch control - Queue statistics and performance metrics tracking - Detailed task progress visualization - Stress testing framework for memory management This addition helps operators track crawler performance and manage memory usage more effectively.

Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.

…rison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples

…ultiple URL processing

…er handling Add experimental parameters dictionary to CrawlerRunConfig to support beta features Make CSP nonce headers optional via experimental config Remove default cookie injection Clean up browser context creation code Improve code formatting in API handler BREAKING CHANGE: Default cookie injection has been removed from page initialization

…atures, changes, fixes, and breaking changes

Extend LLMConfig class to support more fine-grained control over LLM behavior by adding: - temperature control - max tokens limit - top_p sampling - frequency and presence penalties - stop sequences - number of completions These parameters allow for better customization of LLM responses.

…tions

Implements a persistent browser management system that allows running a single shared browser instance that can be reused across multiple crawler sessions. Key changes include: - Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes - Implemented builtin browser management in BrowserProfiler - Added CLI commands for managing builtin browser (start, stop, status, restart, view) - Modified browser process handling to support detached processes - Added automatic builtin browser setup during package installation BREAKING CHANGE: The browser_mode config option changes how browser instances are managed

…oxy handling - Implement Strategy Pattern with ConnectionStrategy interface - Create concrete strategies: Direct, HTTP, and SOCKS connections - Add ConnectionStrategyFactory for strategy instantiation - Extract certificate processing into a separate method - Improve error handling with specific exception types and better logging

Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.

Remove PagePoolConfig in favor of direct page management in browser strategies. Add get_pages() method for efficient parallel page creation. Improve storage state handling and persistence. Add comprehensive parallel crawling tests and performance analysis. BREAKING CHANGE: Removed PagePoolConfig class and related functionality.

…ory support and improved storage state handling

Enhance storage state persistence mechanism in CDP browser strategy by: - Explicitly saving storage state for each browser context - Using proper file path for storage state - Removing unnecessary sleep delay Also includes test improvements: - Simplified test configurations in playwright tests - Temporarily disabled some CDP tests

unclecode · 2025-03-24T13:12:14Z

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

Basic certificate fetching without proxies
Proxy configuration parsing (especially embedded credentials extraction)
Certificate fetching with various proxy types
Format conversion methods (to_playwright_format, str)
Edge cases and error handling

When running the test, please pay special attention to:

Whether the proxy credentials are properly extracted from URLs
If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy
The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results!
ssl-proxy-test.py.md

Implements a new browser strategy that runs Chrome in Docker containers, providing better isolation and cross-platform consistency. Features include: - Connect and launch modes for different container configurations - Persistent storage support for maintaining browser state - Container registry for efficient reuse - Comprehensive test suite for Docker browser functionality This addition allows users to run browser automation workloads in isolated containers, improving security and resource management.

Add DefaultMarkdownGenerator integration and automatic content filtering for markdown output formats. When using 'markdown-fit' or 'md-fit' output formats, automatically apply PruningContentFilter with default settings if no filter config is provided. This change improves the user experience by providing sensible defaults for markdown generation while maintaining the ability to customize filtering behavior.

refactor(cli): remove unused import from FastAPI

…tify

Adds new features to improve user experience and configuration: - Quick JSON extraction with -j flag for direct LLM-based structured data extraction - Global configuration management with 'crwl config' commands - Enhanced LLM extraction with better JSON handling and error management - New user settings for default behaviors (LLM provider, browser settings, etc.) Breaking changes: None

…ategy

…ion setup

wakaka6 · 2025-03-26T10:52:27Z

@wakaka6 Thanks for submitting this PR! I've reviewed it and I'm impressed with the quality of your work. The implementation looks complete, well-tested, focused on the necessary changes without affecting unrelated code, follows our coding patterns, and addresses a real user need for accessing SSL certificates through proxies in restricted environments.

I've attached a comprehensive test script to verify all aspects of your implementation. Could you please run this script in your environment and share the results? The script tests:

Basic certificate fetching without proxies

Proxy configuration parsing (especially embedded credentials extraction)

Certificate fetching with various proxy types

Format conversion methods (to_playwright_format, str)

Edge cases and error handling

When running the test, please pay special attention to:

Whether the proxy credentials are properly extracted from URLs

If both the direct SSLCertificate.from_url method and AsyncWebCrawler correctly use the proxy

The handling of edge cases (invalid proxies, unavailable sites, etc.)

You'll need to configure the PROXIES section in the script with your actual proxy servers for a complete test. If some tests fail, please update your PR to address the issues before we merge.

Looking forward to your test results! ssl-proxy-test.py.md

I added additional edge processing. PTAL again :)

see https://discord.com/channels/1278297938551902308/1349221886143369257/1353992983292416010

The new usage method

from crawl4ai.ssl_certificate import SSLCertificate
from crawl4ai.configs import ProxyConfig

certification, err = SSLCertificate.from_url(url="https://www.baidu.com", proxy_config=ProxyConfig("https://127.0.0.1:8080"), verify_ssl=False)
if err:
    print("Runtime err:", err)

changed test script
ssl-proxy-test.py.md

wakaka6 · 2025-04-09T10:29:03Z

Based on the next branch commit, this PR shuts down. jump to PR #961

unclecode added 14 commits March 7, 2025 20:55

feat(api): refactor crawl request handling to streamline single and m…

6e3c048

…ultiple URL processing

feat(changelog): update CHANGELOG for version 0.5.0.post5 with new fe…

a31d7b8

…atures, changes, fixes, and breaking changes

refactor: clean up imports and improve JSON schema generation instruc…

5358ac0

…tions

wakaka6 force-pushed the feat/support_proxy_for_ssl_certificate branch 2 times, most recently from c269249 to c841a6b Compare March 21, 2025 07:52

feat(ssl-certificate): get ssl certificate support proxy

ddaa072

wakaka6 force-pushed the feat/support_proxy_for_ssl_certificate branch from c841a6b to ddaa072 Compare March 21, 2025 07:56

wakaka6 force-pushed the feat/support_proxy_for_ssl_certificate branch from aadce19 to 5a84854 Compare March 21, 2025 08:46

unclecode added 4 commits March 21, 2025 22:50

feat(browser): enhance browser context creation with user data direct…

6eeb2e4

…ory support and improved storage state handling

unclecode and others added 5 commits March 24, 2025 21:36

chore(version): bump version to 0.5.0.post6

bdd9db5

refactor(cli): remove unused import from FastAPI

fix(ssl_certificate): with encode credentials to decode

380663f

fix(ssl_ceritificate): fix https proxy not working and ignore ssl ver…

163cf29

…tify

unclecode and others added 7 commits March 25, 2025 14:51

Merge branch 'vr0.5.0.post5' into next

6405cf0

update(ssl_ceritificate): catch developer edgecase

3066ae2

feat(cli): add output file option and integrate LXML web scraping str…

5c88d13

…ategy

chore(version): bump version to 0.5.0.post7

d8f38f2

chore(version): bump version to 0.5.0.post8 and update post-installat…

40d4dd3

…ion setup

update(ssl_certificate): support socks4 and better error handler

dd73259

wakaka6 added 2 commits March 27, 2025 10:33

Merge branch 'next' into feat/support_proxy_for_ssl_certificate

d498847

fix(merge-next): proxyconfig

5939800

wakaka6 mentioned this pull request Apr 9, 2025

feat(ssl-certificate): get ssl certificate support proxy #961

Closed

6 tasks

wakaka6 closed this Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(ssl-certificate): get ssl certificate support proxy #864

feat(ssl-certificate): get ssl certificate support proxy #864

Uh oh!

wakaka6 commented Mar 21, 2025 •

edited

Loading

Uh oh!

unclecode commented Mar 24, 2025

Uh oh!

wakaka6 commented Mar 26, 2025 •

edited

Loading

Uh oh!

wakaka6 commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat(ssl-certificate): get ssl certificate support proxy #864

feat(ssl-certificate): get ssl certificate support proxy #864

Uh oh!

Conversation

wakaka6 commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

unclecode commented Mar 24, 2025

Uh oh!

wakaka6 commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wakaka6 commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wakaka6 commented Mar 21, 2025 •

edited

Loading

wakaka6 commented Mar 26, 2025 •

edited

Loading