-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
feat(mcp): Migrate from SSE to HTTP transport and add configurable port #1519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mcp): Migrate from SSE to HTTP transport and add configurable port #1519
Conversation
Replace deprecated SSE transport with FastMCP 2.12.3 HTTP (Streamable) for improved compatibility and maintainability. ## Key Changes ### MCP Bridge (deploy/docker/mcp_bridge.py) - Implement FastMCP HTTP transport with proper lifespan integration - Preserve Pydantic model schemas by unwrapping models to field-level parameters - Add proper Pydantic model serialization in HTTP proxy - Fix Depends() parameter filtering to exclude auth/DI params from tool schemas - Send Pydantic data as root JSON body (not wrapped) for FastAPI compatibility ### Server Configuration (deploy/docker/server.py) - Combine MCP and crawler lifespans for proper initialization order - Mount MCP HTTP endpoint at /mcp - Update port configuration for FastAPI server (11235) - Register tools dynamically from @mcp_tool decorated routes ### Config Handling (crawl4ai/async_configs.py) - Fix null handling in BrowserConfig and CrawlerRunConfig dump() methods - Ensure None values don't cause serialization errors in MCP tool schemas ### Infrastructure - Add fastmcp==2.12.3 to requirements - Update docker-compose.yml port mapping to match new configuration ## Testing Results All 7/7 MCP tools verified working: - ✅ md - Markdown extraction with filtering - ✅ html - Preprocessed HTML for schema extraction - ✅ ask - Crawl4AI library context search - ✅ screenshot - Full-page PNG capture (writes to container filesystem) - ✅ pdf - PDF document generation (writes to container filesystem) - ✅ execute_js - JavaScript execution on pages - ✅ crawl - Multi-URL crawling with configs ## Known Limitations **Docker Deployment - File Output:** - Screenshot and PDF tools write files inside the container filesystem - Files are not accessible to remote MCP clients without additional setup - Workarounds: Copy files from container, or mount volumes (requires same-host) - Recommended future enhancement: MinIO integration for distributed file storage **Local Deployment:** - All tools fully functional with direct filesystem access 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Eliminate sse-starlette==2.2.1 from the Docker requirements to streamline dependencies following the migration to FastMCP HTTP transport.
Address all high-priority issues identified in the comprehensive code review: ## Critical Fixes ### Internal Proxy Connection (High Priority) - Fix 0.0.0.0 base URL issue by mapping to 127.0.0.1 for internal HTTP proxy calls - Prevents MCP tool failures when app.host is "0.0.0.0" (bind-only address) ### MCP Tool Schema Generation (High Priority) - Fix Pydantic default value serialization causing syntax errors in wrapper functions - Simplify all optional parameters to default to None, avoiding complex enum/object serialization - Convert enum type annotations (FilterType) to base types (str) for better MCP client compatibility - Resolves Gemini client "missing types in parameter schema" errors ### Response Processing (High Priority) - Fix GET response double-encoding by attempting JSON parsing first, fallback to text - Ensures proper structured data return instead of JSON-encoded strings ## Code Quality Improvements ### Lifespan Management (Medium Priority) - Fix lifespan closure to use proper FastAPI app parameter instead of global reference - Improves code clarity and prevents potential issues ### Documentation (Medium Priority) - Document JWT authentication limitation - MCP tools don't forward Authorization headers - Add clear comments about intentional MCP app mounting convention - Provides guidance for JWT + MCP integration scenarios ## Verification All 7/7 MCP tools now working correctly: - ✅ md, html, ask, screenshot, pdf, execute_js, crawl - ✅ No syntax errors in dynamic wrapper generation - ✅ Proper type annotations for strict MCP clients (Gemini) - ✅ All optional parameters properly handled with None defaults 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update wrapper functions to preprocess arguments, filtering out empty strings and None values to allow FastAPI to use its defaults. - Ensure proper handling of model data for both tool and query parameter wrappers, improving compatibility with MCP client expectations. This change addresses issues related to parameter serialization and enhances the robustness of the MCP tool integration.
… binary output - Add comprehensive parameter preprocessing in MCP bridge for Pydantic models - Implement automatic type coercion for numeric fields (int/float) from strings - Add list field normalization (single string to list conversion) - Enhance binary output handling for screenshots and PDFs with configurable modes - Update tool documentation with clearer parameter descriptions - Add default-to-file mode configuration for binary exports - Improve error handling and fallback mechanisms - Remove deprecated version field from docker-compose.yml 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Make the Docker host port configurable via HOST_PORT environment variable
while keeping the container's internal port fixed at 11235.
Changes:
- Add .env.example with comprehensive documentation of all config options
- HOST_PORT for customizing host port mapping (default: 11235)
- INSTALL_TYPE options (default/all/torch/transformer) with feature descriptions
- ENABLE_GPU flag with platform compatibility notes
- Update docker-compose.yml to use ${HOST_PORT:-11235}:11235 port mapping
- Simplify supervisord.conf (remove dynamic port config, fixed at 11235)
- Update README with two-file environment pattern (.env vs .llm.env)
- Add docker run example with custom port mapping
Architecture:
- Container always runs on port 11235 internally (simple, consistent)
- Host port is configurable without rebuilding the image
- Defaults maintain backward compatibility (11235:11235)
Usage:
# Default port mapping
docker compose up -d
# Custom host port
echo "HOST_PORT=8080" > .env
docker compose up -d
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughReplaces SSE/WebSocket MCP with an HTTP FastMCP mounted at /mcp and route-driven tool registration; adds binary export configuration and file-write support; tightens None handling in async configs; enriches Pydantic schemas and result normalization; makes Docker host port configurable; updates docs, tests, dependencies, and introduces structured error handling. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client as MCP Client
participant FastMCP as FastMCP HTTP (/mcp)
participant Bridge as mcp_bridge
participant App as FastAPI App
participant Route as Route Handler
rect rgb(245,250,255)
note right of Client: Discovery
Client->>FastMCP: ListTools / ListResources
FastMCP-->>Client: Tools / Resources
end
rect rgb(250,255,245)
note right of Client: Tool invocation (CallTool)
Client->>FastMCP: CallTool(name, params)
FastMCP->>Bridge: map -> proxy wrapper
Bridge->>App: HTTP request (path, method, body)
App->>Route: invoke handler (Pydantic models / params)
alt success
Route-->>App: JSON / text / file-path
App-->>FastMCP: ToolResult (content blocks)
FastMCP-->>Client: Result
else error
Route-->>App: raise HTTPException
App-->>FastMCP: Error payload
FastMCP-->>Client: Error result
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
deploy/docker/config.yml (1)
24-31: Duplicate Redis TLS keys in YAML.Duplicate entries for ssl_cert_reqs, ssl_ca_certs, ssl_certfile, ssl_keyfile will shadow earlier values and confuse ops. Keep one set.
Apply this diff to remove the duplicates:
ssl_cert_reqs: None ssl_ca_certs: None ssl_certfile: None ssl_keyfile: None - - ssl_cert_reqs: None - ssl_ca_certs: None - ssl_certfile: None - ssl_keyfile: Nonedeploy/docker/README.md (1)
713-721: Update config.yml example port to 11235 to match the rest of the docsThe example still shows 8020 which contradicts the standardized 11235 elsewhere in the README and compose.
Apply this diff:
host: "0.0.0.0" - port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf). + port: 11235 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).deploy/docker/server.py (3)
311-333: Arbitrary file write via user-supplied output_path; add allowlist and log file-write failuresEndpoints accept an absolute path and write to it without restriction. This enables arbitrary file writes inside the container. Also, except Exception as e: pass drops errors silently.
Apply this diff to restrict writes to a configured base directory and log failures:
- if body.output_path: - abs_path = os.path.abspath(body.output_path) - os.makedirs(os.path.dirname(abs_path), exist_ok=True) - with open(abs_path, "wb") as f: - f.write(base64.b64decode(screenshot_data)) - return {"success": True, "path": abs_path} + if body.output_path: + abs_path = os.path.abspath(body.output_path) + base_dir = os.path.abspath(config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")) + # Enforce that writes stay within the configured base directory + try: + common = os.path.commonpath([base_dir, abs_path]) + except Exception: + common = "" + if common != base_dir: + raise HTTPException(400, f"output_path must be within {base_dir}") + os.makedirs(os.path.dirname(abs_path), exist_ok=True) + with open(abs_path, "wb") as f: + f.write(base64.b64decode(screenshot_data)) + return {"success": True, "path": abs_path} @@ - except Exception as e: - # Fall through to inline on any file write error - pass + except OSError as e: + # Fall through to inline on any file write error + import logging; logging.getLogger(__name__).warning("screenshot file write failed: %s", e)Optionally add a dedicated config key like binary.allowed_base_dir to avoid overloading default_dir. Based on learnings.
357-379: Apply the same write restriction and logging to PDF endpointMirror the file-write safety for PDFs and avoid silent failures.
Apply this diff:
- if body.output_path: - abs_path = os.path.abspath(body.output_path) - os.makedirs(os.path.dirname(abs_path), exist_ok=True) - with open(abs_path, "wb") as f: - f.write(pdf_data) - return {"success": True, "path": abs_path} + if body.output_path: + abs_path = os.path.abspath(body.output_path) + base_dir = os.path.abspath(config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")) + try: + common = os.path.commonpath([base_dir, abs_path]) + except Exception: + common = "" + if common != base_dir: + raise HTTPException(400, f"output_path must be within {base_dir}") + os.makedirs(os.path.dirname(abs_path), exist_ok=True) + with open(abs_path, "wb") as f: + f.write(pdf_data) + return {"success": True, "path": abs_path} @@ - except Exception: - # Fall through to inline on any file write error - pass + except OSError as e: + # Fall through to inline on any file write error + import logging; logging.getLogger(__name__).warning("pdf file write failed: %s", e)
471-474: Avoid infinite self-redirect on /metrics when Prometheus is disabledThis route redirects to itself and will loop if instrumentation is disabled. Either remove the route or return a 404/plain message when disabled.
Apply this diff:
-@app.get(config["observability"]["prometheus"]["endpoint"]) -async def metrics(): - return RedirectResponse(config["observability"]["prometheus"]["endpoint"]) +@app.get(config["observability"]["prometheus"]["endpoint"]) +async def metrics(): + if config["observability"]["prometheus"]["enabled"]: + # Instrumentator already exposes metrics at this path; let it handle the request. + return PlainTextResponse("OK", status_code=200) + return PlainTextResponse("Prometheus metrics disabled", status_code=404)
🧹 Nitpick comments (16)
deploy/docker/config.yml (1)
93-96: Binary export defaults: ensure directory exists at runtime.If default_mode is "file", create default_dir on startup to avoid 5xx on first write; consider a writable Docker volume by default.
deploy/docker/mcp_bridge.py (6)
35-36: Method selection: avoid arbitrary list indexing.Set iteration is nondeterministic. Prefer next(iter(...)) as Ruff suggests.
Apply this diff:
- method = list(route.methods - {"HEAD", "OPTIONS"})[0] + method = next(iter(route.methods - {"HEAD", "OPTIONS"}))
67-83: HTTP client: add explicit timeout and preserve exception chaining.Explicit timeouts improve resiliency; raising from the original error preserves context.
Apply this diff:
- async with httpx.AsyncClient() as client: + async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client: @@ - raise HTTPException(e.response.status_code, e.response.text) + raise HTTPException(e.response.status_code, e.response.text) from e
93-151: Deduplicate logic: delegate to register_tools_from_routes().create_mcp_server reimplements wrapper generation and skips Optional[...] params; delegate to the canonical path to avoid drift and bugs.
Apply this diff:
def create_mcp_server( @@ ) -> FastMCP: - """Create FastMCP server instance with registered tools.""" - server_name = name or "FastAPI-MCP" - - # Create FastMCP instance - mcp = FastMCP(name=server_name) - - # Register tools by scanning decorated FastAPI routes - for route in routes: - fn = getattr(route, "endpoint", None) - kind = getattr(fn, "__mcp_kind__", None) - - if kind != "tool": - continue - - tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_") - proxy_fn = _make_http_proxy(base_url, route) - description = inspect.getdoc(fn) or f"Tool for {route.path}" - - # Get function signature to create properly typed wrapper - sig = inspect.signature(fn) - params = [] - for param_name, param in sig.parameters.items(): - # Skip Request, Depends, and other FastAPI-specific params - if param.annotation in (inspect.Parameter.empty, type(None)): - continue - if hasattr(param.annotation, "__origin__"): # Skip complex types like Depends - continue - params.append(param) - - # Create function dynamically with explicit parameters - param_names = [p.name for p in params] - param_sig = ", ".join(param_names) - - # Build the async function code - func_code = f""" -async def {tool_name}_wrapper({param_sig}): - kwargs = {{{", ".join(f'"{name}": {name}' for name in param_names)}}} - result = await proxy_fn(**kwargs) - return json.dumps(result, default=str) -""" - - # Execute in a namespace that has access to proxy_fn and json - namespace = {"proxy_fn": proxy_fn, "json": json} - exec(func_code, namespace) - wrapper_fn = namespace[f"{tool_name}_wrapper"] - - # Register the properly typed function - mcp.tool(name=tool_name, description=description)(wrapper_fn) - - print(f"Registered {len([r for r in routes if getattr(getattr(r, 'endpoint', None), '__mcp_kind__', None) == 'tool'])} MCP tools") - - return mcp + """Create FastMCP server instance with registered tools.""" + server_name = name or "FastAPI-MCP" + mcp = FastMCP(name=server_name) + register_tools_from_routes(mcp, routes, base_url) + return mcp
176-180: Apply the same name sanitization here.Ensure wrapper generation never produces invalid identifiers.
Apply this diff:
- tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_") + raw_name = fn.__mcp_name__ or route.path + tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_") + if not re.match(r"[A-Za-z_]", tool_name): + tool_name = f"tool_{tool_name}"
152-161: Handle optional lifespans.Guard against None to avoid TypeError in environments without an app lifespan.
Apply this diff:
- async def combined(app): - # Run both lifespans - MCP outer, existing inner - async with mcp_lifespan(app): - async with existing_lifespan(app): - yield + async def combined(app): + # Run both lifespans - MCP outer, existing inner + if mcp_lifespan is None and existing_lifespan is None: + yield + return + if mcp_lifespan is None: + async with existing_lifespan(app): + yield + return + if existing_lifespan is None: + async with mcp_lifespan(app): + yield + return + async with mcp_lifespan(app): + async with existing_lifespan(app): + yield
132-148: Replace exec-based wrapper generation with a generic async wrapper
- Swap both exec calls (deploy/docker/mcp_bridge.py:142, 376) for a single generic async wrapper that builds its signature via inspect.Signature and sets signature, annotations, and doc, eliminating exec and S102 violations.
.env.example (1)
45-45: Add trailing newline.Fixes dotenv-linter EndingBlankLine warning.
Apply this diff:
-ENABLE_GPU=false +ENABLE_GPU=false +tests/mcp/test_mcp_http_listing.py (1)
11-11: Make MCP_URL configurable via environment to honor custom HOST_PORTHardcoding 11235 will break when users set a custom HOST_PORT. Read from MCP_URL/HOST_PORT with a sensible default.
Apply this diff:
+import os @@ -MCP_URL = "http://127.0.0.1:11235/mcp" +MCP_URL = os.getenv("MCP_URL", f"http://127.0.0.1:{os.getenv('HOST_PORT', '11235')}/mcp")deploy/docker/README.md (1)
226-231: Clarify URL when using custom HOST_PORT with composeTighten language so users don’t miss the port change when HOST_PORT is set.
Apply this diff:
-> The server will be available at `http://localhost:11235` (or your custom `HOST_PORT`). +> The server will be available at `http://localhost:<HOST_PORT>` (default `11235`).deploy/docker/server.py (2)
318-323: S108: consider safer default than /tmp and/or document mount pathUsing a world-writable default like /tmp is acceptable in containers but still flagged. Consider defaulting to a project-specific dir with restricted permissions or ensure you document mounting a host volume and tighten directory perms on create (e.g., mode=0o750).
If you want, I can add mkdirs with explicit permissions and an env-configurable base dir (binary.allowed_base_dir) and open a follow-up PR.
Also applies to: 366-371
664-664: Nit: f-string without placeholdersDrop the f prefix.
Apply this diff:
-print(f"Mounted MCP app at / (MCP endpoint at /mcp)") +print("Mounted MCP app at / (MCP endpoint at /mcp)")tests/mcp/test_mcp_socket.py (4)
14-15: Honor custom HOST_PORT and allow override via MCP_URLAlign with docker-compose/README by reading env and defaulting to 11235.
Apply this diff:
-from fastmcp.client import Client +from fastmcp.client import Client +import os @@ -MCP_URL = "http://localhost:11235/mcp" +MCP_URL = os.getenv("MCP_URL", f"http://localhost:{os.getenv('HOST_PORT', '11235')}/mcp")
17-24: Guard _first_text_block with a clearer error typeRaising RuntimeError is fine; alternatively, ValueError keeps it in the “bad input” family. Optional.
No change required; just a note.
55-68: Robust handling of path vs base64 for binary toolsNice ergonomics. Consider truncation length constants to avoid magic numbers, but optional.
Also applies to: 70-82
84-97: Potential flakiness: execute_js against HNInteracting with HN UI can be brittle. Consider a more static target or add retries/timeouts, but fine for a smoke test.
If needed, I can switch to a static sandbox page in a follow-up commit.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.env.example(1 hunks)crawl4ai/async_configs.py(2 hunks)deploy/docker/README.md(4 hunks)deploy/docker/config.yml(2 hunks)deploy/docker/mcp_bridge.py(2 hunks)deploy/docker/requirements.txt(1 hunks)deploy/docker/schemas.py(2 hunks)deploy/docker/server.py(13 hunks)docker-compose.yml(2 hunks)tests/mcp/test_mcp_http_listing.py(1 hunks)tests/mcp/test_mcp_socket.py(1 hunks)tests/mcp/test_mcp_sse.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/mcp/test_mcp_sse.py
🧰 Additional context used
🧬 Code graph analysis (5)
deploy/docker/schemas.py (1)
deploy/docker/utils.py (1)
FilterType(16-20)
tests/mcp/test_mcp_socket.py (1)
tests/mcp/test_mcp_http_listing.py (1)
main(14-22)
crawl4ai/async_configs.py (1)
tests/docker/test_serialization.py (1)
from_serializable_dict(67-106)
tests/mcp/test_mcp_http_listing.py (1)
tests/mcp/test_mcp_socket.py (1)
main(123-134)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
handle_crawl_request(418-510)handle_llm_qa(61-108)handle_markdown_request(175-240)handle_stream_crawl_request(512-561)stream_results(385-416)deploy/docker/auth.py (3)
TokenRequest(54-55)create_access_token(23-29)get_token_dependency(45-51)deploy/docker/crawler_pool.py (3)
close_all(47-50)get_crawler(22-46)janitor(52-60)deploy/docker/mcp_bridge.py (3)
combine_lifespans(152-160)mcp_tool(27-31)register_tools_from_routes(162-392)deploy/docker/schemas.py (5)
CrawlRequest(7-10)MarkdownRequest(12-27)PDFRequest(41-43)RawCode(30-31)ScreenshotRequest(36-39)deploy/docker/utils.py (4)
FilterType(16-20)load_config(22-40)setup_logging(42-47)verify_email_domain(120-127)
🪛 Ruff (0.13.1)
deploy/docker/schemas.py
23-23: String contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF001)
deploy/docker/mcp_bridge.py
35-35: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice
Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))
(RUF015)
82-82: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
120-120: Loop control variable param_name not used within loop body
Rename unused param_name to _param_name
(B007)
142-142: Use of exec detected
(S102)
228-228: Do not catch blind exception: Exception
(BLE001)
236-236: Do not catch blind exception: Exception
(BLE001)
334-334: Do not catch blind exception: Exception
(BLE001)
376-376: Use of exec detected
(S102)
386-386: Do not catch blind exception: Exception
(BLE001)
tests/mcp/test_mcp_socket.py
23-23: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/server.py
321-321: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
331-333: try-except-pass detected, consider logging the exception
(S110)
331-331: Do not catch blind exception: Exception
(BLE001)
331-331: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
366-366: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
376-378: try-except-pass detected, consider logging the exception
(S110)
376-376: Do not catch blind exception: Exception
(BLE001)
654-654: Possible binding to all interfaces
(S104)
664-664: f-string without any placeholders
Remove extraneous f prefix
(F541)
🪛 dotenv-linter (3.3.0)
.env.example
[warning] 45-45: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
🔇 Additional comments (13)
crawl4ai/async_configs.py (2)
617-623: Nice guard for missing browser config payloadsReturning a default
BrowserConfigwhen the serialized blob isNoneavoids the oldAttributeErrorpath and keeps callers from having to special-case empty configs. Thanks for tightening this up.
1553-1559: Matching improvement for crawler config defaultsMirroring the same
Nonehandling here means both loaders now behave consistently and stop tripping on absent payloads—exactly what we need for the new HTTP transport defaults.deploy/docker/config.yml (1)
6-6: Port change LGTM.Consistent with docker-compose HOST_PORT and tests referencing 11235.
deploy/docker/schemas.py (2)
14-27: Field metadata improvements LGTM.Descriptions and explicit Field usage improve schema clarity and MCP tool docs.
38-43: Binary outputs: defaults make sense.2s wait and optional output_path align with file-based exports in this PR.
.env.example (1)
1-45: .env example LGTM.Clear docs for HOST_PORT/INSTALL_TYPE/ENABLE_GPU; matches docker-compose usage.
deploy/docker/requirements.txt (2)
7-7: Adjust JWT dependency and pin mcp
- Remove
PyJWT==2.10.1from deploy/docker/requirements.txt; code importsJWTand related utilities from python-jwt (jwt>=1.3.1).- Pin
mcpto a known-good version compatible withfastmcp==2.12.3for reproducible builds.Likely an incorrect or invalid review comment.
15-15: fastmcp pin approved
fastmcp==2.12.3 is the latest 2.x release and compatible with mcp>=1.6.0; continue monitoring compatibility on future upgrades.docker-compose.yml (2)
4-4: LGTM: dynamic host port mapping via HOST_PORTKeeps container port stable while allowing flexible host binding. Healthcheck still targets the internal port, which is correct.
48-48: Compose anchor inheritance looks correctService correctly inherits the updated base-config.
tests/mcp/test_mcp_socket.py (2)
26-37: Test payloads and printing look correct for crawlGood coverage and concise output.
123-139: End-to-end flow over HTTP transport looks goodLists tools then exercises all tools. Matches the migration away from SSE.
deploy/docker/server.py (1)
651-659: MCP base_url host mapping looks good Mapping 0.0.0.0 → 127.0.0.1 avoids invalid proxy targets. Tool registration after route creation is appropriate. No additional SSE or WebSocket endpoints require adjustment.
…d timeout - Removed explicit environment variable definitions for API keys in docker-compose.yml to enhance security and streamline configuration. - Increased timeout for LLM operations in mcp_bridge.py to 60 seconds to accommodate longer processing times. This change improves the security of sensitive information and ensures better handling of LLM requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
deploy/docker/mcp_bridge.py (4)
83-83: Add exception chaining for better error context.Missing exception chaining makes debugging harder when HTTP errors occur.
Apply this diff:
- raise HTTPException(e.response.status_code, e.response.text) + raise HTTPException(e.response.status_code, e.response.text) from e
143-143: Consider security implications of exec() usage.Dynamic code execution via
exec()poses security risks and makes debugging difficult. While the generated code appears safe in this context, consider using alternative approaches.For improved security and maintainability, consider using factory functions or decorators instead of exec():
def create_wrapper(tool_name: str, proxy_fn, param_names: list): async def wrapper(**kwargs): filtered_kwargs = {k: v for k, v in kwargs.items() if k in param_names} result = await proxy_fn(**filtered_kwargs) return json.dumps(result, default=str) wrapper.__name__ = f"{tool_name}_wrapper" return wrapperAlso applies to: 377-377
229-229: Consider more specific exception handling.Multiple broad exception catches could mask important errors during type introspection and wrapper generation.
Replace broad exception handling with specific exceptions where possible:
- except Exception: + except (AttributeError, TypeError): origin = NoneSimilar changes should be applied to other broad exception catches for better error diagnostics.
Also applies to: 237-237, 335-335, 387-387
94-152: Use streamable-http transport with FastMCP v2.12.3
FastMCP 2.12.3 defaults to the new streamable-http transport (legacy http+sse is only a fallback). Ensure your_make_http_proxycalls andbase_urltarget a plain HTTP (streamable) endpoint for tool registration.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/mcp_bridge.py(2 hunks)docker-compose.yml(2 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py
35-35: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice
Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))
(RUF015)
83-83: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
121-121: Loop control variable param_name not used within loop body
Rename unused param_name to _param_name
(B007)
143-143: Use of exec detected
(S102)
229-229: Do not catch blind exception: Exception
(BLE001)
237-237: Do not catch blind exception: Exception
(BLE001)
335-335: Do not catch blind exception: Exception
(BLE001)
377-377: Use of exec detected
(S102)
387-387: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (7)
docker-compose.yml (1)
4-4: LGTM! Well-designed configurable port mapping.The dynamic host port mapping using
${HOST_PORT:-11235}allows for flexible deployment while maintaining a consistent container port. This is a best practice for Docker configurations.deploy/docker/mcp_bridge.py (6)
49-56: Path parameter replacement fails for typed params.FastAPI paths like
{file_path:path}won't match"{k}". Use a regex to replace typed variants.Apply this diff:
- for k, v in list(kwargs.items()): + for k, v in list(kwargs.items()): placeholder = "{" + k + "}" - if placeholder in path: - # Path parameter - ensure safe URL encoding - encoded_value = urllib.parse.quote(str(v), safe="") - path = path.replace(placeholder, encoded_value) - kwargs.pop(k) + # Path parameter - support typed params e.g. {k:path} + encoded_value = urllib.parse.quote(str(v), safe="") + pattern = re.compile(r"\{" + re.escape(k) + r"(?:\:[^}]*)?\}") + if pattern.search(path): + path = pattern.sub(encoded_value, path) + kwargs.pop(k)
76-81: JSON parsing improvement looks good.The unified JSON parsing approach for both GET and other methods helps avoid double-encoding issues, with appropriate fallback to text.
114-114: Sanitize tool names to valid identifiers.Route paths may contain characters invalid in function names (hyphens, dots), causing exec() syntax errors.
Apply this diff for both occurrences:
- tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_") + raw_name = fn.__mcp_name__ or route.path + tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_") + if not re.match(r"[A-Za-z_]", tool_name): + tool_name = f"tool_{tool_name}"Also applies to: 178-178
153-162: Excellent lifespan composition pattern.The nested async context manager pattern correctly handles both MCP and existing lifespans, ensuring proper initialization order.
223-267: Robust Pydantic field handling with type coercion.The implementation correctly handles Pydantic model fields, enum base types, and optional fields with proper defaults. The type coercion for numeric fields addresses common MCP client compatibility issues.
276-302: Smart MCP client quirk handling.The preprocessing logic that handles empty strings, null values, list normalization, and numeric coercion addresses real-world MCP client compatibility issues effectively.
- Fix custom host port example in README (8080:11235 vs 11235:11235) - Update MCP documentation to HTTP-only transport (remove SSE/WS refs) - Fix Claude CLI command to use --transport http - Implement regex-based typed path parameter handling in mcp_bridge - Improve tool name sanitization for exec safety - Replace non-breaking hyphen with ASCII hyphen in schemas - Add comprehensive docstrings and comments Addresses CodeRabbit review feedback for better code quality and documentation accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
deploy/docker/mcp_bridge.py (5)
66-66: Select HTTP method deterministically; avoid list()[0].Use an iterator and guard against empty method sets. Also improves readability.
- method = list(route.methods - {"HEAD", "OPTIONS"})[0] + methods = route.methods - {"HEAD", "OPTIONS"} + method = next(iter(methods)) if methods else "GET"
114-115: Preserve exception cause.Chain the original
httpxerror for better debuggability.- except httpx.HTTPStatusError as e: - raise HTTPException(e.response.status_code, e.response.text) + except httpx.HTTPStatusError as e: + raise HTTPException(e.response.status_code, e.response.text) from e
185-185: Prefer logging over print for production diagnostics.Route/tool counts should go to structured logs, not stdout.
Minimal change (outside selected lines, for illustration):
# at top-level import logging logger = logging.getLogger(__name__) # replace prints logger.info("Registered %d MCP tools", ...)Also applies to: 442-442
4-11: Remove unused imports.
BaseModelandCallableare unused.-from __future__ import annotations -import inspect, json, re +from __future__ import annotations +import inspect, json, re @@ -from typing import Callable, List, get_origin, get_args +from typing import List, get_origin, get_args @@ -from pydantic import BaseModel +# from pydantic import BaseModel # Unused
100-101: Reuse httpx client to reduce connection churn.Consider injecting a shared
AsyncClientor a factory for long-lived use.Based on learnings
deploy/docker/README.md (1)
322-325: Align README transport wording
- The
--transport httpflag is already correct.- In deploy/docker/README.md (line 346), change “MCP WebSocket connection” to “MCP HTTP connection”.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deploy/docker/README.md(4 hunks)deploy/docker/mcp_bridge.py(1 hunks)deploy/docker/schemas.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/docker/schemas.py
🧰 Additional context used
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py
66-66: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice
Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))
(RUF015)
81-81: Local variable placeholder is assigned to but never used
Remove assignment to unused variable placeholder
(F841)
115-115: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
179-179: Use of exec detected
(S102)
278-278: Do not catch blind exception: Exception
(BLE001)
286-286: Do not catch blind exception: Exception
(BLE001)
384-384: Do not catch blind exception: Exception
(BLE001)
426-426: Use of exec detected
(S102)
436-436: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (5)
deploy/docker/mcp_bridge.py (2)
147-151: Good: tool-name sanitization prevents exec-time syntax errors.Solid guard against invalid identifiers and code injection risks.
226-229: Sanitize tool names here too (exec safety + schema stability).Without sanitization, names containing hyphens/dots/slashes can cause SyntaxError in the generated wrapper. Mirror the logic used in
create_mcp_server.- tool_count += 1 - tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_") + tool_count += 1 + raw_name = fn.__mcp_name__ or route.path + tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_") + if not re.match(r"[A-Za-z_]", tool_name): + tool_name = f"tool_{tool_name}"This addresses the earlier review asking to sanitize tool names; it wasn’t applied in this code path.
deploy/docker/README.md (3)
118-128: LGTM: Correct custom host port example.Using
-p 8080:11235and updating the access URL removes confusion.
226-231: LGTM: Compose HOST_PORT example and access note.Clear guidance on overriding the host port.
308-317: LGTM: HTTP-only MCP endpoint and JWT limitation noted.The limitation is documented clearly to prevent auth surprises.
- Use deterministic HTTP method selection with next(iter()) - Fix safe parameter popping to prevent KeyError - Improve typed path parameter handling with slash preservation - Fix parameter filtering to include Optional/List types - Add proper exception chaining for better debugging - Replace print statements with structured logging - Remove unused variable assignment - Remove unnecessary f-prefix from string literals Addresses latest CodeRabbit review feedback for production readiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/mcp_bridge.py(1 hunks)deploy/docker/server.py(13 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/server.py (6)
deploy/docker/api.py (3)
handle_crawl_request(418-510)handle_stream_crawl_request(512-561)stream_results(385-416)deploy/docker/auth.py (3)
TokenRequest(54-55)create_access_token(23-29)get_token_dependency(45-51)deploy/docker/crawler_pool.py (3)
close_all(47-50)get_crawler(22-46)janitor(52-60)deploy/docker/mcp_bridge.py (2)
combine_lifespans(197-213)register_tools_from_routes(215-455)deploy/docker/schemas.py (6)
CrawlRequest(7-10)HTMLRequest(33-34)JSEndpointRequest(46-51)MarkdownRequest(12-27)RawCode(30-31)ScreenshotRequest(36-39)deploy/docker/utils.py (4)
FilterType(16-20)load_config(22-40)setup_logging(42-47)verify_email_domain(120-127)
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py
185-185: Use of exec detected
(S102)
289-289: Do not catch blind exception: Exception
(BLE001)
297-297: Do not catch blind exception: Exception
(BLE001)
395-395: Do not catch blind exception: Exception
(BLE001)
437-437: Use of exec detected
(S102)
447-447: Do not catch blind exception: Exception
(BLE001)
449-449: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
deploy/docker/server.py
321-321: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
331-333: try-except-pass detected, consider logging the exception
(S110)
331-331: Do not catch blind exception: Exception
(BLE001)
331-331: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
366-366: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
376-378: try-except-pass detected, consider logging the exception
(S110)
376-376: Do not catch blind exception: Exception
(BLE001)
654-654: Possible binding to all interfaces
(S104)
- Replace timestamp-based filenames with UUID-based unique identifiers - Fixes race condition where concurrent requests to same host would overwrite files - Add uuid import for collision-resistant filename generation Addresses CodeRabbit review feedback for file export safety.
- Add dual parameter support for /md endpoint (body params override query params) - Fix JavaScript execution to return actual values in js_execution_result.return_values - Resolve async generator iteration errors causing batch crawl failures - Fix JSON import scope issues preventing proper error responses - Enable /crawl/stream endpoint functionality in playground UI - Standardize parameter naming from f/q/c to filter/query/cache across all integration points - Add comprehensive backward compatibility test suite with 4 test scenarios - Enhance error handling with user-friendly network/browser error messages - Add path traversal protection for file export operations - Restore /mcp/schema endpoint for MCP tool introspection and documentation BREAKING: None - maintains full backward compatibility while enabling clean parameter migration Fixes multiple runtime crashes and API inconsistencies while preserving existing functionality. Major improvements to developer experience and system reliability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/md_v2/core/docker-deployment.md (1)
286-301: Replace SSE/WS with HTTP FastMCP schema endpoint
Remove the SSE (/mcp/sse) and WebSocket (/mcp/ws) entries and instead document only the HTTP schema endpoint:
- Schema:
GET http://localhost:11235/mcp/schemaUpdate the example to:
# Add the Crawl4AI server as an HTTP FastMCP provider claude mcp add --transport http c4ai-http http://localhost:11235/mcp/schema # List all MCP providers to verify it was added claude mcp list
🧹 Nitpick comments (7)
tests/mcp/test_mcp_js_exec.py (4)
117-121: Add timeout and configurable base URL to the error-handling test.Mirror the same robustness here for consistency with the first test.
- response = requests.post("http://localhost:11235/execute_js", json={ + response = requests.post(f"{BASE_URL}/execute_js", json={ "url": "https://httpbin.org/html", "scripts": scripts - }) + }, timeout=30)
132-151: Strengthen error-object validation.Relying on 'error' in str(dict).lower() is brittle. Check presence of an error key explicitly.
- # Second should be an error object - if not isinstance(return_values[1], dict) or 'error' not in str(return_values[1]).lower(): - print(f"❌ Second value should be error object, got {return_values[1]}") - return False + # Second should be an error object + assert isinstance(return_values[1], dict) and any( + k in return_values[1] for k in ("error", "message", "stack") + ), f"Second value should be error-like object, got {return_values[1]!r}"
61-61: Use f-string conversion flags instead of repr().Cleaner and satisfies linters.
- print(f"Script {i+1} returned: {repr(value)}") + print(f"Script {i+1} returned: {value!r}") @@ - print(f"❌ Expected 'test string', got {repr(test_str)}") + print(f"❌ Expected 'test string', got {test_str!r}") @@ - print(f"❌ Expected 42, got {repr(test_num)}") + print(f"❌ Expected 42, got {test_num!r}")Also applies to: 92-92, 100-100
1-5: Mark as integration test or move out of pytest discovery.This depends on an external server and network. Mark with a pytest marker and skip by default, or move under an integration directory excluded from CI.
Would you like a follow-up patch to add a pytest marker and env-gated skip (e.g., RUN_HTTP_TESTS) so CI stays green when the server isn’t up?
deploy/docker/api.py (3)
425-431: Use Optional[str] for output_path type.PEP 484: prefer Optional[str] = None for clarity and tooling.
- config: dict, - output_path: str = None + config: dict, + output_path: Optional[str] = None
563-593: Tighten file export handling and error chaining.
- Normalize base_dir before validation; create directories with restrictive perms.
- Chain write errors for clearer tracebacks.
- if output_path: - binary_base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports") - abs_path = os.path.abspath(output_path) + if output_path: + binary_base_dir = os.path.realpath(os.path.abspath( + config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports") + )) + abs_path = os.path.realpath(os.path.abspath(output_path)) @@ - os.makedirs(os.path.dirname(abs_path), exist_ok=True) + os.makedirs(os.path.dirname(abs_path), mode=0o700, exist_ok=True) @@ - except (OSError, TypeError) as write_error: + except (OSError, TypeError) as write_error: logger.warning("crawl file write failed: %s", write_error, exc_info=True) - raise HTTPException( + raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=f"Failed to write to {abs_path}: {str(write_error)}" - ) + detail=f"Failed to write to {abs_path}: {write_error!s}" + ) from write_error
80-86: Avoid local absolute import; move helper to utils to prevent cycles.from server import ... inside function risks import issues. Prefer moving _extract_user_friendly_error to utils and import once at module top.
Want a follow-up patch relocating _extract_user_friendly_error to utils and adjusting imports here and in server.py?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.env.example(1 hunks)deploy/docker/README.md(6 hunks)deploy/docker/api.py(9 hunks)deploy/docker/config.yml(2 hunks)deploy/docker/job.py(2 hunks)deploy/docker/mcp_bridge.py(1 hunks)deploy/docker/schemas.py(2 hunks)deploy/docker/server.py(17 hunks)deploy/docker/static/playground/index.html(2 hunks)deploy/docker/test_backward_compatibility.py(1 hunks)deploy/docker/utils.py(2 hunks)docs/md_v2/assets/llm.txt/txt/docker.txt(1 hunks)docs/md_v2/core/docker-deployment.md(1 hunks)tests/mcp/test_mcp_http_listing.py(1 hunks)tests/mcp/test_mcp_js_exec.py(1 hunks)tests/mcp/test_mcp_socket.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- .env.example
- tests/mcp/test_mcp_http_listing.py
- deploy/docker/schemas.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/mcp/test_mcp_socket.py (1)
tests/mcp/test_mcp_http_listing.py (1)
main(16-24)
deploy/docker/api.py (4)
deploy/docker/utils.py (2)
_ensure_within_base_dir(128-139)datetime_handler(57-61)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (2)
success(36-37)model_dump(240-258)crawl4ai/async_webcrawler.py (1)
arun(204-431)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
handle_crawl_request(425-620)handle_llm_qa(62-112)handle_markdown_request(179-247)handle_stream_crawl_request(622-671)stream_results(392-423)crawl4ai/async_webcrawler.py (2)
AsyncWebCrawler(53-852)arun(204-431)deploy/docker/crawler_pool.py (3)
close_all(47-50)get_crawler(22-46)janitor(52-60)deploy/docker/mcp_bridge.py (6)
combine_lifespans(153-175)create_mcp_server(138-149)mcp_resource(24-36)mcp_template(38-50)mcp_tool(52-64)register_tools_from_routes(179-211)deploy/docker/schemas.py (7)
CrawlRequest(7-11)HTMLRequest(34-35)JSEndpointRequest(47-52)MarkdownRequest(13-28)PDFRequest(42-44)RawCode(31-32)ScreenshotRequest(37-40)deploy/docker/utils.py (4)
FilterType(16-20)load_config(22-40)setup_logging(42-47)_ensure_within_base_dir(128-139)
🪛 Ruff (0.13.1)
tests/mcp/test_mcp_js_exec.py
1-1: Shebang is present but file is not executable
(EXE001)
26-26: Probable use of requests call without timeout
(S113)
61-61: Use explicit conversion flag
Replace with conversion flag
(RUF010)
92-92: Use explicit conversion flag
Replace with conversion flag
(RUF010)
100-100: Use explicit conversion flag
Replace with conversion flag
(RUF010)
117-117: Probable use of requests call without timeout
(S113)
tests/mcp/test_mcp_socket.py
25-25: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/test_backward_compatibility.py
1-1: Shebang is present but file is not executable
(EXE001)
82-82: f-string without any placeholders
Remove extraneous f prefix
(F541)
111-111: f-string without any placeholders
Remove extraneous f prefix
(F541)
137-137: Do not catch blind exception: Exception
(BLE001)
deploy/docker/api.py
430-430: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
442-446: Abstract raise to an inner function
(TRY301)
488-488: Do not catch blind exception: Exception
(BLE001)
499-499: Do not catch blind exception: Exception
(BLE001)
565-565: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
589-592: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
591-591: Use explicit conversion flag
Replace with conversion flag
(RUF010)
594-594: Consider moving this statement to an else block
(TRY300)
deploy/docker/server.py
199-199: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
284-284: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
284-284: Use explicit conversion flag
Replace with conversion flag
(RUF010)
285-285: Do not catch blind exception: Exception
(BLE001)
286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
286-286: Use explicit conversion flag
Replace with conversion flag
(RUF010)
321-321: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
840-840: Possible binding to all interfaces
(S104)
898-898: Do not catch blind exception: Exception
(BLE001)
898-898: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
971-972: try-except-pass detected, consider logging the exception
(S110)
971-971: Do not catch blind exception: Exception
(BLE001)
deploy/docker/mcp_bridge.py
229-229: Unused noqa directive (unused: E721)
Remove unused noqa directive
(RUF100)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
🪛 markdownlint-cli2 (0.18.1)
docs/md_v2/core/docker-deployment.md
398-398: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (3)
deploy/docker/api.py (3)
442-446: Good guard for headless-only Docker runs.Clear error and guidance; aligns with container constraints.
596-620: Keep structured error details, but ensure media type upstream.detail contains JSON string. Ensure the FastAPI route uses JSONResponse or exception handlers decode this. Otherwise, clients receive a stringified JSON.
If desired, I can refactor to raise HTTPException(detail=payload_dict) and add a global handler to serialize consistently.
451-454: Direct lookup safe for rate_limiter.enabled The default config (deploy/docker/config.yml) already includescrawler.rate_limiter.enabled, so usingconfig['crawler']['rate_limiter']['enabled']won’t KeyError.Likely an incorrect or invalid review comment.
- Fix /mcp/schema endpoint to use proper FastMCP tool introspection - Remove fallback 'route inspection failed' messages that hurt confidence - Improve array detection by preserving List[str] typing in wrapper generation - Replace unreliable route inspection with direct FastMCP tool access - Add robust _create_param_schema() function for accurate type detection - Clean error handling with proper HTTP 500 responses instead of fallback The schema endpoint now provides professional, reliable tool schemas without misleading fallback messages. Addresses feedback from unclecode#1519
- Fix crawler.arun() extend bug - single result should use append() - Fix is_required() property access (was incorrectly called as function) - Add backward compatibility for 'q' parameter in LlmJobPayload - Remove empty f-strings to satisfy linter - Previously addressed: UUID collision prevention, hyphen fix, docs
- Harden path validation against symlink escapes in utils.py - Update JS execution docs to reflect return_values support - Add missing http language specification to fenced code block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
deploy/docker/api.py(9 hunks)deploy/docker/job.py(3 hunks)deploy/docker/mcp_bridge.py(1 hunks)deploy/docker/server.py(18 hunks)deploy/docker/test_backward_compatibility.py(1 hunks)deploy/docker/utils.py(2 hunks)docs/md_v2/core/docker-deployment.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/md_v2/core/docker-deployment.md
- deploy/docker/job.py
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/docker/api.py (3)
deploy/docker/utils.py (2)
_ensure_within_base_dir(128-145)datetime_handler(57-61)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (1)
model_dump(240-258)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
handle_crawl_request(425-620)handle_llm_qa(62-112)handle_markdown_request(179-247)handle_stream_crawl_request(622-671)stream_results(392-423)deploy/docker/crawler_pool.py (3)
close_all(47-50)get_crawler(22-46)janitor(52-60)deploy/docker/job.py (1)
init_job_router(26-30)deploy/docker/mcp_bridge.py (5)
combine_lifespans(153-175)mcp_resource(24-36)mcp_template(38-50)mcp_tool(52-64)register_tools_from_routes(179-211)deploy/docker/schemas.py (6)
CrawlRequest(7-11)HTMLRequest(34-35)MarkdownRequest(13-28)PDFRequest(42-44)RawCode(31-32)ScreenshotRequest(37-40)deploy/docker/utils.py (3)
FilterType(16-20)load_config(22-40)_ensure_within_base_dir(128-145)
🪛 Ruff (0.13.1)
deploy/docker/api.py
430-430: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
442-446: Abstract raise to an inner function
(TRY301)
488-488: Do not catch blind exception: Exception
(BLE001)
499-499: Do not catch blind exception: Exception
(BLE001)
565-565: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
589-592: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
591-591: Use explicit conversion flag
Replace with conversion flag
(RUF010)
594-594: Consider moving this statement to an else block
(TRY300)
deploy/docker/mcp_bridge.py
229-229: Unused noqa directive (unused: E721)
Remove unused noqa directive
(RUF100)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/server.py
199-199: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
284-284: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
284-284: Use explicit conversion flag
Replace with conversion flag
(RUF010)
285-285: Do not catch blind exception: Exception
(BLE001)
286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
286-286: Use explicit conversion flag
Replace with conversion flag
(RUF010)
321-321: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
840-840: Possible binding to all interfaces
(S104)
893-893: Do not catch blind exception: Exception
(BLE001)
901-901: Do not catch blind exception: Exception
(BLE001)
903-903: Use explicit conversion flag
Replace with conversion flag
(RUF010)
965-965: Do not catch blind exception: Exception
(BLE001)
deploy/docker/test_backward_compatibility.py
1-1: Shebang is present but file is not executable
(EXE001)
137-137: Do not catch blind exception: Exception
(BLE001)
Previously, when a single CrawlResult object was returned (not wrapped in a list), the code would incorrectly treat it as "already a list" and iterate over its fields instead of treating it as a single result. This fix adds proper type checking to: - Detect actual sequences (list, tuple) vs single objects - Wrap single CrawlResult objects in a list - Handle None results gracefully - Ensure downstream code always receives a proper list of results 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/api.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (3)
deploy/docker/utils.py (2)
_ensure_within_base_dir(128-145)datetime_handler(57-61)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (1)
model_dump(240-258)
🪛 Ruff (0.13.1)
deploy/docker/api.py
430-430: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
442-446: Abstract raise to an inner function
(TRY301)
495-495: Do not catch blind exception: Exception
(BLE001)
506-506: Do not catch blind exception: Exception
(BLE001)
572-572: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
596-599: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
598-598: Use explicit conversion flag
Replace with conversion flag
(RUF010)
601-601: Consider moving this statement to an else block
(TRY300)
…jects Previously, the cleanup loop would drop any value that json.dumps() couldn't serialize, causing datetime and Path-like fields to be logged and skipped. This fix adds proper type detection and normalization: - datetime/date objects: converted using existing datetime_handler - os.PathLike/pathlib.Path: converted to strings using os.fspath() - bytes: preserved with existing base64 encoding - Unsupported types: better logging with actual values Changes: - Added pathlib import and date to datetime imports - Added datetime_handler to utils import at module level - Enhanced cleanup loop with proper isinstance checks - Improved error logging to show actual values This ensures important metadata like timestamps and file paths are properly preserved in API responses instead of being silently dropped during JSON serialization. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/api.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (2)
datetime_handler(57-61)_ensure_within_base_dir(128-145)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (2)
success(36-37)model_dump(240-258)crawl4ai/async_webcrawler.py (1)
arun(204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py
432-432: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
444-448: Abstract raise to an inner function
(TRY301)
497-497: Do not catch blind exception: Exception
(BLE001)
508-508: Do not catch blind exception: Exception
(BLE001)
555-555: Local variable datetime_handler referenced before assignment
(F823)
581-581: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
605-608: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
607-607: Use explicit conversion flag
Replace with conversion flag
(RUF010)
610-610: Consider moving this statement to an else block
(TRY300)
…objects
CodeRabbit identified a critical issue: when Path/datetime objects are nested
inside collections (lists, dicts), the previous fix would still drop entire
fields because json.dumps() fails on the collection level.
This recursive approach fixes the issue by:
- Introducing _normalize_value() function that handles nested collections
- Recursively normalizing dict values: {k: _normalize_value(v) for k, v in dict.items()}
- Recursively normalizing list/tuple/set: [_normalize_value(v) for v in collection]
- Preserving existing special PDF handling logic
- Ensuring downloaded_files and other nested filesystem metadata is preserved
This prevents regression in binary-export flows where clients expect to see
saved file locations in nested data structures.
Addresses: unclecode#1519 (comment)...
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
deploy/docker/api.py (2)
588-616: Expand and validate the output base directory
_ensure_within_base_direxpects an absolute base directory, but the value inconfig["binary"]["default_dir"]may be provided as~/exports(per the docs) or include environment variables. Passing it verbatim causesos.path.abspathto treat the tilde literally (/app/~/exports), so every legitimateoutput_pathunder$HOMEis rejected. Please normalize the base folder withos.path.expanduser/os.path.expandvars(and re-run_ensure_within_base_diron the resolved path) before comparison.
432-433: Type-hintoutput_pathasOptional[str]To comply with our typing guidelines, annotate the new parameter as
Optional[str] = Noneso callers understandNoneis acceptable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/api.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
validate_llm_provider(96-116)datetime_handler(57-61)_ensure_within_base_dir(128-145)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (2)
success(36-37)model_dump(240-258)crawl4ai/async_webcrawler.py (1)
arun(204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py
432-432: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
444-448: Abstract raise to an inner function
(TRY301)
497-497: Do not catch blind exception: Exception
(BLE001)
508-508: Do not catch blind exception: Exception
(BLE001)
590-590: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
614-617: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
616-616: Use explicit conversion flag
Replace with conversion flag
(RUF010)
619-619: Consider moving this statement to an else block
(TRY300)
Previously, when a CrawlResultContainer was returned, the normalization logic would treat it as a single object and wrap it in a list, causing downstream serialization to see the container instead of its internal per-URL entries. This fix adds detection for objects with a .results attribute: - Checks hasattr(results_gen, 'results') before single-object fallback - Extracts internal entries with list(results_gen.results) - Preserves existing handling for None, list/tuple, and single-model cases - Maintains async-generator collection and fallback-per-URL retry logic This ensures that CrawlResultContainer objects properly expose their individual crawl results to downstream processing. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/api.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
validate_llm_provider(96-116)datetime_handler(57-61)_ensure_within_base_dir(128-145)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (2)
success(36-37)model_dump(240-258)crawl4ai/async_webcrawler.py (1)
arun(204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py
432-432: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
444-448: Abstract raise to an inner function
(TRY301)
500-500: Do not catch blind exception: Exception
(BLE001)
511-511: Do not catch blind exception: Exception
(BLE001)
593-593: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
617-620: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
619-619: Use explicit conversion flag
Replace with conversion flag
(RUF010)
622-622: Consider moving this statement to an else block
(TRY300)
Previously, the fallback path would append crawler.arun() return values directly to fallback_results, but AsyncWebCrawler.arun returns container wrappers (CrawlResultContainer) which breaks downstream JSON serialization. This fix adds proper container detection and unwrapping: - hasattr(result, 'results') → extend with container.results - hasattr(result, '__iter__') (excluding str/bytes/dict) → extend with iterable - Non-iterable single results → append as-is This ensures the retry/serialization path receives actual CrawlResult objects instead of container wrappers, maintaining consistency with the main success path and preventing JSON serialization errors. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deploy/docker/api.py (2)
148-172: Fix: unwrap container in LLM extraction path to avoid empty results.Same container issue here; use inner result for status/error/content.
Apply this diff:
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun( + async with AsyncWebCrawler() as crawler: + result_container = await crawler.arun( url=url, config=CrawlerRunConfig( extraction_strategy=llm_strategy, scraping_strategy=LXMLWebScrapingStrategy(), cache_mode=cache_mode ) ) - if not result.success: + result_obj = next(iter(getattr(result_container, "results", [result_container])), None) + if not result_obj or not getattr(result_obj, "success", False): await redis.hset(f"task:{task_id}", mapping={ "status": TaskStatus.FAILED, - "error": result.error_message + "error": getattr(result_obj, "error_message", "") }) return try: - content = json.loads(result.extracted_content) + content = json.loads(result_obj.extracted_content) except json.JSONDecodeError: - content = result.extracted_content + content = result_obj.extracted_content
221-243: Fix: unwrap container in markdown handler.Accessing
.success/.markdownon the container will fail. Unwrap first.Apply this diff:
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun( + async with AsyncWebCrawler() as crawler: + result_container = await crawler.arun( url=decoded_url, config=CrawlerRunConfig( markdown_generator=md_generator, scraping_strategy=LXMLWebScrapingStrategy(), cache_mode=cache_mode ) ) - if not result.success: + result_obj = next(iter(getattr(result_container, "results", [result_container])), None) + if not result_obj or not getattr(result_obj, "success", False): from server import _extract_user_friendly_error - error_msg = result.error_message or "Failed to fetch content" + error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content" friendly_error = _extract_user_friendly_error(error_msg, "Markdown generation") raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=friendly_error ) - return (result.markdown.raw_markdown + return (result_obj.markdown.raw_markdown if filter_type == FilterType.RAW - else result.markdown.fit_markdown) + else result_obj.markdown.fit_markdown)
🧹 Nitpick comments (8)
deploy/docker/api.py (8)
431-433: Type hint: make output_path Optional[str].Silences RUF013 and clarifies intent.
As per static analysis hints, apply:
- config: dict, - output_path: str = None + config: dict, + output_path: Optional[str] = None
500-502: Don’t swallow cancellations; re-raise CancelledError.Prevents masking task cancellations in bulk path.
Apply:
- except Exception as bulk_error: + except asyncio.CancelledError: + raise + except Exception as bulk_error: logger.warning("Bulk crawl failed, retrying per-URL: %s", bulk_error)
520-528: Don’t swallow cancellations in per-URL fallback.Same rationale as above.
Apply:
- except Exception as url_error: + except asyncio.CancelledError: + raise + except Exception as url_error: logger.warning("Crawl failed for %s: %s", url, url_error) fallback_results.append( { "url": url, "success": False, "error_message": str(url_error), } )
511-519: Avoid generic iter check; it can misclassify Pydantic models.BaseModel may be iterable; extending on it corrupts results. Restrict to concrete sequences.
Apply:
- elif hasattr(single_result, '__iter__') and not isinstance(single_result, (str, bytes, dict)): - # Other iterable wrapper - iterate and append each result - fallback_results.extend(single_result) + elif isinstance(single_result, (list, tuple)): + # Other concrete sequence wrapper - extend with items + fallback_results.extend(single_result)
609-616: Harden file write: secure dir perms, drop redundant import.Set restrictive permissions and use the already-imported datetime_handler.
Apply:
- os.makedirs(os.path.dirname(abs_path), exist_ok=True) + os.makedirs(os.path.dirname(abs_path), mode=0o700, exist_ok=True) @@ - try: - from utils import datetime_handler + try: with open(abs_path, 'w', encoding='utf-8') as f: json.dump(response_data, f, ensure_ascii=False, indent=2, default=datetime_handler)As per static analysis hints (S108).
624-629: Preserve exception cause when file write fails.Use exception chaining and explicit conversion flag.
As per static analysis hints (B904, RUF010), apply:
- except (OSError, TypeError) as write_error: + except (OSError, TypeError) as write_error: logger.warning("crawl file write failed: %s", write_error, exc_info=True) - raise HTTPException( + raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=f"Failed to write to {abs_path}: {str(write_error)}" - ) + detail=f"Failed to write to {abs_path}: {write_error!s}" + ) from write_error
649-657: Return structured error dict instead of JSON string.Avoids double-encoding and lets FastAPI serialize the object.
Apply:
- import json as json_module # Import locally to avoid any scoping issues raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=json_module.dumps({ # Send structured error - "error": str(e), - "server_memory_delta_mb": mem_delta_mb, - "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0) - }) + detail={ + "error": str(e), + "server_memory_delta_mb": mem_delta_mb, + "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0) + } )
674-679: Honor rate_limiter.enabled in streaming path for parity with non-streaming.Keeps behavior consistent with handle_crawl_request.
Apply:
- dispatcher = MemoryAdaptiveDispatcher( - memory_threshold_percent=config["crawler"]["memory_threshold_percent"], - rate_limiter=RateLimiter( - base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]) - ) - ) + dispatcher = MemoryAdaptiveDispatcher( + memory_threshold_percent=config["crawler"]["memory_threshold_percent"], + rate_limiter=( + RateLimiter(base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])) + if config["crawler"]["rate_limiter"].get("enabled") + else None + ) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/api.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
validate_llm_provider(96-116)datetime_handler(57-61)_ensure_within_base_dir(128-145)deploy/docker/server.py (1)
_extract_user_friendly_error(210-233)crawl4ai/models.py (2)
success(36-37)model_dump(240-258)crawl4ai/async_webcrawler.py (1)
arun(204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py
432-432: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
444-448: Abstract raise to an inner function
(TRY301)
500-500: Do not catch blind exception: Exception
(BLE001)
520-520: Do not catch blind exception: Exception
(BLE001)
602-602: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
626-629: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
628-628: Use explicit conversion flag
Replace with conversion flag
(RUF010)
631-631: Consider moving this statement to an else block
(TRY300)
🔇 Additional comments (2)
deploy/docker/api.py (2)
444-448: Good guard: disallow non-headless sessions in Docker.Prevents unsupported configurations early with a clear error.
556-590: Nice: recursive normalization preserves datetime/path/bytes without data loss.This keeps new filesystem metadata and timestamps intact in responses.
…ion and error handling • Replace 25+ hasattr() calls with @singledispatch protocol dispatch • Add ResultContainer, SingleResult, AsyncResultGenerator protocols • Implement NetworkError, ValidationError, SecurityError protocols • Create ErrorResponse schema with structured error information • Add centralized result_normalizer with recursive value normalization • Fix path handling for quoted output_path parameters • Maintain 100% backward compatibility with existing MCP endpoints • Add comprehensive test coverage for new protocol systems • Enable user-friendly error messages with actionable suggestions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deploy/docker/api.py (2)
156-177: Fix: unwrap arun() result in LLM extraction path.Same container-vs-result issue; without unwrapping, error checks and content parsing are wrong.
Apply this diff:
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun( + async with AsyncWebCrawler() as crawler: + container = await crawler.arun( url=url, config=CrawlerRunConfig( extraction_strategy=llm_strategy, scraping_strategy=LXMLWebScrapingStrategy(), cache_mode=cache_mode ) ) - - if not result.success: + results = normalize_results_protocol_dispatch(container) + result_obj = results[0] if results else None + + if not result_obj or not getattr(result_obj, "success", False): await redis.hset(f"task:{task_id}", mapping={ "status": TaskStatus.FAILED, - "error": result.error_message + "error": getattr(result_obj, "error_message", "") }) return - - try: - content = json.loads(result.extracted_content) + try: + content = json.loads(result_obj.extracted_content) except json.JSONDecodeError: - content = result.extracted_content + content = result_obj.extracted_content
229-251: Fix: unwrap arun() result in markdown handler.Accessing
.success/.markdowndirectly on the container will fail. Unwrap before checks/return.Apply this diff:
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun( + async with AsyncWebCrawler() as crawler: + container = await crawler.arun( url=decoded_url, config=CrawlerRunConfig( markdown_generator=md_generator, scraping_strategy=LXMLWebScrapingStrategy(), cache_mode=cache_mode ) ) - if not result.success: + results = normalize_results_protocol_dispatch(container) + result_obj = results[0] if results else None + if not result_obj or not getattr(result_obj, "success", False): from server import _extract_user_friendly_error - error_msg = result.error_message or "Failed to fetch content" + error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content" friendly_error = _extract_user_friendly_error(error_msg, "Markdown generation") raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=friendly_error ) - return (result.markdown.raw_markdown + return (result_obj.markdown.raw_markdown if filter_type == FilterType.RAW - else result.markdown.fit_markdown) + else result_obj.markdown.fit_markdown)
🧹 Nitpick comments (22)
deploy/docker/result_protocols.py (1)
5-5: Use AsyncIterator/Awaitable in async protocol signaturesTyping-wise, aiter should return AsyncIterator[Any] and anext should return Awaitable[Any]. Update imports and method annotations accordingly.
-from typing import Protocol, runtime_checkable, Iterable, Any, AsyncGenerator +from typing import Protocol, runtime_checkable, Iterable, Any, AsyncIterator, Awaitable @@ - def __aiter__(self) -> AsyncGenerator[Any, None]: + def __aiter__(self) -> AsyncIterator[Any]: """Async iterator interface.""" ... - def __anext__(self) -> Any: + def __anext__(self) -> Awaitable[Any]: """Async next interface.""" ...Also applies to: 31-36
deploy/docker/utils.py (1)
139-146: Prefer package-qualified imports to avoid module shadowingIf this code runs as a package, use absolute imports (e.g., from deploy.docker.error_handler import ...) to avoid ambiguity; keep the legacy fallback as-is.
deploy/docker/test_basic_normalization.py (3)
76-76: Drop unnecessary f-stringNo interpolation here.
- assert "url" in result, f"Expected 'url' key in result" + assert "url" in result, "Expected 'url' key in result"
6-9: Avoid hardcoded /tmp path in testsUse tempfile for portability and to satisfy linters.
import os import pathlib +import tempfile from datetime import datetime, date from typing import List, Any @@ - "path": pathlib.Path("/tmp/test"), + "path": pathlib.Path(tempfile.gettempdir()) / "test", "nested": {"date": date.today()}Also applies to: 80-84
132-140: Narrow the broad exception catch in test mainCatching Exception masks failures. Either let exceptions bubble or catch AssertionError explicitly.
- except Exception as e: + except AssertionError as e: print(f"\n❌ Test failed: {e}") import traceback traceback.print_exc()deploy/docker/test_error_handling.py (2)
134-134: Drop unnecessary f-stringNo placeholders.
- print(f"✓ JSON serialization successful") + print("✓ JSON serialization successful")
162-165: Avoid blind Exception catch in test mainLet errors surface or catch specific exceptions to keep failures actionable.
- except Exception as e: + except AssertionError as e: print(f"\n❌ Error handling test failed: {e}") import traceback traceback.print_exc()deploy/docker/result_normalizer.py (5)
114-124: Honor SingleResult protocol in normalize_single_resultLeverage the protocol you defined rather than only duck-typing Pydantic.
- if hasattr(result, "model_dump"): + if isinstance(result, SingleResult): + result_dict = result.to_dict() + elif hasattr(result, "model_dump"): result_dict = result.model_dump()
69-73: Raise TypeError for unsupported async-generator inputTypeError better communicates invalid input type.
- raise ValueError("Async generators must be handled by caller with async collection") + raise TypeError("Async generators must be handled by caller with async collection")
51-54: Silence ARG001 by using underscore for unused singledispatch argCleaner signature for the None-branch.
-@normalize_results.register -def _(results: type(None)) -> List[Any]: +@normalize_results.register +def _(_: type(None)) -> List[Any]:
5-11: Prefer collections.abc.Iterable over hasattr('iter')Runtime isinstance against collections.abc.Iterable avoids string matching and satisfies linters.
import os import pathlib from base64 import b64encode from datetime import datetime, date from functools import singledispatch from typing import List, Any, Union, AsyncGenerator +from collections.abc import Iterable as IterableABC @@ - if hasattr(results, '__iter__') and not isinstance(results, (str, bytes, dict)): + if isinstance(results, IterableABC) and not isinstance(results, (str, bytes, dict)): try: return list(results)Also applies to: 75-79
15-20: DRY: reuse datetime_handler from utilsAvoid duplicating datetime handling logic; import the existing helper for consistency.
-def datetime_handler(obj): - """Handle datetime serialization - imported from existing logic.""" - if isinstance(obj, (datetime, date)): - return obj.isoformat() - return obj +from utils import datetime_handlerdeploy/docker/api.py (5)
439-441: Type hint: make output_path Optional[str].Avoid implicit Optional per Ruff RUF013.
Apply this diff:
- config: dict, - output_path: str = None + config: dict, + output_path: Optional[str] = None
578-583: Propagate original write error and use explicit f-string conversion.Raise from the caught exception and use {!s} per RUF010.
Apply this diff:
- except (OSError, TypeError) as write_error: - logger.warning("crawl file write failed: %s", write_error, exc_info=True) - raise HTTPException( - status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=f"Failed to write to {abs_path}: {str(write_error)}" - ) + except (OSError, TypeError) as write_error: + logger.warning("crawl file write failed: %s", write_error, exc_info=True) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to write to {abs_path}: {write_error!s}" + ) from write_error
603-611: Avoid double JSON-encoding in HTTPException detail.Return a dict, not a JSON string. Aligns with the PR’s “remove double-JSON-encoding” goal.
Apply this diff:
- import json as json_module # Import locally to avoid any scoping issues - raise HTTPException( - status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=json_module.dumps({ # Send structured error - "error": str(e), - "server_memory_delta_mb": mem_delta_mb, - "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0) - }) - ) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail={ + "error": str(e), + "server_memory_delta_mb": mem_delta_mb, + "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0), + }, + )
498-498: Consider narrowing broad exception catches.Catching bare Exception hinders observability and can mask programming errors; prefer specific exceptions or re-raise after logging.
Also applies to: 511-511
628-633: Align rate limiter enablement with non-streaming path.Non-streaming honors config["crawler"]["rate_limiter"]["enabled"]; streaming path always enables it. Consider respecting the flag for consistency.
Apply this diff:
- dispatcher = MemoryAdaptiveDispatcher( - memory_threshold_percent=config["crawler"]["memory_threshold_percent"], - rate_limiter=RateLimiter( - base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]) - ) - ) + dispatcher = MemoryAdaptiveDispatcher( + memory_threshold_percent=config["crawler"]["memory_threshold_percent"], + rate_limiter=( + RateLimiter(base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])) + if config["crawler"]["rate_limiter"].get("enabled") + else None + ), + )deploy/docker/error_handler.py (1)
76-82: Minor: use {!s} instead of str(...) in f-string.Cleaner and satisfies RUF010.
Apply this diff:
- return ErrorResponse( - error_type="unknown", - error_message=f"{operation} failed: {str(error)}", + return ErrorResponse( + error_type="unknown", + error_message=f"{operation} failed: {error!s}", operation=operation, suggestions=["Please check your input and try again"] )deploy/docker/test_result_normalization.py (1)
1-1: Nit: remove shebang from test module (or make file executable).Tests are run by pytest; the shebang is unnecessary and triggers EXE001.
Apply this diff:
-#!/usr/bin/env python3deploy/docker/server.py (2)
909-968: Improve error handling in JSON schema generationThe
_create_param_schemafunction catches all exceptions but doesn't log them, making debugging difficult. Also, the JSON serialization test could be more efficient.Consider logging exceptions and improving the serialization check:
def _create_param_schema(annotation, default_value): """Create JSON schema for a parameter based on its type annotation.""" import json from typing import get_origin, get_args schema = {"type": "string"} # Default fallback try: # ... existing type handling code ... # Add default value if present and JSON serializable if default_value != inspect.Parameter.empty and default_value is not None: - try: - json.dumps(default_value) # Test if serializable - schema["default"] = default_value - except (TypeError, ValueError): - # If not serializable, convert to string representation - if isinstance(default_value, (str, int, float, bool)): - schema["default"] = default_value - else: - schema["default"] = str(default_value) + # Check if value is directly serializable + if isinstance(default_value, (str, int, float, bool, type(None))): + schema["default"] = default_value + else: + try: + json.dumps(default_value) # Test complex types + schema["default"] = default_value + except (TypeError, ValueError): + schema["default"] = str(default_value) - except Exception: + except Exception as e: # Fallback for any errors in schema generation + logger.debug(f"Failed to generate schema for {annotation}: {e}") schema = {"type": "string", "description": f"Type: {annotation}"} return schema
787-791: Consider extracting truncation limit as a constantThe truncation function uses a hardcoded limit of 800 characters. Consider making this configurable or at least defining it as a named constant for maintainability.
+# At module level +DEFAULT_CONTEXT_TRUNCATE_LIMIT = 800 + def _truncate(text: str, limit: int = 800) -> str: if len(text) <= limit: return text return text[:limit].rstrip() + "…"Or even better, make it configurable:
-def _truncate(text: str, limit: int = 800) -> str: +def _truncate(text: str, limit: int = None) -> str: + if limit is None: + limit = config.get("context", {}).get("truncate_limit", 800) if len(text) <= limit: return text return text[:limit].rstrip() + "…"deploy/docker/error_protocols.py (1)
127-130: Consider documenting the context parameter usageThe
contextparameter inCrawlError.__init__is stored but never accessed through a property. Consider either adding a property for it or documenting when/how it should be used.class CrawlError: """Base implementation for crawl-related errors.""" def __init__(self, message: str, operation: str = "operation", context: Optional[Dict[str, Any]] = None): self._message = message self._operation = operation + # Context can be used to store additional error details for logging/debugging self.context = context or {} @property def message(self) -> str: return self._message @property def operation(self) -> str: return self._operation + + @property + def context(self) -> Dict[str, Any]: + """Additional error context for debugging.""" + return self._context
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
deploy/docker/api.py(11 hunks)deploy/docker/error_handler.py(1 hunks)deploy/docker/error_protocols.py(1 hunks)deploy/docker/result_normalizer.py(1 hunks)deploy/docker/result_protocols.py(1 hunks)deploy/docker/schemas.py(3 hunks)deploy/docker/server.py(18 hunks)deploy/docker/test_basic_normalization.py(1 hunks)deploy/docker/test_error_handling.py(1 hunks)deploy/docker/test_result_normalization.py(1 hunks)deploy/docker/utils.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
deploy/docker/result_protocols.py (2)
deploy/docker/test_basic_normalization.py (1)
results(47-48)deploy/docker/test_result_normalization.py (1)
results(55-56)
deploy/docker/result_normalizer.py (4)
deploy/docker/result_protocols.py (4)
ResultContainer(9-15)SingleResult(19-24)AsyncResultGenerator(28-37)results(13-15)deploy/docker/utils.py (1)
datetime_handler(57-61)deploy/docker/test_basic_normalization.py (2)
results(47-48)model_dump(30-37)deploy/docker/test_result_normalization.py (2)
results(55-56)model_dump(38-45)
deploy/docker/test_error_handling.py (3)
deploy/docker/error_handler.py (3)
handle_error(21-82)parse_network_error(170-206)parse_security_error(209-224)deploy/docker/error_protocols.py (21)
NetworkErrorImpl(141-150)ValidationErrorImpl(154-173)SecurityErrorImpl(176-195)error_code(15-17)error_code(149-150)message(21-23)message(133-134)operation(27-29)operation(137-138)field_name(38-40)field_name(164-165)invalid_value(44-46)invalid_value(168-169)expected_format(50-52)expected_format(172-173)violation_type(61-63)violation_type(186-187)attempted_action(67-69)attempted_action(190-191)allowed_scope(73-75)allowed_scope(194-195)deploy/docker/schemas.py (4)
ErrorResponse(96-199)network_error(107-121)validation_error(124-141)security_error(144-161)
deploy/docker/server.py (8)
deploy/docker/api.py (5)
handle_crawl_request(435-611)handle_llm_qa(72-122)handle_markdown_request(189-257)handle_stream_crawl_request(613-662)stream_results(402-433)crawl4ai/async_webcrawler.py (2)
AsyncWebCrawler(53-852)arun(204-431)deploy/docker/crawler_pool.py (3)
close_all(47-50)get_crawler(22-46)janitor(52-60)deploy/docker/mcp_bridge.py (6)
combine_lifespans(153-175)create_mcp_server(138-149)mcp_resource(24-36)mcp_template(38-50)mcp_tool(52-64)register_tools_from_routes(179-211)deploy/docker/schemas.py (9)
CrawlRequest(7-11)HTMLRequest(34-35)MarkdownRequest(13-28)ErrorResponse(96-199)PDFRequest(42-44)RawCode(31-32)ScreenshotRequest(37-40)network_error(107-121)error_response(82-93)deploy/docker/utils.py (2)
FilterType(16-20)_ensure_within_base_dir(128-179)deploy/docker/error_handler.py (4)
handle_error(21-82)parse_network_error(170-206)parse_security_error(209-224)create_http_exception_from_error(269-274)deploy/docker/error_protocols.py (4)
NetworkErrorImpl(141-150)SecurityErrorImpl(176-195)operation(27-29)operation(137-138)
deploy/docker/api.py (5)
deploy/docker/result_normalizer.py (4)
normalize_results_protocol_dispatch(57-82)normalize_single_result(104-145)normalize_value_recursive(85-101)datetime_handler(15-19)deploy/docker/schemas.py (1)
StandardizedResponse(55-93)deploy/docker/utils.py (3)
validate_llm_provider(96-116)datetime_handler(57-61)_ensure_within_base_dir(128-179)deploy/docker/server.py (1)
_extract_user_friendly_error(217-222)crawl4ai/async_webcrawler.py (1)
arun(204-431)
deploy/docker/schemas.py (3)
deploy/docker/utils.py (1)
FilterType(16-20)deploy/docker/result_protocols.py (1)
results(13-15)deploy/docker/error_protocols.py (30)
error_code(15-17)error_code(149-150)operation(27-29)operation(137-138)message(21-23)message(133-134)field_name(38-40)field_name(164-165)invalid_value(44-46)invalid_value(168-169)expected_format(50-52)expected_format(172-173)violation_type(61-63)violation_type(186-187)attempted_action(67-69)attempted_action(190-191)allowed_scope(73-75)allowed_scope(194-195)config_key(84-86)config_key(208-209)config_value(90-92)config_value(212-213)suggestion(96-98)suggestion(216-217)processing_stage(107-109)processing_stage(230-231)data_type(113-115)data_type(234-235)recovery_suggestion(119-121)recovery_suggestion(238-239)
deploy/docker/test_result_normalization.py (3)
deploy/docker/result_normalizer.py (4)
normalize_results(23-36)normalize_results_protocol_dispatch(57-82)normalize_single_result(104-145)normalize_value_recursive(85-101)deploy/docker/result_protocols.py (3)
ResultContainer(9-15)SingleResult(19-24)results(13-15)deploy/docker/schemas.py (3)
StandardizedResponse(55-93)success_response(65-79)error_response(82-93)
deploy/docker/test_basic_normalization.py (3)
deploy/docker/result_normalizer.py (4)
normalize_results(23-36)normalize_results_protocol_dispatch(57-82)normalize_single_result(104-145)normalize_value_recursive(85-101)deploy/docker/result_protocols.py (3)
ResultContainer(9-15)SingleResult(19-24)results(13-15)deploy/docker/schemas.py (2)
StandardizedResponse(55-93)success_response(65-79)
deploy/docker/error_handler.py (2)
deploy/docker/error_protocols.py (40)
NetworkError(10-29)ValidationError(33-52)SecurityError(56-75)ConfigurationError(79-98)ProcessingError(102-121)NetworkErrorImpl(141-150)ValidationErrorImpl(154-173)SecurityErrorImpl(176-195)ConfigurationErrorImpl(198-217)ProcessingErrorImpl(220-239)operation(27-29)operation(137-138)error_code(15-17)error_code(149-150)message(21-23)message(133-134)field_name(38-40)field_name(164-165)invalid_value(44-46)invalid_value(168-169)expected_format(50-52)expected_format(172-173)violation_type(61-63)violation_type(186-187)attempted_action(67-69)attempted_action(190-191)allowed_scope(73-75)allowed_scope(194-195)config_key(84-86)config_key(208-209)config_value(90-92)config_value(212-213)suggestion(96-98)suggestion(216-217)processing_stage(107-109)processing_stage(230-231)data_type(113-115)data_type(234-235)recovery_suggestion(119-121)recovery_suggestion(238-239)deploy/docker/schemas.py (6)
ErrorResponse(96-199)network_error(107-121)validation_error(124-141)security_error(144-161)configuration_error(164-179)processing_error(182-199)
deploy/docker/utils.py (3)
deploy/docker/error_handler.py (2)
handle_error(21-82)create_http_exception_from_error(269-274)deploy/docker/error_protocols.py (9)
SecurityErrorImpl(176-195)violation_type(61-63)violation_type(186-187)attempted_action(67-69)attempted_action(190-191)allowed_scope(73-75)allowed_scope(194-195)message(21-23)message(133-134)deploy/docker/schemas.py (2)
security_error(144-161)error_response(82-93)
🪛 Ruff (0.13.1)
deploy/docker/result_normalizer.py
52-52: Unused function argument: results
(ARG001)
72-72: Prefer TypeError exception for invalid type
(TRY004)
72-72: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/test_error_handling.py
1-1: Shebang is present but file is not executable
(EXE001)
78-78: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
134-134: f-string without any placeholders
Remove extraneous f prefix
(F541)
162-162: Do not catch blind exception: Exception
(BLE001)
deploy/docker/server.py
204-204: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
219-219: Redefinition of unused parse_network_error from line 69
Remove definition: parse_network_error
(F811)
219-219: Redefinition of unused handle_error from line 69
Remove definition: handle_error
(F811)
273-273: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
273-273: Use explicit conversion flag
Replace with conversion flag
(RUF010)
274-274: Do not catch blind exception: Exception
(BLE001)
275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
275-275: Use explicit conversion flag
Replace with conversion flag
(RUF010)
310-310: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
839-839: Possible binding to all interfaces
(S104)
892-892: Do not catch blind exception: Exception
(BLE001)
900-900: Do not catch blind exception: Exception
(BLE001)
902-902: Use explicit conversion flag
Replace with conversion flag
(RUF010)
964-964: Do not catch blind exception: Exception
(BLE001)
deploy/docker/api.py
440-440: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
452-456: Abstract raise to an inner function
(TRY301)
498-498: Do not catch blind exception: Exception
(BLE001)
511-511: Do not catch blind exception: Exception
(BLE001)
548-548: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
580-583: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
582-582: Use explicit conversion flag
Replace with conversion flag
(RUF010)
585-585: Consider moving this statement to an else block
(TRY300)
deploy/docker/test_result_normalization.py
1-1: Shebang is present but file is not executable
(EXE001)
162-162: Probable insecure usage of temporary file or directory: "/tmp/test"
(S108)
182-182: Probable insecure usage of temporary file or directory: "/tmp"
(S108)
186-186: Probable insecure usage of temporary file or directory: "/tmp/file"
(S108)
201-201: Probable insecure usage of temporary file or directory: "/tmp"
(S108)
267-267: Probable insecure usage of temporary file or directory: "/tmp/test"
(S108)
deploy/docker/test_basic_normalization.py
1-1: Shebang is present but file is not executable
(EXE001)
76-76: f-string without any placeholders
Remove extraneous f prefix
(F541)
82-82: Probable insecure usage of temporary file or directory: "/tmp/test"
(S108)
137-137: Do not catch blind exception: Exception
(BLE001)
deploy/docker/error_handler.py
79-79: Use explicit conversion flag
Replace with conversion flag
(RUF010)
135-135: Unused function argument: operation
(ARG001)
157-157: Unused function argument: operation
(ARG001)
🔇 Additional comments (10)
deploy/docker/utils.py (1)
147-161: Also validate that base_dir exists and is a directoryGood normalization and symlink handling. Add an existence/type check to avoid confusing errors and align with prior guidance.
@@ - # Validate absolute base_dir + # Validate absolute base_dir if not os.path.isabs(base_dir_real): @@ raise create_http_exception_from_error(error_response, 400) + # Validate base_dir exists and is a directory + if not os.path.isdir(base_dir_real): + security_error = SecurityErrorImpl( + violation_type="invalid_base_directory", + attempted_action=f"Set base directory to: {base_dir}", + allowed_scope="existing absolute directories only", + message=f"Security restriction: base directory does not exist or is not a directory: {base_dir_real}." + ) + error_response = handle_error(security_error, "Path validation") + raise create_http_exception_from_error(error_response, 400) @@ - # Validate absolute base_dir - if not os.path.isabs(base_dir_real): + # Validate absolute base_dir + if not os.path.isabs(base_dir_real): raise HTTPException(400, "Security restriction: base directory must be absolute.") + # Validate base_dir exists and is a directory + if not os.path.isdir(base_dir_real): + raise HTTPException( + 400, + f"Security restriction: base directory does not exist or is not a directory: {base_dir_real}." + )Also applies to: 182-200
deploy/docker/api.py (1)
86-98: Fix: unwrap CrawlResultContainer from arun() before accessing fields.
AsyncWebCrawler.arun()returns a container; direct access to.success/.markdownwill break. Unwrap the first result before checks and content extraction.Apply this diff:
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun(url) - if not result.success: - from server import _extract_user_friendly_error - error_msg = result.error_message or "Failed to fetch content" - friendly_error = _extract_user_friendly_error(error_msg, "LLM QA") - raise HTTPException( - status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=friendly_error - ) - content = result.markdown.fit_markdown or result.markdown.raw_markdown + async with AsyncWebCrawler() as crawler: + container = await crawler.arun(url) + results = normalize_results_protocol_dispatch(container) + result_obj = results[0] if results else None + if not result_obj or not getattr(result_obj, "success", False): + from server import _extract_user_friendly_error + error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content" + friendly_error = _extract_user_friendly_error(error_msg, "LLM QA") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=friendly_error + ) + content = result_obj.markdown.fit_markdown or result_obj.markdown.raw_markdowndeploy/docker/server.py (4)
839-839: Security: Binding to all interfaces (0.0.0.0)The code correctly maps
0.0.0.0to127.0.0.1for internal proxy calls, but be aware that binding to0.0.0.0exposes the service to all network interfaces, which could be a security risk in production.The mapping logic on line 839 is correct for internal proxy calls. Ensure that the Docker container networking and firewall rules properly restrict external access if the service shouldn't be publicly accessible.
972-976: JWT authentication limitation is correctly documentedGood that you've documented the JWT authentication limitation for MCP tools. This is an important known issue for users to be aware of.
425-426: Good fix for concurrent file collision issueThe use of UUID for generating unique filenames prevents race conditions when multiple requests for the same host occur simultaneously. This properly addresses the previously identified collision issue.
Based on learnings
Also applies to: 490-491
303-314: Good backward compatibility handlingThe implementation thoughtfully preserves backward compatibility by supporting deprecated query parameters while encouraging migration to request body parameters.
deploy/docker/error_protocols.py (1)
1-239: Well-structured error handling systemThe protocol-based error handling approach with runtime-checkable protocols and concrete implementations is excellent. The separation of protocols from implementations provides good flexibility and testability.
deploy/docker/schemas.py (3)
1-200: Clean schema definitions with comprehensive error handlingThe schema definitions are well-structured with clear field descriptions. The standardized response classes with factory methods provide a consistent API surface. The integration with the error protocol system is well done.
11-11: Good addition of output_path fields for binary dataAdding
output_pathfields to CrawlRequest, ScreenshotRequest, and PDFRequest is an excellent design choice. This prevents token/context limit issues with large base64 payloads and provides better performance.Also applies to: 40-40, 44-44
16-28: Well-documented field migrations in MarkdownRequestThe renaming from
f/q/ctofilter/query/cachewith proper Field descriptions improves API clarity. The backward compatibility is maintained through query parameters as seen in the server implementation.
| # Only use abspath if it's not already an absolute path | ||
| if os.path.isabs(clean_path): | ||
| abs_path = os.path.normpath(clean_path) | ||
| else: | ||
| abs_path = os.path.abspath(clean_path) | ||
|
|
||
| # Validate path is within allowed directory | ||
| _ensure_within_base_dir(abs_path, binary_base_dir) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: resolve relative output_path under the allowed base directory.
Using abspath for relatives can resolve outside the allowed dir and then be rejected by _ensure_within_base_dir. Join with base dir for relatives.
Apply this diff:
- # Only use abspath if it's not already an absolute path
- if os.path.isabs(clean_path):
- abs_path = os.path.normpath(clean_path)
- else:
- abs_path = os.path.abspath(clean_path)
+ # Resolve relative paths inside the allowed base directory
+ if os.path.isabs(clean_path):
+ abs_path = os.path.normpath(clean_path)
+ else:
+ abs_path = os.path.normpath(os.path.join(binary_base_dir, clean_path))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Only use abspath if it's not already an absolute path | |
| if os.path.isabs(clean_path): | |
| abs_path = os.path.normpath(clean_path) | |
| else: | |
| abs_path = os.path.abspath(clean_path) | |
| # Validate path is within allowed directory | |
| _ensure_within_base_dir(abs_path, binary_base_dir) | |
| # Resolve relative paths inside the allowed base directory | |
| if os.path.isabs(clean_path): | |
| abs_path = os.path.normpath(clean_path) | |
| else: | |
| abs_path = os.path.normpath(os.path.join(binary_base_dir, clean_path)) | |
| # Validate path is within allowed directory | |
| _ensure_within_base_dir(abs_path, binary_base_dir) |
🤖 Prompt for AI Agents
In deploy/docker/api.py around lines 553 to 561, the code currently uses
os.path.abspath for relative clean_path which can escape the allowed base dir
and then be rejected; instead, if clean_path is absolute, normalize it with
os.path.normpath; if it is relative, join it with binary_base_dir
(os.path.join(binary_base_dir, clean_path)) and then normalize that result;
finally call _ensure_within_base_dir on the resulting abs_path so relative
outputs are resolved inside the allowed base directory.
| # Check protocols explicitly since singledispatch doesn't handle protocols | ||
| if isinstance(error, NetworkError): | ||
| return ErrorResponse.network_error( | ||
| error_code=error.error_code, | ||
| message=error.message, | ||
| operation=operation, | ||
| suggestions=_get_network_error_suggestions(error.error_code) | ||
| ) | ||
| elif isinstance(error, ValidationError): | ||
| return ErrorResponse.validation_error( | ||
| field_name=error.field_name, | ||
| invalid_value=error.invalid_value, | ||
| expected_format=error.expected_format, | ||
| suggestions=[ | ||
| f"Provide a valid {error.expected_format} for {error.field_name}", | ||
| "Check the API documentation for valid input formats" | ||
| ] | ||
| ) | ||
| elif isinstance(error, SecurityError): | ||
| return ErrorResponse.security_error( | ||
| violation_type=error.violation_type, | ||
| attempted_action=error.attempted_action, | ||
| allowed_scope=error.allowed_scope, | ||
| suggestions=[ | ||
| f"Ensure your action is within the allowed scope: {error.allowed_scope}", | ||
| "Check the security restrictions in the API documentation" | ||
| ] | ||
| ) | ||
| elif isinstance(error, ConfigurationError): | ||
| return ErrorResponse.configuration_error( | ||
| config_key=error.config_key, | ||
| config_value=error.config_value, | ||
| suggestion=error.suggestion | ||
| ) | ||
| elif isinstance(error, ProcessingError): | ||
| return ErrorResponse.processing_error( | ||
| processing_stage=error.processing_stage, | ||
| data_type=error.data_type, | ||
| recovery_suggestion=error.recovery_suggestion, | ||
| operation=operation | ||
| ) | ||
|
|
||
| # Fallback to generic error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: isinstance() on Protocols requires @runtime_checkable.
These isinstance checks will raise TypeError unless the Protocols are marked runtime_checkable.
Add @runtime_checkable to the Protocols in deploy/docker/error_protocols.py:
+from typing import runtime_checkable
@@
- class NetworkError(Protocol):
+@runtime_checkable
+class NetworkError(Protocol):
@@
- class ValidationError(Protocol):
+@runtime_checkable
+class ValidationError(Protocol):
@@
- class SecurityError(Protocol):
+@runtime_checkable
+class SecurityError(Protocol):
@@
- class ConfigurationError(Protocol):
+@runtime_checkable
+class ConfigurationError(Protocol):
@@
- class ProcessingError(Protocol):
+@runtime_checkable
+class ProcessingError(Protocol):Alternatively, replace isinstance checks here with attribute-based duck typing.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In deploy/docker/error_handler.py lines 34-76 the isinstance checks against
Protocol types will raise TypeError unless those Protocols are marked
runtime-checkable; update the Protocol definitions in
deploy/docker/error_protocols.py by adding from typing import runtime_checkable
and decorating each Protocol used here with @runtime_checkable so the isinstance
checks work at runtime, or alternatively replace these isinstance checks in this
file with explicit attribute-based duck-typing (e.g., check for the specific
attributes like error_code, field_name, violation_type, etc.) and then construct
the corresponding ErrorResponse branches.
| def _resolve_binary_base_dir() -> str: | ||
| base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports") | ||
| return os.path.abspath(base_dir) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Insecure temporary directory usage
Using /tmp/crawl4ai-exports is a security risk as it's world-writable and shared among all users. This could lead to path traversal, symlink attacks, or unauthorized access to exported files.
Consider using a user-specific directory or configuring a secure location:
def _resolve_binary_base_dir() -> str:
- base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")
+ # Use container's app directory or a secure user-specific location
+ base_dir = config.get("binary", {}).get("default_dir", "/app/exports")
return os.path.abspath(base_dir)Alternatively, use Python's tempfile module for secure temporary directories:
import tempfile
def _resolve_binary_base_dir() -> str:
base_dir = config.get("binary", {}).get("default_dir", None)
if base_dir is None:
# Create a secure user-specific temp directory
base_dir = os.path.join(tempfile.gettempdir(), f"crawl4ai-exports-{os.getuid()}")
return os.path.abspath(base_dir)🧰 Tools
🪛 Ruff (0.13.1)
204-204: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 203 to 206, the function currently
defaults to the world-writable path "/tmp/crawl4ai-exports" which is insecure;
change it to use a config override when provided and otherwise build a
user-specific temp directory (e.g., join tempfile.gettempdir() with a name
containing the current uid) instead of a fixed /tmp path, ensure the directory
is created securely with restricted permissions (mkdir with 0o700) and
verify/resolve its real path to avoid symlink attacks; add the tempfile import
and ensure the function returns the absolute, real path.
| """Legacy function - use error_handler.parse_network_error() instead.""" | ||
| from error_handler import parse_network_error, handle_error | ||
| network_error = parse_network_error(error_msg, operation) | ||
| error_response = handle_error(network_error, operation) | ||
| return error_response.error_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove shadow imports in legacy compatibility function
The function re-imports parse_network_error and handle_error which were already imported at the module level, causing shadowing.
Remove the redundant import statement:
def _extract_user_friendly_error(error_msg: str, operation: str = "operation") -> str:
"""Legacy function - use error_handler.parse_network_error() instead."""
- from error_handler import parse_network_error, handle_error
network_error = parse_network_error(error_msg, operation)
error_response = handle_error(network_error, operation)
return error_response.error_message📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Legacy function - use error_handler.parse_network_error() instead.""" | |
| from error_handler import parse_network_error, handle_error | |
| network_error = parse_network_error(error_msg, operation) | |
| error_response = handle_error(network_error, operation) | |
| return error_response.error_message | |
| def _extract_user_friendly_error(error_msg: str, operation: str = "operation") -> str: | |
| """Legacy function - use error_handler.parse_network_error() instead.""" | |
| network_error = parse_network_error(error_msg, operation) | |
| error_response = handle_error(network_error, operation) | |
| return error_response.error_message |
🧰 Tools
🪛 Ruff (0.13.1)
219-219: Redefinition of unused parse_network_error from line 69
Remove definition: parse_network_error
(F811)
219-219: Redefinition of unused handle_error from line 69
Remove definition: handle_error
(F811)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 218 to 222, the legacy compatibility
function re-imports parse_network_error and handle_error which shadows the
module-level imports; remove the redundant "from error_handler import
parse_network_error, handle_error" line so the function uses the
already-imported names, leaving the rest of the function unchanged and run
tests/lint to confirm no unresolved references.
| raise HTTPException(400, f"Configuration validation error: {str(e)}") | ||
| except Exception as e: | ||
| raise HTTPException(400, str(e)) | ||
| raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Improve exception handling with chained exceptions
When re-raising exceptions, use from e to preserve the original exception chain for better debugging.
Apply this fix to preserve exception context:
try:
return JSONResponse(_safe_eval_config(raw.code.strip()))
except ValueError as e:
- raise HTTPException(400, f"Configuration validation error: {str(e)}")
+ raise HTTPException(400, f"Configuration validation error: {str(e)}") from e
except Exception as e:
- raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.")
+ raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raise HTTPException(400, f"Configuration validation error: {str(e)}") | |
| except Exception as e: | |
| raise HTTPException(400, str(e)) | |
| raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") | |
| try: | |
| return JSONResponse(_safe_eval_config(raw.code.strip())) | |
| except ValueError as e: | |
| raise HTTPException(400, f"Configuration validation error: {str(e)}") from e | |
| except Exception as e: | |
| raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") from e |
🧰 Tools
🪛 Ruff (0.13.1)
273-273: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
273-273: Use explicit conversion flag
Replace with conversion flag
(RUF010)
274-274: Do not catch blind exception: Exception
(BLE001)
275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
275-275: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 273 to 276, the code re-raises new
HTTPException instances without chaining the original exception; update the
raise statements to use exception chaining (raise HTTPException(<status>,
<message>) from e) so the original exception context is preserved for
debugging—apply this to the Configuration validation except and the generic
except Exception block.
|
👋 Update: Clean Architecture Approach After getting the initial SSE→HTTP migration working in this PR, I've created a clean architectural refactor that delivers the same functionality with production-ready code quality. New PR: #1525 - feat(mcp): migrate to modern HTTP transport with clean architecture refactor Why a separate PR?
Same Result, Better Implementation:
This PR (#1519) represents the valuable iterative learning process that led to the clean solution in #1525. Both approaches solve the same problem - the new PR just presents it in a more maintainable, production-ready form. Thanks for understanding the approach! 🚀 |
Closes: #1316
Problem
MCP server suffered from persistent connection failures and "Unexpected message" errors due to unstable SSE transport. Issue remained unresolved for 2+ months affecting multiple users across versions 0.7.0-0.7.4.
Root cause:
AssertionError: Unexpected message: {'type': 'http.response.start'...}in Starlette middleware when handling SSE responses. Only workaround was rolling back to v0.6.0rc1-r2.Solution
Core Transport Migration
sse-starlette, implemented stable HTTP-based communicationmcp_bridge.py- Dynamically exposes FastAPI routes as MCP tools via decorator patternmd,html,screenshot,pdf,execute_js,crawl,askCritical Fixes
0.0.0.0→127.0.0.1for internal callsBinary Output Handling
Screenshot/PDF tools save to container filesystem to avoid base64 payload token limits.
File access options:
Potential future enhancement: S3/MinIO integration with pre-signed URLs could enable distributed deployments and LLM-friendly direct file access.
Port Configuration
HOST_PORTin.env(default: 11235).env.exampledocuments all build options (INSTALL_TYPE,ENABLE_GPU)Testing & Documentation
anyioand FastMCP clienttest_mcp_http_listing.py- Verifies tool/resource enumerationtest_mcp_socket.py- End-to-end testing of all 7 MCP tools/healthendpoint now returns server version (0.7.4)Technical Notes
mcp_bridge.pyintrospects FastAPI routes, generates typed wrappers dynamicallyAuthorizationheaders)Verification
Tested with Claude Code and custom MCP clients. All 7 tools functioning correctly on HTTP transport.
Summary by CodeRabbit
New Features
Changes
Bug Fixes
Documentation / Tests / Chores