Skip to content

Conversation

@leoric-crown
Copy link

@leoric-crown leoric-crown commented Sep 26, 2025

Closes: #1316

Problem

MCP server suffered from persistent connection failures and "Unexpected message" errors due to unstable SSE transport. Issue remained unresolved for 2+ months affecting multiple users across versions 0.7.0-0.7.4.

Root cause: AssertionError: Unexpected message: {'type': 'http.response.start'...} in Starlette middleware when handling SSE responses. Only workaround was rolling back to v0.6.0rc1-r2.

Solution

Core Transport Migration

  • Replaced SSE with FastMCP HTTP transport - Removed sse-starlette, implemented stable HTTP-based communication
  • New mcp_bridge.py - Dynamically exposes FastAPI routes as MCP tools via decorator pattern
  • 7 MCP tools verified working: md, html, screenshot, pdf, execute_js, crawl, ask

Critical Fixes

  1. Internal proxy addressing - Maps 0.0.0.0127.0.0.1 for internal calls
  2. Parameter serialization - Fixed Pydantic default value handling, enum type conversions
  3. Response encoding - Eliminated double-JSON-encoding on GET responses
  4. Type coercion - Auto-converts string→int/float, normalizes single strings to lists
  5. Bridge hardening - Enhanced error handling and fallback mechanisms

Binary Output Handling

Screenshot/PDF tools save to container filesystem to avoid base64 payload token limits.

File access options:

# Mount volume for direct access
docker run -v /host/path:/tmp/crawl4ai-exports ...

# Or configure default-to-file mode in config.yml
binary:
  default_mode: "file"
  default_dir: "/tmp/crawl4ai-exports"

Potential future enhancement: S3/MinIO integration with pre-signed URLs could enable distributed deployments and LLM-friendly direct file access.

Port Configuration

  • Host port now configurable via HOST_PORT in .env (default: 11235)
  • Container always runs on 11235 internally
  • New .env.example documents all build options (INSTALL_TYPE, ENABLE_GPU)

Testing & Documentation

  • New standalone test scripts using anyio and FastMCP client
    • test_mcp_http_listing.py - Verifies tool/resource enumeration
    • test_mcp_socket.py - End-to-end testing of all 7 MCP tools
  • Removed deprecated SSE tests
  • Updated README with JWT auth limitation documentation
  • /health endpoint now returns server version (0.7.4)

Technical Notes

  • mcp_bridge.py introspects FastAPI routes, generates typed wrappers dynamically
  • Preserves Pydantic schemas for strict MCP clients (Gemini compatibility)
  • Lifespan composition pattern for proper initialization order
  • JWT auth limitation documented (MCP tools don't forward Authorization headers)

Verification

Tested with Claude Code and custom MCP clients. All 7 tools functioning correctly on HTTP transport.

Summary by CodeRabbit

  • New Features

    • HTTP-accessible MCP at /mcp with tool/schema listing, JS-execution return values, and optional export of crawl/screenshot/PDF outputs to disk.
  • Changes

    • Internal port default bumped to 11235; host port configurable via HOST_PORT. Payload keys standardized (filter/query/cache). Dependency updates for MCP HTTP transport.
  • Bug Fixes

    • Better path/base-dir validation, friendlier error responses, safer handling of missing configs/defaults, and Docker headless-mode validation.
  • Documentation / Tests / Chores

    • Added .env.example, Docker docs updates, new HTTP MCP/JS/backward-compatibility and normalization tests; removed legacy SSE/WebSocket tests.

leoric-crown and others added 8 commits September 25, 2025 18:58
Replace deprecated SSE transport with FastMCP 2.12.3 HTTP (Streamable)
for improved compatibility and maintainability.

## Key Changes

### MCP Bridge (deploy/docker/mcp_bridge.py)
- Implement FastMCP HTTP transport with proper lifespan integration
- Preserve Pydantic model schemas by unwrapping models to field-level parameters
- Add proper Pydantic model serialization in HTTP proxy
- Fix Depends() parameter filtering to exclude auth/DI params from tool schemas
- Send Pydantic data as root JSON body (not wrapped) for FastAPI compatibility

### Server Configuration (deploy/docker/server.py)
- Combine MCP and crawler lifespans for proper initialization order
- Mount MCP HTTP endpoint at /mcp
- Update port configuration for FastAPI server (11235)
- Register tools dynamically from @mcp_tool decorated routes

### Config Handling (crawl4ai/async_configs.py)
- Fix null handling in BrowserConfig and CrawlerRunConfig dump() methods
- Ensure None values don't cause serialization errors in MCP tool schemas

### Infrastructure
- Add fastmcp==2.12.3 to requirements
- Update docker-compose.yml port mapping to match new configuration

## Testing Results

All 7/7 MCP tools verified working:
- ✅ md - Markdown extraction with filtering
- ✅ html - Preprocessed HTML for schema extraction
- ✅ ask - Crawl4AI library context search
- ✅ screenshot - Full-page PNG capture (writes to container filesystem)
- ✅ pdf - PDF document generation (writes to container filesystem)
- ✅ execute_js - JavaScript execution on pages
- ✅ crawl - Multi-URL crawling with configs

## Known Limitations

**Docker Deployment - File Output:**
- Screenshot and PDF tools write files inside the container filesystem
- Files are not accessible to remote MCP clients without additional setup
- Workarounds: Copy files from container, or mount volumes (requires same-host)
- Recommended future enhancement: MinIO integration for distributed file storage

**Local Deployment:**
- All tools fully functional with direct filesystem access

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Eliminate sse-starlette==2.2.1 from the Docker requirements to streamline dependencies following the migration to FastMCP HTTP transport.
Address all high-priority issues identified in the comprehensive code review:

## Critical Fixes

### Internal Proxy Connection (High Priority)
- Fix 0.0.0.0 base URL issue by mapping to 127.0.0.1 for internal HTTP proxy calls
- Prevents MCP tool failures when app.host is "0.0.0.0" (bind-only address)

### MCP Tool Schema Generation (High Priority)
- Fix Pydantic default value serialization causing syntax errors in wrapper functions
- Simplify all optional parameters to default to None, avoiding complex enum/object serialization
- Convert enum type annotations (FilterType) to base types (str) for better MCP client compatibility
- Resolves Gemini client "missing types in parameter schema" errors

### Response Processing (High Priority)
- Fix GET response double-encoding by attempting JSON parsing first, fallback to text
- Ensures proper structured data return instead of JSON-encoded strings

## Code Quality Improvements

### Lifespan Management (Medium Priority)
- Fix lifespan closure to use proper FastAPI app parameter instead of global reference
- Improves code clarity and prevents potential issues

### Documentation (Medium Priority)
- Document JWT authentication limitation - MCP tools don't forward Authorization headers
- Add clear comments about intentional MCP app mounting convention
- Provides guidance for JWT + MCP integration scenarios

## Verification

All 7/7 MCP tools now working correctly:
- ✅ md, html, ask, screenshot, pdf, execute_js, crawl
- ✅ No syntax errors in dynamic wrapper generation
- ✅ Proper type annotations for strict MCP clients (Gemini)
- ✅ All optional parameters properly handled with None defaults

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update wrapper functions to preprocess arguments, filtering out empty strings and None values to allow FastAPI to use its defaults.
- Ensure proper handling of model data for both tool and query parameter wrappers, improving compatibility with MCP client expectations.

This change addresses issues related to parameter serialization and enhances the robustness of the MCP tool integration.
… binary output

- Add comprehensive parameter preprocessing in MCP bridge for Pydantic models
- Implement automatic type coercion for numeric fields (int/float) from strings
- Add list field normalization (single string to list conversion)
- Enhance binary output handling for screenshots and PDFs with configurable modes
- Update tool documentation with clearer parameter descriptions
- Add default-to-file mode configuration for binary exports
- Improve error handling and fallback mechanisms
- Remove deprecated version field from docker-compose.yml

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Make the Docker host port configurable via HOST_PORT environment variable
while keeping the container's internal port fixed at 11235.

Changes:
- Add .env.example with comprehensive documentation of all config options
  - HOST_PORT for customizing host port mapping (default: 11235)
  - INSTALL_TYPE options (default/all/torch/transformer) with feature descriptions
  - ENABLE_GPU flag with platform compatibility notes
- Update docker-compose.yml to use ${HOST_PORT:-11235}:11235 port mapping
- Simplify supervisord.conf (remove dynamic port config, fixed at 11235)
- Update README with two-file environment pattern (.env vs .llm.env)
- Add docker run example with custom port mapping

Architecture:
- Container always runs on port 11235 internally (simple, consistent)
- Host port is configurable without rebuilding the image
- Defaults maintain backward compatibility (11235:11235)

Usage:
  # Default port mapping
  docker compose up -d

  # Custom host port
  echo "HOST_PORT=8080" > .env
  docker compose up -d

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Walkthrough

Replaces SSE/WebSocket MCP with an HTTP FastMCP mounted at /mcp and route-driven tool registration; adds binary export configuration and file-write support; tightens None handling in async configs; enriches Pydantic schemas and result normalization; makes Docker host port configurable; updates docs, tests, dependencies, and introduces structured error handling.

Changes

Cohort / File(s) Summary
Env & Docker config
/.env.example, docker-compose.yml, deploy/docker/README.md, deploy/docker/config.yml
Add .env.example (HOST_PORT, INSTALL_TYPE, ENABLE_GPU); make host port configurable (${HOST_PORT:-11235}); bump internal app.port to 11235; add binary.default_mode/binary.default_dir; update README/docs for .env/.llm.env, LLM provider selection, HOST_PORT usage, and MCP caveats.
MCP bridge & server
deploy/docker/mcp_bridge.py, deploy/docker/server.py, deploy/docker/requirements.txt
Replace SSE/WebSocket wiring with FastMCP HTTP: add create_mcp_server, register_tools_from_routes, lifespan combining, HTTP proxy wrappers, _sanitize_tool_name, mcp_tool/mcp_template decorators, mount MCP at /mcp, add /mcp/schema; swap dependency sse-starlettefastmcp.
API, normalization & binary export
deploy/docker/api.py, deploy/docker/result_normalizer.py, deploy/docker/result_protocols.py, deploy/docker/utils.py
Add centralized result normalization, protocols, and path safety: support output_path for crawl and other outputs with _ensure_within_base_dir; normalize/serialize results (including binary handling), enrich error extraction, and update crawl handler signature to accept output_path.
Error protocols & handler
deploy/docker/error_protocols.py, deploy/docker/error_handler.py
Introduce protocol-based error types and concrete impls (Network/Validation/Security/Configuration/Processing) and a centralized handle_error + helpers to convert exceptions into structured ErrorResponse and HTTPExceptions.
Schemas & jobs
deploy/docker/schemas.py, deploy/docker/job.py
Enrich Pydantic Fields/descriptions; add CrawlRequest.output_path; rename Markdown fields to filter, query, cache; add screenshot/pdf output_path; add StandardizedResponse/ErrorResponse; LlmJobPayload uses query alias q and populate_by_name.
Server endpoints & lifecycle
deploy/docker/server.py, deploy/docker/mcp_bridge.py
Wire MCP lifecycle into FastAPI lifespan via combine_lifespans; expose /mcp/schema; add inline-binary size constants and base-dir helpers for binary exports; improved endpoint validation and user-facing errors.
Static UI & playground
deploy/docker/static/playground/index.html
Update client payload keys to filter, query, cache; re-enable /crawl/stream in endpoint map.
Tests: MCP & normalization & compatibility
tests/mcp/*, deploy/docker/test_*, deploy/docker/test_backward_compatibility.py
Add/migrate MCP tests from SSE/WebSocket to FastMCP HTTP Client, add MCP listing and JS-exec tests; add extensive normalization and error-handling unit tests; add backward-compatibility E2E tests for /md.
Docs & assets
docs/md_v2/core/docker-deployment.md, docs/md_v2/assets/llm.txt/txt/docker.txt, deploy/docker/README.md
Update /execute_js docs and examples (fresh page, js_execution_result.return_values), document MCP HTTP caveats (JWT header forwarding), port flexibility, provider config methods, and usage examples.
Async config loader
crawl4ai/async_configs.py
BrowserConfig.load and CrawlerRunConfig.load now return default instances when given None and guard deserialization with fallback kwargs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client as MCP Client
  participant FastMCP as FastMCP HTTP (/mcp)
  participant Bridge as mcp_bridge
  participant App as FastAPI App
  participant Route as Route Handler

  rect rgb(245,250,255)
    note right of Client: Discovery
    Client->>FastMCP: ListTools / ListResources
    FastMCP-->>Client: Tools / Resources
  end

  rect rgb(250,255,245)
    note right of Client: Tool invocation (CallTool)
    Client->>FastMCP: CallTool(name, params)
    FastMCP->>Bridge: map -> proxy wrapper
    Bridge->>App: HTTP request (path, method, body)
    App->>Route: invoke handler (Pydantic models / params)
    alt success
      Route-->>App: JSON / text / file-path
      App-->>FastMCP: ToolResult (content blocks)
      FastMCP-->>Client: Result
    else error
      Route-->>App: raise HTTPException
      App-->>FastMCP: Error payload
      FastMCP-->>Client: Error result
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • unclecode

Poem

I mapped my ports to burrows neat,
11235 hums a steady beat.
FastMCP hops through /mcp gates,
tools answer with tidy crates.
Bunny bytes saved, file or base64 — I twitch my nose and rest a while. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning Besides the core SSE-to-HTTP transport migration and port configurability, the PR introduces a large set of new modules and tests—error_protocols, error_handler, result_normalizer, result_protocols, binary config blocks, job payload alias changes, static UI tweaks, and normalization/error-handling tests—that are not directly required to resolve the “Unexpected message” SSE bug described in issue #1316. Consider removing or splitting the broad error-handling, result-normalization, job aliasing, UI, and binary configuration changes into separate PRs so that the bug fix remains focused and reviewable, and merge the transport migration independently.
Description Check ⚠️ Warning The PR description does not match the repository’s required template: it uses “Problem” and “Solution” headings instead of “## Summary”, omits a dedicated “## List of files changed and why” section, lacks a standalone “## How Has This Been Tested?” heading, and does not include the required checklist. Please restructure the description to include a “## Summary” section, list modified files with reasons under “## List of files changed and why”, add a “## How Has This Been Tested?” section detailing test steps and results, and include the template’s checklist with completed items.
Docstring Coverage ⚠️ Warning Docstring coverage is 69.12% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the core feature changes, namely migrating MCP from SSE to HTTP transport and introducing a configurable port, which aligns directly with the main changes in the PR.
Linked Issues Check ✅ Passed The PR fully addresses issue #1316 by removing the unstable SSE transport and sse-starlette dependency, migrating to FastMCP HTTP-based routing with combined lifespans to fix ASGI middleware handling, mapping internal addresses, and adding tests to verify stable MCP tool functionality without the previous “Unexpected message” assertion or connection failures.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
deploy/docker/config.yml (1)

24-31: Duplicate Redis TLS keys in YAML.

Duplicate entries for ssl_cert_reqs, ssl_ca_certs, ssl_certfile, ssl_keyfile will shadow earlier values and confuse ops. Keep one set.

Apply this diff to remove the duplicates:

   ssl_cert_reqs: None
   ssl_ca_certs: None
   ssl_certfile: None
   ssl_keyfile: None
-
-  ssl_cert_reqs: None
-  ssl_ca_certs: None
-  ssl_certfile: None
-  ssl_keyfile: None
deploy/docker/README.md (1)

713-721: Update config.yml example port to 11235 to match the rest of the docs

The example still shows 8020 which contradicts the standardized 11235 elsewhere in the README and compose.

Apply this diff:

   host: "0.0.0.0"
-  port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
+  port: 11235 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
deploy/docker/server.py (3)

311-333: Arbitrary file write via user-supplied output_path; add allowlist and log file-write failures

Endpoints accept an absolute path and write to it without restriction. This enables arbitrary file writes inside the container. Also, except Exception as e: pass drops errors silently.

Apply this diff to restrict writes to a configured base directory and log failures:

-    if body.output_path:
-        abs_path = os.path.abspath(body.output_path)
-        os.makedirs(os.path.dirname(abs_path), exist_ok=True)
-        with open(abs_path, "wb") as f:
-            f.write(base64.b64decode(screenshot_data))
-        return {"success": True, "path": abs_path}
+    if body.output_path:
+        abs_path = os.path.abspath(body.output_path)
+        base_dir = os.path.abspath(config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports"))
+        # Enforce that writes stay within the configured base directory
+        try:
+            common = os.path.commonpath([base_dir, abs_path])
+        except Exception:
+            common = ""
+        if common != base_dir:
+            raise HTTPException(400, f"output_path must be within {base_dir}")
+        os.makedirs(os.path.dirname(abs_path), exist_ok=True)
+        with open(abs_path, "wb") as f:
+            f.write(base64.b64decode(screenshot_data))
+        return {"success": True, "path": abs_path}
@@
-    except Exception as e:
-        # Fall through to inline on any file write error
-        pass
+    except OSError as e:
+        # Fall through to inline on any file write error
+        import logging; logging.getLogger(__name__).warning("screenshot file write failed: %s", e)

Optionally add a dedicated config key like binary.allowed_base_dir to avoid overloading default_dir. Based on learnings.


357-379: Apply the same write restriction and logging to PDF endpoint

Mirror the file-write safety for PDFs and avoid silent failures.

Apply this diff:

-    if body.output_path:
-        abs_path = os.path.abspath(body.output_path)
-        os.makedirs(os.path.dirname(abs_path), exist_ok=True)
-        with open(abs_path, "wb") as f:
-            f.write(pdf_data)
-        return {"success": True, "path": abs_path}
+    if body.output_path:
+        abs_path = os.path.abspath(body.output_path)
+        base_dir = os.path.abspath(config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports"))
+        try:
+            common = os.path.commonpath([base_dir, abs_path])
+        except Exception:
+            common = ""
+        if common != base_dir:
+            raise HTTPException(400, f"output_path must be within {base_dir}")
+        os.makedirs(os.path.dirname(abs_path), exist_ok=True)
+        with open(abs_path, "wb") as f:
+            f.write(pdf_data)
+        return {"success": True, "path": abs_path}
@@
-    except Exception:
-        # Fall through to inline on any file write error
-        pass
+    except OSError as e:
+        # Fall through to inline on any file write error
+        import logging; logging.getLogger(__name__).warning("pdf file write failed: %s", e)

471-474: Avoid infinite self-redirect on /metrics when Prometheus is disabled

This route redirects to itself and will loop if instrumentation is disabled. Either remove the route or return a 404/plain message when disabled.

Apply this diff:

-@app.get(config["observability"]["prometheus"]["endpoint"])
-async def metrics():
-    return RedirectResponse(config["observability"]["prometheus"]["endpoint"])
+@app.get(config["observability"]["prometheus"]["endpoint"])
+async def metrics():
+    if config["observability"]["prometheus"]["enabled"]:
+        # Instrumentator already exposes metrics at this path; let it handle the request.
+        return PlainTextResponse("OK", status_code=200)
+    return PlainTextResponse("Prometheus metrics disabled", status_code=404)
🧹 Nitpick comments (16)
deploy/docker/config.yml (1)

93-96: Binary export defaults: ensure directory exists at runtime.

If default_mode is "file", create default_dir on startup to avoid 5xx on first write; consider a writable Docker volume by default.

deploy/docker/mcp_bridge.py (6)

35-36: Method selection: avoid arbitrary list indexing.

Set iteration is nondeterministic. Prefer next(iter(...)) as Ruff suggests.

Apply this diff:

-    method = list(route.methods - {"HEAD", "OPTIONS"})[0]
+    method = next(iter(route.methods - {"HEAD", "OPTIONS"}))

67-83: HTTP client: add explicit timeout and preserve exception chaining.

Explicit timeouts improve resiliency; raising from the original error preserves context.

Apply this diff:

-        async with httpx.AsyncClient() as client:
+        async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
@@
-                raise HTTPException(e.response.status_code, e.response.text)
+                raise HTTPException(e.response.status_code, e.response.text) from e

93-151: Deduplicate logic: delegate to register_tools_from_routes().

create_mcp_server reimplements wrapper generation and skips Optional[...] params; delegate to the canonical path to avoid drift and bugs.

Apply this diff:

 def create_mcp_server(
@@
 ) -> FastMCP:
-    """Create FastMCP server instance with registered tools."""
-    server_name = name or "FastAPI-MCP"
-
-    # Create FastMCP instance
-    mcp = FastMCP(name=server_name)
-
-    # Register tools by scanning decorated FastAPI routes
-    for route in routes:
-        fn = getattr(route, "endpoint", None)
-        kind = getattr(fn, "__mcp_kind__", None)
-
-        if kind != "tool":
-            continue
-
-        tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_")
-        proxy_fn = _make_http_proxy(base_url, route)
-        description = inspect.getdoc(fn) or f"Tool for {route.path}"
-
-        # Get function signature to create properly typed wrapper
-        sig = inspect.signature(fn)
-        params = []
-        for param_name, param in sig.parameters.items():
-            # Skip Request, Depends, and other FastAPI-specific params
-            if param.annotation in (inspect.Parameter.empty, type(None)):
-                continue
-            if hasattr(param.annotation, "__origin__"):  # Skip complex types like Depends
-                continue
-            params.append(param)
-
-        # Create function dynamically with explicit parameters
-        param_names = [p.name for p in params]
-        param_sig = ", ".join(param_names)
-
-        # Build the async function code
-        func_code = f"""
-async def {tool_name}_wrapper({param_sig}):
-    kwargs = {{{", ".join(f'"{name}": {name}' for name in param_names)}}}
-    result = await proxy_fn(**kwargs)
-    return json.dumps(result, default=str)
-"""
-
-        # Execute in a namespace that has access to proxy_fn and json
-        namespace = {"proxy_fn": proxy_fn, "json": json}
-        exec(func_code, namespace)
-        wrapper_fn = namespace[f"{tool_name}_wrapper"]
-
-        # Register the properly typed function
-        mcp.tool(name=tool_name, description=description)(wrapper_fn)
-
-    print(f"Registered {len([r for r in routes if getattr(getattr(r, 'endpoint', None), '__mcp_kind__', None) == 'tool'])} MCP tools")
-
-    return mcp
+    """Create FastMCP server instance with registered tools."""
+    server_name = name or "FastAPI-MCP"
+    mcp = FastMCP(name=server_name)
+    register_tools_from_routes(mcp, routes, base_url)
+    return mcp

176-180: Apply the same name sanitization here.

Ensure wrapper generation never produces invalid identifiers.

Apply this diff:

-        tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_")
+        raw_name = fn.__mcp_name__ or route.path
+        tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_")
+        if not re.match(r"[A-Za-z_]", tool_name):
+            tool_name = f"tool_{tool_name}"

152-161: Handle optional lifespans.

Guard against None to avoid TypeError in environments without an app lifespan.

Apply this diff:

-    async def combined(app):
-        # Run both lifespans - MCP outer, existing inner
-        async with mcp_lifespan(app):
-            async with existing_lifespan(app):
-                yield
+    async def combined(app):
+        # Run both lifespans - MCP outer, existing inner
+        if mcp_lifespan is None and existing_lifespan is None:
+            yield
+            return
+        if mcp_lifespan is None:
+            async with existing_lifespan(app):
+                yield
+            return
+        if existing_lifespan is None:
+            async with mcp_lifespan(app):
+                yield
+            return
+        async with mcp_lifespan(app):
+            async with existing_lifespan(app):
+                yield

132-148: Replace exec-based wrapper generation with a generic async wrapper

  • Swap both exec calls (deploy/docker/mcp_bridge.py:142, 376) for a single generic async wrapper that builds its signature via inspect.Signature and sets signature, annotations, and doc, eliminating exec and S102 violations.
.env.example (1)

45-45: Add trailing newline.

Fixes dotenv-linter EndingBlankLine warning.

Apply this diff:

-ENABLE_GPU=false
+ENABLE_GPU=false
+
tests/mcp/test_mcp_http_listing.py (1)

11-11: Make MCP_URL configurable via environment to honor custom HOST_PORT

Hardcoding 11235 will break when users set a custom HOST_PORT. Read from MCP_URL/HOST_PORT with a sensible default.

Apply this diff:

+import os
@@
-MCP_URL = "http://127.0.0.1:11235/mcp"
+MCP_URL = os.getenv("MCP_URL", f"http://127.0.0.1:{os.getenv('HOST_PORT', '11235')}/mcp")
deploy/docker/README.md (1)

226-231: Clarify URL when using custom HOST_PORT with compose

Tighten language so users don’t miss the port change when HOST_PORT is set.

Apply this diff:

-> The server will be available at `http://localhost:11235` (or your custom `HOST_PORT`).
+> The server will be available at `http://localhost:<HOST_PORT>` (default `11235`).
deploy/docker/server.py (2)

318-323: S108: consider safer default than /tmp and/or document mount path

Using a world-writable default like /tmp is acceptable in containers but still flagged. Consider defaulting to a project-specific dir with restricted permissions or ensure you document mounting a host volume and tighten directory perms on create (e.g., mode=0o750).

If you want, I can add mkdirs with explicit permissions and an env-configurable base dir (binary.allowed_base_dir) and open a follow-up PR.

Also applies to: 366-371


664-664: Nit: f-string without placeholders

Drop the f prefix.

Apply this diff:

-print(f"Mounted MCP app at / (MCP endpoint at /mcp)")
+print("Mounted MCP app at / (MCP endpoint at /mcp)")
tests/mcp/test_mcp_socket.py (4)

14-15: Honor custom HOST_PORT and allow override via MCP_URL

Align with docker-compose/README by reading env and defaulting to 11235.

Apply this diff:

-from fastmcp.client import Client
+from fastmcp.client import Client
+import os
@@
-MCP_URL = "http://localhost:11235/mcp"
+MCP_URL = os.getenv("MCP_URL", f"http://localhost:{os.getenv('HOST_PORT', '11235')}/mcp")

17-24: Guard _first_text_block with a clearer error type

Raising RuntimeError is fine; alternatively, ValueError keeps it in the “bad input” family. Optional.

No change required; just a note.


55-68: Robust handling of path vs base64 for binary tools

Nice ergonomics. Consider truncation length constants to avoid magic numbers, but optional.

Also applies to: 70-82


84-97: Potential flakiness: execute_js against HN

Interacting with HN UI can be brittle. Consider a more static target or add retries/timeouts, but fine for a smoke test.

If needed, I can switch to a static sandbox page in a follow-up commit.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and b18fe57.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .env.example (1 hunks)
  • crawl4ai/async_configs.py (2 hunks)
  • deploy/docker/README.md (4 hunks)
  • deploy/docker/config.yml (2 hunks)
  • deploy/docker/mcp_bridge.py (2 hunks)
  • deploy/docker/requirements.txt (1 hunks)
  • deploy/docker/schemas.py (2 hunks)
  • deploy/docker/server.py (13 hunks)
  • docker-compose.yml (2 hunks)
  • tests/mcp/test_mcp_http_listing.py (1 hunks)
  • tests/mcp/test_mcp_socket.py (1 hunks)
  • tests/mcp/test_mcp_sse.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/mcp/test_mcp_sse.py
🧰 Additional context used
🧬 Code graph analysis (5)
deploy/docker/schemas.py (1)
deploy/docker/utils.py (1)
  • FilterType (16-20)
tests/mcp/test_mcp_socket.py (1)
tests/mcp/test_mcp_http_listing.py (1)
  • main (14-22)
crawl4ai/async_configs.py (1)
tests/docker/test_serialization.py (1)
  • from_serializable_dict (67-106)
tests/mcp/test_mcp_http_listing.py (1)
tests/mcp/test_mcp_socket.py (1)
  • main (123-134)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
  • handle_crawl_request (418-510)
  • handle_llm_qa (61-108)
  • handle_markdown_request (175-240)
  • handle_stream_crawl_request (512-561)
  • stream_results (385-416)
deploy/docker/auth.py (3)
  • TokenRequest (54-55)
  • create_access_token (23-29)
  • get_token_dependency (45-51)
deploy/docker/crawler_pool.py (3)
  • close_all (47-50)
  • get_crawler (22-46)
  • janitor (52-60)
deploy/docker/mcp_bridge.py (3)
  • combine_lifespans (152-160)
  • mcp_tool (27-31)
  • register_tools_from_routes (162-392)
deploy/docker/schemas.py (5)
  • CrawlRequest (7-10)
  • MarkdownRequest (12-27)
  • PDFRequest (41-43)
  • RawCode (30-31)
  • ScreenshotRequest (36-39)
deploy/docker/utils.py (4)
  • FilterType (16-20)
  • load_config (22-40)
  • setup_logging (42-47)
  • verify_email_domain (120-127)
🪛 Ruff (0.13.1)
deploy/docker/schemas.py

23-23: String contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF001)

deploy/docker/mcp_bridge.py

35-35: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice

Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))

(RUF015)


82-82: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


120-120: Loop control variable param_name not used within loop body

Rename unused param_name to _param_name

(B007)


142-142: Use of exec detected

(S102)


228-228: Do not catch blind exception: Exception

(BLE001)


236-236: Do not catch blind exception: Exception

(BLE001)


334-334: Do not catch blind exception: Exception

(BLE001)


376-376: Use of exec detected

(S102)


386-386: Do not catch blind exception: Exception

(BLE001)

tests/mcp/test_mcp_socket.py

23-23: Avoid specifying long messages outside the exception class

(TRY003)

deploy/docker/server.py

321-321: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


331-333: try-except-pass detected, consider logging the exception

(S110)


331-331: Do not catch blind exception: Exception

(BLE001)


331-331: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


366-366: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


376-378: try-except-pass detected, consider logging the exception

(S110)


376-376: Do not catch blind exception: Exception

(BLE001)


654-654: Possible binding to all interfaces

(S104)


664-664: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 dotenv-linter (3.3.0)
.env.example

[warning] 45-45: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🔇 Additional comments (13)
crawl4ai/async_configs.py (2)

617-623: Nice guard for missing browser config payloads

Returning a default BrowserConfig when the serialized blob is None avoids the old AttributeError path and keeps callers from having to special-case empty configs. Thanks for tightening this up.


1553-1559: Matching improvement for crawler config defaults

Mirroring the same None handling here means both loaders now behave consistently and stop tripping on absent payloads—exactly what we need for the new HTTP transport defaults.

deploy/docker/config.yml (1)

6-6: Port change LGTM.

Consistent with docker-compose HOST_PORT and tests referencing 11235.

deploy/docker/schemas.py (2)

14-27: Field metadata improvements LGTM.

Descriptions and explicit Field usage improve schema clarity and MCP tool docs.


38-43: Binary outputs: defaults make sense.

2s wait and optional output_path align with file-based exports in this PR.

.env.example (1)

1-45: .env example LGTM.

Clear docs for HOST_PORT/INSTALL_TYPE/ENABLE_GPU; matches docker-compose usage.

deploy/docker/requirements.txt (2)

7-7: Adjust JWT dependency and pin mcp

  • Remove PyJWT==2.10.1 from deploy/docker/requirements.txt; code imports JWT and related utilities from python-jwt (jwt>=1.3.1).
  • Pin mcp to a known-good version compatible with fastmcp==2.12.3 for reproducible builds.

Likely an incorrect or invalid review comment.


15-15: fastmcp pin approved
fastmcp==2.12.3 is the latest 2.x release and compatible with mcp>=1.6.0; continue monitoring compatibility on future upgrades.

docker-compose.yml (2)

4-4: LGTM: dynamic host port mapping via HOST_PORT

Keeps container port stable while allowing flexible host binding. Healthcheck still targets the internal port, which is correct.


48-48: Compose anchor inheritance looks correct

Service correctly inherits the updated base-config.

tests/mcp/test_mcp_socket.py (2)

26-37: Test payloads and printing look correct for crawl

Good coverage and concise output.


123-139: End-to-end flow over HTTP transport looks good

Lists tools then exercises all tools. Matches the migration away from SSE.

deploy/docker/server.py (1)

651-659: MCP base_url host mapping looks good Mapping 0.0.0.0 → 127.0.0.1 avoids invalid proxy targets. Tool registration after route creation is appropriate. No additional SSE or WebSocket endpoints require adjustment.

…d timeout

- Removed explicit environment variable definitions for API keys in docker-compose.yml to enhance security and streamline configuration.
- Increased timeout for LLM operations in mcp_bridge.py to 60 seconds to accommodate longer processing times.

This change improves the security of sensitive information and ensures better handling of LLM requests.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
deploy/docker/mcp_bridge.py (4)

83-83: Add exception chaining for better error context.

Missing exception chaining makes debugging harder when HTTP errors occur.

Apply this diff:

-                raise HTTPException(e.response.status_code, e.response.text)
+                raise HTTPException(e.response.status_code, e.response.text) from e

143-143: Consider security implications of exec() usage.

Dynamic code execution via exec() poses security risks and makes debugging difficult. While the generated code appears safe in this context, consider using alternative approaches.

For improved security and maintainability, consider using factory functions or decorators instead of exec():

def create_wrapper(tool_name: str, proxy_fn, param_names: list):
    async def wrapper(**kwargs):
        filtered_kwargs = {k: v for k, v in kwargs.items() if k in param_names}
        result = await proxy_fn(**filtered_kwargs)
        return json.dumps(result, default=str)
    
    wrapper.__name__ = f"{tool_name}_wrapper"
    return wrapper

Also applies to: 377-377


229-229: Consider more specific exception handling.

Multiple broad exception catches could mask important errors during type introspection and wrapper generation.

Replace broad exception handling with specific exceptions where possible:

-                    except Exception:
+                    except (AttributeError, TypeError):
                        origin = None

Similar changes should be applied to other broad exception catches for better error diagnostics.

Also applies to: 237-237, 335-335, 387-387


94-152: Use streamable-http transport with FastMCP v2.12.3
FastMCP 2.12.3 defaults to the new streamable-http transport (legacy http+sse is only a fallback). Ensure your _make_http_proxy calls and base_url target a plain HTTP (streamable) endpoint for tool registration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b18fe57 and d9fd23c.

📒 Files selected for processing (2)
  • deploy/docker/mcp_bridge.py (2 hunks)
  • docker-compose.yml (2 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py

35-35: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice

Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))

(RUF015)


83-83: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


121-121: Loop control variable param_name not used within loop body

Rename unused param_name to _param_name

(B007)


143-143: Use of exec detected

(S102)


229-229: Do not catch blind exception: Exception

(BLE001)


237-237: Do not catch blind exception: Exception

(BLE001)


335-335: Do not catch blind exception: Exception

(BLE001)


377-377: Use of exec detected

(S102)


387-387: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (7)
docker-compose.yml (1)

4-4: LGTM! Well-designed configurable port mapping.

The dynamic host port mapping using ${HOST_PORT:-11235} allows for flexible deployment while maintaining a consistent container port. This is a best practice for Docker configurations.

deploy/docker/mcp_bridge.py (6)

49-56: Path parameter replacement fails for typed params.

FastAPI paths like {file_path:path} won't match "{k}". Use a regex to replace typed variants.

Apply this diff:

-            for k, v in list(kwargs.items()):
+            for k, v in list(kwargs.items()):
                 placeholder = "{" + k + "}"
-                if placeholder in path:
-                    # Path parameter - ensure safe URL encoding
-                    encoded_value = urllib.parse.quote(str(v), safe="")
-                    path = path.replace(placeholder, encoded_value)
-                    kwargs.pop(k)
+                # Path parameter - support typed params e.g. {k:path}
+                encoded_value = urllib.parse.quote(str(v), safe="")
+                pattern = re.compile(r"\{" + re.escape(k) + r"(?:\:[^}]*)?\}")
+                if pattern.search(path):
+                    path = pattern.sub(encoded_value, path)
+                    kwargs.pop(k)

76-81: JSON parsing improvement looks good.

The unified JSON parsing approach for both GET and other methods helps avoid double-encoding issues, with appropriate fallback to text.


114-114: Sanitize tool names to valid identifiers.

Route paths may contain characters invalid in function names (hyphens, dots), causing exec() syntax errors.

Apply this diff for both occurrences:

-        tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_")
+        raw_name = fn.__mcp_name__ or route.path
+        tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_")
+        if not re.match(r"[A-Za-z_]", tool_name):
+            tool_name = f"tool_{tool_name}"

Also applies to: 178-178


153-162: Excellent lifespan composition pattern.

The nested async context manager pattern correctly handles both MCP and existing lifespans, ensuring proper initialization order.


223-267: Robust Pydantic field handling with type coercion.

The implementation correctly handles Pydantic model fields, enum base types, and optional fields with proper defaults. The type coercion for numeric fields addresses common MCP client compatibility issues.


276-302: Smart MCP client quirk handling.

The preprocessing logic that handles empty strings, null values, list normalization, and numeric coercion addresses real-world MCP client compatibility issues effectively.

- Fix custom host port example in README (8080:11235 vs 11235:11235)
- Update MCP documentation to HTTP-only transport (remove SSE/WS refs)
- Fix Claude CLI command to use --transport http
- Implement regex-based typed path parameter handling in mcp_bridge
- Improve tool name sanitization for exec safety
- Replace non-breaking hyphen with ASCII hyphen in schemas
- Add comprehensive docstrings and comments

Addresses CodeRabbit review feedback for better code quality and documentation accuracy.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
deploy/docker/mcp_bridge.py (5)

66-66: Select HTTP method deterministically; avoid list()[0].

Use an iterator and guard against empty method sets. Also improves readability.

-    method = list(route.methods - {"HEAD", "OPTIONS"})[0]
+    methods = route.methods - {"HEAD", "OPTIONS"}
+    method = next(iter(methods)) if methods else "GET"

114-115: Preserve exception cause.

Chain the original httpx error for better debuggability.

-            except httpx.HTTPStatusError as e:
-                raise HTTPException(e.response.status_code, e.response.text)
+            except httpx.HTTPStatusError as e:
+                raise HTTPException(e.response.status_code, e.response.text) from e

185-185: Prefer logging over print for production diagnostics.

Route/tool counts should go to structured logs, not stdout.

Minimal change (outside selected lines, for illustration):

# at top-level
import logging
logger = logging.getLogger(__name__)

# replace prints
logger.info("Registered %d MCP tools", ...)

Also applies to: 442-442


4-11: Remove unused imports.

BaseModel and Callable are unused.

-from __future__ import annotations
-import inspect, json, re
+from __future__ import annotations
+import inspect, json, re
@@
-from typing import Callable, List, get_origin, get_args
+from typing import List, get_origin, get_args
@@
-from pydantic import BaseModel
+# from pydantic import BaseModel  # Unused

100-101: Reuse httpx client to reduce connection churn.

Consider injecting a shared AsyncClient or a factory for long-lived use.

Based on learnings

deploy/docker/README.md (1)

322-325: Align README transport wording

  • The --transport http flag is already correct.
  • In deploy/docker/README.md (line 346), change “MCP WebSocket connection” to “MCP HTTP connection”.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9fd23c and 26cbc45.

📒 Files selected for processing (3)
  • deploy/docker/README.md (4 hunks)
  • deploy/docker/mcp_bridge.py (1 hunks)
  • deploy/docker/schemas.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/docker/schemas.py
🧰 Additional context used
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py

66-66: Prefer next(iter(route.methods - {"HEAD", "OPTIONS"})) over single element slice

Replace with next(iter(route.methods - {"HEAD", "OPTIONS"}))

(RUF015)


81-81: Local variable placeholder is assigned to but never used

Remove assignment to unused variable placeholder

(F841)


115-115: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


179-179: Use of exec detected

(S102)


278-278: Do not catch blind exception: Exception

(BLE001)


286-286: Do not catch blind exception: Exception

(BLE001)


384-384: Do not catch blind exception: Exception

(BLE001)


426-426: Use of exec detected

(S102)


436-436: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (5)
deploy/docker/mcp_bridge.py (2)

147-151: Good: tool-name sanitization prevents exec-time syntax errors.

Solid guard against invalid identifiers and code injection risks.


226-229: Sanitize tool names here too (exec safety + schema stability).

Without sanitization, names containing hyphens/dots/slashes can cause SyntaxError in the generated wrapper. Mirror the logic used in create_mcp_server.

-        tool_count += 1
-        tool_name = fn.__mcp_name__ or re.sub(r"[/{}}]", "_", route.path).strip("_")
+        tool_count += 1
+        raw_name = fn.__mcp_name__ or route.path
+        tool_name = re.sub(r"[^0-9a-zA-Z_]", "_", raw_name).strip("_")
+        if not re.match(r"[A-Za-z_]", tool_name):
+            tool_name = f"tool_{tool_name}"

This addresses the earlier review asking to sanitize tool names; it wasn’t applied in this code path.

deploy/docker/README.md (3)

118-128: LGTM: Correct custom host port example.

Using -p 8080:11235 and updating the access URL removes confusion.


226-231: LGTM: Compose HOST_PORT example and access note.

Clear guidance on overriding the host port.


308-317: LGTM: HTTP-only MCP endpoint and JWT limitation noted.

The limitation is documented clearly to prevent auth surprises.

- Use deterministic HTTP method selection with next(iter())
- Fix safe parameter popping to prevent KeyError
- Improve typed path parameter handling with slash preservation
- Fix parameter filtering to include Optional/List types
- Add proper exception chaining for better debugging
- Replace print statements with structured logging
- Remove unused variable assignment
- Remove unnecessary f-prefix from string literals

Addresses latest CodeRabbit review feedback for production readiness.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26cbc45 and cb8e095.

📒 Files selected for processing (2)
  • deploy/docker/mcp_bridge.py (1 hunks)
  • deploy/docker/server.py (13 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/server.py (6)
deploy/docker/api.py (3)
  • handle_crawl_request (418-510)
  • handle_stream_crawl_request (512-561)
  • stream_results (385-416)
deploy/docker/auth.py (3)
  • TokenRequest (54-55)
  • create_access_token (23-29)
  • get_token_dependency (45-51)
deploy/docker/crawler_pool.py (3)
  • close_all (47-50)
  • get_crawler (22-46)
  • janitor (52-60)
deploy/docker/mcp_bridge.py (2)
  • combine_lifespans (197-213)
  • register_tools_from_routes (215-455)
deploy/docker/schemas.py (6)
  • CrawlRequest (7-10)
  • HTMLRequest (33-34)
  • JSEndpointRequest (46-51)
  • MarkdownRequest (12-27)
  • RawCode (30-31)
  • ScreenshotRequest (36-39)
deploy/docker/utils.py (4)
  • FilterType (16-20)
  • load_config (22-40)
  • setup_logging (42-47)
  • verify_email_domain (120-127)
🪛 Ruff (0.13.1)
deploy/docker/mcp_bridge.py

185-185: Use of exec detected

(S102)


289-289: Do not catch blind exception: Exception

(BLE001)


297-297: Do not catch blind exception: Exception

(BLE001)


395-395: Do not catch blind exception: Exception

(BLE001)


437-437: Use of exec detected

(S102)


447-447: Do not catch blind exception: Exception

(BLE001)


449-449: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

deploy/docker/server.py

321-321: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


331-333: try-except-pass detected, consider logging the exception

(S110)


331-331: Do not catch blind exception: Exception

(BLE001)


331-331: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


366-366: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


376-378: try-except-pass detected, consider logging the exception

(S110)


376-376: Do not catch blind exception: Exception

(BLE001)


654-654: Possible binding to all interfaces

(S104)

leoric-crown and others added 2 commits September 26, 2025 10:45
- Replace timestamp-based filenames with UUID-based unique identifiers
- Fixes race condition where concurrent requests to same host would overwrite files
- Add uuid import for collision-resistant filename generation

Addresses CodeRabbit review feedback for file export safety.
- Add dual parameter support for /md endpoint (body params override query params)
- Fix JavaScript execution to return actual values in js_execution_result.return_values
- Resolve async generator iteration errors causing batch crawl failures
- Fix JSON import scope issues preventing proper error responses
- Enable /crawl/stream endpoint functionality in playground UI
- Standardize parameter naming from f/q/c to filter/query/cache across all integration points
- Add comprehensive backward compatibility test suite with 4 test scenarios
- Enhance error handling with user-friendly network/browser error messages
- Add path traversal protection for file export operations
- Restore /mcp/schema endpoint for MCP tool introspection and documentation

BREAKING: None - maintains full backward compatibility while enabling clean parameter migration

Fixes multiple runtime crashes and API inconsistencies while preserving existing functionality.
Major improvements to developer experience and system reliability.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/md_v2/core/docker-deployment.md (1)

286-301: Replace SSE/WS with HTTP FastMCP schema endpoint
Remove the SSE (/mcp/sse) and WebSocket (/mcp/ws) entries and instead document only the HTTP schema endpoint:

  • Schema: GET http://localhost:11235/mcp/schema

Update the example to:

# Add the Crawl4AI server as an HTTP FastMCP provider
claude mcp add --transport http c4ai-http http://localhost:11235/mcp/schema

# List all MCP providers to verify it was added
claude mcp list
🧹 Nitpick comments (7)
tests/mcp/test_mcp_js_exec.py (4)

117-121: Add timeout and configurable base URL to the error-handling test.

Mirror the same robustness here for consistency with the first test.

-    response = requests.post("http://localhost:11235/execute_js", json={
+    response = requests.post(f"{BASE_URL}/execute_js", json={
         "url": "https://httpbin.org/html",
         "scripts": scripts
-    })
+    }, timeout=30)

132-151: Strengthen error-object validation.

Relying on 'error' in str(dict).lower() is brittle. Check presence of an error key explicitly.

-    # Second should be an error object
-    if not isinstance(return_values[1], dict) or 'error' not in str(return_values[1]).lower():
-        print(f"❌ Second value should be error object, got {return_values[1]}")
-        return False
+    # Second should be an error object
+    assert isinstance(return_values[1], dict) and any(
+        k in return_values[1] for k in ("error", "message", "stack")
+    ), f"Second value should be error-like object, got {return_values[1]!r}"

61-61: Use f-string conversion flags instead of repr().

Cleaner and satisfies linters.

-        print(f"Script {i+1} returned: {repr(value)}")
+        print(f"Script {i+1} returned: {value!r}")
@@
-        print(f"❌ Expected 'test string', got {repr(test_str)}")
+        print(f"❌ Expected 'test string', got {test_str!r}")
@@
-        print(f"❌ Expected 42, got {repr(test_num)}")
+        print(f"❌ Expected 42, got {test_num!r}")

Also applies to: 92-92, 100-100


1-5: Mark as integration test or move out of pytest discovery.

This depends on an external server and network. Mark with a pytest marker and skip by default, or move under an integration directory excluded from CI.

Would you like a follow-up patch to add a pytest marker and env-gated skip (e.g., RUN_HTTP_TESTS) so CI stays green when the server isn’t up?

deploy/docker/api.py (3)

425-431: Use Optional[str] for output_path type.

PEP 484: prefer Optional[str] = None for clarity and tooling.

-    config: dict,
-    output_path: str = None
+    config: dict,
+    output_path: Optional[str] = None

563-593: Tighten file export handling and error chaining.

  • Normalize base_dir before validation; create directories with restrictive perms.
  • Chain write errors for clearer tracebacks.
-        if output_path:
-            binary_base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")
-            abs_path = os.path.abspath(output_path)
+        if output_path:
+            binary_base_dir = os.path.realpath(os.path.abspath(
+                config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")
+            ))
+            abs_path = os.path.realpath(os.path.abspath(output_path))
@@
-            os.makedirs(os.path.dirname(abs_path), exist_ok=True)
+            os.makedirs(os.path.dirname(abs_path), mode=0o700, exist_ok=True)
@@
-            except (OSError, TypeError) as write_error:
+            except (OSError, TypeError) as write_error:
                 logger.warning("crawl file write failed: %s", write_error, exc_info=True)
-                raise HTTPException(
+                raise HTTPException(
                     status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=f"Failed to write to {abs_path}: {str(write_error)}"
-                )
+                    detail=f"Failed to write to {abs_path}: {write_error!s}"
+                ) from write_error

80-86: Avoid local absolute import; move helper to utils to prevent cycles.

from server import ... inside function risks import issues. Prefer moving _extract_user_friendly_error to utils and import once at module top.

Want a follow-up patch relocating _extract_user_friendly_error to utils and adjusting imports here and in server.py?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb8e095 and ca84054.

📒 Files selected for processing (16)
  • .env.example (1 hunks)
  • deploy/docker/README.md (6 hunks)
  • deploy/docker/api.py (9 hunks)
  • deploy/docker/config.yml (2 hunks)
  • deploy/docker/job.py (2 hunks)
  • deploy/docker/mcp_bridge.py (1 hunks)
  • deploy/docker/schemas.py (2 hunks)
  • deploy/docker/server.py (17 hunks)
  • deploy/docker/static/playground/index.html (2 hunks)
  • deploy/docker/test_backward_compatibility.py (1 hunks)
  • deploy/docker/utils.py (2 hunks)
  • docs/md_v2/assets/llm.txt/txt/docker.txt (1 hunks)
  • docs/md_v2/core/docker-deployment.md (1 hunks)
  • tests/mcp/test_mcp_http_listing.py (1 hunks)
  • tests/mcp/test_mcp_js_exec.py (1 hunks)
  • tests/mcp/test_mcp_socket.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • .env.example
  • tests/mcp/test_mcp_http_listing.py
  • deploy/docker/schemas.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/mcp/test_mcp_socket.py (1)
tests/mcp/test_mcp_http_listing.py (1)
  • main (16-24)
deploy/docker/api.py (4)
deploy/docker/utils.py (2)
  • _ensure_within_base_dir (128-139)
  • datetime_handler (57-61)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (2)
  • success (36-37)
  • model_dump (240-258)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
  • handle_crawl_request (425-620)
  • handle_llm_qa (62-112)
  • handle_markdown_request (179-247)
  • handle_stream_crawl_request (622-671)
  • stream_results (392-423)
crawl4ai/async_webcrawler.py (2)
  • AsyncWebCrawler (53-852)
  • arun (204-431)
deploy/docker/crawler_pool.py (3)
  • close_all (47-50)
  • get_crawler (22-46)
  • janitor (52-60)
deploy/docker/mcp_bridge.py (6)
  • combine_lifespans (153-175)
  • create_mcp_server (138-149)
  • mcp_resource (24-36)
  • mcp_template (38-50)
  • mcp_tool (52-64)
  • register_tools_from_routes (179-211)
deploy/docker/schemas.py (7)
  • CrawlRequest (7-11)
  • HTMLRequest (34-35)
  • JSEndpointRequest (47-52)
  • MarkdownRequest (13-28)
  • PDFRequest (42-44)
  • RawCode (31-32)
  • ScreenshotRequest (37-40)
deploy/docker/utils.py (4)
  • FilterType (16-20)
  • load_config (22-40)
  • setup_logging (42-47)
  • _ensure_within_base_dir (128-139)
🪛 Ruff (0.13.1)
tests/mcp/test_mcp_js_exec.py

1-1: Shebang is present but file is not executable

(EXE001)


26-26: Probable use of requests call without timeout

(S113)


61-61: Use explicit conversion flag

Replace with conversion flag

(RUF010)


92-92: Use explicit conversion flag

Replace with conversion flag

(RUF010)


100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)


117-117: Probable use of requests call without timeout

(S113)

tests/mcp/test_mcp_socket.py

25-25: Avoid specifying long messages outside the exception class

(TRY003)

deploy/docker/test_backward_compatibility.py

1-1: Shebang is present but file is not executable

(EXE001)


82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)


111-111: f-string without any placeholders

Remove extraneous f prefix

(F541)


137-137: Do not catch blind exception: Exception

(BLE001)

deploy/docker/api.py

430-430: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


442-446: Abstract raise to an inner function

(TRY301)


488-488: Do not catch blind exception: Exception

(BLE001)


499-499: Do not catch blind exception: Exception

(BLE001)


565-565: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


589-592: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


591-591: Use explicit conversion flag

Replace with conversion flag

(RUF010)


594-594: Consider moving this statement to an else block

(TRY300)

deploy/docker/server.py

199-199: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


284-284: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


284-284: Use explicit conversion flag

Replace with conversion flag

(RUF010)


285-285: Do not catch blind exception: Exception

(BLE001)


286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


286-286: Use explicit conversion flag

Replace with conversion flag

(RUF010)


321-321: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


840-840: Possible binding to all interfaces

(S104)


898-898: Do not catch blind exception: Exception

(BLE001)


898-898: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


971-972: try-except-pass detected, consider logging the exception

(S110)


971-971: Do not catch blind exception: Exception

(BLE001)

deploy/docker/mcp_bridge.py

229-229: Unused noqa directive (unused: E721)

Remove unused noqa directive

(RUF100)


249-249: Avoid specifying long messages outside the exception class

(TRY003)

🪛 markdownlint-cli2 (0.18.1)
docs/md_v2/core/docker-deployment.md

398-398: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)
deploy/docker/api.py (3)

442-446: Good guard for headless-only Docker runs.

Clear error and guidance; aligns with container constraints.


596-620: Keep structured error details, but ensure media type upstream.

detail contains JSON string. Ensure the FastAPI route uses JSONResponse or exception handlers decode this. Otherwise, clients receive a stringified JSON.

If desired, I can refactor to raise HTTPException(detail=payload_dict) and add a global handler to serialize consistently.


451-454: Direct lookup safe for rate_limiter.enabled The default config (deploy/docker/config.yml) already includes crawler.rate_limiter.enabled, so using config['crawler']['rate_limiter']['enabled'] won’t KeyError.

Likely an incorrect or invalid review comment.

- Fix /mcp/schema endpoint to use proper FastMCP tool introspection
- Remove fallback 'route inspection failed' messages that hurt confidence
- Improve array detection by preserving List[str] typing in wrapper generation
- Replace unreliable route inspection with direct FastMCP tool access
- Add robust _create_param_schema() function for accurate type detection
- Clean error handling with proper HTTP 500 responses instead of fallback

The schema endpoint now provides professional, reliable tool schemas
without misleading fallback messages.

Addresses feedback from unclecode#1519
- Fix crawler.arun() extend bug - single result should use append()
- Fix is_required() property access (was incorrectly called as function)
- Add backward compatibility for 'q' parameter in LlmJobPayload
- Remove empty f-strings to satisfy linter
- Previously addressed: UUID collision prevention, hyphen fix, docs
- Harden path validation against symlink escapes in utils.py
- Update JS execution docs to reflect return_values support
- Add missing http language specification to fenced code block
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca84054 and 7142db8.

📒 Files selected for processing (7)
  • deploy/docker/api.py (9 hunks)
  • deploy/docker/job.py (3 hunks)
  • deploy/docker/mcp_bridge.py (1 hunks)
  • deploy/docker/server.py (18 hunks)
  • deploy/docker/test_backward_compatibility.py (1 hunks)
  • deploy/docker/utils.py (2 hunks)
  • docs/md_v2/core/docker-deployment.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/md_v2/core/docker-deployment.md
  • deploy/docker/job.py
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/docker/api.py (3)
deploy/docker/utils.py (2)
  • _ensure_within_base_dir (128-145)
  • datetime_handler (57-61)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (1)
  • model_dump (240-258)
deploy/docker/server.py (6)
deploy/docker/api.py (5)
  • handle_crawl_request (425-620)
  • handle_llm_qa (62-112)
  • handle_markdown_request (179-247)
  • handle_stream_crawl_request (622-671)
  • stream_results (392-423)
deploy/docker/crawler_pool.py (3)
  • close_all (47-50)
  • get_crawler (22-46)
  • janitor (52-60)
deploy/docker/job.py (1)
  • init_job_router (26-30)
deploy/docker/mcp_bridge.py (5)
  • combine_lifespans (153-175)
  • mcp_resource (24-36)
  • mcp_template (38-50)
  • mcp_tool (52-64)
  • register_tools_from_routes (179-211)
deploy/docker/schemas.py (6)
  • CrawlRequest (7-11)
  • HTMLRequest (34-35)
  • MarkdownRequest (13-28)
  • PDFRequest (42-44)
  • RawCode (31-32)
  • ScreenshotRequest (37-40)
deploy/docker/utils.py (3)
  • FilterType (16-20)
  • load_config (22-40)
  • _ensure_within_base_dir (128-145)
🪛 Ruff (0.13.1)
deploy/docker/api.py

430-430: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


442-446: Abstract raise to an inner function

(TRY301)


488-488: Do not catch blind exception: Exception

(BLE001)


499-499: Do not catch blind exception: Exception

(BLE001)


565-565: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


589-592: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


591-591: Use explicit conversion flag

Replace with conversion flag

(RUF010)


594-594: Consider moving this statement to an else block

(TRY300)

deploy/docker/mcp_bridge.py

229-229: Unused noqa directive (unused: E721)

Remove unused noqa directive

(RUF100)


249-249: Avoid specifying long messages outside the exception class

(TRY003)

deploy/docker/server.py

199-199: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


284-284: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


284-284: Use explicit conversion flag

Replace with conversion flag

(RUF010)


285-285: Do not catch blind exception: Exception

(BLE001)


286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


286-286: Use explicit conversion flag

Replace with conversion flag

(RUF010)


321-321: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


840-840: Possible binding to all interfaces

(S104)


893-893: Do not catch blind exception: Exception

(BLE001)


901-901: Do not catch blind exception: Exception

(BLE001)


903-903: Use explicit conversion flag

Replace with conversion flag

(RUF010)


965-965: Do not catch blind exception: Exception

(BLE001)

deploy/docker/test_backward_compatibility.py

1-1: Shebang is present but file is not executable

(EXE001)


137-137: Do not catch blind exception: Exception

(BLE001)

Previously, when a single CrawlResult object was returned (not wrapped in a list),
the code would incorrectly treat it as "already a list" and iterate over its fields
instead of treating it as a single result.

This fix adds proper type checking to:
- Detect actual sequences (list, tuple) vs single objects
- Wrap single CrawlResult objects in a list
- Handle None results gracefully
- Ensure downstream code always receives a proper list of results

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7142db8 and 7eb7d9e.

📒 Files selected for processing (1)
  • deploy/docker/api.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (3)
deploy/docker/utils.py (2)
  • _ensure_within_base_dir (128-145)
  • datetime_handler (57-61)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (1)
  • model_dump (240-258)
🪛 Ruff (0.13.1)
deploy/docker/api.py

430-430: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


442-446: Abstract raise to an inner function

(TRY301)


495-495: Do not catch blind exception: Exception

(BLE001)


506-506: Do not catch blind exception: Exception

(BLE001)


572-572: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


596-599: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


598-598: Use explicit conversion flag

Replace with conversion flag

(RUF010)


601-601: Consider moving this statement to an else block

(TRY300)

…jects

Previously, the cleanup loop would drop any value that json.dumps() couldn't
serialize, causing datetime and Path-like fields to be logged and skipped.

This fix adds proper type detection and normalization:
- datetime/date objects: converted using existing datetime_handler
- os.PathLike/pathlib.Path: converted to strings using os.fspath()
- bytes: preserved with existing base64 encoding
- Unsupported types: better logging with actual values

Changes:
- Added pathlib import and date to datetime imports
- Added datetime_handler to utils import at module level
- Enhanced cleanup loop with proper isinstance checks
- Improved error logging to show actual values

This ensures important metadata like timestamps and file paths are properly
preserved in API responses instead of being silently dropped during JSON
serialization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7eb7d9e and 6c7e833.

📒 Files selected for processing (1)
  • deploy/docker/api.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (2)
  • datetime_handler (57-61)
  • _ensure_within_base_dir (128-145)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (2)
  • success (36-37)
  • model_dump (240-258)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py

432-432: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


444-448: Abstract raise to an inner function

(TRY301)


497-497: Do not catch blind exception: Exception

(BLE001)


508-508: Do not catch blind exception: Exception

(BLE001)


555-555: Local variable datetime_handler referenced before assignment

(F823)


581-581: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


605-608: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


607-607: Use explicit conversion flag

Replace with conversion flag

(RUF010)


610-610: Consider moving this statement to an else block

(TRY300)

…objects

CodeRabbit identified a critical issue: when Path/datetime objects are nested
inside collections (lists, dicts), the previous fix would still drop entire
fields because json.dumps() fails on the collection level.

This recursive approach fixes the issue by:
- Introducing _normalize_value() function that handles nested collections
- Recursively normalizing dict values: {k: _normalize_value(v) for k, v in dict.items()}
- Recursively normalizing list/tuple/set: [_normalize_value(v) for v in collection]
- Preserving existing special PDF handling logic
- Ensuring downloaded_files and other nested filesystem metadata is preserved

This prevents regression in binary-export flows where clients expect to see
saved file locations in nested data structures.

Addresses: unclecode#1519 (comment)...

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
deploy/docker/api.py (2)

588-616: Expand and validate the output base directory

_ensure_within_base_dir expects an absolute base directory, but the value in config["binary"]["default_dir"] may be provided as ~/exports (per the docs) or include environment variables. Passing it verbatim causes os.path.abspath to treat the tilde literally (/app/~/exports), so every legitimate output_path under $HOME is rejected. Please normalize the base folder with os.path.expanduser/os.path.expandvars (and re-run _ensure_within_base_dir on the resolved path) before comparison.


432-433: Type-hint output_path as Optional[str]

To comply with our typing guidelines, annotate the new parameter as Optional[str] = None so callers understand None is acceptable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c7e833 and 487ccf0.

📒 Files selected for processing (1)
  • deploy/docker/api.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
  • validate_llm_provider (96-116)
  • datetime_handler (57-61)
  • _ensure_within_base_dir (128-145)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (2)
  • success (36-37)
  • model_dump (240-258)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py

432-432: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


444-448: Abstract raise to an inner function

(TRY301)


497-497: Do not catch blind exception: Exception

(BLE001)


508-508: Do not catch blind exception: Exception

(BLE001)


590-590: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


614-617: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


616-616: Use explicit conversion flag

Replace with conversion flag

(RUF010)


619-619: Consider moving this statement to an else block

(TRY300)

Previously, when a CrawlResultContainer was returned, the normalization logic
would treat it as a single object and wrap it in a list, causing downstream
serialization to see the container instead of its internal per-URL entries.

This fix adds detection for objects with a .results attribute:
- Checks hasattr(results_gen, 'results') before single-object fallback
- Extracts internal entries with list(results_gen.results)
- Preserves existing handling for None, list/tuple, and single-model cases
- Maintains async-generator collection and fallback-per-URL retry logic

This ensures that CrawlResultContainer objects properly expose their
individual crawl results to downstream processing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 487ccf0 and 3e82fad.

📒 Files selected for processing (1)
  • deploy/docker/api.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
  • validate_llm_provider (96-116)
  • datetime_handler (57-61)
  • _ensure_within_base_dir (128-145)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (2)
  • success (36-37)
  • model_dump (240-258)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py

432-432: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


444-448: Abstract raise to an inner function

(TRY301)


500-500: Do not catch blind exception: Exception

(BLE001)


511-511: Do not catch blind exception: Exception

(BLE001)


593-593: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


617-620: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


619-619: Use explicit conversion flag

Replace with conversion flag

(RUF010)


622-622: Consider moving this statement to an else block

(TRY300)

Previously, the fallback path would append crawler.arun() return values
directly to fallback_results, but AsyncWebCrawler.arun returns container
wrappers (CrawlResultContainer) which breaks downstream JSON serialization.

This fix adds proper container detection and unwrapping:
- hasattr(result, 'results') → extend with container.results
- hasattr(result, '__iter__') (excluding str/bytes/dict) → extend with iterable
- Non-iterable single results → append as-is

This ensures the retry/serialization path receives actual CrawlResult objects
instead of container wrappers, maintaining consistency with the main success
path and preventing JSON serialization errors.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deploy/docker/api.py (2)

148-172: Fix: unwrap container in LLM extraction path to avoid empty results.

Same container issue here; use inner result for status/error/content.

Apply this diff:

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
+        async with AsyncWebCrawler() as crawler:
+            result_container = await crawler.arun(
                 url=url,
                 config=CrawlerRunConfig(
                     extraction_strategy=llm_strategy,
                     scraping_strategy=LXMLWebScrapingStrategy(),
                     cache_mode=cache_mode
                 )
             )
 
-        if not result.success:
+        result_obj = next(iter(getattr(result_container, "results", [result_container])), None)
+        if not result_obj or not getattr(result_obj, "success", False):
             await redis.hset(f"task:{task_id}", mapping={
                 "status": TaskStatus.FAILED,
-                "error": result.error_message
+                "error": getattr(result_obj, "error_message", "")
             })
             return
 
         try:
-            content = json.loads(result.extracted_content)
+            content = json.loads(result_obj.extracted_content)
         except json.JSONDecodeError:
-            content = result.extracted_content
+            content = result_obj.extracted_content

221-243: Fix: unwrap container in markdown handler.

Accessing .success/.markdown on the container will fail. Unwrap first.

Apply this diff:

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
+        async with AsyncWebCrawler() as crawler:
+            result_container = await crawler.arun(
                 url=decoded_url,
                 config=CrawlerRunConfig(
                     markdown_generator=md_generator,
                     scraping_strategy=LXMLWebScrapingStrategy(),
                     cache_mode=cache_mode
                 )
             )
 
-            if not result.success:
+            result_obj = next(iter(getattr(result_container, "results", [result_container])), None)
+            if not result_obj or not getattr(result_obj, "success", False):
                 from server import _extract_user_friendly_error
-                error_msg = result.error_message or "Failed to fetch content"
+                error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content"
                 friendly_error = _extract_user_friendly_error(error_msg, "Markdown generation")
                 raise HTTPException(
                     status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
                     detail=friendly_error
                 )
 
-            return (result.markdown.raw_markdown 
+            return (result_obj.markdown.raw_markdown 
                    if filter_type == FilterType.RAW 
-                   else result.markdown.fit_markdown)
+                   else result_obj.markdown.fit_markdown)
🧹 Nitpick comments (8)
deploy/docker/api.py (8)

431-433: Type hint: make output_path Optional[str].

Silences RUF013 and clarifies intent.

As per static analysis hints, apply:

-    config: dict,
-    output_path: str = None
+    config: dict,
+    output_path: Optional[str] = None

500-502: Don’t swallow cancellations; re-raise CancelledError.

Prevents masking task cancellations in bulk path.

Apply:

-        except Exception as bulk_error:
+        except asyncio.CancelledError:
+            raise
+        except Exception as bulk_error:
             logger.warning("Bulk crawl failed, retrying per-URL: %s", bulk_error)

520-528: Don’t swallow cancellations in per-URL fallback.

Same rationale as above.

Apply:

-                except Exception as url_error:
+                except asyncio.CancelledError:
+                    raise
+                except Exception as url_error:
                     logger.warning("Crawl failed for %s: %s", url, url_error)
                     fallback_results.append(
                         {
                             "url": url,
                             "success": False,
                             "error_message": str(url_error),
                         }
                     )

511-519: Avoid generic iter check; it can misclassify Pydantic models.

BaseModel may be iterable; extending on it corrupts results. Restrict to concrete sequences.

Apply:

-                    elif hasattr(single_result, '__iter__') and not isinstance(single_result, (str, bytes, dict)):
-                        # Other iterable wrapper - iterate and append each result
-                        fallback_results.extend(single_result)
+                    elif isinstance(single_result, (list, tuple)):
+                        # Other concrete sequence wrapper - extend with items
+                        fallback_results.extend(single_result)

609-616: Harden file write: secure dir perms, drop redundant import.

Set restrictive permissions and use the already-imported datetime_handler.

Apply:

-            os.makedirs(os.path.dirname(abs_path), exist_ok=True)
+            os.makedirs(os.path.dirname(abs_path), mode=0o700, exist_ok=True)
@@
-            try:
-                from utils import datetime_handler
+            try:
                 with open(abs_path, 'w', encoding='utf-8') as f:
                     json.dump(response_data, f, ensure_ascii=False, indent=2, default=datetime_handler)

As per static analysis hints (S108).


624-629: Preserve exception cause when file write fails.

Use exception chaining and explicit conversion flag.

As per static analysis hints (B904, RUF010), apply:

-            except (OSError, TypeError) as write_error:
+            except (OSError, TypeError) as write_error:
                 logger.warning("crawl file write failed: %s", write_error, exc_info=True)
-                raise HTTPException(
+                raise HTTPException(
                     status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=f"Failed to write to {abs_path}: {str(write_error)}"
-                )
+                    detail=f"Failed to write to {abs_path}: {write_error!s}"
+                ) from write_error

649-657: Return structured error dict instead of JSON string.

Avoids double-encoding and lets FastAPI serialize the object.

Apply:

-        import json as json_module  # Import locally to avoid any scoping issues
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=json_module.dumps({ # Send structured error
-                "error": str(e),
-                "server_memory_delta_mb": mem_delta_mb,
-                "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
-            })
+            detail={
+                "error": str(e),
+                "server_memory_delta_mb": mem_delta_mb,
+                "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
+            }
         )

674-679: Honor rate_limiter.enabled in streaming path for parity with non-streaming.

Keeps behavior consistent with handle_crawl_request.

Apply:

-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
-            rate_limiter=RateLimiter(
-                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
-            )
-        )
+        dispatcher = MemoryAdaptiveDispatcher(
+            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
+            rate_limiter=(
+                RateLimiter(base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]))
+                if config["crawler"]["rate_limiter"].get("enabled")
+                else None
+            )
+        )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e82fad and f60e6ae.

📒 Files selected for processing (1)
  • deploy/docker/api.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/api.py (4)
deploy/docker/utils.py (3)
  • validate_llm_provider (96-116)
  • datetime_handler (57-61)
  • _ensure_within_base_dir (128-145)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (210-233)
crawl4ai/models.py (2)
  • success (36-37)
  • model_dump (240-258)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
🪛 Ruff (0.13.1)
deploy/docker/api.py

432-432: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


444-448: Abstract raise to an inner function

(TRY301)


500-500: Do not catch blind exception: Exception

(BLE001)


520-520: Do not catch blind exception: Exception

(BLE001)


602-602: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


626-629: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


628-628: Use explicit conversion flag

Replace with conversion flag

(RUF010)


631-631: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (2)
deploy/docker/api.py (2)

444-448: Good guard: disallow non-headless sessions in Docker.

Prevents unsupported configurations early with a clear error.


556-590: Nice: recursive normalization preserves datetime/path/bytes without data loss.

This keeps new filesystem metadata and timestamps intact in responses.

…ion and error handling

• Replace 25+ hasattr() calls with @singledispatch protocol dispatch
• Add ResultContainer, SingleResult, AsyncResultGenerator protocols
• Implement NetworkError, ValidationError, SecurityError protocols
• Create ErrorResponse schema with structured error information
• Add centralized result_normalizer with recursive value normalization
• Fix path handling for quoted output_path parameters
• Maintain 100% backward compatibility with existing MCP endpoints
• Add comprehensive test coverage for new protocol systems
• Enable user-friendly error messages with actionable suggestions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deploy/docker/api.py (2)

156-177: Fix: unwrap arun() result in LLM extraction path.

Same container-vs-result issue; without unwrapping, error checks and content parsing are wrong.

Apply this diff:

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
+        async with AsyncWebCrawler() as crawler:
+            container = await crawler.arun(
                 url=url,
                 config=CrawlerRunConfig(
                     extraction_strategy=llm_strategy,
                     scraping_strategy=LXMLWebScrapingStrategy(),
                     cache_mode=cache_mode
                 )
             )
-
-        if not result.success:
+        results = normalize_results_protocol_dispatch(container)
+        result_obj = results[0] if results else None
+
+        if not result_obj or not getattr(result_obj, "success", False):
             await redis.hset(f"task:{task_id}", mapping={
                 "status": TaskStatus.FAILED,
-                "error": result.error_message
+                "error": getattr(result_obj, "error_message", "")
             })
             return
-
-        try:
-            content = json.loads(result.extracted_content)
+        try:
+            content = json.loads(result_obj.extracted_content)
         except json.JSONDecodeError:
-            content = result.extracted_content
+            content = result_obj.extracted_content

229-251: Fix: unwrap arun() result in markdown handler.

Accessing .success/.markdown directly on the container will fail. Unwrap before checks/return.

Apply this diff:

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
+        async with AsyncWebCrawler() as crawler:
+            container = await crawler.arun(
                 url=decoded_url,
                 config=CrawlerRunConfig(
                     markdown_generator=md_generator,
                     scraping_strategy=LXMLWebScrapingStrategy(),
                     cache_mode=cache_mode
                 )
             )
 
-            if not result.success:
+            results = normalize_results_protocol_dispatch(container)
+            result_obj = results[0] if results else None
+            if not result_obj or not getattr(result_obj, "success", False):
                 from server import _extract_user_friendly_error
-                error_msg = result.error_message or "Failed to fetch content"
+                error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content"
                 friendly_error = _extract_user_friendly_error(error_msg, "Markdown generation")
                 raise HTTPException(
                     status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
                     detail=friendly_error
                 )
 
-            return (result.markdown.raw_markdown 
+            return (result_obj.markdown.raw_markdown 
                    if filter_type == FilterType.RAW 
-                   else result.markdown.fit_markdown)
+                   else result_obj.markdown.fit_markdown)
🧹 Nitpick comments (22)
deploy/docker/result_protocols.py (1)

5-5: Use AsyncIterator/Awaitable in async protocol signatures

Typing-wise, aiter should return AsyncIterator[Any] and anext should return Awaitable[Any]. Update imports and method annotations accordingly.

-from typing import Protocol, runtime_checkable, Iterable, Any, AsyncGenerator
+from typing import Protocol, runtime_checkable, Iterable, Any, AsyncIterator, Awaitable
@@
-    def __aiter__(self) -> AsyncGenerator[Any, None]:
+    def __aiter__(self) -> AsyncIterator[Any]:
         """Async iterator interface."""
         ...
 
-    def __anext__(self) -> Any:
+    def __anext__(self) -> Awaitable[Any]:
         """Async next interface."""
         ...

Also applies to: 31-36

deploy/docker/utils.py (1)

139-146: Prefer package-qualified imports to avoid module shadowing

If this code runs as a package, use absolute imports (e.g., from deploy.docker.error_handler import ...) to avoid ambiguity; keep the legacy fallback as-is.

deploy/docker/test_basic_normalization.py (3)

76-76: Drop unnecessary f-string

No interpolation here.

-    assert "url" in result, f"Expected 'url' key in result"
+    assert "url" in result, "Expected 'url' key in result"

6-9: Avoid hardcoded /tmp path in tests

Use tempfile for portability and to satisfy linters.

 import os
 import pathlib
+import tempfile
 from datetime import datetime, date
 from typing import List, Any
@@
-        "path": pathlib.Path("/tmp/test"),
+        "path": pathlib.Path(tempfile.gettempdir()) / "test",
         "nested": {"date": date.today()}

Also applies to: 80-84


132-140: Narrow the broad exception catch in test main

Catching Exception masks failures. Either let exceptions bubble or catch AssertionError explicitly.

-    except Exception as e:
+    except AssertionError as e:
         print(f"\n❌ Test failed: {e}")
         import traceback
         traceback.print_exc()
deploy/docker/test_error_handling.py (2)

134-134: Drop unnecessary f-string

No placeholders.

-    print(f"✓ JSON serialization successful")
+    print("✓ JSON serialization successful")

162-165: Avoid blind Exception catch in test main

Let errors surface or catch specific exceptions to keep failures actionable.

-    except Exception as e:
+    except AssertionError as e:
         print(f"\n❌ Error handling test failed: {e}")
         import traceback
         traceback.print_exc()
deploy/docker/result_normalizer.py (5)

114-124: Honor SingleResult protocol in normalize_single_result

Leverage the protocol you defined rather than only duck-typing Pydantic.

-    if hasattr(result, "model_dump"):
+    if isinstance(result, SingleResult):
+        result_dict = result.to_dict()
+    elif hasattr(result, "model_dump"):
         result_dict = result.model_dump()

69-73: Raise TypeError for unsupported async-generator input

TypeError better communicates invalid input type.

-        raise ValueError("Async generators must be handled by caller with async collection")
+        raise TypeError("Async generators must be handled by caller with async collection")

51-54: Silence ARG001 by using underscore for unused singledispatch arg

Cleaner signature for the None-branch.

-@normalize_results.register
-def _(results: type(None)) -> List[Any]:
+@normalize_results.register
+def _(_: type(None)) -> List[Any]:

5-11: Prefer collections.abc.Iterable over hasattr('iter')

Runtime isinstance against collections.abc.Iterable avoids string matching and satisfies linters.

 import os
 import pathlib
 from base64 import b64encode
 from datetime import datetime, date
 from functools import singledispatch
 from typing import List, Any, Union, AsyncGenerator
+from collections.abc import Iterable as IterableABC
@@
-    if hasattr(results, '__iter__') and not isinstance(results, (str, bytes, dict)):
+    if isinstance(results, IterableABC) and not isinstance(results, (str, bytes, dict)):
         try:
             return list(results)

Also applies to: 75-79


15-20: DRY: reuse datetime_handler from utils

Avoid duplicating datetime handling logic; import the existing helper for consistency.

-def datetime_handler(obj):
-    """Handle datetime serialization - imported from existing logic."""
-    if isinstance(obj, (datetime, date)):
-        return obj.isoformat()
-    return obj
+from utils import datetime_handler
deploy/docker/api.py (5)

439-441: Type hint: make output_path Optional[str].

Avoid implicit Optional per Ruff RUF013.

Apply this diff:

-    config: dict,
-    output_path: str = None
+    config: dict,
+    output_path: Optional[str] = None

578-583: Propagate original write error and use explicit f-string conversion.

Raise from the caught exception and use {!s} per RUF010.

Apply this diff:

-            except (OSError, TypeError) as write_error:
-                logger.warning("crawl file write failed: %s", write_error, exc_info=True)
-                raise HTTPException(
-                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=f"Failed to write to {abs_path}: {str(write_error)}"
-                )
+            except (OSError, TypeError) as write_error:
+                logger.warning("crawl file write failed: %s", write_error, exc_info=True)
+                raise HTTPException(
+                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                    detail=f"Failed to write to {abs_path}: {write_error!s}"
+                ) from write_error

603-611: Avoid double JSON-encoding in HTTPException detail.

Return a dict, not a JSON string. Aligns with the PR’s “remove double-JSON-encoding” goal.

Apply this diff:

-        import json as json_module  # Import locally to avoid any scoping issues
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=json_module.dumps({ # Send structured error
-                "error": str(e),
-                "server_memory_delta_mb": mem_delta_mb,
-                "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0)
-            })
-        )
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail={
+                "error": str(e),
+                "server_memory_delta_mb": mem_delta_mb,
+                "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0),
+            },
+        )

498-498: Consider narrowing broad exception catches.

Catching bare Exception hinders observability and can mask programming errors; prefer specific exceptions or re-raise after logging.

Also applies to: 511-511


628-633: Align rate limiter enablement with non-streaming path.

Non-streaming honors config["crawler"]["rate_limiter"]["enabled"]; streaming path always enables it. Consider respecting the flag for consistency.

Apply this diff:

-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
-            rate_limiter=RateLimiter(
-                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
-            )
-        )
+        dispatcher = MemoryAdaptiveDispatcher(
+            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
+            rate_limiter=(
+                RateLimiter(base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]))
+                if config["crawler"]["rate_limiter"].get("enabled")
+                else None
+            ),
+        )
deploy/docker/error_handler.py (1)

76-82: Minor: use {!s} instead of str(...) in f-string.

Cleaner and satisfies RUF010.

Apply this diff:

-    return ErrorResponse(
-        error_type="unknown",
-        error_message=f"{operation} failed: {str(error)}",
+    return ErrorResponse(
+        error_type="unknown",
+        error_message=f"{operation} failed: {error!s}",
         operation=operation,
         suggestions=["Please check your input and try again"]
     )
deploy/docker/test_result_normalization.py (1)

1-1: Nit: remove shebang from test module (or make file executable).

Tests are run by pytest; the shebang is unnecessary and triggers EXE001.

Apply this diff:

-#!/usr/bin/env python3
deploy/docker/server.py (2)

909-968: Improve error handling in JSON schema generation

The _create_param_schema function catches all exceptions but doesn't log them, making debugging difficult. Also, the JSON serialization test could be more efficient.

Consider logging exceptions and improving the serialization check:

 def _create_param_schema(annotation, default_value):
     """Create JSON schema for a parameter based on its type annotation."""
     import json
     from typing import get_origin, get_args
     
     schema = {"type": "string"}  # Default fallback
     
     try:
         # ... existing type handling code ...
         
         # Add default value if present and JSON serializable
         if default_value != inspect.Parameter.empty and default_value is not None:
-            try:
-                json.dumps(default_value)  # Test if serializable
-                schema["default"] = default_value
-            except (TypeError, ValueError):
-                # If not serializable, convert to string representation
-                if isinstance(default_value, (str, int, float, bool)):
-                    schema["default"] = default_value
-                else:
-                    schema["default"] = str(default_value)
+            # Check if value is directly serializable
+            if isinstance(default_value, (str, int, float, bool, type(None))):
+                schema["default"] = default_value
+            else:
+                try:
+                    json.dumps(default_value)  # Test complex types
+                    schema["default"] = default_value
+                except (TypeError, ValueError):
+                    schema["default"] = str(default_value)
                     
-    except Exception:
+    except Exception as e:
         # Fallback for any errors in schema generation
+        logger.debug(f"Failed to generate schema for {annotation}: {e}")
         schema = {"type": "string", "description": f"Type: {annotation}"}
     
     return schema

787-791: Consider extracting truncation limit as a constant

The truncation function uses a hardcoded limit of 800 characters. Consider making this configurable or at least defining it as a named constant for maintainability.

+# At module level
+DEFAULT_CONTEXT_TRUNCATE_LIMIT = 800
+
 def _truncate(text: str, limit: int = 800) -> str:
     if len(text) <= limit:
         return text
     return text[:limit].rstrip() + "…"

Or even better, make it configurable:

-def _truncate(text: str, limit: int = 800) -> str:
+def _truncate(text: str, limit: int = None) -> str:
+    if limit is None:
+        limit = config.get("context", {}).get("truncate_limit", 800)
     if len(text) <= limit:
         return text
     return text[:limit].rstrip() + "…"
deploy/docker/error_protocols.py (1)

127-130: Consider documenting the context parameter usage

The context parameter in CrawlError.__init__ is stored but never accessed through a property. Consider either adding a property for it or documenting when/how it should be used.

 class CrawlError:
     """Base implementation for crawl-related errors."""

     def __init__(self, message: str, operation: str = "operation", context: Optional[Dict[str, Any]] = None):
         self._message = message
         self._operation = operation
+        # Context can be used to store additional error details for logging/debugging
         self.context = context or {}

     @property
     def message(self) -> str:
         return self._message

     @property
     def operation(self) -> str:
         return self._operation
+
+    @property
+    def context(self) -> Dict[str, Any]:
+        """Additional error context for debugging."""
+        return self._context
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f60e6ae and ae348c4.

📒 Files selected for processing (11)
  • deploy/docker/api.py (11 hunks)
  • deploy/docker/error_handler.py (1 hunks)
  • deploy/docker/error_protocols.py (1 hunks)
  • deploy/docker/result_normalizer.py (1 hunks)
  • deploy/docker/result_protocols.py (1 hunks)
  • deploy/docker/schemas.py (3 hunks)
  • deploy/docker/server.py (18 hunks)
  • deploy/docker/test_basic_normalization.py (1 hunks)
  • deploy/docker/test_error_handling.py (1 hunks)
  • deploy/docker/test_result_normalization.py (1 hunks)
  • deploy/docker/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
deploy/docker/result_protocols.py (2)
deploy/docker/test_basic_normalization.py (1)
  • results (47-48)
deploy/docker/test_result_normalization.py (1)
  • results (55-56)
deploy/docker/result_normalizer.py (4)
deploy/docker/result_protocols.py (4)
  • ResultContainer (9-15)
  • SingleResult (19-24)
  • AsyncResultGenerator (28-37)
  • results (13-15)
deploy/docker/utils.py (1)
  • datetime_handler (57-61)
deploy/docker/test_basic_normalization.py (2)
  • results (47-48)
  • model_dump (30-37)
deploy/docker/test_result_normalization.py (2)
  • results (55-56)
  • model_dump (38-45)
deploy/docker/test_error_handling.py (3)
deploy/docker/error_handler.py (3)
  • handle_error (21-82)
  • parse_network_error (170-206)
  • parse_security_error (209-224)
deploy/docker/error_protocols.py (21)
  • NetworkErrorImpl (141-150)
  • ValidationErrorImpl (154-173)
  • SecurityErrorImpl (176-195)
  • error_code (15-17)
  • error_code (149-150)
  • message (21-23)
  • message (133-134)
  • operation (27-29)
  • operation (137-138)
  • field_name (38-40)
  • field_name (164-165)
  • invalid_value (44-46)
  • invalid_value (168-169)
  • expected_format (50-52)
  • expected_format (172-173)
  • violation_type (61-63)
  • violation_type (186-187)
  • attempted_action (67-69)
  • attempted_action (190-191)
  • allowed_scope (73-75)
  • allowed_scope (194-195)
deploy/docker/schemas.py (4)
  • ErrorResponse (96-199)
  • network_error (107-121)
  • validation_error (124-141)
  • security_error (144-161)
deploy/docker/server.py (8)
deploy/docker/api.py (5)
  • handle_crawl_request (435-611)
  • handle_llm_qa (72-122)
  • handle_markdown_request (189-257)
  • handle_stream_crawl_request (613-662)
  • stream_results (402-433)
crawl4ai/async_webcrawler.py (2)
  • AsyncWebCrawler (53-852)
  • arun (204-431)
deploy/docker/crawler_pool.py (3)
  • close_all (47-50)
  • get_crawler (22-46)
  • janitor (52-60)
deploy/docker/mcp_bridge.py (6)
  • combine_lifespans (153-175)
  • create_mcp_server (138-149)
  • mcp_resource (24-36)
  • mcp_template (38-50)
  • mcp_tool (52-64)
  • register_tools_from_routes (179-211)
deploy/docker/schemas.py (9)
  • CrawlRequest (7-11)
  • HTMLRequest (34-35)
  • MarkdownRequest (13-28)
  • ErrorResponse (96-199)
  • PDFRequest (42-44)
  • RawCode (31-32)
  • ScreenshotRequest (37-40)
  • network_error (107-121)
  • error_response (82-93)
deploy/docker/utils.py (2)
  • FilterType (16-20)
  • _ensure_within_base_dir (128-179)
deploy/docker/error_handler.py (4)
  • handle_error (21-82)
  • parse_network_error (170-206)
  • parse_security_error (209-224)
  • create_http_exception_from_error (269-274)
deploy/docker/error_protocols.py (4)
  • NetworkErrorImpl (141-150)
  • SecurityErrorImpl (176-195)
  • operation (27-29)
  • operation (137-138)
deploy/docker/api.py (5)
deploy/docker/result_normalizer.py (4)
  • normalize_results_protocol_dispatch (57-82)
  • normalize_single_result (104-145)
  • normalize_value_recursive (85-101)
  • datetime_handler (15-19)
deploy/docker/schemas.py (1)
  • StandardizedResponse (55-93)
deploy/docker/utils.py (3)
  • validate_llm_provider (96-116)
  • datetime_handler (57-61)
  • _ensure_within_base_dir (128-179)
deploy/docker/server.py (1)
  • _extract_user_friendly_error (217-222)
crawl4ai/async_webcrawler.py (1)
  • arun (204-431)
deploy/docker/schemas.py (3)
deploy/docker/utils.py (1)
  • FilterType (16-20)
deploy/docker/result_protocols.py (1)
  • results (13-15)
deploy/docker/error_protocols.py (30)
  • error_code (15-17)
  • error_code (149-150)
  • operation (27-29)
  • operation (137-138)
  • message (21-23)
  • message (133-134)
  • field_name (38-40)
  • field_name (164-165)
  • invalid_value (44-46)
  • invalid_value (168-169)
  • expected_format (50-52)
  • expected_format (172-173)
  • violation_type (61-63)
  • violation_type (186-187)
  • attempted_action (67-69)
  • attempted_action (190-191)
  • allowed_scope (73-75)
  • allowed_scope (194-195)
  • config_key (84-86)
  • config_key (208-209)
  • config_value (90-92)
  • config_value (212-213)
  • suggestion (96-98)
  • suggestion (216-217)
  • processing_stage (107-109)
  • processing_stage (230-231)
  • data_type (113-115)
  • data_type (234-235)
  • recovery_suggestion (119-121)
  • recovery_suggestion (238-239)
deploy/docker/test_result_normalization.py (3)
deploy/docker/result_normalizer.py (4)
  • normalize_results (23-36)
  • normalize_results_protocol_dispatch (57-82)
  • normalize_single_result (104-145)
  • normalize_value_recursive (85-101)
deploy/docker/result_protocols.py (3)
  • ResultContainer (9-15)
  • SingleResult (19-24)
  • results (13-15)
deploy/docker/schemas.py (3)
  • StandardizedResponse (55-93)
  • success_response (65-79)
  • error_response (82-93)
deploy/docker/test_basic_normalization.py (3)
deploy/docker/result_normalizer.py (4)
  • normalize_results (23-36)
  • normalize_results_protocol_dispatch (57-82)
  • normalize_single_result (104-145)
  • normalize_value_recursive (85-101)
deploy/docker/result_protocols.py (3)
  • ResultContainer (9-15)
  • SingleResult (19-24)
  • results (13-15)
deploy/docker/schemas.py (2)
  • StandardizedResponse (55-93)
  • success_response (65-79)
deploy/docker/error_handler.py (2)
deploy/docker/error_protocols.py (40)
  • NetworkError (10-29)
  • ValidationError (33-52)
  • SecurityError (56-75)
  • ConfigurationError (79-98)
  • ProcessingError (102-121)
  • NetworkErrorImpl (141-150)
  • ValidationErrorImpl (154-173)
  • SecurityErrorImpl (176-195)
  • ConfigurationErrorImpl (198-217)
  • ProcessingErrorImpl (220-239)
  • operation (27-29)
  • operation (137-138)
  • error_code (15-17)
  • error_code (149-150)
  • message (21-23)
  • message (133-134)
  • field_name (38-40)
  • field_name (164-165)
  • invalid_value (44-46)
  • invalid_value (168-169)
  • expected_format (50-52)
  • expected_format (172-173)
  • violation_type (61-63)
  • violation_type (186-187)
  • attempted_action (67-69)
  • attempted_action (190-191)
  • allowed_scope (73-75)
  • allowed_scope (194-195)
  • config_key (84-86)
  • config_key (208-209)
  • config_value (90-92)
  • config_value (212-213)
  • suggestion (96-98)
  • suggestion (216-217)
  • processing_stage (107-109)
  • processing_stage (230-231)
  • data_type (113-115)
  • data_type (234-235)
  • recovery_suggestion (119-121)
  • recovery_suggestion (238-239)
deploy/docker/schemas.py (6)
  • ErrorResponse (96-199)
  • network_error (107-121)
  • validation_error (124-141)
  • security_error (144-161)
  • configuration_error (164-179)
  • processing_error (182-199)
deploy/docker/utils.py (3)
deploy/docker/error_handler.py (2)
  • handle_error (21-82)
  • create_http_exception_from_error (269-274)
deploy/docker/error_protocols.py (9)
  • SecurityErrorImpl (176-195)
  • violation_type (61-63)
  • violation_type (186-187)
  • attempted_action (67-69)
  • attempted_action (190-191)
  • allowed_scope (73-75)
  • allowed_scope (194-195)
  • message (21-23)
  • message (133-134)
deploy/docker/schemas.py (2)
  • security_error (144-161)
  • error_response (82-93)
🪛 Ruff (0.13.1)
deploy/docker/result_normalizer.py

52-52: Unused function argument: results

(ARG001)


72-72: Prefer TypeError exception for invalid type

(TRY004)


72-72: Avoid specifying long messages outside the exception class

(TRY003)

deploy/docker/test_error_handling.py

1-1: Shebang is present but file is not executable

(EXE001)


78-78: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


134-134: f-string without any placeholders

Remove extraneous f prefix

(F541)


162-162: Do not catch blind exception: Exception

(BLE001)

deploy/docker/server.py

204-204: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


219-219: Redefinition of unused parse_network_error from line 69

Remove definition: parse_network_error

(F811)


219-219: Redefinition of unused handle_error from line 69

Remove definition: handle_error

(F811)


273-273: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


273-273: Use explicit conversion flag

Replace with conversion flag

(RUF010)


274-274: Do not catch blind exception: Exception

(BLE001)


275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


275-275: Use explicit conversion flag

Replace with conversion flag

(RUF010)


310-310: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


839-839: Possible binding to all interfaces

(S104)


892-892: Do not catch blind exception: Exception

(BLE001)


900-900: Do not catch blind exception: Exception

(BLE001)


902-902: Use explicit conversion flag

Replace with conversion flag

(RUF010)


964-964: Do not catch blind exception: Exception

(BLE001)

deploy/docker/api.py

440-440: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


452-456: Abstract raise to an inner function

(TRY301)


498-498: Do not catch blind exception: Exception

(BLE001)


511-511: Do not catch blind exception: Exception

(BLE001)


548-548: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)


580-583: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


582-582: Use explicit conversion flag

Replace with conversion flag

(RUF010)


585-585: Consider moving this statement to an else block

(TRY300)

deploy/docker/test_result_normalization.py

1-1: Shebang is present but file is not executable

(EXE001)


162-162: Probable insecure usage of temporary file or directory: "/tmp/test"

(S108)


182-182: Probable insecure usage of temporary file or directory: "/tmp"

(S108)


186-186: Probable insecure usage of temporary file or directory: "/tmp/file"

(S108)


201-201: Probable insecure usage of temporary file or directory: "/tmp"

(S108)


267-267: Probable insecure usage of temporary file or directory: "/tmp/test"

(S108)

deploy/docker/test_basic_normalization.py

1-1: Shebang is present but file is not executable

(EXE001)


76-76: f-string without any placeholders

Remove extraneous f prefix

(F541)


82-82: Probable insecure usage of temporary file or directory: "/tmp/test"

(S108)


137-137: Do not catch blind exception: Exception

(BLE001)

deploy/docker/error_handler.py

79-79: Use explicit conversion flag

Replace with conversion flag

(RUF010)


135-135: Unused function argument: operation

(ARG001)


157-157: Unused function argument: operation

(ARG001)

🔇 Additional comments (10)
deploy/docker/utils.py (1)

147-161: Also validate that base_dir exists and is a directory

Good normalization and symlink handling. Add an existence/type check to avoid confusing errors and align with prior guidance.

@@
-    # Validate absolute base_dir
+    # Validate absolute base_dir
     if not os.path.isabs(base_dir_real):
@@
         raise create_http_exception_from_error(error_response, 400)
 
+    # Validate base_dir exists and is a directory
+    if not os.path.isdir(base_dir_real):
+        security_error = SecurityErrorImpl(
+            violation_type="invalid_base_directory",
+            attempted_action=f"Set base directory to: {base_dir}",
+            allowed_scope="existing absolute directories only",
+            message=f"Security restriction: base directory does not exist or is not a directory: {base_dir_real}."
+        )
+        error_response = handle_error(security_error, "Path validation")
+        raise create_http_exception_from_error(error_response, 400)
@@
-    # Validate absolute base_dir
-    if not os.path.isabs(base_dir_real):
+    # Validate absolute base_dir
+    if not os.path.isabs(base_dir_real):
         raise HTTPException(400, "Security restriction: base directory must be absolute.")
+    # Validate base_dir exists and is a directory
+    if not os.path.isdir(base_dir_real):
+        raise HTTPException(
+            400,
+            f"Security restriction: base directory does not exist or is not a directory: {base_dir_real}."
+        )

Also applies to: 182-200

deploy/docker/api.py (1)

86-98: Fix: unwrap CrawlResultContainer from arun() before accessing fields.

AsyncWebCrawler.arun() returns a container; direct access to .success/.markdown will break. Unwrap the first result before checks and content extraction.

Apply this diff:

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(url)
-            if not result.success:
-                from server import _extract_user_friendly_error
-                error_msg = result.error_message or "Failed to fetch content"
-                friendly_error = _extract_user_friendly_error(error_msg, "LLM QA")
-                raise HTTPException(
-                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=friendly_error
-                )
-            content = result.markdown.fit_markdown or result.markdown.raw_markdown
+        async with AsyncWebCrawler() as crawler:
+            container = await crawler.arun(url)
+            results = normalize_results_protocol_dispatch(container)
+            result_obj = results[0] if results else None
+            if not result_obj or not getattr(result_obj, "success", False):
+                from server import _extract_user_friendly_error
+                error_msg = getattr(result_obj, "error_message", None) or "Failed to fetch content"
+                friendly_error = _extract_user_friendly_error(error_msg, "LLM QA")
+                raise HTTPException(
+                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                    detail=friendly_error
+                )
+            content = result_obj.markdown.fit_markdown or result_obj.markdown.raw_markdown
deploy/docker/server.py (4)

839-839: Security: Binding to all interfaces (0.0.0.0)

The code correctly maps 0.0.0.0 to 127.0.0.1 for internal proxy calls, but be aware that binding to 0.0.0.0 exposes the service to all network interfaces, which could be a security risk in production.

The mapping logic on line 839 is correct for internal proxy calls. Ensure that the Docker container networking and firewall rules properly restrict external access if the service shouldn't be publicly accessible.


972-976: JWT authentication limitation is correctly documented

Good that you've documented the JWT authentication limitation for MCP tools. This is an important known issue for users to be aware of.


425-426: Good fix for concurrent file collision issue

The use of UUID for generating unique filenames prevents race conditions when multiple requests for the same host occur simultaneously. This properly addresses the previously identified collision issue.

Based on learnings

Also applies to: 490-491


303-314: Good backward compatibility handling

The implementation thoughtfully preserves backward compatibility by supporting deprecated query parameters while encouraging migration to request body parameters.

deploy/docker/error_protocols.py (1)

1-239: Well-structured error handling system

The protocol-based error handling approach with runtime-checkable protocols and concrete implementations is excellent. The separation of protocols from implementations provides good flexibility and testability.

deploy/docker/schemas.py (3)

1-200: Clean schema definitions with comprehensive error handling

The schema definitions are well-structured with clear field descriptions. The standardized response classes with factory methods provide a consistent API surface. The integration with the error protocol system is well done.


11-11: Good addition of output_path fields for binary data

Adding output_path fields to CrawlRequest, ScreenshotRequest, and PDFRequest is an excellent design choice. This prevents token/context limit issues with large base64 payloads and provides better performance.

Also applies to: 40-40, 44-44


16-28: Well-documented field migrations in MarkdownRequest

The renaming from f/q/c to filter/query/cache with proper Field descriptions improves API clarity. The backward compatibility is maintained through query parameters as seen in the server implementation.

Comment on lines +553 to +561
# Only use abspath if it's not already an absolute path
if os.path.isabs(clean_path):
abs_path = os.path.normpath(clean_path)
else:
abs_path = os.path.abspath(clean_path)

# Validate path is within allowed directory
_ensure_within_base_dir(abs_path, binary_base_dir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix: resolve relative output_path under the allowed base directory.

Using abspath for relatives can resolve outside the allowed dir and then be rejected by _ensure_within_base_dir. Join with base dir for relatives.

Apply this diff:

-            # Only use abspath if it's not already an absolute path
-            if os.path.isabs(clean_path):
-                abs_path = os.path.normpath(clean_path)
-            else:
-                abs_path = os.path.abspath(clean_path)
+            # Resolve relative paths inside the allowed base directory
+            if os.path.isabs(clean_path):
+                abs_path = os.path.normpath(clean_path)
+            else:
+                abs_path = os.path.normpath(os.path.join(binary_base_dir, clean_path))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Only use abspath if it's not already an absolute path
if os.path.isabs(clean_path):
abs_path = os.path.normpath(clean_path)
else:
abs_path = os.path.abspath(clean_path)
# Validate path is within allowed directory
_ensure_within_base_dir(abs_path, binary_base_dir)
# Resolve relative paths inside the allowed base directory
if os.path.isabs(clean_path):
abs_path = os.path.normpath(clean_path)
else:
abs_path = os.path.normpath(os.path.join(binary_base_dir, clean_path))
# Validate path is within allowed directory
_ensure_within_base_dir(abs_path, binary_base_dir)
🤖 Prompt for AI Agents
In deploy/docker/api.py around lines 553 to 561, the code currently uses
os.path.abspath for relative clean_path which can escape the allowed base dir
and then be rejected; instead, if clean_path is absolute, normalize it with
os.path.normpath; if it is relative, join it with binary_base_dir
(os.path.join(binary_base_dir, clean_path)) and then normalize that result;
finally call _ensure_within_base_dir on the resulting abs_path so relative
outputs are resolved inside the allowed base directory.

Comment on lines +34 to +76
# Check protocols explicitly since singledispatch doesn't handle protocols
if isinstance(error, NetworkError):
return ErrorResponse.network_error(
error_code=error.error_code,
message=error.message,
operation=operation,
suggestions=_get_network_error_suggestions(error.error_code)
)
elif isinstance(error, ValidationError):
return ErrorResponse.validation_error(
field_name=error.field_name,
invalid_value=error.invalid_value,
expected_format=error.expected_format,
suggestions=[
f"Provide a valid {error.expected_format} for {error.field_name}",
"Check the API documentation for valid input formats"
]
)
elif isinstance(error, SecurityError):
return ErrorResponse.security_error(
violation_type=error.violation_type,
attempted_action=error.attempted_action,
allowed_scope=error.allowed_scope,
suggestions=[
f"Ensure your action is within the allowed scope: {error.allowed_scope}",
"Check the security restrictions in the API documentation"
]
)
elif isinstance(error, ConfigurationError):
return ErrorResponse.configuration_error(
config_key=error.config_key,
config_value=error.config_value,
suggestion=error.suggestion
)
elif isinstance(error, ProcessingError):
return ErrorResponse.processing_error(
processing_stage=error.processing_stage,
data_type=error.data_type,
recovery_suggestion=error.recovery_suggestion,
operation=operation
)

# Fallback to generic error handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: isinstance() on Protocols requires @runtime_checkable.

These isinstance checks will raise TypeError unless the Protocols are marked runtime_checkable.

Add @runtime_checkable to the Protocols in deploy/docker/error_protocols.py:

+from typing import runtime_checkable
@@
- class NetworkError(Protocol):
+@runtime_checkable
+class NetworkError(Protocol):
@@
- class ValidationError(Protocol):
+@runtime_checkable
+class ValidationError(Protocol):
@@
- class SecurityError(Protocol):
+@runtime_checkable
+class SecurityError(Protocol):
@@
- class ConfigurationError(Protocol):
+@runtime_checkable
+class ConfigurationError(Protocol):
@@
- class ProcessingError(Protocol):
+@runtime_checkable
+class ProcessingError(Protocol):

Alternatively, replace isinstance checks here with attribute-based duck typing.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In deploy/docker/error_handler.py lines 34-76 the isinstance checks against
Protocol types will raise TypeError unless those Protocols are marked
runtime-checkable; update the Protocol definitions in
deploy/docker/error_protocols.py by adding from typing import runtime_checkable
and decorating each Protocol used here with @runtime_checkable so the isinstance
checks work at runtime, or alternatively replace these isinstance checks in this
file with explicit attribute-based duck-typing (e.g., check for the specific
attributes like error_code, field_name, violation_type, etc.) and then construct
the corresponding ErrorResponse branches.

Comment on lines +203 to +206
def _resolve_binary_base_dir() -> str:
base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")
return os.path.abspath(base_dir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Security concern: Insecure temporary directory usage

Using /tmp/crawl4ai-exports is a security risk as it's world-writable and shared among all users. This could lead to path traversal, symlink attacks, or unauthorized access to exported files.

Consider using a user-specific directory or configuring a secure location:

 def _resolve_binary_base_dir() -> str:
-    base_dir = config.get("binary", {}).get("default_dir", "/tmp/crawl4ai-exports")
+    # Use container's app directory or a secure user-specific location
+    base_dir = config.get("binary", {}).get("default_dir", "/app/exports")
     return os.path.abspath(base_dir)

Alternatively, use Python's tempfile module for secure temporary directories:

import tempfile
def _resolve_binary_base_dir() -> str:
    base_dir = config.get("binary", {}).get("default_dir", None)
    if base_dir is None:
        # Create a secure user-specific temp directory
        base_dir = os.path.join(tempfile.gettempdir(), f"crawl4ai-exports-{os.getuid()}")
    return os.path.abspath(base_dir)
🧰 Tools
🪛 Ruff (0.13.1)

204-204: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"

(S108)

🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 203 to 206, the function currently
defaults to the world-writable path "/tmp/crawl4ai-exports" which is insecure;
change it to use a config override when provided and otherwise build a
user-specific temp directory (e.g., join tempfile.gettempdir() with a name
containing the current uid) instead of a fixed /tmp path, ensure the directory
is created securely with restricted permissions (mkdir with 0o700) and
verify/resolve its real path to avoid symlink attacks; add the tempfile import
and ensure the function returns the absolute, real path.

Comment on lines +218 to +222
"""Legacy function - use error_handler.parse_network_error() instead."""
from error_handler import parse_network_error, handle_error
network_error = parse_network_error(error_msg, operation)
error_response = handle_error(network_error, operation)
return error_response.error_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove shadow imports in legacy compatibility function

The function re-imports parse_network_error and handle_error which were already imported at the module level, causing shadowing.

Remove the redundant import statement:

 def _extract_user_friendly_error(error_msg: str, operation: str = "operation") -> str:
     """Legacy function - use error_handler.parse_network_error() instead."""
-    from error_handler import parse_network_error, handle_error
     network_error = parse_network_error(error_msg, operation)
     error_response = handle_error(network_error, operation)
     return error_response.error_message
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Legacy function - use error_handler.parse_network_error() instead."""
from error_handler import parse_network_error, handle_error
network_error = parse_network_error(error_msg, operation)
error_response = handle_error(network_error, operation)
return error_response.error_message
def _extract_user_friendly_error(error_msg: str, operation: str = "operation") -> str:
"""Legacy function - use error_handler.parse_network_error() instead."""
network_error = parse_network_error(error_msg, operation)
error_response = handle_error(network_error, operation)
return error_response.error_message
🧰 Tools
🪛 Ruff (0.13.1)

219-219: Redefinition of unused parse_network_error from line 69

Remove definition: parse_network_error

(F811)


219-219: Redefinition of unused handle_error from line 69

Remove definition: handle_error

(F811)

🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 218 to 222, the legacy compatibility
function re-imports parse_network_error and handle_error which shadows the
module-level imports; remove the redundant "from error_handler import
parse_network_error, handle_error" line so the function uses the
already-imported names, leaving the rest of the function unchanged and run
tests/lint to confirm no unresolved references.

Comment on lines +273 to 276
raise HTTPException(400, f"Configuration validation error: {str(e)}")
except Exception as e:
raise HTTPException(400, str(e))
raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Improve exception handling with chained exceptions

When re-raising exceptions, use from e to preserve the original exception chain for better debugging.

Apply this fix to preserve exception context:

     try:
         return JSONResponse(_safe_eval_config(raw.code.strip()))
     except ValueError as e:
-        raise HTTPException(400, f"Configuration validation error: {str(e)}")
+        raise HTTPException(400, f"Configuration validation error: {str(e)}") from e
     except Exception as e:
-        raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.")
+        raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raise HTTPException(400, f"Configuration validation error: {str(e)}")
except Exception as e:
raise HTTPException(400, str(e))
raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.")
try:
return JSONResponse(_safe_eval_config(raw.code.strip()))
except ValueError as e:
raise HTTPException(400, f"Configuration validation error: {str(e)}") from e
except Exception as e:
raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") from e
🧰 Tools
🪛 Ruff (0.13.1)

273-273: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


273-273: Use explicit conversion flag

Replace with conversion flag

(RUF010)


274-274: Do not catch blind exception: Exception

(BLE001)


275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


275-275: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 273 to 276, the code re-raises new
HTTPException instances without chaining the original exception; update the
raise statements to use exception chaining (raise HTTPException(<status>,
<message>) from e) so the original exception context is preserved for
debugging—apply this to the Configuration validation except and the generic
except Exception block.

@leoric-crown
Copy link
Author

👋 Update: Clean Architecture Approach

After getting the initial SSE→HTTP migration working in this PR, I've created a clean architectural refactor that delivers the same functionality with production-ready code quality.

New PR: #1525 - feat(mcp): migrate to modern HTTP transport with clean architecture refactor

Why a separate PR?

  • Clean History: 6 focused commits vs 22 iterative commits here
  • Architecture Focus: Service-oriented design with proper separation of concerns
  • Production Ready: 35 new tests with 1,507 lines of comprehensive coverage
  • Review Efficiency: Easier for maintainers to assess the refined implementation

Same Result, Better Implementation:

  • ✅ All 7 MCP tools working with HTTP transport
  • ✅ Zero breaking changes and full backward compatibility
  • ✅ Comprehensive error handling with correlation IDs
  • ✅ Modern FastMCP 2.12.3 integration

This PR (#1519) represents the valuable iterative learning process that led to the clean solution in #1525. Both approaches solve the same problem - the new PR just presents it in a more maintainable, production-ready form.

Thanks for understanding the approach! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MCP SSE is failing with "Unexpected message"

1 participant