Skip to content

Conversation

@RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Jan 8, 2026

Description

Adds configurable accelerator_device parameter to both Standard and VLM Docling pipelines, allowing users to switch between CPU or GPU for document conversion based on their infrastructure resources.

Changes:

  • Added accelerator_device parameter (auto, cpu, gpu) to docling_convert_standard and docling_convert_vlm components
  • Added docling_accelerator_device pipeline parameter to both standard_convert_pipeline.py and vlm_convert_pipeline.py
  • Added input validation with clear error messages for invalid device values
  • Updated local_run.py for both pipelines with comprehensive test suite including:
  • Normal functionality tests (CPU, auto-detect, different configs)
  • Failure scenario tests (invalid parameters validation)

How Has This Been Tested?

Run local test suite (requires Docker), tests cover:

  1. baseline_cpu - CPU-only processing
  2. auto_detect - Auto device detection
  3. gpu_test - GPU/CUDA acceleration
  4. Various backend/table mode combinations
  5. Validation failure scenarios (invalid device, backend, table mode, etc.)

Merge criteria:
[x] The commits are squashed in a cohesive manner and have meaningful messages.
[x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
[x] The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added manual accelerator device selection (auto/CPU/GPU) to conversion components and pipelines.
    • Introduced parameterized test-run capabilities with configurable scenarios, failure cases, and per-test orchestration.
  • Tests

    • Added comprehensive test-run scaffolding: execution, output validation, timing, and aggregated reporting with exit status.
  • Documentation

    • Clarified docstrings for key pipeline and test entry points.

✏️ Tip: You can customize this high-level summary in your review settings.

@RobuRishabh RobuRishabh requested a review from a team as a code owner January 8, 2026 20:17
@coderabbitai
Copy link

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

Added test orchestration and validation scaffolding to docling-standard and docling-vlm, plus a new accelerator device parameter propagated through components, pipeline signatures, and compiled YAMLs to allow selecting auto/cpu/gpu.

Changes

Cohort / File(s) Summary
Test Framework Infrastructure
kubeflow-pipelines/docling-standard/local_run.py, kubeflow-pipelines/docling-vlm/local_run.py
Added TEST_CONFIGS and FAILURE_SCENARIOS. Added take_first_split() component, validate_output(), run_test_scenario(), and parameterized convert_pipeline_test(). Reworked main() to initialize a Docker local runner, execute normal and failure test phases, aggregate results, and emit a summary + exit code.
Standard Pipeline Components & Pipeline
kubeflow-pipelines/docling-standard/standard_components.py, kubeflow-pipelines/docling-standard/standard_convert_pipeline.py, kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
Added accelerator_device: str = "auto" parameter to docling_convert_standard() and propagated it through convert_pipeline. Introduced accelerator_device_map and validation; replaced hardcoded AcceleratorDevice with mapped value. Updated compiled YAML to add docling_accelerator_device top-level input and wire accelerator_device into the component and for-loop channels.
VLM Pipeline Components & Pipeline
kubeflow-pipelines/docling-vlm/vlm_components.py, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
Added accelerator_device: str = "auto" parameter to docling_convert_vlm() and propagated it through convert_pipeline. Added accelerator_device_map with validation and used mapped AcceleratorDevice in AcceleratorOptions. Updated compiled YAML to add docling_accelerator_device root input, component input, and for-loop wiring.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LocalRunner
    participant TestOrch as Test Orchestrator
    participant Pipeline
    participant Validator
    participant Reporter

    User->>LocalRunner: main()
    LocalRunner->>TestOrch: initialize runner
    TestOrch->>Pipeline: run_test_scenario(name, config)
    Pipeline->>Pipeline: convert_pipeline_test(...) with accelerator param
    Pipeline-->>TestOrch: pipeline execution result
    TestOrch->>Validator: validate_output(output_dir)
    Validator-->>TestOrch: validation result
    TestOrch->>Reporter: collect result
    Reporter-->>User: print summary & exit
Loading
sequenceDiagram
    participant Client
    participant Pipeline
    participant Component
    participant Converter

    Client->>Pipeline: convert_pipeline(docling_accelerator_device="gpu")
    Pipeline->>Component: docling_convert_*(accelerator_device="gpu")
    Component->>Component: validate & map accelerator_device
    Component->>Converter: init AcceleratorOptions(device=CUDA)
    Converter-->>Component: conversion result
    Component-->>Pipeline: result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through tests with configs bright,

I split and validate by moonlight,
Auto, CPU, or GPU,
I pick the fastest path for you,
Pipelines hum — a carrot treat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'added accelerator devices' is directly related to the main change: adding configurable accelerator_device parameters to both Standard and VLM Docling pipelines to enable CPU/GPU device selection.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @kubeflow-pipelines/docling-standard/local_run.py:
- Around line 1-3: File local_run.py has formatting issues preventing CI; run
the formatter and commit the changes. Run the suggested command (ruff format
kubeflow-pipelines/docling-standard/local_run.py) or apply your project's
formatter to the module local_run.py, verify the imports and file pass
ruff/flake8 checks, then add/commit the formatted file and push to the branch.

In @kubeflow-pipelines/docling-vlm/local_run.py:
- Around line 124-197: The validate_output function is implemented but never
invoked; call validate_output(output_dir, test_name) from run_test_scenario
after a successful pipeline run (i.e., when should_fail is False and
artifacts/output_dir exist), capture the returned validation_results, set test
status to FAIL if validation_results["passed"] is False, and include
validation_results in the run_test_scenario return payload; alternatively, if
you prefer removal, delete validate_output and any imports it requires. Ensure
you reference the validate_output function, run_test_scenario, should_fail,
output_dir, and test_name when making the change.
- Line 1: The file local_run.py is failing CI due to formatting; run the
auto-formatter and commit the result: execute `ruff format
kubeflow-pipelines/docling-vlm/local_run.py` (or run your project's
pre-commit/formatter) to fix whitespace/formatting issues and then stage/commit
the updated file so the import and any other code in local_run.py conforms to
the project's ruff style rules.
- Around line 42-47: TEST_CONFIGS currently includes an active "gpu_test" entry
which contradicts the "Uncomment if you have GPU available" comment and will
fail on CPU-only machines; either remove or comment out the "gpu_test" dict from
TEST_CONFIGS, or replace the static entry with conditional inclusion that checks
GPU availability (e.g., using torch.cuda.is_available() or an environment flag)
before adding the "gpu_test" config so the GPU-only test is only present when a
GPU is detected.
🧹 Nitpick comments (3)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

87-96: LGTM! Consider extracting shared validation logic.

The implementation is correct and consistent with standard_components.py. The accelerator_device_map and validation logic are duplicated between both component files.

For maintainability, you could consider extracting this shared logic to a common utility in the common module, but this is optional given the current scope.

kubeflow-pipelines/docling-standard/local_run.py (1)

301-326: Simplify by removing unnecessary else after return.

The static analysis hint suggests removing the else block since the if branch returns. This improves readability.

♻️ Suggested simplification
         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
kubeflow-pipelines/docling-vlm/local_run.py (1)

254-279: Consider removing unnecessary else after return.

Per static analysis, the else block at line 269 is redundant since the if block returns. De-indenting improves readability.

♻️ Proposed refactor
         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37b97a0 and 45dd82b.

📒 Files selected for processing (8)
  • kubeflow-pipelines/docling-standard/local_run.py
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-vlm/local_run.py
  • kubeflow-pipelines/docling-vlm/vlm_components.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (5)
  • take_first_split (71-73)
  • convert_pipeline_test (80-117)
  • validate_output (124-197)
  • run_test_scenario (204-287)
  • main (294-356)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (5)
  • take_first_split (106-108)
  • convert_pipeline_test (115-166)
  • validate_output (172-245)
  • run_test_scenario (251-334)
  • main (340-402)
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-standard/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

kubeflow-pipelines/docling-vlm/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py

[refactor] 115-115: Too many positional arguments (8/5)

(R0917)


[refactor] 304-326: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 257-279: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: Summary
🔇 Additional comments (12)
kubeflow-pipelines/docling-standard/standard_components.py (1)

95-104: LGTM! Clean validation and mapping implementation.

The accelerator_device_map provides a user-friendly abstraction over the internal AcceleratorDevice enum, and the validation with descriptive error messages is well implemented. The mapping of "gpu" to AcceleratorDevice.CUDA is a sensible UX choice.

Also applies to: 167-169

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)

29-29: LGTM!

The parameter is correctly added to the pipeline signature and properly forwarded to the VLM converter component. The naming convention with docling_ prefix at the pipeline level and shorter name at the component level is consistent with other parameters in both pipelines.

Also applies to: 77-77

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

29-29: LGTM!

The accelerator device parameter is correctly added and threaded through to the standard converter component, maintaining consistency with the VLM pipeline implementation.

Also applies to: 90-90

kubeflow-pipelines/docling-standard/local_run.py (3)

172-245: validate_output function is defined but never called.

This function validates output quality but isn't invoked in the test runner flow. Was this intended for future use, or should it be integrated into run_test_scenario for successful test cases?


23-59: LGTM! Comprehensive test configurations.

The test matrix covers key scenarios: CPU-only, auto-detect, GPU/CUDA, different backends, and table modes. The structure aligns well with the VLM pipeline's test configs.


61-99: LGTM! Good coverage of failure scenarios.

The failure scenarios properly test validation for all configurable parameters: device, backend, table mode, image mode, and OCR engine. The expected_error patterns will correctly match the ValueError messages from the components.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

107-111: LGTM! Pipeline wiring is correctly generated.

The compiled YAML correctly propagates docling_accelerator_device through all pipeline layers:

  • Top-level pipeline input (lines 840-843)
  • For-loop component input (lines 235-236, 799-800)
  • Converter component parameter binding (lines 203-204)
  • Component input definition (lines 107-111)

The embedded component code reflects the source changes accurately.

Also applies to: 203-204, 235-236, 799-800, 840-843

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

116-120: LGTM! Standard pipeline compiled YAML is correctly updated.

The docling_accelerator_device parameter is properly wired through the pipeline:

  • Top-level input definition (lines 947-950)
  • For-loop input binding (lines 299-300, 888-889)
  • Component invocation (lines 252-253)
  • Component parameter schema (lines 116-120)

The embedded component code accurately reflects the source changes from standard_components.py.

Also applies to: 252-253, 299-300, 888-889, 947-950

kubeflow-pipelines/docling-vlm/local_run.py (4)

50-65: LGTM!

The failure scenarios are well-structured with clear expected error messages for validation testing.


70-73: LGTM!

The helper component is consistent with the standard pipeline implementation.


79-117: LGTM!

The test pipeline correctly wires the accelerator device and other parameters to the VLM converter component.


294-356: LGTM!

The main entry point correctly orchestrates test execution, collects results, and provides a clear summary with appropriate exit codes.

Comment on lines +1 to +3
import json
import sys
import time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix formatting to pass CI pipeline.

The pipeline failure indicates this file needs formatting. Run the suggested command to fix:

ruff format kubeflow-pipelines/docling-standard/local_run.py
🧰 Tools
🪛 GitHub Actions: Python Formatting Validation

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-standard/local_run.py around lines 1 - 3, File
local_run.py has formatting issues preventing CI; run the formatter and commit
the changes. Run the suggested command (ruff format
kubeflow-pipelines/docling-standard/local_run.py) or apply your project's
formatter to the module local_run.py, verify the imports and file pass
ruff/flake8 checks, then add/commit the formatted file and push to the branch.

@@ -1,8 +1,9 @@
import json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix formatting to pass CI.

The pipeline failure indicates this file needs formatting. Run ruff format kubeflow-pipelines/docling-vlm/local_run.py to fix.

🧰 Tools
🪛 GitHub Actions: Python Formatting Validation

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py at line 1, The file local_run.py
is failing CI due to formatting; run the auto-formatter and commit the result:
execute `ruff format kubeflow-pipelines/docling-vlm/local_run.py` (or run your
project's pre-commit/formatter) to fix whitespace/formatting issues and then
stage/commit the updated file so the import and any other code in local_run.py
conforms to the project's ruff style rules.

Comment on lines +42 to +47
# Uncomment if you have GPU available
"gpu_test": {
"docling_accelerator_device": "gpu",
"docling_num_threads": 4,
"docling_image_export_mode": "embedded",
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment says to uncomment but gpu_test is already active.

The comment suggests uncommenting gpu_test if GPU is available, but it's already included in TEST_CONFIGS. This will cause test failures on systems without GPU. Either comment out the test case or remove the misleading comment and add GPU availability detection.

Option 1: Comment out as the instruction suggests
     # Uncomment if you have GPU available
-    "gpu_test": {
-        "docling_accelerator_device": "gpu",
-        "docling_num_threads": 4,
-        "docling_image_export_mode": "embedded",
-    },
+    # "gpu_test": {
+    #     "docling_accelerator_device": "gpu",
+    #     "docling_num_threads": 4,
+    #     "docling_image_export_mode": "embedded",
+    # },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Uncomment if you have GPU available
"gpu_test": {
"docling_accelerator_device": "gpu",
"docling_num_threads": 4,
"docling_image_export_mode": "embedded",
},
# Uncomment if you have GPU available
# "gpu_test": {
# "docling_accelerator_device": "gpu",
# "docling_num_threads": 4,
# "docling_image_export_mode": "embedded",
# },
🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py around lines 42 - 47,
TEST_CONFIGS currently includes an active "gpu_test" entry which contradicts the
"Uncomment if you have GPU available" comment and will fail on CPU-only
machines; either remove or comment out the "gpu_test" dict from TEST_CONFIGS, or
replace the static entry with conditional inclusion that checks GPU availability
(e.g., using torch.cuda.is_available() or an environment flag) before adding the
"gpu_test" config so the GPU-only test is only present when a GPU is detected.

Comment on lines +124 to +199
def validate_output(output_dir: Path, test_name: str) -> dict:
"""
Validate the output quality and format.
Args:
output_dir: Path to the output directory containing converted files.
test_name: Name of the test for reporting.
Returns:
Dictionary containing validation results and checks.
"""
validation_results = {
"test_name": test_name,
"passed": True,
"checks": {},
}

# Check for JSON output
json_files = list(output_dir.glob("*.json"))
validation_results["checks"]["json_exists"] = len(json_files) > 0

# Check for Markdown output
md_files = list(output_dir.glob("*.md"))
validation_results["checks"]["md_exists"] = len(md_files) > 0

if json_files:
json_path = json_files[0]
try:
with open(json_path) as f:
data = json.load(f)

# Required fields validation
validation_results["checks"]["has_name"] = "name" in data
validation_results["checks"]["has_pages"] = "pages" in data

if "pages" in data:
validation_results["checks"]["pages_count"] = len(data["pages"])
validation_results["checks"]["has_content"] = len(data["pages"]) > 0

# Check page structure
if data["pages"]:
first_page = data["pages"][0]
validation_results["checks"]["page_has_number"] = (
"page_number" in first_page
)
validation_results["checks"]["page_has_size"] = "size" in first_page
except Exception as e:
validation_results["checks"]["json_parsing_error"] = str(e)
validation_results["passed"] = False

if md_files:
md_path = md_files[0]
try:
content = md_path.read_text()
validation_results["checks"]["md_length"] = len(content)
validation_results["checks"]["md_not_empty"] = len(content) > 100
validation_results["checks"]["md_has_headers"] = "#" in content
validation_results["checks"]["md_no_error"] = not content.startswith(
"Error"
)
except Exception as e:
validation_results["checks"]["md_parsing_error"] = str(e)
validation_results["passed"] = False

# Overall pass/fail
validation_results["passed"] = all(
[
validation_results["checks"].get("json_exists", False),
validation_results["checks"].get("md_exists", False),
validation_results["checks"].get("has_content", False),
]
)

return validation_results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

validate_output is defined but never called.

This function implements output validation but is never invoked in run_test_scenario or main. Either integrate it into the test flow to validate outputs after successful pipeline runs, or remove it if not needed.

Example integration into run_test_scenario
# After successful pipeline run in run_test_scenario:
if not should_fail:
    # Optionally validate output quality
    # validation = validate_output(output_dir, test_name)
    # if not validation["passed"]:
    #     return {"test_name": test_name, "status": "FAIL", ...}
    pass

Would you like me to help integrate validate_output into the test flow, or should this function be removed?

🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py around lines 124 - 197, The
validate_output function is implemented but never invoked; call
validate_output(output_dir, test_name) from run_test_scenario after a successful
pipeline run (i.e., when should_fail is False and artifacts/output_dir exist),
capture the returned validation_results, set test status to FAIL if
validation_results["passed"] is False, and include validation_results in the
run_test_scenario return payload; alternatively, if you prefer removal, delete
validate_output and any imports it requires. Ensure you reference the
validate_output function, run_test_scenario, should_fail, output_dir, and
test_name when making the change.

Signed-off-by: roburishabh <roburishabh@outlook.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
kubeflow-pipelines/docling-vlm/local_run.py (2)

126-199: Remove unused validation function.

The validate_output function is defined but never called anywhere in the code. Either remove it or integrate it into the test flow.

🧹 Option 1: Remove the unused function
-def validate_output(output_dir: Path, test_name: str) -> dict:
-    """
-    Validate the output quality and format.
-
-    Args:
-        output_dir: Path to the output directory containing converted files.
-        test_name: Name of the test for reporting.
-
-    Returns:
-        Dictionary containing validation results and checks.
-    """
-    validation_results = {
-        "test_name": test_name,
-        "passed": True,
-        "checks": {},
-    }
-
-    # Check for JSON output
-    json_files = list(output_dir.glob("*.json"))
-    validation_results["checks"]["json_exists"] = len(json_files) > 0
-
-    # Check for Markdown output
-    md_files = list(output_dir.glob("*.md"))
-    validation_results["checks"]["md_exists"] = len(md_files) > 0
-
-    if json_files:
-        json_path = json_files[0]
-        try:
-            with open(json_path) as f:
-                data = json.load(f)
-
-            # Required fields validation
-            validation_results["checks"]["has_name"] = "name" in data
-            validation_results["checks"]["has_pages"] = "pages" in data
-
-            if "pages" in data:
-                validation_results["checks"]["pages_count"] = len(data["pages"])
-                validation_results["checks"]["has_content"] = len(data["pages"]) > 0
-
-                # Check page structure
-                if data["pages"]:
-                    first_page = data["pages"][0]
-                    validation_results["checks"]["page_has_number"] = (
-                        "page_number" in first_page
-                    )
-                    validation_results["checks"]["page_has_size"] = "size" in first_page
-        except Exception as e:
-            validation_results["checks"]["json_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    if md_files:
-        md_path = md_files[0]
-        try:
-            content = md_path.read_text()
-            validation_results["checks"]["md_length"] = len(content)
-            validation_results["checks"]["md_not_empty"] = len(content) > 100
-            validation_results["checks"]["md_has_headers"] = "#" in content
-            validation_results["checks"]["md_no_error"] = not content.startswith(
-                "Error"
-            )
-        except Exception as e:
-            validation_results["checks"]["md_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    # Overall pass/fail
-    validation_results["passed"] = all(
-        [
-            validation_results["checks"].get("json_exists", False),
-            validation_results["checks"].get("md_exists", False),
-            validation_results["checks"].get("has_content", False),
-        ]
-    )
-
-    return validation_results
-

257-282: Simplify control flow by removing unnecessary else.

The else block after Line 271 is unnecessary since the preceding if block returns early.

♻️ Simplify the control flow
         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
kubeflow-pipelines/docling-standard/local_run.py (2)

177-250: Remove unused validation function.

The validate_output function is defined but never called. Either remove it or integrate it into the test workflow (same issue as in the VLM local_run.py).

🧹 Remove the unused function
-def validate_output(output_dir: Path, test_name: str) -> dict:
-    """
-    Validate the output quality and format.
-
-    Args:
-        output_dir: Path to the output directory containing converted files.
-        test_name: Name of the test for reporting.
-
-    Returns:
-        Dictionary containing validation results and checks.
-    """
-    validation_results = {
-        "test_name": test_name,
-        "passed": True,
-        "checks": {},
-    }
-
-    # Check for JSON output
-    json_files = list(output_dir.glob("*.json"))
-    validation_results["checks"]["json_exists"] = len(json_files) > 0
-
-    # Check for Markdown output
-    md_files = list(output_dir.glob("*.md"))
-    validation_results["checks"]["md_exists"] = len(md_files) > 0
-
-    if json_files:
-        json_path = json_files[0]
-        try:
-            with open(json_path) as f:
-                data = json.load(f)
-
-            # Required fields validation
-            validation_results["checks"]["has_name"] = "name" in data
-            validation_results["checks"]["has_pages"] = "pages" in data
-
-            if "pages" in data:
-                validation_results["checks"]["pages_count"] = len(data["pages"])
-                validation_results["checks"]["has_content"] = len(data["pages"]) > 0
-
-                # Check page structure
-                if data["pages"]:
-                    first_page = data["pages"][0]
-                    validation_results["checks"]["page_has_number"] = (
-                        "page_number" in first_page
-                    )
-                    validation_results["checks"]["page_has_size"] = "size" in first_page
-        except Exception as e:
-            validation_results["checks"]["json_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    if md_files:
-        md_path = md_files[0]
-        try:
-            content = md_path.read_text()
-            validation_results["checks"]["md_length"] = len(content)
-            validation_results["checks"]["md_not_empty"] = len(content) > 100
-            validation_results["checks"]["md_has_headers"] = "#" in content
-            validation_results["checks"]["md_no_error"] = not content.startswith(
-                "Error"
-            )
-        except Exception as e:
-            validation_results["checks"]["md_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    # Overall pass/fail
-    validation_results["passed"] = all(
-        [
-            validation_results["checks"].get("json_exists", False),
-            validation_results["checks"].get("md_exists", False),
-            validation_results["checks"].get("has_content", False),
-        ]
-    )
-
-    return validation_results
-

308-333: Simplify control flow by removing unnecessary else.

The else block after Line 322 is unnecessary since the preceding if block returns early.

♻️ Simplify the control flow
         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45dd82b and a3e41f4.

📒 Files selected for processing (8)
  • kubeflow-pipelines/docling-standard/local_run.py
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-vlm/local_run.py
  • kubeflow-pipelines/docling-vlm/vlm_components.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🧰 Additional context used
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py

[refactor] 118-118: Too many positional arguments (8/5)

(R0917)


[refactor] 311-333: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 260-282: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: Summary
🔇 Additional comments (13)
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

29-29: LGTM! Clean parameter addition and propagation.

The docling_accelerator_device parameter is correctly added to the pipeline signature and properly propagated to the docling_convert_standard component.

Also applies to: 90-90

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)

29-29: LGTM! Consistent with standard pipeline implementation.

The docling_accelerator_device parameter is correctly added and propagated, following the same pattern as the standard pipeline.

Also applies to: 77-77

kubeflow-pipelines/docling-vlm/vlm_components.py (3)

23-23: LGTM! Well-documented parameter addition.

The accelerator_device parameter is properly added with a sensible default and clear docstring.

Also applies to: 38-38


87-96: LGTM! Robust validation with clear error messages.

The validation logic correctly maps user-friendly device names to AcceleratorDevice enum values. The "gpu" → CUDA mapping is intuitive and the error message provides helpful guidance.


178-178: LGTM! Correct usage of validated device mapping.

The mapped accelerator device is properly applied to the AcceleratorOptions configuration.

kubeflow-pipelines/docling-vlm/local_run.py (4)

18-65: LGTM! Comprehensive test coverage.

The test configurations cover normal functionality (CPU, auto-detect, GPU, various modes) and failure scenarios (invalid device, invalid image mode). This provides solid validation of the new parameter.


71-74: LGTM! Clear and simple helper component.

The take_first_split component is well-documented and handles the edge case of empty splits.


81-118: LGTM! Well-structured test pipeline.

The parameterized test pipeline enables flexible testing of different accelerator device configurations.


207-290: LGTM! Well-designed test orchestration.

The test framework effectively validates both successful conversions and proper error handling for invalid parameters. The two-phase approach (normal tests + failure scenarios) provides comprehensive coverage.

Also applies to: 298-360

kubeflow-pipelines/docling-standard/local_run.py (4)

23-99: LGTM! Excellent test coverage for standard pipeline.

The test configurations comprehensively cover different backends (dlparse_v4, pypdfium2), table modes (accurate, fast), and failure scenarios (invalid device, backend, table mode, image mode, OCR engine). This provides thorough validation of the new accelerator_device parameter and existing options.


106-109: LGTM! Clear helper component.

The take_first_split helper is well-documented and handles edge cases properly.


117-169: LGTM! Comprehensive test pipeline.

The parameterized test pipeline enables thorough testing of accelerator device configurations alongside other conversion parameters.


258-341: LGTM! Robust test orchestration with excellent coverage.

The test framework comprehensively validates the standard pipeline's accelerator device parameter alongside other configuration options. The two-phase approach effectively catches both successful conversions and validation errors.

Also applies to: 349-411

Comment on lines +27 to +41
"auto_detect": {
"docling_accelerator_device": "auto",
"docling_num_threads": 4,
"docling_image_export_mode": "embedded",
},
"minimal_threads": {
"docling_accelerator_device": "cpu",
"docling_num_threads": 2,
"docling_image_export_mode": "embedded",
},
"placeholder_images": {
"docling_accelerator_device": "cpu",
"docling_num_threads": 4,
"docling_image_export_mode": "placeholder",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these testcases, one with cpu and other with gpu is sufficient. The idea here is to test our pipeline code and not the docling functionality comprehensively.

Comment on lines +50 to +65
FAILURE_SCENARIOS = {
"invalid_device": {
"docling_accelerator_device": "invalid_device",
"docling_num_threads": 4,
"docling_image_export_mode": "embedded",
"should_fail": True,
"expected_error": "Invalid accelerator_device",
},
"invalid_image_mode": {
"docling_accelerator_device": "cpu",
"docling_num_threads": 4,
"docling_image_export_mode": "invalid_mode",
"should_fail": True,
"expected_error": "Invalid image_export_mode",
},
}
Copy link
Contributor

@shruthis4 shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind these testcases?

Copy link
Contributor Author

@RobuRishabh RobuRishabh Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just different failures scenarios, I wanted to check whether they are failing or not when they are required to fail for different arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases

Comment on lines +38 to +51
"pypdfium2_backend": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "pypdfium2",
"docling_table_mode": "fast",
"docling_num_threads": 2,
"docling_ocr": False,
},
"fast_table_mode": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "dlparse_v4",
"docling_table_mode": "fast",
"docling_num_threads": 4,
"docling_ocr": True,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again we should not go into this level of details for such workflow level test. You could merge these cases with the previous ones. For this change i see cases with device set to "auto", "cpu" and "gpu" as critical.

Comment on lines +62 to +67
"invalid_device": {
"docling_accelerator_device": "invalid_device",
"docling_pdf_backend": "dlparse_v4",
"docling_table_mode": "accurate",
"should_fail": True,
"expected_error": "Invalid accelerator_device",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are just passing on the parameter to the docling to trigger the error correct? In which case testing the docling error should not be this test's responsibility

Comment on lines +68 to +98
},
"invalid_backend": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "invalid_backend",
"docling_table_mode": "accurate",
"should_fail": True,
"expected_error": "Invalid pdf_backend",
},
"invalid_table_mode": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "dlparse_v4",
"docling_table_mode": "invalid_mode",
"should_fail": True,
"expected_error": "Invalid table_mode",
},
"invalid_image_mode": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "dlparse_v4",
"docling_table_mode": "accurate",
"docling_image_export_mode": "invalid_mode",
"should_fail": True,
"expected_error": "Invalid image_export_mode",
},
"invalid_ocr_engine": {
"docling_accelerator_device": "cpu",
"docling_pdf_backend": "dlparse_v4",
"docling_table_mode": "accurate",
"docling_ocr_engine": "invalid_engine",
"should_fail": True,
"expected_error": "Invalid ocr_engine",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My above comment applies to all these test cases

Comment on lines +50 to +65
FAILURE_SCENARIOS = {
"invalid_device": {
"docling_accelerator_device": "invalid_device",
"docling_num_threads": 4,
"docling_image_export_mode": "embedded",
"should_fail": True,
"expected_error": "Invalid accelerator_device",
},
"invalid_image_mode": {
"docling_accelerator_device": "cpu",
"docling_num_threads": 4,
"docling_image_export_mode": "invalid_mode",
"should_fail": True,
"expected_error": "Invalid image_export_mode",
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases

@mergify
Copy link

mergify bot commented Jan 15, 2026

🔀 Merge Conflict Detected

This pull request has merge conflicts that must be resolved before it can be merged.
@RobuRishabh please rebase your branch.

Syncing a Fork Guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants