added accelerator devices #79

RobuRishabh · 2026-01-08T20:17:43Z

Description

Adds configurable accelerator_device parameter to both Standard and VLM Docling pipelines, allowing users to switch between CPU or GPU for document conversion based on their infrastructure resources.

Changes:

Added accelerator_device parameter (auto, cpu, gpu) to docling_convert_standard and docling_convert_vlm components
Added docling_accelerator_device pipeline parameter to both standard_convert_pipeline.py and vlm_convert_pipeline.py
Added input validation with clear error messages for invalid device values
Updated local_run.py for both pipelines with comprehensive test suite including:
Normal functionality tests (CPU, auto-detect, different configs)
Failure scenario tests (invalid parameters validation)

How Has This Been Tested?

Run local test suite (requires Docker), tests cover:

baseline_cpu - CPU-only processing
auto_detect - Auto device detection
gpu_test - GPU/CUDA acceleration
Various backend/table mode combinations
Validation failure scenarios (invalid device, backend, table mode, etc.)

Merge criteria:
[x] The commits are squashed in a cohesive manner and have meaningful messages.
[x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
[x] The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Added manual accelerator device selection (auto/CPU/GPU) to conversion components and pipelines.
- Introduced parameterized test-run capabilities with configurable scenarios, failure cases, and per-test orchestration.
Tests
- Added comprehensive test-run scaffolding: execution, output validation, timing, and aggregated reporting with exit status.
Documentation
- Clarified docstrings for key pipeline and test entry points.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-08T20:18:07Z

📝 Walkthrough

Walkthrough

Added test orchestration and validation scaffolding to docling-standard and docling-vlm, plus a new accelerator device parameter propagated through components, pipeline signatures, and compiled YAMLs to allow selecting auto/cpu/gpu.

Changes

Cohort / File(s)	Summary
Test Framework Infrastructure `kubeflow-pipelines/docling-standard/local_run.py`, `kubeflow-pipelines/docling-vlm/local_run.py`	Added `TEST_CONFIGS` and `FAILURE_SCENARIOS`. Added `take_first_split()` component, `validate_output()`, `run_test_scenario()`, and parameterized `convert_pipeline_test()`. Reworked `main()` to initialize a Docker local runner, execute normal and failure test phases, aggregate results, and emit a summary + exit code.
Standard Pipeline Components & Pipeline `kubeflow-pipelines/docling-standard/standard_components.py`, `kubeflow-pipelines/docling-standard/standard_convert_pipeline.py`, `kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml`	Added `accelerator_device: str = "auto"` parameter to `docling_convert_standard()` and propagated it through `convert_pipeline`. Introduced `accelerator_device_map` and validation; replaced hardcoded AcceleratorDevice with mapped value. Updated compiled YAML to add `docling_accelerator_device` top-level input and wire `accelerator_device` into the component and for-loop channels.
VLM Pipeline Components & Pipeline `kubeflow-pipelines/docling-vlm/vlm_components.py`, `kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py`, `kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml`	Added `accelerator_device: str = "auto"` parameter to `docling_convert_vlm()` and propagated it through `convert_pipeline`. Added `accelerator_device_map` with validation and used mapped AcceleratorDevice in AcceleratorOptions. Updated compiled YAML to add `docling_accelerator_device` root input, component input, and for-loop wiring.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LocalRunner
    participant TestOrch as Test Orchestrator
    participant Pipeline
    participant Validator
    participant Reporter

    User->>LocalRunner: main()
    LocalRunner->>TestOrch: initialize runner
    TestOrch->>Pipeline: run_test_scenario(name, config)
    Pipeline->>Pipeline: convert_pipeline_test(...) with accelerator param
    Pipeline-->>TestOrch: pipeline execution result
    TestOrch->>Validator: validate_output(output_dir)
    Validator-->>TestOrch: validation result
    TestOrch->>Reporter: collect result
    Reporter-->>User: print summary & exit

sequenceDiagram
    participant Client
    participant Pipeline
    participant Component
    participant Converter

    Client->>Pipeline: convert_pipeline(docling_accelerator_device="gpu")
    Pipeline->>Component: docling_convert_*(accelerator_device="gpu")
    Component->>Component: validate & map accelerator_device
    Component->>Converter: init AcceleratorOptions(device=CUDA)
    Converter-->>Component: conversion result
    Component-->>Pipeline: result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through tests with configs bright,

I split and validate by moonlight,
Auto, CPU, or GPU,
I pick the fastest path for you,
Pipelines hum — a carrot treat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'added accelerator devices' is directly related to the main change: adding configurable accelerator_device parameters to both Standard and VLM Docling pipelines to enable CPU/GPU device selection.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @kubeflow-pipelines/docling-standard/local_run.py:
- Around line 1-3: File local_run.py has formatting issues preventing CI; run
the formatter and commit the changes. Run the suggested command (ruff format
kubeflow-pipelines/docling-standard/local_run.py) or apply your project's
formatter to the module local_run.py, verify the imports and file pass
ruff/flake8 checks, then add/commit the formatted file and push to the branch.

In @kubeflow-pipelines/docling-vlm/local_run.py:
- Around line 124-197: The validate_output function is implemented but never
invoked; call validate_output(output_dir, test_name) from run_test_scenario
after a successful pipeline run (i.e., when should_fail is False and
artifacts/output_dir exist), capture the returned validation_results, set test
status to FAIL if validation_results["passed"] is False, and include
validation_results in the run_test_scenario return payload; alternatively, if
you prefer removal, delete validate_output and any imports it requires. Ensure
you reference the validate_output function, run_test_scenario, should_fail,
output_dir, and test_name when making the change.
- Line 1: The file local_run.py is failing CI due to formatting; run the
auto-formatter and commit the result: execute `ruff format
kubeflow-pipelines/docling-vlm/local_run.py` (or run your project's
pre-commit/formatter) to fix whitespace/formatting issues and then stage/commit
the updated file so the import and any other code in local_run.py conforms to
the project's ruff style rules.
- Around line 42-47: TEST_CONFIGS currently includes an active "gpu_test" entry
which contradicts the "Uncomment if you have GPU available" comment and will
fail on CPU-only machines; either remove or comment out the "gpu_test" dict from
TEST_CONFIGS, or replace the static entry with conditional inclusion that checks
GPU availability (e.g., using torch.cuda.is_available() or an environment flag)
before adding the "gpu_test" config so the GPU-only test is only present when a
GPU is detected.

🧹 Nitpick comments (3)

kubeflow-pipelines/docling-vlm/vlm_components.py (1)

87-96: LGTM! Consider extracting shared validation logic.

The implementation is correct and consistent with standard_components.py. The accelerator_device_map and validation logic are duplicated between both component files.

For maintainability, you could consider extracting this shared logic to a common utility in the common module, but this is optional given the current scope.

kubeflow-pipelines/docling-standard/local_run.py (1)

301-326: Simplify by removing unnecessary else after return.

The static analysis hint suggests removing the else block since the if branch returns. This improves readability.

♻️ Suggested simplification

         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }

kubeflow-pipelines/docling-vlm/local_run.py (1)

254-279: Consider removing unnecessary else after return.

Per static analysis, the else block at line 269 is redundant since the if block returns. De-indenting improves readability.

♻️ Proposed refactor

         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37b97a0 and 45dd82b.

📒 Files selected for processing (8)

kubeflow-pipelines/docling-standard/local_run.py
kubeflow-pipelines/docling-standard/standard_components.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
kubeflow-pipelines/docling-vlm/local_run.py
kubeflow-pipelines/docling-vlm/vlm_components.py
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml

🧰 Additional context used

🧬 Code graph analysis (2)

kubeflow-pipelines/docling-standard/local_run.py (1)

kubeflow-pipelines/docling-vlm/local_run.py (5)

take_first_split (71-73)

convert_pipeline_test (80-117)

validate_output (124-197)

run_test_scenario (204-287)

main (294-356)

kubeflow-pipelines/docling-vlm/local_run.py (1)

kubeflow-pipelines/docling-standard/local_run.py (5)

take_first_split (106-108)

convert_pipeline_test (115-166)

validate_output (172-245)

run_test_scenario (251-334)

main (340-402)

🪛 GitHub Actions: Python Formatting Validation

kubeflow-pipelines/docling-standard/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

kubeflow-pipelines/docling-vlm/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🪛 Pylint (4.0.4)

kubeflow-pipelines/docling-standard/local_run.py

[refactor] 115-115: Too many positional arguments (8/5)

(R0917)

[refactor] 304-326: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 257-279: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
GitHub Check: Summary

🔇 Additional comments (12)

kubeflow-pipelines/docling-standard/standard_components.py (1)

95-104: LGTM! Clean validation and mapping implementation.

The accelerator_device_map provides a user-friendly abstraction over the internal AcceleratorDevice enum, and the validation with descriptive error messages is well implemented. The mapping of "gpu" to AcceleratorDevice.CUDA is a sensible UX choice.

Also applies to: 167-169

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)

29-29: LGTM!

The parameter is correctly added to the pipeline signature and properly forwarded to the VLM converter component. The naming convention with docling_ prefix at the pipeline level and shorter name at the component level is consistent with other parameters in both pipelines.

Also applies to: 77-77

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

29-29: LGTM!

The accelerator device parameter is correctly added and threaded through to the standard converter component, maintaining consistency with the VLM pipeline implementation.

Also applies to: 90-90

kubeflow-pipelines/docling-standard/local_run.py (3)

172-245: validate_output function is defined but never called.

This function validates output quality but isn't invoked in the test runner flow. Was this intended for future use, or should it be integrated into run_test_scenario for successful test cases?

23-59: LGTM! Comprehensive test configurations.

The test matrix covers key scenarios: CPU-only, auto-detect, GPU/CUDA, different backends, and table modes. The structure aligns well with the VLM pipeline's test configs.

61-99: LGTM! Good coverage of failure scenarios.

The failure scenarios properly test validation for all configurable parameters: device, backend, table mode, image mode, and OCR engine. The expected_error patterns will correctly match the ValueError messages from the components.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

107-111: LGTM! Pipeline wiring is correctly generated.

The compiled YAML correctly propagates docling_accelerator_device through all pipeline layers:

Top-level pipeline input (lines 840-843)

For-loop component input (lines 235-236, 799-800)

Converter component parameter binding (lines 203-204)

Component input definition (lines 107-111)

The embedded component code reflects the source changes accurately.

Also applies to: 203-204, 235-236, 799-800, 840-843

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

116-120: LGTM! Standard pipeline compiled YAML is correctly updated.

The docling_accelerator_device parameter is properly wired through the pipeline:

Top-level input definition (lines 947-950)

For-loop input binding (lines 299-300, 888-889)

Component invocation (lines 252-253)

Component parameter schema (lines 116-120)

The embedded component code accurately reflects the source changes from standard_components.py.

Also applies to: 252-253, 299-300, 888-889, 947-950

kubeflow-pipelines/docling-vlm/local_run.py (4)

50-65: LGTM!

The failure scenarios are well-structured with clear expected error messages for validation testing.

70-73: LGTM!

The helper component is consistent with the standard pipeline implementation.

79-117: LGTM!

The test pipeline correctly wires the accelerator device and other parameters to the VLM converter component.

294-356: LGTM!

The main entry point correctly orchestrates test execution, collects results, and provides a clear summary with appropriate exit codes.

coderabbitai · 2026-01-08T20:22:06Z

kubeflow-pipelines/docling-standard/local_run.py

+import json
 import sys
+import time


⚠️ Potential issue | 🟡 Minor

Fix formatting to pass CI pipeline.

The pipeline failure indicates this file needs formatting. Run the suggested command to fix:

ruff format kubeflow-pipelines/docling-standard/local_run.py

🧰 Tools

🪛 GitHub Actions: Python Formatting Validation

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🤖 Prompt for AI Agents

In @kubeflow-pipelines/docling-standard/local_run.py around lines 1 - 3, File local_run.py has formatting issues preventing CI; run the formatter and commit the changes. Run the suggested command (ruff format kubeflow-pipelines/docling-standard/local_run.py) or apply your project's formatter to the module local_run.py, verify the imports and file pass ruff/flake8 checks, then add/commit the formatted file and push to the branch.

coderabbitai · 2026-01-08T20:22:06Z

kubeflow-pipelines/docling-vlm/local_run.py

@@ -1,8 +1,9 @@
+import json


⚠️ Potential issue | 🟡 Minor

Fix formatting to pass CI.

The pipeline failure indicates this file needs formatting. Run ruff format kubeflow-pipelines/docling-vlm/local_run.py to fix.

🧰 Tools

🪛 GitHub Actions: Python Formatting Validation

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'

🤖 Prompt for AI Agents

In @kubeflow-pipelines/docling-vlm/local_run.py at line 1, The file local_run.py is failing CI due to formatting; run the auto-formatter and commit the result: execute `ruff format kubeflow-pipelines/docling-vlm/local_run.py` (or run your project's pre-commit/formatter) to fix whitespace/formatting issues and then stage/commit the updated file so the import and any other code in local_run.py conforms to the project's ruff style rules.

coderabbitai · 2026-01-08T20:22:06Z

kubeflow-pipelines/docling-vlm/local_run.py

+    # Uncomment if you have GPU available
+    "gpu_test": {
+        "docling_accelerator_device": "gpu",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "embedded",
+    },


⚠️ Potential issue | 🟡 Minor

Comment says to uncomment but gpu_test is already active.

The comment suggests uncommenting gpu_test if GPU is available, but it's already included in TEST_CONFIGS. This will cause test failures on systems without GPU. Either comment out the test case or remove the misleading comment and add GPU availability detection.

Option 1: Comment out as the instruction suggests

# Uncomment if you have GPU available - "gpu_test": { - "docling_accelerator_device": "gpu", - "docling_num_threads": 4, - "docling_image_export_mode": "embedded", - }, + # "gpu_test": { + # "docling_accelerator_device": "gpu", + # "docling_num_threads": 4, + # "docling_image_export_mode": "embedded", + # },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Uncomment if you have GPU available

"gpu_test": {

"docling_accelerator_device": "gpu",

"docling_num_threads": 4,

"docling_image_export_mode": "embedded",

},

# Uncomment if you have GPU available

# "gpu_test": {

# "docling_accelerator_device": "gpu",

# "docling_num_threads": 4,

# "docling_image_export_mode": "embedded",

# },

🤖 Prompt for AI Agents

In @kubeflow-pipelines/docling-vlm/local_run.py around lines 42 - 47, TEST_CONFIGS currently includes an active "gpu_test" entry which contradicts the "Uncomment if you have GPU available" comment and will fail on CPU-only machines; either remove or comment out the "gpu_test" dict from TEST_CONFIGS, or replace the static entry with conditional inclusion that checks GPU availability (e.g., using torch.cuda.is_available() or an environment flag) before adding the "gpu_test" config so the GPU-only test is only present when a GPU is detected.

coderabbitai · 2026-01-08T20:22:06Z

kubeflow-pipelines/docling-vlm/local_run.py

+def validate_output(output_dir: Path, test_name: str) -> dict:
+    """
+    Validate the output quality and format.
+
+    Args:
+        output_dir: Path to the output directory containing converted files.
+        test_name: Name of the test for reporting.
+
+    Returns:
+        Dictionary containing validation results and checks.
+    """
+    validation_results = {
+        "test_name": test_name,
+        "passed": True,
+        "checks": {},
+    }
+
+    # Check for JSON output
+    json_files = list(output_dir.glob("*.json"))
+    validation_results["checks"]["json_exists"] = len(json_files) > 0
+
+    # Check for Markdown output
+    md_files = list(output_dir.glob("*.md"))
+    validation_results["checks"]["md_exists"] = len(md_files) > 0
+
+    if json_files:
+        json_path = json_files[0]
+        try:
+            with open(json_path) as f:
+                data = json.load(f)
+
+            # Required fields validation
+            validation_results["checks"]["has_name"] = "name" in data
+            validation_results["checks"]["has_pages"] = "pages" in data
+
+            if "pages" in data:
+                validation_results["checks"]["pages_count"] = len(data["pages"])
+                validation_results["checks"]["has_content"] = len(data["pages"]) > 0
+
+                # Check page structure
+                if data["pages"]:
+                    first_page = data["pages"][0]
+                    validation_results["checks"]["page_has_number"] = (
+                        "page_number" in first_page
+                    )
+                    validation_results["checks"]["page_has_size"] = "size" in first_page
+        except Exception as e:
+            validation_results["checks"]["json_parsing_error"] = str(e)
+            validation_results["passed"] = False
+
+    if md_files:
+        md_path = md_files[0]
+        try:
+            content = md_path.read_text()
+            validation_results["checks"]["md_length"] = len(content)
+            validation_results["checks"]["md_not_empty"] = len(content) > 100
+            validation_results["checks"]["md_has_headers"] = "#" in content
+            validation_results["checks"]["md_no_error"] = not content.startswith(
+                "Error"
+            )
+        except Exception as e:
+            validation_results["checks"]["md_parsing_error"] = str(e)
+            validation_results["passed"] = False
+
+    # Overall pass/fail
+    validation_results["passed"] = all(
+        [
+            validation_results["checks"].get("json_exists", False),
+            validation_results["checks"].get("md_exists", False),
+            validation_results["checks"].get("has_content", False),
+        ]
+    )
+
+    return validation_results


⚠️ Potential issue | 🟡 Minor

validate_output is defined but never called.

This function implements output validation but is never invoked in run_test_scenario or main. Either integrate it into the test flow to validate outputs after successful pipeline runs, or remove it if not needed.

Example integration into run_test_scenario

# After successful pipeline run in run_test_scenario: if not should_fail: # Optionally validate output quality # validation = validate_output(output_dir, test_name) # if not validation["passed"]: # return {"test_name": test_name, "status": "FAIL", ...} pass

Would you like me to help integrate validate_output into the test flow, or should this function be removed?

🤖 Prompt for AI Agents

In @kubeflow-pipelines/docling-vlm/local_run.py around lines 124 - 197, The validate_output function is implemented but never invoked; call validate_output(output_dir, test_name) from run_test_scenario after a successful pipeline run (i.e., when should_fail is False and artifacts/output_dir exist), capture the returned validation_results, set test status to FAIL if validation_results["passed"] is False, and include validation_results in the run_test_scenario return payload; alternatively, if you prefer removal, delete validate_output and any imports it requires. Ensure you reference the validate_output function, run_test_scenario, should_fail, output_dir, and test_name when making the change.

Signed-off-by: roburishabh <roburishabh@outlook.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

kubeflow-pipelines/docling-vlm/local_run.py (2)

126-199: Remove unused validation function.

The validate_output function is defined but never called anywhere in the code. Either remove it or integrate it into the test flow.

🧹 Option 1: Remove the unused function

-def validate_output(output_dir: Path, test_name: str) -> dict:
-    """
-    Validate the output quality and format.
-
-    Args:
-        output_dir: Path to the output directory containing converted files.
-        test_name: Name of the test for reporting.
-
-    Returns:
-        Dictionary containing validation results and checks.
-    """
-    validation_results = {
-        "test_name": test_name,
-        "passed": True,
-        "checks": {},
-    }
-
-    # Check for JSON output
-    json_files = list(output_dir.glob("*.json"))
-    validation_results["checks"]["json_exists"] = len(json_files) > 0
-
-    # Check for Markdown output
-    md_files = list(output_dir.glob("*.md"))
-    validation_results["checks"]["md_exists"] = len(md_files) > 0
-
-    if json_files:
-        json_path = json_files[0]
-        try:
-            with open(json_path) as f:
-                data = json.load(f)
-
-            # Required fields validation
-            validation_results["checks"]["has_name"] = "name" in data
-            validation_results["checks"]["has_pages"] = "pages" in data
-
-            if "pages" in data:
-                validation_results["checks"]["pages_count"] = len(data["pages"])
-                validation_results["checks"]["has_content"] = len(data["pages"]) > 0
-
-                # Check page structure
-                if data["pages"]:
-                    first_page = data["pages"][0]
-                    validation_results["checks"]["page_has_number"] = (
-                        "page_number" in first_page
-                    )
-                    validation_results["checks"]["page_has_size"] = "size" in first_page
-        except Exception as e:
-            validation_results["checks"]["json_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    if md_files:
-        md_path = md_files[0]
-        try:
-            content = md_path.read_text()
-            validation_results["checks"]["md_length"] = len(content)
-            validation_results["checks"]["md_not_empty"] = len(content) > 100
-            validation_results["checks"]["md_has_headers"] = "#" in content
-            validation_results["checks"]["md_no_error"] = not content.startswith(
-                "Error"
-            )
-        except Exception as e:
-            validation_results["checks"]["md_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    # Overall pass/fail
-    validation_results["passed"] = all(
-        [
-            validation_results["checks"].get("json_exists", False),
-            validation_results["checks"].get("md_exists", False),
-            validation_results["checks"].get("has_content", False),
-        ]
-    )
-
-    return validation_results
-

257-282: Simplify control flow by removing unnecessary else.

The else block after Line 271 is unnecessary since the preceding if block returns early.

♻️ Simplify the control flow

         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }

kubeflow-pipelines/docling-standard/local_run.py (2)

177-250: Remove unused validation function.

The validate_output function is defined but never called. Either remove it or integrate it into the test workflow (same issue as in the VLM local_run.py).

🧹 Remove the unused function

-def validate_output(output_dir: Path, test_name: str) -> dict:
-    """
-    Validate the output quality and format.
-
-    Args:
-        output_dir: Path to the output directory containing converted files.
-        test_name: Name of the test for reporting.
-
-    Returns:
-        Dictionary containing validation results and checks.
-    """
-    validation_results = {
-        "test_name": test_name,
-        "passed": True,
-        "checks": {},
-    }
-
-    # Check for JSON output
-    json_files = list(output_dir.glob("*.json"))
-    validation_results["checks"]["json_exists"] = len(json_files) > 0
-
-    # Check for Markdown output
-    md_files = list(output_dir.glob("*.md"))
-    validation_results["checks"]["md_exists"] = len(md_files) > 0
-
-    if json_files:
-        json_path = json_files[0]
-        try:
-            with open(json_path) as f:
-                data = json.load(f)
-
-            # Required fields validation
-            validation_results["checks"]["has_name"] = "name" in data
-            validation_results["checks"]["has_pages"] = "pages" in data
-
-            if "pages" in data:
-                validation_results["checks"]["pages_count"] = len(data["pages"])
-                validation_results["checks"]["has_content"] = len(data["pages"]) > 0
-
-                # Check page structure
-                if data["pages"]:
-                    first_page = data["pages"][0]
-                    validation_results["checks"]["page_has_number"] = (
-                        "page_number" in first_page
-                    )
-                    validation_results["checks"]["page_has_size"] = "size" in first_page
-        except Exception as e:
-            validation_results["checks"]["json_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    if md_files:
-        md_path = md_files[0]
-        try:
-            content = md_path.read_text()
-            validation_results["checks"]["md_length"] = len(content)
-            validation_results["checks"]["md_not_empty"] = len(content) > 100
-            validation_results["checks"]["md_has_headers"] = "#" in content
-            validation_results["checks"]["md_no_error"] = not content.startswith(
-                "Error"
-            )
-        except Exception as e:
-            validation_results["checks"]["md_parsing_error"] = str(e)
-            validation_results["passed"] = False
-
-    # Overall pass/fail
-    validation_results["passed"] = all(
-        [
-            validation_results["checks"].get("json_exists", False),
-            validation_results["checks"].get("md_exists", False),
-            validation_results["checks"].get("has_content", False),
-        ]
-    )
-
-    return validation_results
-

308-333: Simplify control flow by removing unnecessary else.

The else block after Line 322 is unnecessary since the preceding if block returns early.

♻️ Simplify the control flow

         if should_fail:
             expected_error = config.get("expected_error", "")
             # Check if pipeline failed as expected (validation worked)
             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print("   Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"   Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"   Expected error containing: '{expected_error}'")
-                print(f"   Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": "Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"   Expected error containing: '{expected_error}'")
+            print(f"   Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": "Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45dd82b and a3e41f4.

📒 Files selected for processing (8)

kubeflow-pipelines/docling-standard/local_run.py
kubeflow-pipelines/docling-standard/standard_components.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
kubeflow-pipelines/docling-vlm/local_run.py
kubeflow-pipelines/docling-vlm/vlm_components.py
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
kubeflow-pipelines/docling-standard/standard_components.py
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml

🧰 Additional context used

🪛 Pylint (4.0.4)

kubeflow-pipelines/docling-standard/local_run.py

[refactor] 118-118: Too many positional arguments (8/5)

(R0917)

[refactor] 311-333: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 260-282: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
GitHub Check: Summary

🔇 Additional comments (13)

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

29-29: LGTM! Clean parameter addition and propagation.

The docling_accelerator_device parameter is correctly added to the pipeline signature and properly propagated to the docling_convert_standard component.

Also applies to: 90-90

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)

29-29: LGTM! Consistent with standard pipeline implementation.

The docling_accelerator_device parameter is correctly added and propagated, following the same pattern as the standard pipeline.

Also applies to: 77-77

kubeflow-pipelines/docling-vlm/vlm_components.py (3)

23-23: LGTM! Well-documented parameter addition.

The accelerator_device parameter is properly added with a sensible default and clear docstring.

Also applies to: 38-38

87-96: LGTM! Robust validation with clear error messages.

The validation logic correctly maps user-friendly device names to AcceleratorDevice enum values. The "gpu" → CUDA mapping is intuitive and the error message provides helpful guidance.

178-178: LGTM! Correct usage of validated device mapping.

The mapped accelerator device is properly applied to the AcceleratorOptions configuration.

kubeflow-pipelines/docling-vlm/local_run.py (4)

18-65: LGTM! Comprehensive test coverage.

The test configurations cover normal functionality (CPU, auto-detect, GPU, various modes) and failure scenarios (invalid device, invalid image mode). This provides solid validation of the new parameter.

71-74: LGTM! Clear and simple helper component.

The take_first_split component is well-documented and handles the edge case of empty splits.

81-118: LGTM! Well-structured test pipeline.

The parameterized test pipeline enables flexible testing of different accelerator device configurations.

207-290: LGTM! Well-designed test orchestration.

The test framework effectively validates both successful conversions and proper error handling for invalid parameters. The two-phase approach (normal tests + failure scenarios) provides comprehensive coverage.

Also applies to: 298-360

kubeflow-pipelines/docling-standard/local_run.py (4)

23-99: LGTM! Excellent test coverage for standard pipeline.

The test configurations comprehensively cover different backends (dlparse_v4, pypdfium2), table modes (accurate, fast), and failure scenarios (invalid device, backend, table mode, image mode, OCR engine). This provides thorough validation of the new accelerator_device parameter and existing options.

106-109: LGTM! Clear helper component.

The take_first_split helper is well-documented and handles edge cases properly.

117-169: LGTM! Comprehensive test pipeline.

The parameterized test pipeline enables thorough testing of accelerator device configurations alongside other conversion parameters.

258-341: LGTM! Robust test orchestration with excellent coverage.

The test framework comprehensively validates the standard pipeline's accelerator device parameter alongside other configuration options. The two-phase approach effectively catches both successful conversions and validation errors.

Also applies to: 349-411

shruthis4 · 2026-01-13T13:20:09Z

kubeflow-pipelines/docling-vlm/local_run.py

+    "auto_detect": {
+        "docling_accelerator_device": "auto",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "embedded",
+    },
+    "minimal_threads": {
+        "docling_accelerator_device": "cpu",
+        "docling_num_threads": 2,
+        "docling_image_export_mode": "embedded",
+    },
+    "placeholder_images": {
+        "docling_accelerator_device": "cpu",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "placeholder",
+    },


I think we can remove these testcases, one with cpu and other with gpu is sufficient. The idea here is to test our pipeline code and not the docling functionality comprehensively.

shruthis4 · 2026-01-13T13:20:36Z

kubeflow-pipelines/docling-vlm/local_run.py

+FAILURE_SCENARIOS = {
+    "invalid_device": {
+        "docling_accelerator_device": "invalid_device",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "embedded",
+        "should_fail": True,
+        "expected_error": "Invalid accelerator_device",
+    },
+    "invalid_image_mode": {
+        "docling_accelerator_device": "cpu",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "invalid_mode",
+        "should_fail": True,
+        "expected_error": "Invalid image_export_mode",
+    },
+}


What is the idea behind these testcases?

Just different failures scenarios, I wanted to check whether they are failing or not when they are required to fail for different arguments

Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases

shruthis4 · 2026-01-13T14:30:34Z

kubeflow-pipelines/docling-standard/local_run.py

+    "pypdfium2_backend": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "pypdfium2",
+        "docling_table_mode": "fast",
+        "docling_num_threads": 2,
+        "docling_ocr": False,
+    },
+    "fast_table_mode": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "dlparse_v4",
+        "docling_table_mode": "fast",
+        "docling_num_threads": 4,
+        "docling_ocr": True,
+    },


Again we should not go into this level of details for such workflow level test. You could merge these cases with the previous ones. For this change i see cases with device set to "auto", "cpu" and "gpu" as critical.

shruthis4 · 2026-01-13T14:31:40Z

kubeflow-pipelines/docling-standard/local_run.py

+    "invalid_device": {
+        "docling_accelerator_device": "invalid_device",
+        "docling_pdf_backend": "dlparse_v4",
+        "docling_table_mode": "accurate",
+        "should_fail": True,
+        "expected_error": "Invalid accelerator_device",


We are just passing on the parameter to the docling to trigger the error correct? In which case testing the docling error should not be this test's responsibility

shruthis4 · 2026-01-13T14:32:08Z

kubeflow-pipelines/docling-standard/local_run.py

+    },
+    "invalid_backend": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "invalid_backend",
+        "docling_table_mode": "accurate",
+        "should_fail": True,
+        "expected_error": "Invalid pdf_backend",
+    },
+    "invalid_table_mode": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "dlparse_v4",
+        "docling_table_mode": "invalid_mode",
+        "should_fail": True,
+        "expected_error": "Invalid table_mode",
+    },
+    "invalid_image_mode": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "dlparse_v4",
+        "docling_table_mode": "accurate",
+        "docling_image_export_mode": "invalid_mode",
+        "should_fail": True,
+        "expected_error": "Invalid image_export_mode",
+    },
+    "invalid_ocr_engine": {
+        "docling_accelerator_device": "cpu",
+        "docling_pdf_backend": "dlparse_v4",
+        "docling_table_mode": "accurate",
+        "docling_ocr_engine": "invalid_engine",
+        "should_fail": True,
+        "expected_error": "Invalid ocr_engine",
+    },


My above comment applies to all these test cases

shruthis4 · 2026-01-13T14:34:37Z

kubeflow-pipelines/docling-vlm/local_run.py

+FAILURE_SCENARIOS = {
+    "invalid_device": {
+        "docling_accelerator_device": "invalid_device",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "embedded",
+        "should_fail": True,
+        "expected_error": "Invalid accelerator_device",
+    },
+    "invalid_image_mode": {
+        "docling_accelerator_device": "cpu",
+        "docling_num_threads": 4,
+        "docling_image_export_mode": "invalid_mode",
+        "should_fail": True,
+        "expected_error": "Invalid image_export_mode",
+    },
+}


Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases

mergify · 2026-01-15T15:21:29Z

🔀 Merge Conflict Detected

This pull request has merge conflicts that must be resolved before it can be merged.
@RobuRishabh please rebase your branch.

Syncing a Fork Guide

RobuRishabh requested a review from a team as a code owner January 8, 2026 20:17

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

added accelerator devices

a3e41f4

Signed-off-by: roburishabh <roburishabh@outlook.com>

RobuRishabh force-pushed the accelerator_devices branch from 45dd82b to a3e41f4 Compare January 8, 2026 20:23

RobuRishabh requested review from Vedant-Deshpande, alimaredia, alinaryan and shruthis4 January 8, 2026 20:24

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

shruthis4 reviewed Jan 13, 2026

View reviewed changes

shruthis4 requested changes Jan 13, 2026

View reviewed changes

added accelerator devices #79

Are you sure you want to change the base?

added accelerator devices #79

Uh oh!

Conversation

RobuRishabh commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes:

How Has This Been Tested?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobuRishabh Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

shruthis4 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobuRishabh commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

shruthis4 Jan 13, 2026 •

edited

Loading

RobuRishabh Jan 13, 2026 •

edited

Loading