-
Notifications
You must be signed in to change notification settings - Fork 8
added accelerator devices #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdded test orchestration and validation scaffolding to docling-standard and docling-vlm, plus a new accelerator device parameter propagated through components, pipeline signatures, and compiled YAMLs to allow selecting auto/cpu/gpu. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant LocalRunner
participant TestOrch as Test Orchestrator
participant Pipeline
participant Validator
participant Reporter
User->>LocalRunner: main()
LocalRunner->>TestOrch: initialize runner
TestOrch->>Pipeline: run_test_scenario(name, config)
Pipeline->>Pipeline: convert_pipeline_test(...) with accelerator param
Pipeline-->>TestOrch: pipeline execution result
TestOrch->>Validator: validate_output(output_dir)
Validator-->>TestOrch: validation result
TestOrch->>Reporter: collect result
Reporter-->>User: print summary & exit
sequenceDiagram
participant Client
participant Pipeline
participant Component
participant Converter
Client->>Pipeline: convert_pipeline(docling_accelerator_device="gpu")
Pipeline->>Component: docling_convert_*(accelerator_device="gpu")
Component->>Component: validate & map accelerator_device
Component->>Converter: init AcceleratorOptions(device=CUDA)
Converter-->>Component: conversion result
Component-->>Pipeline: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @kubeflow-pipelines/docling-standard/local_run.py:
- Around line 1-3: File local_run.py has formatting issues preventing CI; run
the formatter and commit the changes. Run the suggested command (ruff format
kubeflow-pipelines/docling-standard/local_run.py) or apply your project's
formatter to the module local_run.py, verify the imports and file pass
ruff/flake8 checks, then add/commit the formatted file and push to the branch.
In @kubeflow-pipelines/docling-vlm/local_run.py:
- Around line 124-197: The validate_output function is implemented but never
invoked; call validate_output(output_dir, test_name) from run_test_scenario
after a successful pipeline run (i.e., when should_fail is False and
artifacts/output_dir exist), capture the returned validation_results, set test
status to FAIL if validation_results["passed"] is False, and include
validation_results in the run_test_scenario return payload; alternatively, if
you prefer removal, delete validate_output and any imports it requires. Ensure
you reference the validate_output function, run_test_scenario, should_fail,
output_dir, and test_name when making the change.
- Line 1: The file local_run.py is failing CI due to formatting; run the
auto-formatter and commit the result: execute `ruff format
kubeflow-pipelines/docling-vlm/local_run.py` (or run your project's
pre-commit/formatter) to fix whitespace/formatting issues and then stage/commit
the updated file so the import and any other code in local_run.py conforms to
the project's ruff style rules.
- Around line 42-47: TEST_CONFIGS currently includes an active "gpu_test" entry
which contradicts the "Uncomment if you have GPU available" comment and will
fail on CPU-only machines; either remove or comment out the "gpu_test" dict from
TEST_CONFIGS, or replace the static entry with conditional inclusion that checks
GPU availability (e.g., using torch.cuda.is_available() or an environment flag)
before adding the "gpu_test" config so the GPU-only test is only present when a
GPU is detected.
🧹 Nitpick comments (3)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
87-96: LGTM! Consider extracting shared validation logic.The implementation is correct and consistent with
standard_components.py. Theaccelerator_device_mapand validation logic are duplicated between both component files.For maintainability, you could consider extracting this shared logic to a common utility in the
commonmodule, but this is optional given the current scope.kubeflow-pipelines/docling-standard/local_run.py (1)
301-326: Simplify by removing unnecessaryelseafterreturn.The static analysis hint suggests removing the
elseblock since theifbranch returns. This improves readability.♻️ Suggested simplification
if should_fail: expected_error = config.get("expected_error", "") # Check if pipeline failed as expected (validation worked) if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): print(f"✅ TEST PASSED: {test_name} - Failed as expected") print(" Pipeline correctly rejected invalid parameter") if expected_error: print(f" Expected error type: '{expected_error}'") return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f" Expected error containing: '{expected_error}'") - print(f" Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": "Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f" Expected error containing: '{expected_error}'") + print(f" Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": "Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }kubeflow-pipelines/docling-vlm/local_run.py (1)
254-279: Consider removing unnecessaryelseafterreturn.Per static analysis, the
elseblock at line 269 is redundant since theifblock returns. De-indenting improves readability.♻️ Proposed refactor
if should_fail: expected_error = config.get("expected_error", "") # Check if pipeline failed as expected (validation worked) if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): print(f"✅ TEST PASSED: {test_name} - Failed as expected") print(" Pipeline correctly rejected invalid parameter") if expected_error: print(f" Expected error type: '{expected_error}'") return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f" Expected error containing: '{expected_error}'") - print(f" Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": "Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f" Expected error containing: '{expected_error}'") + print(f" Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": "Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
kubeflow-pipelines/docling-standard/local_run.pykubeflow-pipelines/docling-standard/standard_components.pykubeflow-pipelines/docling-standard/standard_convert_pipeline.pykubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yamlkubeflow-pipelines/docling-vlm/local_run.pykubeflow-pipelines/docling-vlm/vlm_components.pykubeflow-pipelines/docling-vlm/vlm_convert_pipeline.pykubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (5)
take_first_split(71-73)convert_pipeline_test(80-117)validate_output(124-197)run_test_scenario(204-287)main(294-356)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (5)
take_first_split(106-108)convert_pipeline_test(115-166)validate_output(172-245)run_test_scenario(251-334)main(340-402)
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-standard/local_run.py
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'
kubeflow-pipelines/docling-vlm/local_run.py
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 115-115: Too many positional arguments (8/5)
(R0917)
[refactor] 304-326: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 257-279: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: Summary
🔇 Additional comments (12)
kubeflow-pipelines/docling-standard/standard_components.py (1)
95-104: LGTM! Clean validation and mapping implementation.The
accelerator_device_mapprovides a user-friendly abstraction over the internalAcceleratorDeviceenum, and the validation with descriptive error messages is well implemented. The mapping of"gpu"toAcceleratorDevice.CUDAis a sensible UX choice.Also applies to: 167-169
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)
29-29: LGTM!The parameter is correctly added to the pipeline signature and properly forwarded to the VLM converter component. The naming convention with
docling_prefix at the pipeline level and shorter name at the component level is consistent with other parameters in both pipelines.Also applies to: 77-77
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)
29-29: LGTM!The accelerator device parameter is correctly added and threaded through to the standard converter component, maintaining consistency with the VLM pipeline implementation.
Also applies to: 90-90
kubeflow-pipelines/docling-standard/local_run.py (3)
172-245:validate_outputfunction is defined but never called.This function validates output quality but isn't invoked in the test runner flow. Was this intended for future use, or should it be integrated into
run_test_scenariofor successful test cases?
23-59: LGTM! Comprehensive test configurations.The test matrix covers key scenarios: CPU-only, auto-detect, GPU/CUDA, different backends, and table modes. The structure aligns well with the VLM pipeline's test configs.
61-99: LGTM! Good coverage of failure scenarios.The failure scenarios properly test validation for all configurable parameters: device, backend, table mode, image mode, and OCR engine. The
expected_errorpatterns will correctly match theValueErrormessages from the components.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
107-111: LGTM! Pipeline wiring is correctly generated.The compiled YAML correctly propagates
docling_accelerator_devicethrough all pipeline layers:
- Top-level pipeline input (lines 840-843)
- For-loop component input (lines 235-236, 799-800)
- Converter component parameter binding (lines 203-204)
- Component input definition (lines 107-111)
The embedded component code reflects the source changes accurately.
Also applies to: 203-204, 235-236, 799-800, 840-843
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
116-120: LGTM! Standard pipeline compiled YAML is correctly updated.The
docling_accelerator_deviceparameter is properly wired through the pipeline:
- Top-level input definition (lines 947-950)
- For-loop input binding (lines 299-300, 888-889)
- Component invocation (lines 252-253)
- Component parameter schema (lines 116-120)
The embedded component code accurately reflects the source changes from
standard_components.py.Also applies to: 252-253, 299-300, 888-889, 947-950
kubeflow-pipelines/docling-vlm/local_run.py (4)
50-65: LGTM!The failure scenarios are well-structured with clear expected error messages for validation testing.
70-73: LGTM!The helper component is consistent with the standard pipeline implementation.
79-117: LGTM!The test pipeline correctly wires the accelerator device and other parameters to the VLM converter component.
294-356: LGTM!The main entry point correctly orchestrates test execution, collects results, and provides a clear summary with appropriate exit codes.
| import json | ||
| import sys | ||
| import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix formatting to pass CI pipeline.
The pipeline failure indicates this file needs formatting. Run the suggested command to fix:
ruff format kubeflow-pipelines/docling-standard/local_run.py🧰 Tools
🪛 GitHub Actions: Python Formatting Validation
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-standard/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'
🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-standard/local_run.py around lines 1 - 3, File
local_run.py has formatting issues preventing CI; run the formatter and commit
the changes. Run the suggested command (ruff format
kubeflow-pipelines/docling-standard/local_run.py) or apply your project's
formatter to the module local_run.py, verify the imports and file pass
ruff/flake8 checks, then add/commit the formatted file and push to the branch.
| @@ -1,8 +1,9 @@ | |||
| import json | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix formatting to pass CI.
The pipeline failure indicates this file needs formatting. Run ruff format kubeflow-pipelines/docling-vlm/local_run.py to fix.
🧰 Tools
🪛 GitHub Actions: Python Formatting Validation
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format kubeflow-pipelines/docling-vlm/local_run.py' or 'ruff format --write' to fix formatting. Command: 'ruff format --check kubeflow-pipelines/common/init.py kubeflow-pipelines/common/components.py kubeflow-pipelines/common/constants.py kubeflow-pipelines/docling-standard/local_run.py kubeflow-pipelines/docling-standard/standard_components.py kubeflow-pipelines/docling-standard/standard_convert_pipeline.py kubeflow-pipelines/docling-vlm/local_run.py kubeflow-pipelines/docling-vlm/vlm_components.py kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py'
🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py at line 1, The file local_run.py
is failing CI due to formatting; run the auto-formatter and commit the result:
execute `ruff format kubeflow-pipelines/docling-vlm/local_run.py` (or run your
project's pre-commit/formatter) to fix whitespace/formatting issues and then
stage/commit the updated file so the import and any other code in local_run.py
conforms to the project's ruff style rules.
| # Uncomment if you have GPU available | ||
| "gpu_test": { | ||
| "docling_accelerator_device": "gpu", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "embedded", | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says to uncomment but gpu_test is already active.
The comment suggests uncommenting gpu_test if GPU is available, but it's already included in TEST_CONFIGS. This will cause test failures on systems without GPU. Either comment out the test case or remove the misleading comment and add GPU availability detection.
Option 1: Comment out as the instruction suggests
# Uncomment if you have GPU available
- "gpu_test": {
- "docling_accelerator_device": "gpu",
- "docling_num_threads": 4,
- "docling_image_export_mode": "embedded",
- },
+ # "gpu_test": {
+ # "docling_accelerator_device": "gpu",
+ # "docling_num_threads": 4,
+ # "docling_image_export_mode": "embedded",
+ # },📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Uncomment if you have GPU available | |
| "gpu_test": { | |
| "docling_accelerator_device": "gpu", | |
| "docling_num_threads": 4, | |
| "docling_image_export_mode": "embedded", | |
| }, | |
| # Uncomment if you have GPU available | |
| # "gpu_test": { | |
| # "docling_accelerator_device": "gpu", | |
| # "docling_num_threads": 4, | |
| # "docling_image_export_mode": "embedded", | |
| # }, |
🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py around lines 42 - 47,
TEST_CONFIGS currently includes an active "gpu_test" entry which contradicts the
"Uncomment if you have GPU available" comment and will fail on CPU-only
machines; either remove or comment out the "gpu_test" dict from TEST_CONFIGS, or
replace the static entry with conditional inclusion that checks GPU availability
(e.g., using torch.cuda.is_available() or an environment flag) before adding the
"gpu_test" config so the GPU-only test is only present when a GPU is detected.
| def validate_output(output_dir: Path, test_name: str) -> dict: | ||
| """ | ||
| Validate the output quality and format. | ||
| Args: | ||
| output_dir: Path to the output directory containing converted files. | ||
| test_name: Name of the test for reporting. | ||
| Returns: | ||
| Dictionary containing validation results and checks. | ||
| """ | ||
| validation_results = { | ||
| "test_name": test_name, | ||
| "passed": True, | ||
| "checks": {}, | ||
| } | ||
|
|
||
| # Check for JSON output | ||
| json_files = list(output_dir.glob("*.json")) | ||
| validation_results["checks"]["json_exists"] = len(json_files) > 0 | ||
|
|
||
| # Check for Markdown output | ||
| md_files = list(output_dir.glob("*.md")) | ||
| validation_results["checks"]["md_exists"] = len(md_files) > 0 | ||
|
|
||
| if json_files: | ||
| json_path = json_files[0] | ||
| try: | ||
| with open(json_path) as f: | ||
| data = json.load(f) | ||
|
|
||
| # Required fields validation | ||
| validation_results["checks"]["has_name"] = "name" in data | ||
| validation_results["checks"]["has_pages"] = "pages" in data | ||
|
|
||
| if "pages" in data: | ||
| validation_results["checks"]["pages_count"] = len(data["pages"]) | ||
| validation_results["checks"]["has_content"] = len(data["pages"]) > 0 | ||
|
|
||
| # Check page structure | ||
| if data["pages"]: | ||
| first_page = data["pages"][0] | ||
| validation_results["checks"]["page_has_number"] = ( | ||
| "page_number" in first_page | ||
| ) | ||
| validation_results["checks"]["page_has_size"] = "size" in first_page | ||
| except Exception as e: | ||
| validation_results["checks"]["json_parsing_error"] = str(e) | ||
| validation_results["passed"] = False | ||
|
|
||
| if md_files: | ||
| md_path = md_files[0] | ||
| try: | ||
| content = md_path.read_text() | ||
| validation_results["checks"]["md_length"] = len(content) | ||
| validation_results["checks"]["md_not_empty"] = len(content) > 100 | ||
| validation_results["checks"]["md_has_headers"] = "#" in content | ||
| validation_results["checks"]["md_no_error"] = not content.startswith( | ||
| "Error" | ||
| ) | ||
| except Exception as e: | ||
| validation_results["checks"]["md_parsing_error"] = str(e) | ||
| validation_results["passed"] = False | ||
|
|
||
| # Overall pass/fail | ||
| validation_results["passed"] = all( | ||
| [ | ||
| validation_results["checks"].get("json_exists", False), | ||
| validation_results["checks"].get("md_exists", False), | ||
| validation_results["checks"].get("has_content", False), | ||
| ] | ||
| ) | ||
|
|
||
| return validation_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_output is defined but never called.
This function implements output validation but is never invoked in run_test_scenario or main. Either integrate it into the test flow to validate outputs after successful pipeline runs, or remove it if not needed.
Example integration into run_test_scenario
# After successful pipeline run in run_test_scenario:
if not should_fail:
# Optionally validate output quality
# validation = validate_output(output_dir, test_name)
# if not validation["passed"]:
# return {"test_name": test_name, "status": "FAIL", ...}
passWould you like me to help integrate validate_output into the test flow, or should this function be removed?
🤖 Prompt for AI Agents
In @kubeflow-pipelines/docling-vlm/local_run.py around lines 124 - 197, The
validate_output function is implemented but never invoked; call
validate_output(output_dir, test_name) from run_test_scenario after a successful
pipeline run (i.e., when should_fail is False and artifacts/output_dir exist),
capture the returned validation_results, set test status to FAIL if
validation_results["passed"] is False, and include validation_results in the
run_test_scenario return payload; alternatively, if you prefer removal, delete
validate_output and any imports it requires. Ensure you reference the
validate_output function, run_test_scenario, should_fail, output_dir, and
test_name when making the change.
Signed-off-by: roburishabh <roburishabh@outlook.com>
45dd82b to
a3e41f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
kubeflow-pipelines/docling-vlm/local_run.py (2)
126-199: Remove unused validation function.The
validate_outputfunction is defined but never called anywhere in the code. Either remove it or integrate it into the test flow.🧹 Option 1: Remove the unused function
-def validate_output(output_dir: Path, test_name: str) -> dict: - """ - Validate the output quality and format. - - Args: - output_dir: Path to the output directory containing converted files. - test_name: Name of the test for reporting. - - Returns: - Dictionary containing validation results and checks. - """ - validation_results = { - "test_name": test_name, - "passed": True, - "checks": {}, - } - - # Check for JSON output - json_files = list(output_dir.glob("*.json")) - validation_results["checks"]["json_exists"] = len(json_files) > 0 - - # Check for Markdown output - md_files = list(output_dir.glob("*.md")) - validation_results["checks"]["md_exists"] = len(md_files) > 0 - - if json_files: - json_path = json_files[0] - try: - with open(json_path) as f: - data = json.load(f) - - # Required fields validation - validation_results["checks"]["has_name"] = "name" in data - validation_results["checks"]["has_pages"] = "pages" in data - - if "pages" in data: - validation_results["checks"]["pages_count"] = len(data["pages"]) - validation_results["checks"]["has_content"] = len(data["pages"]) > 0 - - # Check page structure - if data["pages"]: - first_page = data["pages"][0] - validation_results["checks"]["page_has_number"] = ( - "page_number" in first_page - ) - validation_results["checks"]["page_has_size"] = "size" in first_page - except Exception as e: - validation_results["checks"]["json_parsing_error"] = str(e) - validation_results["passed"] = False - - if md_files: - md_path = md_files[0] - try: - content = md_path.read_text() - validation_results["checks"]["md_length"] = len(content) - validation_results["checks"]["md_not_empty"] = len(content) > 100 - validation_results["checks"]["md_has_headers"] = "#" in content - validation_results["checks"]["md_no_error"] = not content.startswith( - "Error" - ) - except Exception as e: - validation_results["checks"]["md_parsing_error"] = str(e) - validation_results["passed"] = False - - # Overall pass/fail - validation_results["passed"] = all( - [ - validation_results["checks"].get("json_exists", False), - validation_results["checks"].get("md_exists", False), - validation_results["checks"].get("has_content", False), - ] - ) - - return validation_results -
257-282: Simplify control flow by removing unnecessary else.The
elseblock after Line 271 is unnecessary since the precedingifblock returns early.♻️ Simplify the control flow
if should_fail: expected_error = config.get("expected_error", "") # Check if pipeline failed as expected (validation worked) if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): print(f"✅ TEST PASSED: {test_name} - Failed as expected") print(" Pipeline correctly rejected invalid parameter") if expected_error: print(f" Expected error type: '{expected_error}'") return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f" Expected error containing: '{expected_error}'") - print(f" Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": "Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f" Expected error containing: '{expected_error}'") + print(f" Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": "Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }kubeflow-pipelines/docling-standard/local_run.py (2)
177-250: Remove unused validation function.The
validate_outputfunction is defined but never called. Either remove it or integrate it into the test workflow (same issue as in the VLM local_run.py).🧹 Remove the unused function
-def validate_output(output_dir: Path, test_name: str) -> dict: - """ - Validate the output quality and format. - - Args: - output_dir: Path to the output directory containing converted files. - test_name: Name of the test for reporting. - - Returns: - Dictionary containing validation results and checks. - """ - validation_results = { - "test_name": test_name, - "passed": True, - "checks": {}, - } - - # Check for JSON output - json_files = list(output_dir.glob("*.json")) - validation_results["checks"]["json_exists"] = len(json_files) > 0 - - # Check for Markdown output - md_files = list(output_dir.glob("*.md")) - validation_results["checks"]["md_exists"] = len(md_files) > 0 - - if json_files: - json_path = json_files[0] - try: - with open(json_path) as f: - data = json.load(f) - - # Required fields validation - validation_results["checks"]["has_name"] = "name" in data - validation_results["checks"]["has_pages"] = "pages" in data - - if "pages" in data: - validation_results["checks"]["pages_count"] = len(data["pages"]) - validation_results["checks"]["has_content"] = len(data["pages"]) > 0 - - # Check page structure - if data["pages"]: - first_page = data["pages"][0] - validation_results["checks"]["page_has_number"] = ( - "page_number" in first_page - ) - validation_results["checks"]["page_has_size"] = "size" in first_page - except Exception as e: - validation_results["checks"]["json_parsing_error"] = str(e) - validation_results["passed"] = False - - if md_files: - md_path = md_files[0] - try: - content = md_path.read_text() - validation_results["checks"]["md_length"] = len(content) - validation_results["checks"]["md_not_empty"] = len(content) > 100 - validation_results["checks"]["md_has_headers"] = "#" in content - validation_results["checks"]["md_no_error"] = not content.startswith( - "Error" - ) - except Exception as e: - validation_results["checks"]["md_parsing_error"] = str(e) - validation_results["passed"] = False - - # Overall pass/fail - validation_results["passed"] = all( - [ - validation_results["checks"].get("json_exists", False), - validation_results["checks"].get("md_exists", False), - validation_results["checks"].get("has_content", False), - ] - ) - - return validation_results -
308-333: Simplify control flow by removing unnecessary else.The
elseblock after Line 322 is unnecessary since the precedingifblock returns early.♻️ Simplify the control flow
if should_fail: expected_error = config.get("expected_error", "") # Check if pipeline failed as expected (validation worked) if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): print(f"✅ TEST PASSED: {test_name} - Failed as expected") print(" Pipeline correctly rejected invalid parameter") if expected_error: print(f" Expected error type: '{expected_error}'") return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f" Expected error containing: '{expected_error}'") - print(f" Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": "Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f" Expected error containing: '{expected_error}'") + print(f" Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": "Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
kubeflow-pipelines/docling-standard/local_run.pykubeflow-pipelines/docling-standard/standard_components.pykubeflow-pipelines/docling-standard/standard_convert_pipeline.pykubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yamlkubeflow-pipelines/docling-vlm/local_run.pykubeflow-pipelines/docling-vlm/vlm_components.pykubeflow-pipelines/docling-vlm/vlm_convert_pipeline.pykubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
- kubeflow-pipelines/docling-standard/standard_components.py
- kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🧰 Additional context used
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 118-118: Too many positional arguments (8/5)
(R0917)
[refactor] 311-333: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 260-282: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: Summary
🔇 Additional comments (13)
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)
29-29: LGTM! Clean parameter addition and propagation.The
docling_accelerator_deviceparameter is correctly added to the pipeline signature and properly propagated to thedocling_convert_standardcomponent.Also applies to: 90-90
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)
29-29: LGTM! Consistent with standard pipeline implementation.The
docling_accelerator_deviceparameter is correctly added and propagated, following the same pattern as the standard pipeline.Also applies to: 77-77
kubeflow-pipelines/docling-vlm/vlm_components.py (3)
23-23: LGTM! Well-documented parameter addition.The
accelerator_deviceparameter is properly added with a sensible default and clear docstring.Also applies to: 38-38
87-96: LGTM! Robust validation with clear error messages.The validation logic correctly maps user-friendly device names to
AcceleratorDeviceenum values. The "gpu" → CUDA mapping is intuitive and the error message provides helpful guidance.
178-178: LGTM! Correct usage of validated device mapping.The mapped accelerator device is properly applied to the
AcceleratorOptionsconfiguration.kubeflow-pipelines/docling-vlm/local_run.py (4)
18-65: LGTM! Comprehensive test coverage.The test configurations cover normal functionality (CPU, auto-detect, GPU, various modes) and failure scenarios (invalid device, invalid image mode). This provides solid validation of the new parameter.
71-74: LGTM! Clear and simple helper component.The
take_first_splitcomponent is well-documented and handles the edge case of empty splits.
81-118: LGTM! Well-structured test pipeline.The parameterized test pipeline enables flexible testing of different accelerator device configurations.
207-290: LGTM! Well-designed test orchestration.The test framework effectively validates both successful conversions and proper error handling for invalid parameters. The two-phase approach (normal tests + failure scenarios) provides comprehensive coverage.
Also applies to: 298-360
kubeflow-pipelines/docling-standard/local_run.py (4)
23-99: LGTM! Excellent test coverage for standard pipeline.The test configurations comprehensively cover different backends (dlparse_v4, pypdfium2), table modes (accurate, fast), and failure scenarios (invalid device, backend, table mode, image mode, OCR engine). This provides thorough validation of the new accelerator_device parameter and existing options.
106-109: LGTM! Clear helper component.The
take_first_splithelper is well-documented and handles edge cases properly.
117-169: LGTM! Comprehensive test pipeline.The parameterized test pipeline enables thorough testing of accelerator device configurations alongside other conversion parameters.
258-341: LGTM! Robust test orchestration with excellent coverage.The test framework comprehensively validates the standard pipeline's accelerator device parameter alongside other configuration options. The two-phase approach effectively catches both successful conversions and validation errors.
Also applies to: 349-411
| "auto_detect": { | ||
| "docling_accelerator_device": "auto", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "embedded", | ||
| }, | ||
| "minimal_threads": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_num_threads": 2, | ||
| "docling_image_export_mode": "embedded", | ||
| }, | ||
| "placeholder_images": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "placeholder", | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove these testcases, one with cpu and other with gpu is sufficient. The idea here is to test our pipeline code and not the docling functionality comprehensively.
| FAILURE_SCENARIOS = { | ||
| "invalid_device": { | ||
| "docling_accelerator_device": "invalid_device", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "embedded", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid accelerator_device", | ||
| }, | ||
| "invalid_image_mode": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "invalid_mode", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid image_export_mode", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea behind these testcases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just different failures scenarios, I wanted to check whether they are failing or not when they are required to fail for different arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases
| "pypdfium2_backend": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "pypdfium2", | ||
| "docling_table_mode": "fast", | ||
| "docling_num_threads": 2, | ||
| "docling_ocr": False, | ||
| }, | ||
| "fast_table_mode": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "dlparse_v4", | ||
| "docling_table_mode": "fast", | ||
| "docling_num_threads": 4, | ||
| "docling_ocr": True, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again we should not go into this level of details for such workflow level test. You could merge these cases with the previous ones. For this change i see cases with device set to "auto", "cpu" and "gpu" as critical.
| "invalid_device": { | ||
| "docling_accelerator_device": "invalid_device", | ||
| "docling_pdf_backend": "dlparse_v4", | ||
| "docling_table_mode": "accurate", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid accelerator_device", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are just passing on the parameter to the docling to trigger the error correct? In which case testing the docling error should not be this test's responsibility
| }, | ||
| "invalid_backend": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "invalid_backend", | ||
| "docling_table_mode": "accurate", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid pdf_backend", | ||
| }, | ||
| "invalid_table_mode": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "dlparse_v4", | ||
| "docling_table_mode": "invalid_mode", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid table_mode", | ||
| }, | ||
| "invalid_image_mode": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "dlparse_v4", | ||
| "docling_table_mode": "accurate", | ||
| "docling_image_export_mode": "invalid_mode", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid image_export_mode", | ||
| }, | ||
| "invalid_ocr_engine": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_pdf_backend": "dlparse_v4", | ||
| "docling_table_mode": "accurate", | ||
| "docling_ocr_engine": "invalid_engine", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid ocr_engine", | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My above comment applies to all these test cases
| FAILURE_SCENARIOS = { | ||
| "invalid_device": { | ||
| "docling_accelerator_device": "invalid_device", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "embedded", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid accelerator_device", | ||
| }, | ||
| "invalid_image_mode": { | ||
| "docling_accelerator_device": "cpu", | ||
| "docling_num_threads": 4, | ||
| "docling_image_export_mode": "invalid_mode", | ||
| "should_fail": True, | ||
| "expected_error": "Invalid image_export_mode", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like mentioned in my above comment, these should not be tested here. Also keep the vlm test to what is absolutely needed, as standard test already gives the coverage for overlapping cases
|
🔀 Merge Conflict Detected This pull request has merge conflicts that must be resolved before it can be merged. |
Description
Adds configurable
accelerator_deviceparameter to both Standard and VLM Docling pipelines, allowing users to switch between CPU or GPU for document conversion based on their infrastructure resources.Changes:
How Has This Been Tested?
Run local test suite (requires Docker), tests cover:
Merge criteria:
[x] The commits are squashed in a cohesive manner and have meaningful messages.
[x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
[x] The developer has manually tested the changes and verified that the changes work
Summary by CodeRabbit
New Features
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.