docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker#2561
Conversation
…for deepseek-ocr-sagemaker
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a370db8. Configure here.
| x1 = int(raw_box[0] / 999 * width) | ||
| y1 = int(raw_box[1] / 999 * height) | ||
| x2 = int(raw_box[2] / 999 * width) | ||
| y2 = int(raw_box[3] / 999 * height) |
There was a problem hiding this comment.
Missing coordinates null guard
High Severity
build_document_markdown always indexes block["coordinates"][0] even when extract_grounding_blocks left coordinates as None or parsing failed. A grounding block with missing or invalid detection data then raises during extract instead of skipping or degrading that block.
Reviewed by Cursor Bugbot for commit a370db8. Configure here.
|
|
||
| # Get storage backend and save | ||
| storage = get_storage(repo_id=settings.hub.repo_id) | ||
| storage.save_dataset(ds, "dataset") |
There was a problem hiding this comment.
Ignored dataset save failures
High Severity
Pipeline stages call storage.save_dataset but never check its boolean result. S3Storage and GCSStorage can return False after errors or missing output URIs, yet the stage still logs success and sm_job_runner writes _SUCCESS, so jobs can finish with no dataset in S3.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit a370db8. Configure here.
| ) | ||
| except Exception as exc: | ||
| LOGGER.error("DeepSeek request failed: %s", exc) | ||
| raise |
There was a problem hiding this comment.
Retry settings never applied
Medium Severity
DeepSeekClient accepts max_retries and backoff settings from InferenceSettings, stores them on the instance, but _async_completion performs a single OpenAI request and re-raises on failure. Transient vLLM or network errors are not retried despite configuration.
Reviewed by Cursor Bugbot for commit a370db8. Configure here.
|
|
||
| if not lookup: | ||
| LOGGER.info("No descriptions generated") | ||
| return |
There was a problem hiding this comment.
Describe exits without output
Medium Severity
After queuing figures (pending > 0), if no description lines are collected into lookup, run_stage_describe logs and returns without saving an updated dataset. The job can succeed while leaving the describe stage output missing for assemble.
Reviewed by Cursor Bugbot for commit a370db8. Configure here.
| asyncio.create_task(self._async_completion(p, t)) | ||
| for p, t in zip(payloads, timeouts) | ||
| ] | ||
| return await asyncio.gather(*tasks) |
There was a problem hiding this comment.
Max concurrency setting unused
Medium Severity
InferenceSettings.max_concurrency is loaded from env but never used. _async_infer_batch schedules one asyncio task per request in the batch with asyncio.gather, so concurrency equals batch size only and cannot be capped by EXTRACT_MAX_CONCURRENCY / DESCRIBE_MAX_CONCURRENCY.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a370db8. Configure here.


This PR resolves #2553.
The sagemaker-notebook.ipynb notebook under docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ uses !tar to bundle helper scripts (entry.sh, sm_job_runner.py) and a local library (../llm_ocr) to package the SageMaker Training Job. However, these assets were never committed to this repository, causing the source-bundling step to fail and making it impossible for users to run or follow the notebook.
These files were obtained from the original repository: fgbelidji/llm-lab under batch-ocr-inference/.
Proposed Changes:
Added entry.sh and sm_job_runner.py into the docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ folder next to the notebook.
Added the core llm_ocr/ library folder into docs/sagemaker/notebooks/sagemaker-sdk/ so that the notebook can successfully resolve ../llm_ocr during bundling.
Note
Low Risk
Documentation and example-only assets under
docs/; no changes to core library runtime, auth, or production services. Operational risk is limited to users running the SageMaker notebook with cloud credentials they configure.Overview
Commits the SageMaker job bootstrap and
llm_ocrpipeline that thedeepseek-ocr-sagemakernotebook already expects when bundling source forModelTrainer, fixing brokentar/copy steps that referenced files not in the repo.entry.shinstalls uv at job start and runssm_job_runner.pyfrom/opt/ml/input/data/code.sm_job_runner.pyloads SageMaker hyperparameters into env vars, then delegates tollm_ocr.cli.main(), writing_SUCCESSor a failure file under/opt/ml/output.The new
llm_ocrpackage implements a three-stage OCR pipeline (PIPELINE_STAGE: extract, describe, assemble): optional vLLM subprocess +DeepSeekClientbatch inference, grounding-based markdown/figure extraction, caption enrichment, and dataset persistence via a single-backend abstraction (HF Hub, S3 viasm_io, or GCS viacloudrun_io).Reviewed by Cursor Bugbot for commit a370db8. Bugbot is set up for automated code reviews on this repo. Configure here.