Skip to content

docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker#2561

Open
yyouretoast wants to merge 1 commit into
huggingface:mainfrom
yyouretoast:fix-deepseek-ocr-sagemaker-missing-files
Open

docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker#2561
yyouretoast wants to merge 1 commit into
huggingface:mainfrom
yyouretoast:fix-deepseek-ocr-sagemaker-missing-files

Conversation

@yyouretoast

@yyouretoast yyouretoast commented Jun 13, 2026

Copy link
Copy Markdown

This PR resolves #2553.

The sagemaker-notebook.ipynb notebook under docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ uses !tar to bundle helper scripts (entry.sh, sm_job_runner.py) and a local library (../llm_ocr) to package the SageMaker Training Job. However, these assets were never committed to this repository, causing the source-bundling step to fail and making it impossible for users to run or follow the notebook.

These files were obtained from the original repository: fgbelidji/llm-lab under batch-ocr-inference/.

Proposed Changes:

Added entry.sh and sm_job_runner.py into the docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ folder next to the notebook.

Added the core llm_ocr/ library folder into docs/sagemaker/notebooks/sagemaker-sdk/ so that the notebook can successfully resolve ../llm_ocr during bundling.


Note

Low Risk
Documentation and example-only assets under docs/; no changes to core library runtime, auth, or production services. Operational risk is limited to users running the SageMaker notebook with cloud credentials they configure.

Overview
Commits the SageMaker job bootstrap and llm_ocr pipeline that the deepseek-ocr-sagemaker notebook already expects when bundling source for ModelTrainer, fixing broken tar/copy steps that referenced files not in the repo.

entry.sh installs uv at job start and runs sm_job_runner.py from /opt/ml/input/data/code. sm_job_runner.py loads SageMaker hyperparameters into env vars, then delegates to llm_ocr.cli.main(), writing _SUCCESS or a failure file under /opt/ml/output.

The new llm_ocr package implements a three-stage OCR pipeline (PIPELINE_STAGE: extract, describe, assemble): optional vLLM subprocess + DeepSeekClient batch inference, grounding-based markdown/figure extraction, caption enrichment, and dataset persistence via a single-backend abstraction (HF Hub, S3 via sm_io, or GCS via cloudrun_io).

Reviewed by Cursor Bugbot for commit a370db8. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a370db8. Configure here.

x1 = int(raw_box[0] / 999 * width)
y1 = int(raw_box[1] / 999 * height)
x2 = int(raw_box[2] / 999 * width)
y2 = int(raw_box[3] / 999 * height)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing coordinates null guard

High Severity

build_document_markdown always indexes block["coordinates"][0] even when extract_grounding_blocks left coordinates as None or parsing failed. A grounding block with missing or invalid detection data then raises during extract instead of skipping or degrading that block.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a370db8. Configure here.


# Get storage backend and save
storage = get_storage(repo_id=settings.hub.repo_id)
storage.save_dataset(ds, "dataset")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignored dataset save failures

High Severity

Pipeline stages call storage.save_dataset but never check its boolean result. S3Storage and GCSStorage can return False after errors or missing output URIs, yet the stage still logs success and sm_job_runner writes _SUCCESS, so jobs can finish with no dataset in S3.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a370db8. Configure here.

)
except Exception as exc:
LOGGER.error("DeepSeek request failed: %s", exc)
raise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry settings never applied

Medium Severity

DeepSeekClient accepts max_retries and backoff settings from InferenceSettings, stores them on the instance, but _async_completion performs a single OpenAI request and re-raises on failure. Transient vLLM or network errors are not retried despite configuration.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a370db8. Configure here.


if not lookup:
LOGGER.info("No descriptions generated")
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe exits without output

Medium Severity

After queuing figures (pending > 0), if no description lines are collected into lookup, run_stage_describe logs and returns without saving an updated dataset. The job can succeed while leaving the describe stage output missing for assemble.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a370db8. Configure here.

asyncio.create_task(self._async_completion(p, t))
for p, t in zip(payloads, timeouts)
]
return await asyncio.gather(*tasks)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max concurrency setting unused

Medium Severity

InferenceSettings.max_concurrency is loaded from env but never used. _async_infer_batch schedules one asyncio task per request in the batch with asyncio.gather, so concurrency equals batch size only and cannot be capped by EXTRACT_MAX_CONCURRENCY / DESCRIBE_MAX_CONCURRENCY.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a370db8. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deepseek-ocr-sagemaker notebook references uncommitted code (entry.sh, sm_job_runner.py, llm_ocr/)

1 participant