docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker by yyouretoast · Pull Request #2561 · huggingface/hub-docs

yyouretoast · 2026-06-13T10:54:00Z

This PR resolves #2553.

The sagemaker-notebook.ipynb notebook under docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ uses !tar to bundle helper scripts (entry.sh, sm_job_runner.py) and a local library (../llm_ocr) to package the SageMaker Training Job. However, these assets were never committed to this repository, causing the source-bundling step to fail and making it impossible for users to run or follow the notebook.

These files were obtained from the original repository: fgbelidji/llm-lab under batch-ocr-inference/.

Proposed Changes:

Added entry.sh and sm_job_runner.py into the docs/sagemaker/notebooks/sagemaker-sdk/deepseek-ocr-sagemaker/ folder next to the notebook.

Added the core llm_ocr/ library folder into docs/sagemaker/notebooks/sagemaker-sdk/ so that the notebook can successfully resolve ../llm_ocr during bundling.

Note

Low Risk
Documentation and example-only assets under docs/; no changes to core library runtime, auth, or production services. Operational risk is limited to users running the SageMaker notebook with cloud credentials they configure.

Overview
Commits the SageMaker job bootstrap and llm_ocr pipeline that the deepseek-ocr-sagemaker notebook already expects when bundling source for ModelTrainer, fixing broken tar/copy steps that referenced files not in the repo.

entry.sh installs uv at job start and runs sm_job_runner.py from /opt/ml/input/data/code. sm_job_runner.py loads SageMaker hyperparameters into env vars, then delegates to llm_ocr.cli.main(), writing _SUCCESS or a failure file under /opt/ml/output.

The new llm_ocr package implements a three-stage OCR pipeline (PIPELINE_STAGE: extract, describe, assemble): optional vLLM subprocess + DeepSeekClient batch inference, grounding-based markdown/figure extraction, caption enrichment, and dataset persistence via a single-backend abstraction (HF Hub, S3 via sm_io, or GCS via cloudrun_io).

^{Reviewed by Cursor Bugbot for commit a370db8. Bugbot is set up for automated code reviews on this repo. Configure here.}

…for deepseek-ocr-sagemaker

cursor

Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

cursor · 2026-06-13T10:55:34Z

+        x1 = int(raw_box[0] / 999 * width)
+        y1 = int(raw_box[1] / 999 * height)
+        x2 = int(raw_box[2] / 999 * width)
+        y2 = int(raw_box[3] / 999 * height)


Missing coordinates null guard

High Severity

build_document_markdown always indexes block["coordinates"][0] even when extract_grounding_blocks left coordinates as None or parsing failed. A grounding block with missing or invalid detection data then raises during extract instead of skipping or degrading that block.

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

cursor · 2026-06-13T10:55:34Z

+
+    # Get storage backend and save
+    storage = get_storage(repo_id=settings.hub.repo_id)
+    storage.save_dataset(ds, "dataset")


Ignored dataset save failures

High Severity

Pipeline stages call storage.save_dataset but never check its boolean result. S3Storage and GCSStorage can return False after errors or missing output URIs, yet the stage still logs success and sm_job_runner writes _SUCCESS, so jobs can finish with no dataset in S3.

Additional Locations (2)

docs/sagemaker/notebooks/sagemaker-sdk/llm_ocr/stages.py#L397-L398

docs/sagemaker/notebooks/sagemaker-sdk/llm_ocr/storage.py#L120-L136

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

cursor · 2026-06-13T10:55:34Z

+            )
+        except Exception as exc:
+            LOGGER.error("DeepSeek request failed: %s", exc)
+            raise


Retry settings never applied

Medium Severity

DeepSeekClient accepts max_retries and backoff settings from InferenceSettings, stores them on the instance, but _async_completion performs a single OpenAI request and re-raises on failure. Transient vLLM or network errors are not retried despite configuration.

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

cursor · 2026-06-13T10:55:35Z

+
+    if not lookup:
+        LOGGER.info("No descriptions generated")
+        return


Describe exits without output

Medium Severity

After queuing figures (pending > 0), if no description lines are collected into lookup, run_stage_describe logs and returns without saving an updated dataset. The job can succeed while leaving the describe stage output missing for assemble.

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

cursor · 2026-06-13T10:55:35Z

+            asyncio.create_task(self._async_completion(p, t))
+            for p, t in zip(payloads, timeouts)
+        ]
+        return await asyncio.gather(*tasks)


Max concurrency setting unused

Medium Severity

InferenceSettings.max_concurrency is loaded from env but never used. _async_infer_batch schedules one asyncio task per request in the batch with asyncio.gather, so concurrency equals batch size only and cannot be capped by EXTRACT_MAX_CONCURRENCY / DESCRIBE_MAX_CONCURRENCY.

Additional Locations (1)

docs/sagemaker/notebooks/sagemaker-sdk/llm_ocr/config.py#L42-L54

^{Reviewed by Cursor Bugbot for commit a370db8. Configure here.}

docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr …

a370db8

…for deepseek-ocr-sagemaker

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker#2561

docs(sagemaker): add missing entry.sh, sm_job_runner.py, and llm_ocr module for deepseek-ocr-sagemaker#2561
yyouretoast wants to merge 1 commit into
huggingface:mainfrom
yyouretoast:fix-deepseek-ocr-sagemaker-missing-files

yyouretoast commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yyouretoast commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Missing coordinates null guard

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Ignored dataset save failures

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Retry settings never applied

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Describe exits without output

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Max concurrency setting unused

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yyouretoast commented Jun 13, 2026 •

edited by cursor Bot

Loading