Skip to content

Release 0.58.3#3069

Merged
odlbot merged 4 commits intoreleasefrom
release-candidate
Mar 19, 2026
Merged

Release 0.58.3#3069
odlbot merged 4 commits intoreleasefrom
release-candidate

Conversation

@odlbot
Copy link
Contributor

@odlbot odlbot commented Mar 19, 2026

Shankar Ambady

Carey P Gumaer

gumaerc and others added 4 commits March 18, 2026 15:24
* Update everything to use the v3 enrollments endpoint

* upgrade api temporarily to branch build and use new upgrade product fields

* if a course run is passed in, get the b2b contract ID directly from the v3 run data

* fix typecheck issue with missing upgrade product props

* fix is_upgradable check

* properly handle upgrade deadline

* switch back to release api client

* fix test mock after rebase

* copilot suggestion regarding checking product id before rendering upgrade banner

* address feedback
* logic fix

* adding test

* fixing test

* move check outside of ocr method

* Revert "move check outside of ocr method"

This reverts commit ced5536.

* move check outside of ocr

* fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* catch all pdf errors

* fix test

* fix test

* fix check

* fix tests

* switch logging statements

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines +594 to +600

page_count = len(PdfReader(file_path).pages)
if page_count > settings.OCR_PDF_MAX_PAGE_THRESHOLD and not is_tutor_problem:
return None
return {
"content": _pdf_to_markdown(file_path),
"content_title": "",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The _extract_content_with_ocr function lacks exception handling. A PDF that is valid on its first page but corrupted on a later page will cause an unhandled exception.
Severity: HIGH

Suggested Fix

Wrap the PDF processing logic within the _extract_content_with_ocr function in a try/except block to catch potential exceptions from pypdf, such as FileNotDecryptedError or PdfReadError. This will prevent a single malformed file from crashing the entire ETL task and allow it to be skipped gracefully.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: learning_resources/etl/utils.py#L594-L600

Potential issue: The `pdf_is_valid` function only checks the first page of a PDF. A file
with a valid first page but a corrupted or encrypted subsequent page will pass this
initial validation. When `_extract_content_with_ocr` is later called, it attempts to
count all pages via `len(PdfReader(file_path).pages)`. Because a `try/except` block was
removed in this function, an exception raised by `pypdf` (e.g., `FileNotDecryptedError`,
`PdfReadError`) on a subsequent bad page will be unhandled. This will crash the entire
ETL ingestion task for that file, whereas previously it would have been gracefully
skipped.

Did we get this right? 👍 / 👎 to inform future reviews.

@odlbot odlbot merged commit 833c2cb into release Mar 19, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants