Merged
Conversation
* Update everything to use the v3 enrollments endpoint * upgrade api temporarily to branch build and use new upgrade product fields * if a course run is passed in, get the b2b contract ID directly from the v3 run data * fix typecheck issue with missing upgrade product props * fix is_upgradable check * properly handle upgrade deadline * switch back to release api client * fix test mock after rebase * copilot suggestion regarding checking product id before rendering upgrade banner * address feedback
* logic fix * adding test * fixing test * move check outside of ocr method * Revert "move check outside of ocr method" This reverts commit ced5536. * move check outside of ocr * fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * catch all pdf errors * fix test * fix test * fix check * fix tests * switch logging statements --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines
+594
to
+600
|
|
||
| page_count = len(PdfReader(file_path).pages) | ||
| if page_count > settings.OCR_PDF_MAX_PAGE_THRESHOLD and not is_tutor_problem: | ||
| return None | ||
| return { | ||
| "content": _pdf_to_markdown(file_path), | ||
| "content_title": "", |
There was a problem hiding this comment.
Bug: The _extract_content_with_ocr function lacks exception handling. A PDF that is valid on its first page but corrupted on a later page will cause an unhandled exception.
Severity: HIGH
Suggested Fix
Wrap the PDF processing logic within the _extract_content_with_ocr function in a try/except block to catch potential exceptions from pypdf, such as FileNotDecryptedError or PdfReadError. This will prevent a single malformed file from crashing the entire ETL task and allow it to be skipped gracefully.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: learning_resources/etl/utils.py#L594-L600
Potential issue: The `pdf_is_valid` function only checks the first page of a PDF. A file
with a valid first page but a corrupted or encrypted subsequent page will pass this
initial validation. When `_extract_content_with_ocr` is later called, it attempts to
count all pages via `len(PdfReader(file_path).pages)`. Because a `try/except` block was
removed in this function, an exception raised by `pypdf` (e.g., `FileNotDecryptedError`,
`PdfReadError`) on a subsequent bad page will be unhandled. This will crash the entire
ETL ingestion task for that file, whereas previously it would have been gracefully
skipped.
Did we get this right? 👍 / 👎 to inform future reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Shankar Ambady
Carey P Gumaer