AskTim canvas ai contentfile ingestion issue#3059
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the learning resource ETL text-extraction flow so that when OCR is enabled but yields no result, extraction can fall back to Apache Tika instead of returning None early.
Changes:
- Adjust
_extract_contentto only return the OCR result when OCR returns a non-Nonecontent dict; otherwise fall back to Tika extraction. - Update the encrypted-PDF test setup to prevent real Tika parsing calls.
- Add a new unit test to verify OCR→Tika fallback when OCR returns
None.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
learning_resources/etl/utils.py |
Changes OCR branch to conditionally fall back to Tika when OCR returns None. |
learning_resources/etl/utils_test.py |
Adds/adjusts tests around OCR fallback and encrypted PDF handling. |
| file_extension=file_extension, file_path=file_path, use_ocr=use_ocr | ||
| ): | ||
| return _extract_content_with_ocr(file_path, is_tutor_problem) | ||
| content_dict = _extract_content_with_ocr(file_path, is_tutor_problem) | ||
| if content_dict: | ||
| return content_dict |
There was a problem hiding this comment.
Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?
There was a problem hiding this comment.
👍 I extracted the logic to catch encrypted docs into its own method that gets called beforehand
mbertrand
left a comment
There was a problem hiding this comment.
Works great but I think it might be worth addressing the sentry issues to handle encrypted pdfs
| file_extension=file_extension, file_path=file_path, use_ocr=use_ocr | ||
| ): | ||
| return _extract_content_with_ocr(file_path, is_tutor_problem) | ||
| content_dict = _extract_content_with_ocr(file_path, is_tutor_problem) | ||
| if content_dict: | ||
| return content_dict |
There was a problem hiding this comment.
Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?
This reverts commit ced5536.
| if content_dict: | ||
| return content_dict |
There was a problem hiding this comment.
Bug: The check if content_dict: is truthy for a dictionary with empty content strings, preventing the intended fallback to Tika extraction when OCR yields no text.
Severity: MEDIUM
Suggested Fix
Modify the conditional check to also verify that the content key in the returned dictionary is not empty. Change if content_dict: to if content_dict and content_dict.get("content"): to ensure the fallback to Tika occurs when OCR returns a dictionary with empty content.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: learning_resources/etl/utils.py#L651-L652
Potential issue: When the OCR process in `_extract_content_with_ocr` fails to extract
text from a PDF, it can return a dictionary with an empty content string, such as
`{"content": "", "content_title": ""}`. The subsequent check `if content_dict:`
evaluates to true for this dictionary because a non-empty dictionary is truthy in
Python. This causes the function to return the dictionary with empty content
immediately, incorrectly skipping the intended fallback to the Tika extraction process.
As a result, a `ContentFile` with empty content may be ingested instead of being
re-processed by Tika, which might have successfully extracted the content.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
mbertrand
left a comment
There was a problem hiding this comment.
Just a couple logging nitpicks
learning_resources/etl/utils.py
Outdated
| file_path = Path(olx_path) / Path(source_path) | ||
|
|
||
| if file_extension == ".pdf" and file_path.is_file() and not pdf_is_valid(file_path): | ||
| log.exception("Skipping invalid pdf %s", file_path) |
There was a problem hiding this comment.
There's no exception/stacktrace here so log.warning would probably be better, can use log.exception in pdf_is_valid if an exception occurs there.
learning_resources/etl/utils.py
Outdated
| reader.pages[0].extract_text() | ||
| return True | ||
| except Exception as e: # noqa: BLE001 | ||
| log.warning("PDF validation error for %s: %s", pdf_path, e) |
There was a problem hiding this comment.
change to log.exception here, use warning in calling function
| except FileNotDecryptedError: | ||
| log.exception("Skipping encrypted pdf %s", file_path) | ||
|
|
||
| page_count = len(PdfReader(file_path).pages) |
There was a problem hiding this comment.
Bug: The function _extract_content_with_ocr lacks exception handling. Errors from PdfReader or _pdf_to_markdown will crash the ETL process for the file.
Severity: HIGH
Suggested Fix
Wrap the calls within _extract_content_with_ocr in a try...except block to catch potential exceptions from both PdfReader and _pdf_to_markdown. Upon catching an exception, log the error and return None to allow the ETL process to gracefully skip the problematic file instead of crashing.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: learning_resources/etl/utils.py#L595
Potential issue: The function `_extract_content_with_ocr` has had its exception handling
removed. While a new `pdf_is_valid` check is performed beforehand, this check does not
guarantee that subsequent processing will succeed. Specifically, the call to
`_pdf_to_markdown` inside `_extract_content_with_ocr` is unprotected. Any exceptions
from `_pdf_to_markdown`—which can be caused by LLM API errors, image processing
failures, or other PDF parsing issues—will propagate unhandled, crashing the ETL process
for the specific file being processed.
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/10526
Description (What does it do?)
This PR resolves an issue with ingesting canvas contentfiles (specifically pdfs that are more than 10 pages) that started happening after a refactor of learning_resources/etl/utils.py::_extract_content. Essentially pdf files would not fallback to being parsed via tikka if it exceeded settings.OCR_PDF_MAX_PAGE_THRESHOLD
How can this be tested?