Skip to content

AskTim canvas ai contentfile ingestion issue#3059

Merged
shanbady merged 13 commits intomainfrom
shanbady/asktim-canvas-ai-issue
Mar 19, 2026
Merged

AskTim canvas ai contentfile ingestion issue#3059
shanbady merged 13 commits intomainfrom
shanbady/asktim-canvas-ai-issue

Conversation

@shanbady
Copy link
Contributor

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/10526

Description (What does it do?)

This PR resolves an issue with ingesting canvas contentfiles (specifically pdfs that are more than 10 pages) that started happening after a refactor of learning_resources/etl/utils.py::_extract_content. Essentially pdf files would not fallback to being parsed via tikka if it exceeded settings.OCR_PDF_MAX_PAGE_THRESHOLD

How can this be tested?

  1. checkout main
  2. make sure you have OPENAI_API_KEY set locally. also make sure you have settings.OCR_MODEL set to "gpt-5-nano-2025-08-07" and settings.COURSE_ARCHIVE_BUCKET_NAME set to "ol-data-lake-landing-zone-production"
  3. open a django shell and run the following to ingest this specific canvas course:
 from learning_resources.tasks import ingest_canvas_course
ingest_canvas_course('canvas/course_content/37842/7225ecfcef27b0368f8dca21492c663c93c9ecc304c9f45fc3d6d62659fd0ea5.imscc', True)
LearningResource.objects.filter(readable_id__icontains="37842").first().runs.first().content_files.all()
  1. note that there are 0 contentfiles that appear for the course
  2. checkout this branch
  3. exit/restart django shell and re-run the script
  4. note that there are contentfiles now

@shanbady shanbady marked this pull request as ready for review March 17, 2026 14:46
Copilot AI review requested due to automatic review settings March 17, 2026 14:46
@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the learning resource ETL text-extraction flow so that when OCR is enabled but yields no result, extraction can fall back to Apache Tika instead of returning None early.

Changes:

  • Adjust _extract_content to only return the OCR result when OCR returns a non-None content dict; otherwise fall back to Tika extraction.
  • Update the encrypted-PDF test setup to prevent real Tika parsing calls.
  • Add a new unit test to verify OCR→Tika fallback when OCR returns None.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
learning_resources/etl/utils.py Changes OCR branch to conditionally fall back to Tika when OCR returns None.
learning_resources/etl/utils_test.py Adds/adjusts tests around OCR fallback and encrypted PDF handling.

Comment on lines 648 to +652
file_extension=file_extension, file_path=file_path, use_ocr=use_ocr
):
return _extract_content_with_ocr(file_path, is_tutor_problem)
content_dict = _extract_content_with_ocr(file_path, is_tutor_problem)
if content_dict:
return content_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I extracted the logic to catch encrypted docs into its own method that gets called beforehand

@mbertrand mbertrand self-assigned this Mar 18, 2026
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great but I think it might be worth addressing the sentry issues to handle encrypted pdfs

Comment on lines 648 to +652
file_extension=file_extension, file_path=file_path, use_ocr=use_ocr
):
return _extract_content_with_ocr(file_path, is_tutor_problem)
content_dict = _extract_content_with_ocr(file_path, is_tutor_problem)
if content_dict:
return content_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?

@mbertrand mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Mar 18, 2026
Comment on lines +651 to +652
if content_dict:
return content_dict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The check if content_dict: is truthy for a dictionary with empty content strings, preventing the intended fallback to Tika extraction when OCR yields no text.
Severity: MEDIUM

Suggested Fix

Modify the conditional check to also verify that the content key in the returned dictionary is not empty. Change if content_dict: to if content_dict and content_dict.get("content"): to ensure the fallback to Tika occurs when OCR returns a dictionary with empty content.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: learning_resources/etl/utils.py#L651-L652

Potential issue: When the OCR process in `_extract_content_with_ocr` fails to extract
text from a PDF, it can return a dictionary with an empty content string, such as
`{"content": "", "content_title": ""}`. The subsequent check `if content_dict:`
evaluates to true for this dictionary because a non-empty dictionary is truthy in
Python. This causes the function to return the dictionary with empty content
immediately, incorrectly skipping the intended fallback to the Tika extraction process.
As a result, a `ContentFile` with empty content may be ingested instead of being
re-processed by Tika, which might have successfully extracted the content.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@shanbady shanbady requested a review from mbertrand March 18, 2026 19:30
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple logging nitpicks

file_path = Path(olx_path) / Path(source_path)

if file_extension == ".pdf" and file_path.is_file() and not pdf_is_valid(file_path):
log.exception("Skipping invalid pdf %s", file_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no exception/stacktrace here so log.warning would probably be better, can use log.exception in pdf_is_valid if an exception occurs there.

reader.pages[0].extract_text()
return True
except Exception as e: # noqa: BLE001
log.warning("PDF validation error for %s: %s", pdf_path, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to log.exception here, use warning in calling function

except FileNotDecryptedError:
log.exception("Skipping encrypted pdf %s", file_path)

page_count = len(PdfReader(file_path).pages)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The function _extract_content_with_ocr lacks exception handling. Errors from PdfReader or _pdf_to_markdown will crash the ETL process for the file.
Severity: HIGH

Suggested Fix

Wrap the calls within _extract_content_with_ocr in a try...except block to catch potential exceptions from both PdfReader and _pdf_to_markdown. Upon catching an exception, log the error and return None to allow the ETL process to gracefully skip the problematic file instead of crashing.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: learning_resources/etl/utils.py#L595

Potential issue: The function `_extract_content_with_ocr` has had its exception handling
removed. While a new `pdf_is_valid` check is performed beforehand, this check does not
guarantee that subsequent processing will succeed. Specifically, the call to
`_pdf_to_markdown` inside `_extract_content_with_ocr` is unprotected. Any exceptions
from `_pdf_to_markdown`—which can be caused by LLM API errors, image processing
failures, or other PDF parsing issues—will propagate unhandled, crashing the ETL process
for the specific file being processed.

@shanbady shanbady merged commit 03e59af into main Mar 19, 2026
14 checks passed
@shanbady shanbady deleted the shanbady/asktim-canvas-ai-issue branch March 19, 2026 16:28
@odlbot odlbot mentioned this pull request Mar 19, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants