AskTim canvas ai contentfile ingestion issue by shanbady · Pull Request #3059 · mitodl/mit-learn

shanbady · 2026-03-17T14:15:07Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/10526

Description (What does it do?)

This PR resolves an issue with ingesting canvas contentfiles (specifically pdfs that are more than 10 pages) that started happening after a refactor of learning_resources/etl/utils.py::_extract_content. Essentially pdf files would not fallback to being parsed via tikka if it exceeded settings.OCR_PDF_MAX_PAGE_THRESHOLD

How can this be tested?

checkout main
make sure you have OPENAI_API_KEY set locally. also make sure you have settings.OCR_MODEL set to "gpt-5-nano-2025-08-07" and settings.COURSE_ARCHIVE_BUCKET_NAME set to "ol-data-lake-landing-zone-production"
open a django shell and run the following to ingest this specific canvas course:

 from learning_resources.tasks import ingest_canvas_course
ingest_canvas_course('canvas/course_content/37842/7225ecfcef27b0368f8dca21492c663c93c9ecc304c9f45fc3d6d62659fd0ea5.imscc', True)
LearningResource.objects.filter(readable_id__icontains="37842").first().runs.first().content_files.all()

note that there are 0 contentfiles that appear for the course
checkout this branch
exit/restart django shell and re-run the script
note that there are contentfiles now

Copilot

Pull request overview

Updates the learning resource ETL text-extraction flow so that when OCR is enabled but yields no result, extraction can fall back to Apache Tika instead of returning None early.

Changes:

Adjust _extract_content to only return the OCR result when OCR returns a non-None content dict; otherwise fall back to Tika extraction.
Update the encrypted-PDF test setup to prevent real Tika parsing calls.
Add a new unit test to verify OCR→Tika fallback when OCR returns None.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`learning_resources/etl/utils.py`	Changes OCR branch to conditionally fall back to Tika when OCR returns `None`.
`learning_resources/etl/utils_test.py`	Adds/adjusts tests around OCR fallback and encrypted PDF handling.

mbertrand · 2026-03-18T14:02:01Z

learning_resources/etl/utils.py

        file_extension=file_extension, file_path=file_path, use_ocr=use_ocr
    ):
-        return _extract_content_with_ocr(file_path, is_tutor_problem)
+        content_dict = _extract_content_with_ocr(file_path, is_tutor_problem)
+        if content_dict:
+            return content_dict


Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?

👍 I extracted the logic to catch encrypted docs into its own method that gets called beforehand

learning_resources/etl/utils_test.py

mbertrand

Works great but I think it might be worth addressing the sentry issues to handle encrypted pdfs

mbertrand · 2026-03-18T14:02:01Z

learning_resources/etl/utils.py

        file_extension=file_extension, file_path=file_path, use_ocr=use_ocr
    ):
-        return _extract_content_with_ocr(file_path, is_tutor_problem)
+        content_dict = _extract_content_with_ocr(file_path, is_tutor_problem)
+        if content_dict:
+            return content_dict


Probably worth handling this. Maybe we can let FileNotDecryptedError propagate out of _extract_content_with_ocr and catch it in _extract_content instead to return None from there, before trying to process it with tika?

learning_resources/etl/utils.py

This reverts commit ced5536.

sentry · 2026-03-18T17:31:05Z

learning_resources/etl/utils.py

+        if content_dict:
+            return content_dict


Bug: The check if content_dict: is truthy for a dictionary with empty content strings, preventing the intended fallback to Tika extraction when OCR yields no text.
_{Severity: MEDIUM}

Suggested Fix

Modify the conditional check to also verify that the content key in the returned dictionary is not empty. Change if content_dict: to if content_dict and content_dict.get("content"): to ensure the fallback to Tika occurs when OCR returns a dictionary with empty content.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: learning_resources/etl/utils.py#L651-L652 Potential issue: When the OCR process in `_extract_content_with_ocr` fails to extract text from a PDF, it can return a dictionary with an empty content string, such as `{"content": "", "content_title": ""}`. The subsequent check `if content_dict:` evaluates to true for this dictionary because a non-empty dictionary is truthy in Python. This causes the function to return the dictionary with empty content immediately, incorrectly skipping the intended fallback to the Tika extraction process. As a result, a `ContentFile` with empty content may be ingested instead of being re-processed by Tika, which might have successfully extracted the content.

learning_resources/etl/utils.py

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

learning_resources/etl/utils.py

learning_resources/etl/utils_test.py

mbertrand

Just a couple logging nitpicks

mbertrand · 2026-03-18T19:55:07Z

learning_resources/etl/utils.py

    file_path = Path(olx_path) / Path(source_path)
-
+    if file_extension == ".pdf" and file_path.is_file() and not pdf_is_valid(file_path):
+        log.exception("Skipping invalid pdf %s", file_path)


There's no exception/stacktrace here so log.warning would probably be better, can use log.exception in pdf_is_valid if an exception occurs there.

mbertrand · 2026-03-18T19:55:56Z

learning_resources/etl/utils.py

+            reader.pages[0].extract_text()
+            return True
+    except Exception as e:  # noqa: BLE001
+        log.warning("PDF validation error for %s: %s", pdf_path, e)


change to log.exception here, use warning in calling function

sentry · 2026-03-19T16:20:23Z

learning_resources/etl/utils.py

-    except FileNotDecryptedError:
-        log.exception("Skipping encrypted pdf %s", file_path)
+
+    page_count = len(PdfReader(file_path).pages)


Bug: The function _extract_content_with_ocr lacks exception handling. Errors from PdfReader or _pdf_to_markdown will crash the ETL process for the file.
_{Severity: HIGH}

Suggested Fix

Wrap the calls within _extract_content_with_ocr in a try...except block to catch potential exceptions from both PdfReader and _pdf_to_markdown. Upon catching an exception, log the error and return None to allow the ETL process to gracefully skip the problematic file instead of crashing.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: learning_resources/etl/utils.py#L595 Potential issue: The function `_extract_content_with_ocr` has had its exception handling removed. While a new `pdf_is_valid` check is performed beforehand, this check does not guarantee that subsequent processing will succeed. Specifically, the call to `_pdf_to_markdown` inside `_extract_content_with_ocr` is unprotected. Any exceptions from `_pdf_to_markdown`—which can be caused by LLM API errors, image processing failures, or other PDF parsing issues—will propagate unhandled, crashing the ETL process for the specific file being processed.

shanbady added 3 commits March 17, 2026 09:44

logic fix

5c8ec6b

adding test

a05b1e1

fixing test

9dffa43

shanbady marked this pull request as ready for review March 17, 2026 14:46

Copilot AI review requested due to automatic review settings March 17, 2026 14:46

shanbady added the Needs Review An open Pull Request that is ready for review label Mar 17, 2026

Copilot started reviewing on behalf of shanbady March 17, 2026 14:47 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

mbertrand self-assigned this Mar 18, 2026

mbertrand requested changes Mar 18, 2026

View reviewed changes

mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Mar 18, 2026

move check outside of ocr method

ced5536

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils.py Outdated Show resolved Hide resolved

Revert "move check outside of ocr method"

9d0f78f

This reverts commit ced5536.

sentry bot reviewed Mar 18, 2026

View reviewed changes

move check outside of ocr

dc28cac

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils.py Show resolved Hide resolved

fix for pull request finding

36f2888

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils.py Outdated Show resolved Hide resolved

catch all pdf errors

09abf9d

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils.py Show resolved Hide resolved

learning_resources/etl/utils_test.py Show resolved Hide resolved

fix test

1d3ff2f

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils_test.py Outdated Show resolved Hide resolved

fix test

8c6aa78

sentry bot reviewed Mar 18, 2026

View reviewed changes

learning_resources/etl/utils_test.py Show resolved Hide resolved

shanbady added 2 commits March 18, 2026 14:47

fix check

b79283a

fix tests

d5e2563

shanbady requested a review from mbertrand March 18, 2026 19:30

mbertrand approved these changes Mar 18, 2026

View reviewed changes

switch logging statements

0be78d2

sentry bot reviewed Mar 19, 2026

View reviewed changes

shanbady merged commit 03e59af into main Mar 19, 2026
14 checks passed

shanbady deleted the shanbady/asktim-canvas-ai-issue branch March 19, 2026 16:28

odlbot mentioned this pull request Mar 19, 2026

Release 0.58.3 #3069

Open

2 tasks

Conversation

shanbady commented Mar 17, 2026

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mbertrand Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

mbertrand Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

mbertrand Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mbertrand Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants