Skip to content

Remove camelot#2624

Merged
symroe merged 11 commits intomasterfrom
remove-camelot
Mar 4, 2026
Merged

Remove camelot#2624
symroe merged 11 commits intomasterfrom
remove-camelot

Conversation

@symroe
Copy link
Copy Markdown
Member

@symroe symroe commented Nov 20, 2025

In this PR I do a couple of things:

  1. Remove the SOPN testing baseline code. This was more of a one off feature that we added to compare the existing SOPN tooling against AWS Textract. We wanted a way to validate that Textract was at least as good as our existing code. We validated this to the point where we're getting rid of the existing code (the next commit). It might be useful to have the ability to generate a baseline in future, but this code is already quite stale so I think it's better to get rid of it and write new code if we want something like this in future.
  2. We kept both old and new systems for parsing PDFs into DataFrames. We don't want to use Camelot any more, partly because Textract is much better (it deals with images, for one thing), but also because it's not maintained any more. In the second commit, I remove everything I can find to do with Camelot.

A couple of extra points:

  1. I've not removed all local PDF libraries. This is because we have some tooling to split up PDFs by page that relies on these libraries. There are changes I'd like to make to this system over time, but I consider that out of scope for this PR. This means we still need to install some local libraries at the system level.
  2. I've removed some of the parse tables tests because they were tied to Camelot not Textract. We want something like these tests in future, but they would be testing code that is likely to change again in the near future, so I've decided to delete them with an eye to re-making them as part of future work. Again, the refactor wasn't worth the time for code that's about to change anyway.

@symroe symroe requested a review from chris48s November 20, 2025 13:11
@chris48s
Copy link
Copy Markdown
Member

chris48s commented Nov 21, 2025

Having deleted all this code, are we now in a position to remove camelot-py[cv] from pyproject.toml, even if we have to keep the other python/system libs for PDF parsing?


One thing we lose by removing the Camelot backend is the ability to have a "working" local application with out connecting it to AWS. That doesn't mean we should keep Camelot, but do you think there is any mileage in having a "mock" PDF parsing backend? I am thinking we write a "parser" where you feed it any PDF and it throws the PDF away and just returns the same hard-coded list of people/parties. Worth bothering with?


There's a code formatting error to resolve with ruff to get the build passing.

Comment thread Makefile
@chris48s
Copy link
Copy Markdown
Member

chris48s commented Dec 2, 2025

Quick thing I noticed while working on something else..
There are some tests in the test suite with a:

@skipIf(should_skip_conversion_tests(), "Required conversion libs not installed")

decorator on them. Can you double-check if that is something we can get rid of as part of this PR.

Comment thread ynr/apps/sopn_parsing/tests/test_parse_tables.py
@symroe
Copy link
Copy Markdown
Member Author

symroe commented Feb 9, 2026

I force pushed the commit where I removed too much code, restoring some files. This should mean we can merge with #2639 easier

Copy link
Copy Markdown
Member

@chris48s chris48s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more comments.

Can you have a look at getting the build passing.

Also the comment about deleting candidates_import_from_live_site still stands

Comment thread ynr/apps/bulk_adding/views/sopns.py Outdated
Comment thread ynr/apps/sopn_parsing/management/commands/sopn_parsing_process_unparsed.py Outdated
Comment thread ynr/apps/sopn_parsing/helpers/textract_helpers.py
This was used to capture baselines for Camelot. We no longer need this
code
@symroe symroe force-pushed the remove-camelot branch 4 times, most recently from 9170df9 to 64a87e2 Compare March 3, 2026 21:03
@symroe symroe requested a review from chris48s March 3, 2026 21:10
Comment thread .env.example
Comment thread ynr/apps/bulk_adding/templates/bulk_add/sopns/add_form.html
Comment thread ynr/apps/sopn_parsing/tests/__init__.py Outdated
symroe and others added 10 commits March 4, 2026 13:25
This is a bit of a tangle to remove cleanly. I think I've removed too
much, especially some of the tests that really should be converted to
use AWS Textract rather than just removing them.

The plan is to built these up again with a bit of a rethink / redesign
of the whole system. Maintaining these tests while refactoring wouldn't
be a good idea, so I suggest revisiting them later.
Now we're in a container, we don't need to skip these tests
This hasn't been useful for a while and we now have a system to import
from a DB backup
This was only used by Camelot and removing it should help us catch code
paths that still try to use camelot
@symroe symroe merged commit 893126c into master Mar 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants