Skip to content

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503

Draft
mart-r wants to merge 9 commits into
mainfrom
build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases
Draft

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503
mart-r wants to merge 9 commits into
mainfrom
build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases

Conversation

@mart-r
Copy link
Copy Markdown
Collaborator

@mart-r mart-r commented May 22, 2026

The underlying issue

medcat-den source distribution are pushed to TestPyPI on every commit. And because they include test-time resources (test / fake models) they are rather large (~32MB). Over time this has meant we've reached PyPI's per project storage limit of 10GB. So now, because of this, medcat-den workflows on the main branch are failing because TestPyPI uploads are failing.

Caveats to consider

The idea of packaging your tests (along with the resources required to run them) is quite common for source distributions. In fact, the default behaviour seems to be to include everything that is tracked by git. There are a number of ways to get around this (i.e removing the files before building, pruning in MANIFEST.in), but they seem to be counter to the open source principles or not really following modern package building standards.

The proposed plan

In order to make this a viable option, I plan to store test time models centrally to the repo. This means that they won't be included in the builds since they're outside the scope of the source. But it also has the added benefit of allowing us to reused the same test models across multiple projects within the repo (e.g medcat and medcat-den, but why not medcat-service as well). On top of that there needs to be a way to access these files from a source distribution. And because that now doesn't include these test-time resources, they need to be fetached. The plan uses pooch to do the fetching from the relevant version on GitHub, but the logic defaults to local files if available. This will involve including these files in relevant releases as well. On the way there we also need to make some changes on the exact paths that are used to interact with these models in the test suite (but that shouldn't be extensive).

This is the plan:

  • Store test time models centrally
  • Implement for MedCAT
  • Implement for MedCAT-den
  • Implement workflow to add resources to relevant releases

@mart-r mart-r marked this pull request as draft May 22, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant