Skip to content

feat(sayt): Add SAYTBuilder that constructs runtime artefacts for later use#71

Merged
Tom-Owen-ONS merged 11 commits into
mainfrom
SA-694-load-vector-store-db-from-parquet-on-runtime-v2
Jun 19, 2026
Merged

feat(sayt): Add SAYTBuilder that constructs runtime artefacts for later use#71
Tom-Owen-ONS merged 11 commits into
mainfrom
SA-694-load-vector-store-db-from-parquet-on-runtime-v2

Conversation

@Tom-Owen-ONS

Copy link
Copy Markdown
Contributor

📌 Pull Request Template

Please complete all sections

✨ Summary

This PR adds a SAYTBuilder class, which takes a set of configured RetrieverSpecs, as well as a corpus and constructs an artefact folder that contains the built vector db parquet files, as well as the corpus and a manifest to locate all of the necessary files to run a SAYTSuggester. It allows for building of the artefacts at image build stage, then loading from the artefact directory from a SAYTSuggester.from_artefact() method.

📜 Changes Introduced

  • Add SAYTBuilder to build an artefact directory to construct a SAYTSuggester quickly from a pre-build vector db.

  • Adds an example notebook to test this functionality

  • Expands SAYTSuggester.get_config() to return a SaytConfiguration Pydantic model, which contains information about global suggester settings, the corpus, the retrievers and the artefact provenance (file locations etc).

  • Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)

  • Updates to tests and/or documentation

  • Terraform changes (if applicable)

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

A notebook has been provided in demos/sayt/sayt_artifact_example.py to show the basic usage of SAYTBuilder.

@Tom-Owen-ONS Tom-Owen-ONS requested a review from ivyONS June 9, 2026 14:12
@Tom-Owen-ONS Tom-Owen-ONS changed the title feat(sayt): Add SAYTBuilder that constructs runtime artifacts for later use feat(sayt): Add SAYTBuilder that constructs runtime artefacts for later use Jun 9, 2026
Comment thread src/industrial_classification_utils/sayt/retriever_specs.py Outdated
@Tom-Owen-ONS

Copy link
Copy Markdown
Contributor Author

For consideration: remove the sayt_* prefixes from the modules in src/industrial_classification_utils/sayt, as they're already in a sub-package called 'sayt' so it should implied?

@Tom-Owen-ONS

Tom-Owen-ONS commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

This does not directly implement the use of min_chars per retriever as it was already getting quite big and it would be nice to add this as a feature with separate example notebooks etc. Happy to be challenged on this and add it here though

Comment thread src/industrial_classification_utils/sayt/suggester.py Outdated
Comment thread src/industrial_classification_utils/sayt/suggester.py Outdated
Comment thread src/industrial_classification_utils/sayt/storage.py Outdated
Comment thread src/industrial_classification_utils/sayt/storage.py Outdated
Comment thread src/industrial_classification_utils/sayt/storage.py Outdated
Comment thread src/industrial_classification_utils/sayt/core.py Outdated
Comment thread src/industrial_classification_utils/sayt/core.py
Comment thread src/industrial_classification_utils/sayt/builder.py Outdated
Comment thread src/industrial_classification_utils/sayt/core.py Outdated
Comment thread src/industrial_classification_utils/sayt/suggester.py
@ivyONS

ivyONS commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Works as expected, provides good speedup on load.

Next steps:
Async: Consider building/loading/querying the different retrievers in parallel.
Can we add caching? (as the user types more characters the initial part of the string will be searched repeatedly)

Comment thread .pre-commit-config.yaml
Comment thread src/industrial_classification_utils/sayt/_base.py

@ivyONS ivyONS left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tom-Owen-ONS Tom-Owen-ONS merged commit 54735a0 into main Jun 19, 2026
5 checks passed
@Tom-Owen-ONS Tom-Owen-ONS deleted the SA-694-load-vector-store-db-from-parquet-on-runtime-v2 branch June 19, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants