Warning
Repository retirement notice
This repository will be retired in the coming months as the functionality is being moved elsewhere. Existing code remains available for reference, but new development should be directed to the replacement repositories (survey-assist-embed-core and/or survey-assist-classification-core).
Standard Occupational Classification (SOC) Utilities, initially developed for Survey Assist API and complements the SOC Classification Library code.
SOC classification utilities used in the classification of occupations. This repository contains core code used by the SOC Classification Library.
- Embeddings. Functionality for embedding SOC hierarchy data, managing vector stores, and performing similarity searches
- Data Access. Functions to load CSV data files related to SOC.
- LLM flows:
- Survey Assist classify (two-step):
unambiguous_soc_code, thenformulate_open_questionwhen not codable (mirrors SIC). - Legacy / other callers:
sa_rag_soc_code(single-shot RAG; not used on the classify route). - Planned (SA-700):
reranker_socandSOC_PROMPT_RERANKER, mirroringreranker_sic/SIC_PROMPT_RERANKERinsic-classification-utils. Neither SIC nor SOC classify calls a reranker today. Until SOC reranker exists,survey-assist-apiconfig v3 listsSA_SOC_PROMPT_RAGas the third SOC prompt instead ofSOC_PROMPT_RERANKER(SIC v3 lists the reranker prompt).
- Survey Assist classify (two-step):
Ensure you have the following installed on your local machine:
- Python 3.12 (Recommended: use
pyenvto manage versions) -
poetry(for dependency management) - Colima (if running locally with containers)
- Terraform (for infrastructure management)
- Google Cloud SDK (
gcloud) with appropriate permissions
The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.
git clone https://github.com/ONSdigital/soc-classification-utils.git
cd soc-classification-utilspoetry installGit hooks can be used to check code before commit. To install run:
pre-commit installdocs - documentation as code using mkdocs
scripts - location of any supporting scripts (e.g data cleansing etc)
src/occupational_classification_utils/data - example data and SOC classification data used for embeddings
src/occupational_classification_utils/embed - ChromaDB vector store and embedding code, includes an example use of the store.
src/occupational_classification_utils/models - common data structures that need to be shared
src/occupational_classification_utils/utils - common utility functions such as xls file read for embeddings.
tests - PyTest unit testing for code base, aim is for 80% coverage.
Code quality and static analysis will be enforced using ruff and mypy. Security checking will be enhanced by running bandit.
To check the code quality, but only report any errors without auto-fix run:
make check-python-nofixTo check the code quality and automatically fix errors where possible run:
make check-pythonDocumentation is available in the docs folder and can be viewed using mkdocs
make run-docs