Skip to content

ONSdigital/soc-classification-utils

Repository files navigation

SOC Classification Utils

Status: Retiring

Warning

Repository retirement notice

This repository will be retired in the coming months as the functionality is being moved elsewhere. Existing code remains available for reference, but new development should be directed to the replacement repositories (survey-assist-embed-core and/or survey-assist-classification-core).

Standard Occupational Classification (SOC) Utilities, initially developed for Survey Assist API and complements the SOC Classification Library code.

Overview

SOC classification utilities used in the classification of occupations. This repository contains core code used by the SOC Classification Library.

Features

  • Embeddings. Functionality for embedding SOC hierarchy data, managing vector stores, and performing similarity searches
  • Data Access. Functions to load CSV data files related to SOC.
  • LLM flows:
    • Survey Assist classify (two-step): unambiguous_soc_code, then formulate_open_question when not codable (mirrors SIC).
    • Legacy / other callers: sa_rag_soc_code (single-shot RAG; not used on the classify route).
    • Planned (SA-700): reranker_soc and SOC_PROMPT_RERANKER, mirroring reranker_sic / SIC_PROMPT_RERANKER in sic-classification-utils. Neither SIC nor SOC classify calls a reranker today. Until SOC reranker exists, survey-assist-api config v3 lists SA_SOC_PROMPT_RAG as the third SOC prompt instead of SOC_PROMPT_RERANKER (SIC v3 lists the reranker prompt).

Prerequisites

Ensure you have the following installed on your local machine:

  • Python 3.12 (Recommended: use pyenv to manage versions)
  • poetry (for dependency management)
  • Colima (if running locally with containers)
  • Terraform (for infrastructure management)
  • Google Cloud SDK (gcloud) with appropriate permissions

Local Development Setup

The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.

Clone the repository

git clone https://github.com/ONSdigital/soc-classification-utils.git
cd soc-classification-utils

Install Dependencies

poetry install

Add Git Hooks

Git hooks can be used to check code before commit. To install run:

pre-commit install

Run Locally

${\small\color{red}\text{TODO}}$

Structure

docs - documentation as code using mkdocs

scripts - location of any supporting scripts (e.g data cleansing etc)

${\small\color{red}\text{TODO}}$

src/occupational_classification_utils/data - example data and SOC classification data used for embeddings

src/occupational_classification_utils/embed - ChromaDB vector store and embedding code, includes an example use of the store.

src/occupational_classification_utils/models - common data structures that need to be shared

src/occupational_classification_utils/utils - common utility functions such as xls file read for embeddings.

tests - PyTest unit testing for code base, aim is for 80% coverage.

GCP Setup

${\small\color{red}\text{TODO}}$

Code Quality

Code quality and static analysis will be enforced using ruff and mypy. Security checking will be enhanced by running bandit.

To check the code quality, but only report any errors without auto-fix run:

make check-python-nofix

To check the code quality and automatically fix errors where possible run:

make check-python

Documentation

Documentation is available in the docs folder and can be viewed using mkdocs

make run-docs

Testing

${\small\color{red}\text{TODO}}$

Environment Variables

${\small\color{red}\text{TODO}}$

About

SOC Classification Utilities

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors