Skip to content

SA-653/migrate sic data access to library#12

Merged
dstewartons merged 6 commits into
mainfrom
SA-653/migrate-sic-data-access-to-library
Jun 15, 2026
Merged

SA-653/migrate sic data access to library#12
dstewartons merged 6 commits into
mainfrom
SA-653/migrate-sic-data-access-to-library

Conversation

@dstewartons

@dstewartons dstewartons commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

📌 Pull Request Template

Please complete all sections

✨ Summary

Moves SIC workbook data access into sic-classification-library, mirroring the completed SA-649 SOC merge (soc_data_access in soc-classification-library). Adds industrial_classification.data_access.sic_data_access with load_sic_index, load_sic_structure, load_sic_hierarchy, and load_text_from_config so sic-classification-utils can import from the library and delete its duplicate sic_data_access.py.

Depends on: This PR merges first. Follow-up PRs on branch SA-653/migrate-sic-data-access-to-library in sic-classification-utils (import rewiring, remove utils module), then lockfile bumps in sic-classification-vector-store and survey-assist-api.

Note
Packaged .xlsx workbooks remain in industrial_classification_utils.data.sic_index for now (same pattern as SOC post–SA-649). Config tuples are unchanged; only the loader implementation moves to the library.

CI note: CI purposefully fails here as until the dependency tags exist in the remote remerged repos, sopyproject.toml temporarily uses local editable paths:

📜 Changes Introduced

  • Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)

  • Updates to tests and/or documentation

  • Terraform changes (if applicable) — N/A

  • src/industrial_classification/data_access/sic_data_access.py (new): workbook loaders and load_sic_hierarchy wrapper around sic_hierarchy.load_hierarchy; behaviour matches the former utils module (sheet names, columns, normalisation).

  • tests/test_data_access.py (new): mocked workbook tests for load_sic_index, load_sic_structure, and load_sic_hierarchy.

  • pyproject.toml: version 0.1.5; add runtime dependency openpyxl ^3.1.5.

  • poetry.lock: refreshed for openpyxl.

Out of scope in this repo

  • SICLookup CSV lookup behaviour unchanged.
  • No move of packaged .xlsx assets from utils into the library wheel (optional follow-up).

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed (library module docstring only; utils docs updated in sibling PR)

🔍 How to Test

Prerequisites

Sibling checkouts on branch SA-653/migrate-sic-data-access-to-library (same parent directory so path deps resolve) and review of the following

  1. PR sic-classification-library - this repo
  2. PR sic-classification-utils
  3. PR sic-classification-vector-store
  4. PR survey-assist-api

Unit tests (this repo only)

From sic-classification-library root:

poetry install
poetry run pytest tests/ -q

Verified on this branch: 13 passed (includes 3 new test_data_access tests plus existing lookup/meta tests).

Focused data-access tests:

poetry run pytest tests/test_data_access.py -v

Lint (if you use project conventions):

poetry run pylint --recursive=y src/industrial_classification tests

Note

  • If imports fail with No module named 'industrial_classification.data_access', reinstall this library (poetry install in library, then refresh utils/API venv).

- load SIC index, structure and hierarchy from packaged workbooks like the former utils module
@dstewartons dstewartons changed the title Sa 653/migrate sic data access to library SA-653/migrate sic data access to library Jun 5, 2026
@dstewartons dstewartons force-pushed the SA-653/migrate-sic-data-access-to-library branch from 6e2f05a to 92fb9a7 Compare June 5, 2026 12:06

@gibbardsteve gibbardsteve left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested all PRs in one shot. I am happy to approve these PRs, but there is one observation we should fix as a follow up JIRA ticket please.

I have:
manually reviewed code
run unit tests across all PRs and repositories
tested sic vector store can be built and loaded
manually tested search-index and status
(observation - vector store can no longer be built from xlsx input, I think this is as per data science classifai changes and not an artifact of this PR)
(observation - /status endpoint has a null field for index. Given the merging of code, changes coming I don't thin we want to worry about fixing this yet)

{
    "embedding_model_name": "all-MiniLM-L6-v2",
    "db_dir": "src/sic_classification_vector_store/data/vector_store",
    "index_source_file": null,
    "k_matches": 20,
    "status": "ready",
    "index_size": 34663
}

manually tested Survey Assist api
sic-lookup is successful for match and no match scenarios
classify is successful for ambiguous and classified scenarios
(observation - the SIC and SOC vector store clients are being created each time a classify request comes in. The clients should only be instantiated once in the lifecycle of the app and both share the same http client. This must be fixed in a separate ticket please).

Example:

shared_http_client = httpx.AsyncClient()

fastapi_app.state.sic_vector_store_client = SICVectorStoreClient(
    base_url=sic_url,
    http_client=shared_http_client,
)

fastapi_app.state.soc_vector_store_client = SOCVectorStoreClient(
    base_url=soc_url,
    http_client=shared_http_client,
)

And the BaseVectorStoreClient needs to use the shared http client.

@dstewartons

dstewartons commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

(observation - vector store can no longer be built from xlsx input, I think this is as per data science classifai changes and not an artifact of this PR)

Yes out of scope PR only moves SIC workbook loaders into the library, it does not change how the vector store is built.

(observation - /status endpoint has a null field for index. Given the merging of code, changes coming I don't thin we want to worry about fixing this yet)

I’ve seen the same and I’m treating it as pre-existing follow-on from the status model work, not something this PR introduced - created SA-742 on the board for this

observation - the SIC and SOC vector store clients are being created each time a classify request comes in.

Created SA-741 to address that

@dstewartons dstewartons merged commit 92fb9a7 into main Jun 15, 2026
5 checks passed
@dstewartons dstewartons deleted the SA-653/migrate-sic-data-access-to-library branch June 15, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants