SA-653/migrate sic data access to library#12
Conversation
- load SIC index, structure and hierarchy from packaged workbooks like the former utils module
- cover load_sic_index, load_sic_structure and load_sic_hierarchy with mocked workbook loads
6e2f05a to
92fb9a7
Compare
gibbardsteve
left a comment
There was a problem hiding this comment.
I tested all PRs in one shot. I am happy to approve these PRs, but there is one observation we should fix as a follow up JIRA ticket please.
I have:
manually reviewed code
run unit tests across all PRs and repositories
tested sic vector store can be built and loaded
manually tested search-index and status
(observation - vector store can no longer be built from xlsx input, I think this is as per data science classifai changes and not an artifact of this PR)
(observation - /status endpoint has a null field for index. Given the merging of code, changes coming I don't thin we want to worry about fixing this yet)
{
"embedding_model_name": "all-MiniLM-L6-v2",
"db_dir": "src/sic_classification_vector_store/data/vector_store",
"index_source_file": null,
"k_matches": 20,
"status": "ready",
"index_size": 34663
}manually tested Survey Assist api
sic-lookup is successful for match and no match scenarios
classify is successful for ambiguous and classified scenarios
(observation - the SIC and SOC vector store clients are being created each time a classify request comes in. The clients should only be instantiated once in the lifecycle of the app and both share the same http client. This must be fixed in a separate ticket please).
Example:
shared_http_client = httpx.AsyncClient()
fastapi_app.state.sic_vector_store_client = SICVectorStoreClient(
base_url=sic_url,
http_client=shared_http_client,
)
fastapi_app.state.soc_vector_store_client = SOCVectorStoreClient(
base_url=soc_url,
http_client=shared_http_client,
)And the BaseVectorStoreClient needs to use the shared http client.
Yes out of scope PR only moves SIC workbook loaders into the library, it does not change how the vector store is built.
I’ve seen the same and I’m treating it as pre-existing follow-on from the status model work, not something this PR introduced - created SA-742 on the board for this
Created SA-741 to address that |
📌 Pull Request Template
✨ Summary
Moves SIC workbook data access into
sic-classification-library, mirroring the completed SA-649 SOC merge (soc_data_accessinsoc-classification-library). Addsindustrial_classification.data_access.sic_data_accesswithload_sic_index,load_sic_structure,load_sic_hierarchy, andload_text_from_configsosic-classification-utilscan import from the library and delete its duplicatesic_data_access.py.Depends on: This PR merges first. Follow-up PRs on branch
SA-653/migrate-sic-data-access-to-libraryinsic-classification-utils(import rewiring, remove utils module), then lockfile bumps insic-classification-vector-storeandsurvey-assist-api.📜 Changes Introduced
Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)
Updates to tests and/or documentation
Terraform changes (if applicable) — N/A
src/industrial_classification/data_access/sic_data_access.py(new): workbook loaders andload_sic_hierarchywrapper aroundsic_hierarchy.load_hierarchy; behaviour matches the former utils module (sheet names, columns, normalisation).tests/test_data_access.py(new): mocked workbook tests forload_sic_index,load_sic_structure, andload_sic_hierarchy.pyproject.toml: version 0.1.5; add runtime dependencyopenpyxl ^3.1.5.poetry.lock: refreshed foropenpyxl.Out of scope in this repo
SICLookupCSV lookup behaviour unchanged..xlsxassets from utils into the library wheel (optional follow-up).✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
Prerequisites
Sibling checkouts on branch
SA-653/migrate-sic-data-access-to-library(same parent directory so path deps resolve) and review of the followingsic-classification-library- this reposic-classification-utilssic-classification-vector-storesurvey-assist-apiUnit tests (this repo only)
From
sic-classification-libraryroot:Verified on this branch:
13 passed(includes 3 newtest_data_accesstests plus existing lookup/meta tests).Focused data-access tests:
Lint (if you use project conventions):
Note
No module named 'industrial_classification.data_access', reinstall this library (poetry installin library, then refresh utils/API venv).