SA-674/soc lookup csv compatibilty#52
Conversation
…ests - verify missing descriptions return a null code instead of raising
- add explicit soc hierarchy pairs for minor sub major and major levels
…ichment - populate more static SOCmeta entries from soc 2020 volume 1 descriptions data
…criptions only and removing strip on ingest
gibbardsteve
left a comment
There was a problem hiding this comment.
The changes look good and testing is working as expected. The relevant release tags need applying.
Running SOC lookup and SIC lookup gives us all we need for initial testing I think.
Ran with similarly = true and searched strings that matched and didn’t, spot checked meta data for SOC and all looks ok.
I have some comments below but these are all separate tickets that can go through refinement and prioritisation.
Meta data is expanded now, I see we added an array for tasks, but currently there is only one long string for all tasks.
"tasks": [
"~analyses economic, social, legal and other data, and plans, formulates and directs at strategic level the operation of a company or organisation ~consults with subordinates to formulate, implement and review company/organisation policy, authorises funding for policy implementation programmes and institutes reporting, auditing and control systems ~prepares, or arranges for the preparation of, reports, budgets, forecasts or other information ~plans and controls the allocation of resources and the selection of senior staff ~evaluates government/local authority departmental activities, discusses problems with government/local authority officials and administrators and formulates departmental policy ~negotiates and monitors contracted out services provided to the local authority by the private sector ~studies and acts upon any legislation that may affect the local authority ~stimulates public interest by providing publicity, giving lectures and interviews and organising appeals for a variety of causes ~directs or undertakes the preparation, publication and dissemination of reports and other information of interest to members and other interested parties"
]It would be nice to split on ‘~’ and have an array item for each individual task. This is a nice to have though so could be a separate ticket and goes on the backlog to be prioritised.
We do not return the Related Job Tiles data or the Groups Classified Within Sub-Groups. This is also a nice to have and should be a separate ticket and goes on the backlog to be prioritised. For similarly (potential matches) the SOC code only expands the major group (1 Digit). SIC will give the Division (2digits) so we may want to expand SOC to also return the sub-major-group information in the lookup for potential matches. Again this is a nice to have and should be a separate ticket and goes on the backlog to be prioritised.
Another ticket for the backlog (low priority). When similarity is set to false, the format of the JSON should be the same for a result where no match was found (it should have empty values for code, code_meta etc and not include potential_matches) currently for SIC and SOC if similarity is false and we cannot find a direct match we send:
{
"detail": "No SIC code found for description: pubs"
}
{
"detail": "No SOC code found for description: manager"
}
📌 Pull Request Template
✨ Summary
Hardens
SOCLookupfor production-style SOC CSVs (description,labelonly, invalid headers rejected clearly,SICLookup-matching normalisation, explicit tests for present and absent descriptions, SA-674) and returns SIC-like layered SOC metadata on hits—minor, sub-major, and major group fields withSocMeta, expanded static lookup data, and updated tests (SA-672).📜 Changes Introduced
Compared to
main, branchSA-674/soc-lookup-csv-compatibiltychanges exactly these paths (seegit diff main...HEAD --stat):src/occupational_classification/lookup/soc_lookup.py— on a successful exact match, populatecode_minor_group,code_minor_group_meta,code_sub_major_group, andcode_sub_major_group_meta(with existing major-group and unitcode_metafields); normalise loaded CSV text likeSICLookup(descriptionlower-cased only, no.strip()ondescriptionorlabel); smalllookuprefactor to satisfy Pylint on local variables.src/occupational_classification/meta/soc_meta.py— large expansion of the static in-memorySOCmetamap (SOC 2020 Volume 1-style copy) andSocMetalookups so minor, sub-major, and major levels can be enriched; public nameSOC_META→SOCmeta(SIC-style identifier).tests/test_lookup.py— assert derived minor and sub-major codes and*_metaon hits; assert those fields areNonewhen there is no match - rename and correct the unit-meta test socode_meta["code"]is the full four-digit unit (2136in the fixture), not a single-digit parent.tests/test_soc_lookup_example_data.py— addtest_soc_lookup_example_absent_description_returns_null_code, which loads the packaged example CSV, callslookup("orchard planner"), and expectscodeto beNonewithout an exception.notebooks/soc_2025_05_01.py— update imports and demo cells to useSOCmetainstead ofSOC_META.Feature implementation / bug fix / testing
Updates to tests and/or documentation
Terraform changes (not applicable)
✅ Checklist
black . --check)isort . --check)ruff check .,pylint --verbose .,mypy --follow-untyped-imports src)bandit -r src/occupational_classification)poetry run pytest(57 passed locally;make check-python-nofixwas used for lint + hygiene)SA-674-soc-lookup-compatability-PR.md)🔍 How to Test
From
soc-classification-libraryafterpoetry install: