Skip to content

SA-674/soc lookup csv compatibilty#52

Merged
dstewartons merged 7 commits into
mainfrom
SA-674/soc-lookup-csv-compatibilty
May 13, 2026
Merged

SA-674/soc lookup csv compatibilty#52
dstewartons merged 7 commits into
mainfrom
SA-674/soc-lookup-csv-compatibilty

Conversation

@dstewartons

@dstewartons dstewartons commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

📌 Pull Request Template

Please complete all sections

✨ Summary

Hardens SOCLookup for production-style SOC CSVs (description,label only, invalid headers rejected clearly, SICLookup-matching normalisation, explicit tests for present and absent descriptions, SA-674) and returns SIC-like layered SOC metadata on hits—minor, sub-major, and major group fields with SocMeta, expanded static lookup data, and updated tests (SA-672).

📜 Changes Introduced

Compared to main, branch SA-674/soc-lookup-csv-compatibilty changes exactly these paths (see git diff main...HEAD --stat):

  • src/occupational_classification/lookup/soc_lookup.py — on a successful exact match, populate code_minor_group, code_minor_group_meta, code_sub_major_group, and code_sub_major_group_meta (with existing major-group and unit code_meta fields); normalise loaded CSV text like SICLookup (description lower-cased only, no .strip() on description or label); small lookup refactor to satisfy Pylint on local variables.

  • src/occupational_classification/meta/soc_meta.py — large expansion of the static in-memory SOCmeta map (SOC 2020 Volume 1-style copy) and SocMeta lookups so minor, sub-major, and major levels can be enriched; public name SOC_METASOCmeta (SIC-style identifier).

  • tests/test_lookup.py — assert derived minor and sub-major codes and *_meta on hits; assert those fields are None when there is no match - rename and correct the unit-meta test so code_meta["code"] is the full four-digit unit (2136 in the fixture), not a single-digit parent.

  • tests/test_soc_lookup_example_data.py — add test_soc_lookup_example_absent_description_returns_null_code, which loads the packaged example CSV, calls lookup("orchard planner"), and expects code to be None without an exception.

  • notebooks/soc_2025_05_01.py — update imports and demo cells to use SOCmeta instead of SOC_META.

  • Feature implementation / bug fix / testing

  • Updates to tests and/or documentation

  • Terraform changes (not applicable)

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black (black . --check)
  • Imports are sorted using isort (isort . --check)
  • Code passes linting with Ruff, Pylint, and Mypy (ruff check ., pylint --verbose ., mypy --follow-untyped-imports src)
  • Security checks pass using Bandit (bandit -r src/occupational_classification)
  • Unit tests pass using poetry run pytest (57 passed locally; make check-python-nofix was used for lint + hygiene)
  • Terraform files (not applicable)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed (SA-674-soc-lookup-compatability-PR.md)

🔍 How to Test

From soc-classification-library after poetry install:

poetry run pytest tests/test_lookup.py -v

@dstewartons dstewartons changed the title Sa 674/soc lookup csv compatibilty SA-674/soc lookup csv compatibilty Apr 29, 2026
@gibbardsteve gibbardsteve self-requested a review May 13, 2026 12:38

@gibbardsteve gibbardsteve left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good and testing is working as expected. The relevant release tags need applying.

Running SOC lookup and SIC lookup gives us all we need for initial testing I think.

Ran with similarly = true and searched strings that matched and didn’t, spot checked meta data for SOC and all looks ok.

I have some comments below but these are all separate tickets that can go through refinement and prioritisation.

Meta data is expanded now, I see we added an array for tasks, but currently there is only one long string for all tasks.

"tasks": [
            "~analyses economic, social, legal and other data, and plans, formulates and directs at strategic level the operation of a company or organisation ~consults with subordinates to formulate, implement and review company/organisation policy, authorises funding for policy implementation programmes and institutes reporting, auditing and control systems ~prepares, or arranges for the preparation of, reports, budgets, forecasts or other information ~plans and controls the allocation of resources and the selection of senior staff ~evaluates government/local authority departmental activities, discusses problems with government/local authority officials and administrators and formulates departmental policy ~negotiates and monitors contracted out services provided to the local authority by the private sector ~studies and acts upon any legislation that may affect the local authority ~stimulates public interest by providing publicity, giving lectures and interviews and organising appeals for a variety of causes ~directs or undertakes the preparation, publication and dissemination of reports and other information of interest to members and other interested parties"
        ]

It would be nice to split on ‘~’ and have an array item for each individual task. This is a nice to have though so could be a separate ticket and goes on the backlog to be prioritised.

We do not return the Related Job Tiles data or the Groups Classified Within Sub-Groups. This is also a nice to have and should be a separate ticket and goes on the backlog to be prioritised.

For similarly (potential matches) the SOC code only expands the major group (1 Digit). SIC will give the Division (2digits) so we may want to expand SOC to also return the sub-major-group information in the lookup for potential matches. Again this is a nice to have and should be a separate ticket and goes on the backlog to be prioritised.

Another ticket for the backlog (low priority). When similarity is set to false, the format of the JSON should be the same for a result where no match was found (it should have empty values for code, code_meta etc and not include potential_matches) currently for SIC and SOC if similarity is false and we cannot find a direct match we send:

{
  "detail": "No SIC code found for description: pubs"
}
{
    "detail": "No SOC code found for description: manager"
}

@dstewartons dstewartons merged commit 40cd9f0 into main May 13, 2026
5 checks passed
@dstewartons dstewartons deleted the SA-674/soc-lookup-csv-compatibilty branch May 13, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants