Skip to content

SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23

Draft
peter-spencer-ons wants to merge 62 commits into
mainfrom
SA617_soc_knowledgebase
Draft

SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23
peter-spencer-ons wants to merge 62 commits into
mainfrom
SA617_soc_knowledgebase

Conversation

@peter-spencer-ons

Copy link
Copy Markdown
Contributor

✨ Summary

Prepare datasets for knowledgebase and Direct Lookup for SOC, using data from ASHE.

Changes include adapting prompts and LLM methods that are not meant to be used in the main branch, but exclusively in SA617, without merging.

📜 Changes Introduced

  • Adapt prompts and methods within src to be usable for the task.
  • Add prompt and method to correct misspelled words from ASHE dataset
  • Prepare scripts for:
    • correcting misspellings
    • creating knowledgebase
    • assigning SOC codes with LLM use
    • selecting unambiguous codes for Direct Lookup

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

select a subset of ASHE dataset
create .env file (please reach out for details)
run:

  • ashe_clean_2026_04.py
  • assign_soc_code_2026_03.py
  • soc_kb_2026_04.ipynb
  • create_soc_lookup_2026_04.ipynb

@peter-spencer-ons peter-spencer-ons requested a review from ivyONS May 6, 2026 09:49
Comment thread notebooks/assign_soc_code_2026_03.py Outdated
Comment thread notebooks/assign_soc_code_2026_03.py Outdated
Comment thread notebooks/soc_kb_2026_04.py Outdated
Comment thread notebooks/ashe_clean_2026_04.py
Comment thread notebooks/soc_kb_2026_04.py Outdated
Comment thread notebooks/soc_kb_2026_04.py
Comment thread notebooks/soc_kb_2026_04.py Outdated
Comment thread poetry.lock
Comment thread notebooks/create_soc_lookup_2026_04.py Outdated
Comment thread notebooks/create_soc_lookup_2026_04.py
Comment thread notebooks/create_soc_lookup_2026_04.py
Comment thread notebooks/create_soc_lookup_2026_04.py Outdated
Comment thread notebooks/create_soc_lookup_2026_04.py
Comment thread notebooks/create_soc_lookup_2026_04.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants