Skip to content

Improve quantity exploration and cleaning notebooks#5

Open
TheSanjBot wants to merge 1 commit into
offCanada:mainfrom
TheSanjBot:issue-2-quantity-field-cleaning
Open

Improve quantity exploration and cleaning notebooks#5
TheSanjBot wants to merge 1 commit into
offCanada:mainfrom
TheSanjBot:issue-2-quantity-field-cleaning

Conversation

@TheSanjBot

Copy link
Copy Markdown
  • explores product_quantity_unit, product_quantity, and quantity
  • exploration to understand variation and missingness
  • cleaning logic for OCR cleanup, unit normalization, multipack parsing, conflict handling
  • keeps true contradictions as conflicts and ambiguous cases unresolved
  • helps prepare quantity data for more reliable downstream product consolidation

The outputs:

  • resolved
    • used when the quantity can be interpreted confidently in a comparable normalized form
    • examples: 500 g, 12 x 355 ml, 24 g (0.85 oz)
  • conflict
    • used when quantity signals materially disagree, so the pipeline should not guess
    • example: 25 kg / 5 lbs
  • partial
    • used when the row contains useful quantity-related information, but not a safely comparable package size
    • examples: packaging-only information
  • unresolved
    • used when the text remains ambiguous, broken, or unsupported

The current workflow is useful, but not complete:

  • many rows are still unresolved because the logic is intentionally conservative
  • some mixed expressions still need attention
  • multilingual alias coverage is helpful but not comprehensive of all languages
  • some rows are only rescued because structured fields are reliable, not because the text parser fully understands them
  • household-unit expressions are intentionally excluded from consolidation
  • a small set of remaining conflicts may still be recoverable with additional safe rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant