Skip to content

Improve Translation Quality β€” Eliminate AI_MUST_REPLACE Leakage & Strengthen Validation PipelineΒ #1656

@pethers

Description

@pethers

πŸ“‹ Issue Type

Bug Fix / Quality Improvement

🎯 Objective

Eliminate the persistent AI_MUST_REPLACE marker leakage in translated (non-EN) articles and strengthen the translation validation pipeline to catch content quality issues before articles reach production. Currently 35% of April 2026 articles still contain unresolved markers.

πŸ“Š Current State

  • AI_MUST_REPLACE leakage: 146 out of 408 April 2026 articles contain AI_MUST_REPLACE markers (35.8%)
  • EN articles: Clean β€” 0 markers in 2026-04-10-committee-reports-en.html, 2026-04-09-opposition-motions-en.html
  • Translated articles: ES and FI articles have 2 markers each (e.g., 2026-04-09-opposition-motions-es.html, 2026-04-09-opposition-motions-fi.html)
  • Pattern: EN articles are clean, AI_MUST_REPLACE markers leak into translations when AI agents translate template comments instead of replacing them
  • Translation validation: validate-news-translations.ts v3.0 warns on EN/SV body-content leakage but does NOT fail CI (exit 0)
  • Swedish leakage detector: detect-swedish-leakage.ts (366 lines) detects SV content in non-SV articles
  • Translation dictionary: translation-dictionary.ts (3,673 lines) β€” massive file, hard to maintain
  • Banned patterns: check-banned-patterns.ts checks for banned generic template text, sourcing from shared.ts BANNED_PATTERNS
  • shared.ts markers: 17 AI_MUST_REPLACE comment-only HTML markers in content generator functions β€” designed to be invisible but translations sometimes expose them

πŸš€ Desired State

  1. Zero AI_MUST_REPLACE markers in all published articles (EN and translated)
  2. CI-enforced validation: validate-news-translations.ts should fail CI (exit 1) when AI_MUST_REPLACE markers are detected, not just warn
  3. Improved translation workflow: news-translate.md workflow prompt should explicitly instruct AI to detect and replace ALL AI_MUST_REPLACE comments during translation
  4. Better marker detection: Strengthen check-banned-patterns.ts to catch markers inside HTML comments (<!-- AI_MUST_REPLACE -->) not just visible text
  5. Translation dictionary improvements: Better coverage for political terminology across all 14 languages
  6. Content leakage prevention: Strengthen EN→non-EN content leakage detection to catch English text in translated articles

πŸ”§ Implementation Approach

Files to modify (NO overlap with other issues):

  • scripts/validate-news-translations.ts β€” enforce CI failure on AI_MUST_REPLACE detection
  • scripts/detect-swedish-leakage.ts β€” extend to detect any source-language leakage patterns
  • scripts/translation-dictionary.ts β€” improve political terminology coverage, split into manageable sections
  • scripts/check-banned-patterns.ts β€” extend to detect markers inside HTML comments
  • scripts/validate-translations.ts β€” improve validation coverage
  • scripts/statistical-claims-detector.ts β€” validate statistical claims survive translation
  • .github/workflows/news-translate.md β€” improve translation prompt to explicitly handle AI_MUST_REPLACE markers
  • .github/workflows/news-translate.lock.yml β€” recompile after prompt update
  • scripts/validate-news-generation.sh β€” update Check 15 for stricter enforcement

Key improvements:

  1. Enforce CI failure: Change validate-news-translations.ts to exit with code 1 when any article contains AI_MUST_REPLACE markers (currently warns only, exit 0)
  2. Extend banned pattern detection: Update check-banned-patterns.ts to scan HTML comments for AI_MUST_REPLACE markers using regex <!--[^>]*AI_MUST_REPLACE[^>]*-->
  3. Improve translation prompt: Add explicit instruction in news-translate.md: "SCAN every HTML comment in the source article. If any contains 'AI_MUST_REPLACE', you MUST generate replacement content in the target language."
  4. Split translation dictionary: Break translation-dictionary.ts (3,673 lines) into domain-specific files: political-terms.ts, committee-names.ts, party-names.ts, general-terms.ts
  5. Add marker-stripping step: Add a post-translation cleanup step that strips any remaining AI_MUST_REPLACE markers with a warning, preventing them from reaching production
  6. Recompile workflow: After updating news-translate.md, run gh aw compile news-translate

πŸ€– Recommended Agent

quality-engineer β€” Best expertise in validation pipelines, content quality gates, and testing

βœ… Acceptance Criteria

  • Zero AI_MUST_REPLACE markers in all newly generated articles (EN + all translations)
  • validate-news-translations.ts exits with code 1 when markers detected
  • check-banned-patterns.ts detects markers in HTML comments
  • Translation prompt explicitly handles AI_MUST_REPLACE replacement
  • translation-dictionary.ts split into ≀4 manageable domain files
  • news-translate.md updated and recompiled to .lock.yml
  • All existing tests pass (npx vitest run)
  • Validation scripts correctly flag existing articles with markers

πŸ“š References

  • Translation Validator: scripts/validate-news-translations.ts (v3.0)
  • Swedish Leakage: scripts/detect-swedish-leakage.ts
  • Banned Patterns: scripts/check-banned-patterns.ts
  • Translation Dict: scripts/translation-dictionary.ts
  • Shared Markers: scripts/data-transformers/content-generators/shared.ts (17 markers)
  • Validation Shell: scripts/validate-news-generation.sh (Check 15)
  • Architecture: ARCHITECTURE.md

🏷️ Labels

translation, validation, bug, code-quality

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions