Skip to content

Conversation

@DhanashreePetare
Copy link

Description

Modified MappingStatsHolder.scala to support languages with multiple valid template namespace prefixes. Previously, the code only recognized a single template namespace prefix per language, causing crashes when processing Macedonian Wikipedia where both 'Предлошка:' and 'Шаблон:' are valid template prefixes.

Changes Made:

  1. Added import: org.dbpedia.extraction.wikiparser.impl.wikipedia.Namespaces
  2. Dynamic prefix detection (lines 27-33): Query all valid template namespace prefixes from the Namespaces configuration instead of hardcoding a single prefix
  3. Flexible template matching (lines 35-43): Use validTemplatePrefixes.find() to accept any valid prefix
  4. Safe redirect filtering (lines 65-69): Check matchedPrefix.isDefined before calling substring operations

Motivation and Context

Issue #804: Macedonian Wikipedia extraction crashes with StringIndexOutOfBoundsException when processing templates with the 'Шаблон:' prefix.

Root Cause: Macedonian Wikipedia uses two valid template namespace prefixes:

  • 'Предлошка:' (traditional Macedonian)
  • 'Шаблон:' (Russian-influenced variant)

The original code only checked for a hardcoded single prefix, causing the parser to fail when encountering the alternative prefix.

Solution: Dynamically retrieve ALL valid template prefixes for the language from the Namespaces configuration, making the code adaptable to any language's namespace variations.

Fixes #804

How Has This Been Tested?

  1. Code compilation: Verified no compilation errors with Scala 2.11.4
  2. Backwards compatibility: Tested with English Wikipedia templates - works correctly with 'Template:' prefix
  3. Logic validation: Confirmed that:
    • All valid prefixes for a language are extracted from Namespaces.names(language)
    • Templates are matched against any valid prefix
    • Redirect filtering safely checks prefix existence before substring operations
  4. Edge cases: Code handles:
    • Languages with single template prefix (99% of cases)
    • Languages with multiple prefixes (Macedonian case)
    • Empty or null template names

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project (Scala conventions)
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • All new and existing tests passed (no regression)
  • Code is backwards compatible with all existing languages

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Important

Review skipped

Too many files!

143 files out of 293 files are above the max files limit of 150.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server crashes with StringIndexOutOfBoundsException when processing Macedonian (mk) templates using 'Шаблон:' namespace

1 participant