Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,12 @@
<artifactId>jena-arq</artifactId>
</dependency>

<dependency>
<groupId>com.github.tototoshi</groupId>
<artifactId>scala-csv_2.11</artifactId>
<version>1.3.10</version>
</dependency>

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
Expand Down
82 changes: 53 additions & 29 deletions core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
package org.dbpedia.extraction.util

import com.github.tototoshi.csv.CSVReader

import java.io.File
import java.net.URL
import java.util.logging.Logger
import java.io.StringReader
import java.io.StringReader
Comment on lines +8 to +9
Copy link

@coderabbitai coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate import.

StringReader is imported twice on lines 8 and 9. Remove one of the duplicate imports.

Apply this diff to fix the duplicate import:

 import java.io.File
 import java.net.URL
 import java.util.logging.Logger
 import java.io.StringReader
-import java.io.StringReader
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import java.io.StringReader
import java.io.StringReader
import java.io.File
import java.net.URL
import java.util.logging.Logger
import java.io.StringReader
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala around lines 8
to 9, there is a duplicate import of java.io.StringReader; remove the redundant
import line so StringReader is imported only once (delete one of the two
identical import statements).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean
Failed to execute goal org.sonatype.plugins:nexus-staging-maven-plugin:1.6.8:deploy (injected-nexus-deploy) on project server: Failed to deploy artifacts: Could not transfer artifact org.dbpedia.extraction:core:jar:javadoc:pr-806-20251102.073958-1 from/to github (https://maven.pkg.github.com/dbpedia/extraction-framework): status code: 403, reason phrase: Forbidden (403) -> [Help 1]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

GitHub Packages Maven 403 Forbidden error authentication

💡 Result:

Quick checklist to fix GitHub Packages (Maven) 403 Forbidden authentication:

  1. Use a valid token:

    • Use a personal access token (classic) or GITHUB_TOKEN (in Actions). PAT must include read:packages to download and read:packages plus write:packages to publish; private-package access may also need repo. Renew/replace expired tokens. [1][2]
  2. Ensure settings.xml server id matches POM:

    • The in ~/.m2/settings.xml must exactly match the used in your pom.xml repository. Username = your GitHub username (or GITHUB_ACTOR in Actions); password = the token. Run mvn -X to confirm which settings.xml is used. [1][3]
  3. Use the correct repository URL for your GitHub instance:

    • For GitHub.com use https://maven.pkg.github.com/OWNER/REPOSITORY (or the documented URL pattern for enterprise/self-hosted instances). If your instance uses subdomain isolation use the maven.HOSTNAME form. URL must match the repo owner and name. [1][3]
  4. Check token scope vs. package visibility:

    • Public packages may need fewer scopes; private/internal packages require the write/read scopes (and repo for classic tokens depending on registry permissions). [2]
  5. CI / Actions specifics:

    • If using GITHUB_TOKEN in Actions, ensure the workflow has permissions: packages: read/write (or set permissions in workflow). For PATs used in CI, store as secrets and reference in settings.xml or MAVEN_SETTINGS. [2]
  6. Debug tips:

    • Run mvn -X to view auth attempts and which server id/credentials Maven uses.
    • Confirm there are no typos, uppercase owner names (must be lowercase in repository URL), and the PAT isn’t rate-limited or revoked. [3]

If you want, share your relevant settings.xml server block and your pom.xml distributionManagement snippet (redact the token) and I’ll spot the mismatch.

Sources: GitHub Packages Maven docs and community guide. [1] [2] [3]

References
[1] Working with the Apache Maven registry — GitHub Docs. (docs.github.com)
[2] REST API / About permissions for GitHub Packages — GitHub Docs. (docs.github.com)
[3] Practical guide: private GitHub Maven packages (example & settings.xml tips). (joe-mccarthy.github.io)

Citations:


@ghost-2362003, this error means Maven failed to authenticate when trying to deploy your package to GitHub Packages. A 403 Forbidden status indicates an authentication or authorization issue.

Common causes and solutions:

  1. Missing or invalid GitHub token:

    • You need a personal access token (PAT) with write:packages scope to publish packages
    • If the token expired or was revoked, generate a new one at: https://github.com/settings/tokens
    • For CI/CD workflows, ensure GITHUB_TOKEN has proper permissions
  2. Maven settings.xml configuration:

    • The <server><id> in your ~/.m2/settings.xml must exactly match the <id> in your pom.xml <distributionManagement> section
    • Your settings.xml should contain:
      <server>
        <id>github</id>
        <username>YOUR_GITHUB_USERNAME</username>
        <password>YOUR_GITHUB_TOKEN</password>
      </server>
  3. Repository URL format:

    • The URL must follow the pattern https://maven.pkg.github.com/OWNER/REPOSITORY (owner must be lowercase)
    • Verify your pom.xml has the correct repository URL
  4. Debug the issue:

    • Run mvn -X deploy to see which credentials Maven is using and identify mismatches

Since this appears to be happening in a CI/CD pipeline (based on the artifact name with timestamp), make sure your GitHub Actions workflow has the necessary permissions and token configured. Let me know if you need help checking your Maven configuration!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changes can i make from my end for this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all pom.xml files to check distribution management configuration
fd -e xml pom.xml --exec cat {}

Length of output: 78895


@ghost-2362003, I found the issue! Looking at your parent pom.xml, there's a repository URL mismatch that's causing the 403 error.

The Problem:
In your parent pom.xml (around line 48), the github.repo.url property points to:

<github.repo.url>https://maven.pkg.github.com/DBpedia-Enterprise/extraction-framework</github.repo.url>

But your actual repository is dbpedia/extraction-framework (not DBpedia-Enterprise/extraction-framework).

What you need to change:

Update the github.repo.url property in your parent pom.xml to match your actual repository:

<github.repo.url>https://maven.pkg.github.com/dbpedia/extraction-framework</github.repo.url>

Location to edit:

  • File: pom.xml (parent/root pom.xml)
  • Line: ~48 (in the <properties> section)

This mismatch is causing Maven to try deploying to the wrong repository URL, which results in the 403 Forbidden error because you don't have permissions to deploy to DBpedia-Enterprise/extraction-framework.

After making this change, the deployment should work correctly (assuming you have the proper GitHub token configured in your CI/CD environment or local Maven settings).


🧬 Code Graph Analysis Results

core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala

<possibly_relevant_code_snippets>core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala</possibly_relevant_code_snippets>

Snippet (lines 22-211)

object ConfigUtils {

  /**
    * Simple regex matching Wikipedia language codes.
    * Language codes have at least two characters, start with a lower-case letter and contain only
    * lower-case letters and dash, but there are also dumps for "wikimania2005wiki" etc.
    */
  val LanguageRegex: Regex = """([a-z][a-z0-9-]+)""".r

  /**
    * Regex used for excluding languages from the import.
    */
  val ExcludedLanguageRegex: Regex = """!([a-z][a-z0-9-]+)""".r

  /**
    * Regex for numeric range, both limits optional
    */
  val RangeRegex: Regex = """(\d*)-(\d*)""".r

  //val baseDir = getValue(universalConfig , "base-dir", true){
   // x => new File(x)
      //if (! dir.exists) throw error("dir "+dir+" does not exist")
      //dir
  //}

  def loadConfig(filePath: String, charset: String = "UTF-8"): Properties = {
    val file = new File(filePath)
    loadFromStream(new FileInputStream(file), charset)
  }

  def loadConfig(url: URL): Object = {

    url match {
      case selection =>
        if(selection.getFile.endsWith(".json"))
          loadJsonComfig(url)
        else
          loadFromStream(url.openStream())
    }
  }

  def loadJsonComfig(url: URL): JsonNode ={
    val objectMapper = new ObjectMapper(new JsonFactory())
    val objectReader: ObjectReader = objectMapper.reader()
    val inputStream = url.openStream()
    val res = objectReader.readTree(inputStream)
    inputStream.close()
    res
  }

  private def loadFromStream(file: InputStream, charset: String = "UTF-8"): Properties ={
    val config = new Properties()
    try config.load(new InputStreamReader(file, charset))
    finally file.close()
    config
  }


  def getValues[T](config: Properties, key: String, sep: String, required: Boolean = false)(map: String => T): Seq[T] = {
    getStrings(config, key, sep, required).map(map(_))
  }

  def getStrings(config: Properties, key: String, sep: String, required: Boolean = false): Seq[String] = {
    val string = getString(config, key, required)
    if (string == null) Seq.empty
    else string.trimSplit(sep)
  }

  def getStringMap(config: Properties, key: String, sep: String, required: Boolean = false): Map[String, String] = {
    getStrings(config, key, sep, required).map(x => x.split("->")).map( y => y(0) -> y(1)).toMap
  }

  def getValue[T](config: Properties, key: String, required: Boolean = false)(map: String => T): T = {
    val string = getString(config, key, required)
    if (string == null) null.asInstanceOf[T]
    else map(string)
  }
  
  def getString(config: Properties, key: String, required: Boolean = false): String = {
    val string = config.getProperty(key)
    if (string != null) string
    else if (! required) null
    else throw new IllegalArgumentException("property '"+key+"' not defined.")
  }
  
  /**
   * @param baseDir directory of wikipedia.csv, needed to resolve article count ranges
   * @param args array of space- or comma-separated language codes or article count ranges
   * @return languages, sorted by language code
   */
  def parseLanguages(baseDir: File, args: Seq[String], wikiPostFix: String = "wiki"): Array[Language] = {
    if(!baseDir.exists())
      throw new IllegalArgumentException("Base directory does not exist yet: " + baseDir)
    
    val keys = for(arg <- args; key <- arg.split("[,\\s]"); if key.nonEmpty) yield key
        
    var languages = SortedSet[Language]()
    var excludedLanguages = SortedSet[Language]()
    
    val ranges = new mutable.HashSet[(Int,Int)]
  
    for (key <- keys) key match {
      case "@mappings" => languages ++= Namespace.mappingLanguages
      case "@chapters" => languages ++= Namespace.chapterLanguages
      case "@downloaded" => languages ++= downloadedLanguages(baseDir, wikiPostFix)
      case "@all" => languages ++= Language.map.values
      case "@abstracts" =>
        //@downloaded - Commons & Wikidata
        languages ++= downloadedLanguages(baseDir, wikiPostFix)
        excludedLanguages += Language.Commons
        excludedLanguages += Language.Wikidata
      case RangeRegex(from, to) => ranges += toRange(from, to)
      case LanguageRegex(language) => languages += Language(language)
      case ExcludedLanguageRegex(language) => excludedLanguages += Language(language)
      case other => throw new IllegalArgumentException("Invalid language / range '"+other+"'")
    }
    
    // resolve page count ranges to languages
    if (ranges.nonEmpty)
    {
      // Note: the file is in ASCII, any non-ASCII chars are XML-encoded like '&#231;'. 
      // There is no Codec.ASCII, but UTF-8 also works for ASCII. Luckily we don't use 
      // these non-ASCII chars anyway, so we don't have to unescape them.
      
      // for all wikis in one of the desired ranges...
      languages ++= (for ((from, to) <- ranges; lang <- Language.map.values; if from <= lang.pages && lang.pages <= to) yield lang)
    }

    languages --= excludedLanguages
    languages.toArray
  }

  private def downloadedLanguages(baseDir: File, wikiPostFix: String = "wiki"): Array[Language] = {
    (for (file <- baseDir.listFiles().filter(x => x.isDirectory)) yield
      Language.get(file.getName.replaceAll(wikiPostFix + "$", "").replace("_", "-")) match{
        case Some(l) => l
        case None => null
      }).filter(x => x != null)
  }

  def toRange(from: String, to: String): (Int, Int) = {
    val lo: Int = if (from.isEmpty) 0 else from.toInt
    val hi: Int = if (to.isEmpty) Int.MaxValue else to.toInt
    if (lo > hi) throw new NumberFormatException
    (lo, hi)
  }

  def parseVersionString(str: String): Try[String] =Try {
    Option(str) match {
      case Some(v) => "2\\d{3}-\\d{2}".r.findFirstMatchIn(v.trim) match {
        case Some(y) => if (y.end == 7) v.trim else throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
        case None => throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
      }
      case None => throw new IllegalArgumentException("No version string was provided.")
    }
  }

  /**
    * This function was extracted from the ImageExtractor object, since
    *  the free & nonfree images are now extracted before starting the extraction jobs
    * @param source pages_articles of a given language
    * @param wikiCode the wikicode of a given language
    * @return two lists: ._1: list of free images, ._2: list of nonfree images
    */
  def loadImages(source: Source, wikiCode: String, extractionRecorder: ExtractionRecorder[WikiPage] = null): (Seq[String], Seq[String]) =
  {
    val freeImages = new mutable.HashSet[String]()
    val nonFreeImages = new mutable.HashSet[String]()

    for(page <- source if page.title.namespace == Namespace.File;
        ImageExtractorConfig.ImageLinkRegex() <- List(page.title.encoded) )
    {
      if(extractionRecorder != null) {
        val records = page.getExtractionRecords() match {
          case seq: Seq[RecordEntry[WikiPage]] if seq.nonEmpty => seq
          case _ => Seq(new RecordEntry[WikiPage](page, page.uri, RecordSeverity.Info, page.title.language))
        }
        //forward all records to the recorder
        extractionRecorder.record(records:_*)
      }
      ImageExtractorConfig.NonFreeRegex(wikiCode).findFirstIn(page.source) match
      {
        case Some(_) => nonFreeImages += page.title.encoded
        case None => if (freeImages != null) freeImages += page.title.encoded
      }
    }

    (freeImages.toSeq, nonFreeImages.toSeq)
  }
}

</possibly_relevant_code_snippets>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried CodeRabbit’s suggestion to update the repository URL, but the deployment still fails with a 403 error. It looks like this might be a permissions issue affecting contributor PRs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to jfery and he said that it needs to be fixed form his end.


import org.dbpedia.extraction.config.ConfigUtils

Expand Down Expand Up @@ -47,17 +51,31 @@ object WikiInfo
*
*/
def fromLines(lines: Iterator[String]): Seq[WikiInfo] = {
val info = new ArrayBuffer[WikiInfo]

if (! lines.hasNext) throw new Exception("empty file")
lines.next // skip first line (headers)

for (line <- lines)
if (line.nonEmpty)
fromLine(line) match{
case Some(x) => info += x
case None =>
}
val info = new ArrayBuffer[WikiInfo]

// Join all lines back into a single string for proper CSV parsing
val content = lines.mkString("\n")
val reader = CSVReader.open(new StringReader(content))

try {
val allRows = reader.iterator.toSeq

if (allRows.isEmpty) throw new Exception("empty file")

// Skip header row
for (row <- allRows.tail) {
if (row.nonEmpty && row.length >= 15) {
val pages = try row(4).toInt catch { case _: NumberFormatException => 0 }
val wikiCode = row(2)

if (ConfigUtils.LanguageRegex.pattern.matcher(wikiCode).matches) {
info += new WikiInfo(wikiCode, pages)
}
}
}
} finally {
reader.close()
}

info
}
Expand All @@ -66,22 +84,28 @@ object WikiInfo
* Reads a WikiInfo object from a single CSV line.
*/
def fromLine(line: String): Option[WikiInfo] = {
val fields = line.split(",", -1)

if (fields.length < 15) throw new Exception("expected [15] fields, found ["+fields.length+"] in line ["+line+"]")

val pages = try fields(4).toInt
catch { case nfe: NumberFormatException => 0 }

val wikiCode = fields(2)
if (! ConfigUtils.LanguageRegex.pattern.matcher(fields(2)).matches) throw new Exception("expected language code in field with index [2], found line ["+line+"]")

//if(Language.map.keySet.contains(wikiCode))
Option(new WikiInfo(wikiCode, pages))
//else
//{
// logger.log(Level.WARNING, "Language: " + wikiCode + " will be ignored. Add this language to the addonlangs.json file to extract it.")
// None
//}
val reader = CSVReader.open(new StringReader(line))
try {
val fields = reader.iterator.toSeq.headOption.getOrElse(Seq.empty)

if (fields.length < 15) {
logger.warning(s"expected [15] fields, found [${fields.length}] in line [${line.take(100)}...]")
return None
}

val pages = try fields(4).toInt
catch { case nfe: NumberFormatException => 0 }

val wikiCode = fields(2)
if (!ConfigUtils.LanguageRegex.pattern.matcher(wikiCode).matches) {
logger.warning(s"expected language code in field with index [2], found line [${line.take(100)}...]")
return None
}

Option(new WikiInfo(wikiCode, pages))
} finally {
reader.close()
}

}
}
}
Loading