Skip to content

Explore using Wiktionary dumps to derive translation data #650

@andrewtavis

Description

@andrewtavis

Terms

Description

The Scribe community needs translation data for its projects. One means of achieving this would be to get the data from Wiktionary. The benefits of this are that the data is expansive and also includes data based on the various versions of a word. This issue would entail looking into the following:

{
  "book": {
    "noun": {
      "1": {
        "_description": "collection of sheets of paper bound together containing printed or written material",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      "2": {
        "_description": "another_description",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      ...
    },
    "verb": {
      "1": {
        "_description": "to reserve",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      }
      ...
    }
  }
}
  • The new process would be based on the Wiktionary dumps (only EN)
  • We'd want a function that when called would get all translations for all words from a Wiktionary dump
    • Inputs would be an ISO-2 code for the language and an optional dump ID (for if we need to run it on a specific dump - defaults to latest)

Note: We could consider using wiktextract for this

Contribution

Happy to explore how to proceed here and also help with coding/review :)

Metadata

Metadata

Labels

help wantedExtra attention is needed

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions