Explore using Wiktionary dumps to derive translation data

### Terms

- [x] I have searched [open and closed feature requests](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aissue+label%3Afeature)
- [x] I agree to follow Scribe-Data's [Code of Conduct](https://github.com/scribe-org/Scribe-Data/blob/main/.github/CODE_OF_CONDUCT.md)

### Description

The Scribe community needs translation data for its projects. One means of achieving this would be to get the data from Wiktionary. The benefits of this are that the data is expansive and also includes data based on the various versions of a word. This issue would entail looking into the following:

- First exploring the current API process: https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wiktionary/parse_mediaWiki.py
- The output for this is appropriate for what we need and should be modeled in the new process
    - We want a dictionary where the keys are strings that are words from Wiktionary and the first key is a data type like noun or verb
    - Then within these sub-dictionaries we would have the ISO-2 of the translation as a key and then the translation as a value
    - We'd also want the description of the word from Wiktionary
        - Example: https://en.wiktionary.org/wiki/book/translations#Noun
        - We'd want "collection of sheets of paper bound together containing printed or written material"

```json
{
  "book": {
    "noun": {
      "1": {
        "_description": "collection of sheets of paper bound together containing printed or written material",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      "2": {
        "_description": "another_description",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      },
      ...
    },
    "verb": {
      "1": {
        "_description": "to reserve",
        "iso_2_1": "translation",
        "iso_2_2": "translation"
      }
      ...
    }
  }
}
```

- The new process would be based on the [Wiktionary dumps](https://dumps.wikimedia.org/enwiktionary/) (only EN)
- We'd want a function that when called would get all translations for all words from a Wiktionary dump
    - Inputs would be an ISO-2 code for the language and an optional dump ID (for if we need to run it on a specific dump - defaults to latest)

Note: We could consider using [wiktextract](https://github.com/tatuylonen/wiktextract) for this

### Contribution

Happy to explore how to proceed here and also help with coding/review :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore using Wiktionary dumps to derive translation data #650

Terms

Description

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore using Wiktionary dumps to derive translation data #650

Description

Terms

Description

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions