Skip to content

Versioning Guidelines for ELTeC

CarolinOdebrecht edited this page Sep 5, 2018 · 27 revisions

Versioning Guidelines for ELTeC – First Draft / open for discussion!

ELTeC

  • ELTeC is defined as a collection of sub-corpora. Each sub-corpus contains novels in a different language.
  • ELTeC is stored in its own organization (cf. https://github.com/COST-ELTeC).
  • Each ELTeC sub-corpus (language collection) is stored in its own repository.
  • Metadata for the whole ELTeC collection is provided by a teiHeader. Metadata for individual texts is provided by a teiHeader in each text.
  • ELTeC metadata is stored and updated in a repository along with the texts.
  • All ELTeC texts are stored in a TEI format. For details see textFeatures

Sub-corpora for Language Collection

  • A language collection is defined as a set of 100 novels in a certain language.

  • Each text of a language collection gets a unique ID with a two-letter ISO language reference, e.g., “eng001”, “deu001”. (https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes)

  • Each language collection will be stored, changed and updated in a repository, e.g., “ELTeC-deu”, “ELTeC-eng” etc. Additional to the languages proposed in the MoU of the Action, ELTeC can contain also language collections for texts of different cultural contexts and language variants in Europe, e.g. Swiss German, Catalan. The name of the repositories will be for example ELTeC-deu, ELTeC-gsw, ELTeC-spa, ELTeC-cat using the ISO 639-2 codes (which contain more languages than the ISO 639-1 codes.)

  • For selection guidelines see here*

  • Each language collection repository contains different folders for the encoding levels. A text is stored either in the level0 or level1 folders depending on the encoding level of the text. (If a text gets a richer encoding, e.g., from level0 to level1, it will be stored in level1. We will not store texts in both encodings at the same time.)

i. Orig: for the original data (all available input formats)

ii. level0: basic TEI Encoding

iii.level1: richer TEI Encoding

iv. level2: richer TEI Encoding with tokenization and linguistic annotation

Please note: If the repository is still empty and does not contain any such folders, you can simply create such a folder (Orig, level1, level2).

Each schema will be stored in a Schema repository.

Versioning

  • GitHub provides versioning control for each editing step (adding data, changing data etc.)

Release

  • Separately from the versioning control via GitHub, we will publish identified releases of ELTeC (including all sub-corpora) on Zenodo.
  • Each release will be published on Zenodo using a GitHub-Zenodo-Bridge.
  • Each release will get a DOI in Zenodo for referencing purposes.
  • Zenodo provides a DOI for each version.
  • Zenodo provides a DOI for all versions together: “This DOI represents all versions, and will always resolve to the latest one”
  • ELTeC will be citable, including a list of corpus editors.

Advantages:

  • On GitHub we can create, change and update the corpus.
  • For long-term archiving and referencing purposes, we can use Zenodo. This addresses all important issues concerning data management.
  • Additionally, ELTeC will be more visible and findable by others researchers.
  • Via a GitHub-Zenodo-Bridge, we can easily publish the current version of ELTeC without much effort (see e.g. https://guides.github.com/activities/citable-code/).
  • Archiving requires a stable institution which guarantees services. Zenodo’s perspective: “Built and developed by researchers, to ensure that everyone can join in Open Science.” (http://about.zenodo.org/) GitHub is strictly speaking an enterprise without such guarantees for researcher (and has recently been acquired by Microsoft)
  • Zenodo provides DOI for stable referencing. GitHub does not provide DOI.
  • We are findable on Zenodo -> visibility.

Licence

The overall goal of the licence should be to encourage re-use, to make sure the creators of the collections receive proper academic credit, and to ensure our COST Action serves as a model for open access to textual resources. We talked about many options and scenarios and have come to the conclusion that the best licence would be a Creative Commons Attribution ("CC BY" for short) licence. Find out what this licence means here: https://creativecommons.org/licenses/by/4.0/

First of all, it should be clear that this licence does not apply to the text itself, as all texts we use are in the public domain. The licence also does not apply to the metadata, as (largely) factual data cannot and should not be copyrighted. This means the licence only applies to the TEI markup in the files or to any other annotations we may add.

The CC BY licence will ensure that other people can do almost anything with these texts, such as download them, annotate them, modify them, analyse them, integrate them into their own text collection, and republish any of these new versions. The only condition is that they mention the licence of the source (CC BY) and attribute the creators of the text collection (ideally by mentioning both the ELTeC URL and the names of the editors).

We have decided not to apply any more restrictive conditions on re-use, such as the non-commercial condition (NC), the no-derivatives condition (ND) or the "share-alike" condition (SA), as they restrict the flexibility with which our texts can be re-used. Such additional restrictions may also discourage potential users from using the texts in situations where they are unsure about the exact scope of the restriction.

There is one more issue regarding licencing, that of compatibility issues. If we come across a text that is relevant to an ELTeC collection according to the selection criteria, and it is available in a full-text digital format, maybe even with XML-TEI markup, but the licence is for example "CC BY-SA", "CC BY-ND" or "CC BY-NC", what do we do? Again, the licence can only apply to the markup, not the text. Strictly speaking, however, we would not be allowed to reuse the text with its markup and give it a less restrictive licence, using "CC BY". One way out would be to assume basic TEI markup cannot be copyrighted and any existing, more fancy markup would be either removed or replaced by our own fancy markup. However, if such a situation should occur, we recommend contacting the original editors of the texts and asking them for permission to use and re-licence the text with a "CC BY" licence.

To sum this all up: We propose to licence all ELTeC collections with a Creative Commons Attribution (CC BY) licence which applies to the markup, not the text itself.

Release-Definition

Content

  • Each text collection contains several files in different formats (level of encodings). Each release on Zenodo contains all data files according to the structure of the repository.
  • We will set up independent releases for each language collection. In this way, we need to coordinate the different developments in each language collection.

Version numbering

During the Action, we might want to iteratively publish the corpus. Next to a DOI, a version number might be useful to add. A consistent version numbering scheme enables data users to track whether a collection has changed and if a new version is available, determine specifically which version they used before and which version they are working with now set expectations about how each version would differ. (https://www.ands.org.au/working-with-data/data-management/data-versioning) When we use a consistent numbering, we can refer on each release in an easy way, e.g., in a conference abstract where we can refer on “ELTeC Version 1.0.1” and additionally use the DOI for publication references.
Options:

  • ELTeC as a super corpus gets a version number.
  • Each subcollection as a sub-corpus gets a version number too.

Kind of version number

We can use a Major.Minor numbering system where a major revision means for example that substantial data is added, changes in the data model are made. (1.0 -> 2.0). A minor revision may indicate that we renamed or, corrected data (2.0 ->2.1).

Open Questions

  1. For referencing purposes (e.g. on Zenodo), we should provide names of the corpus editors of ELTeC and annotators of ELTeC.
  • We do not need to create and update such a list: it can be automatically generated from the metadata supplied with each text or with the metadata for ELTeC as a corpus.

What's here

E5C-discussion-paper ELTeC Corpus Composition Criteria Compliance Calculations : draft for discussion

Challenges-on-text-selection Reports on challenges regarding text selection and balancing

Workflow Step-by-step introduction for contributing texts to ELTeC.

Uploading-files-on-GitHub-Step-by-Step How to upload texts on GitHub

textFeatures Table of textual features and their encodings

teiHeaders Instructions for compiling an ELTeC Header

choosingTitles Suggestions on how to select texts for ELTeC

Versioning-Guidelines-for-ELTeC Draft for defining our versioning guidelines.

Filenames and identifiers: A proposal

Please feel free to add ideas and discussion notes

Call-for-Contributions What texts can you contribute?

Example-Texts Add an example here!

ELTeC-List-of-Candidates Draft table for text candidates

Online-Text-Collections Some links to less well known collections

Clone this wiki locally