Skip to content

Filenames and identifiers

Lou edited this page Mar 6, 2020 · 7 revisions

This document is a proposal regarding naming conventions to observe when creating identifiers and filenames for texts in an ELTeC collection.

Now that many of us have gained some experience with these, but before everything is done and settled, it may be a good time to try and find a consensus. Despite the fact that the following is written as a guideline, it is really a proposal. For any proposals, the explanations or reasons are given in parentheses.

First of all, I believe we should agree that a shared, coherent filename and identifier scheme is desirable. I believe it is, because it makes writing scripts that will work with all collections much easier.

(1) Identifiers

Basic assumptions: The function of the identifier is to allow us to refer to the texts across the entire ELTeC, so it needs to be unique. Also, it should be short so that it is convenient. The identifier does not have to make much sense to humans.

  1. The identifier should be documented in the "xml:id" attribute of the "TEI" element. (This is where it should be taken from for all other purposes.)
  2. The identifier should start with an uppercase version of the language identifier of the collection, so "FRA" in the case of the "ELTeC-fra" collection or "POR" in the case of the ELTeC-por collection. (I believe it makes sense to link any text to its collection in this way. Also, it makes it much easier to be sure the identifiers are unique across ELTeC.)

An earlier version of this document proposed lowercase, but in practice most repository creators have chosen uppercase. The choice is arbitrary

  1. After the language identifier, there should be a numerical part to the identifier. These numerical part should have at least 4 digits and no more than 6 digits. These can be arbitrary values or can have some meaning (such as the year), but should not impose undue or arbitrary limits on the number of possible texts in the collection. (Note that the extended ELTeC collections may theoretically have more than 1000, but most likely not more than 9999, novels and we should not need to change the identifier scheme in that case. So four digits would be enough, except when you use 4 digits for the year, then you need additional digits.)

In summary, I believe this means valid identifiers would look like this: "FRA1234", or "ENG12345" or even "HUN123456". Or, ENG18780", of course.

(2) Filenames

Assumptions: Filenames should be unique, should not be overly long and should make (some) sense to humans.

  1. The filename should include the identifier. (This is important as a guarantee for the filenames to be unique across the collection and for any automatic renaming based on metadata to work.) As a consequence, the rest of the filename, if any, on its own, does not need to be unique.
  2. Filenames do not need to remain the same at all times, as long as they include the identifier. We are defining a default filename here. (Based on the metadata contained in the teiHeader for each collection, either directly or from a metadata table generated from the teiHeader, renaming the files automatically will be easy as long as the identifier is included. However, a standard default filename will be convenient for users.)
  3. The identifier should be added to the filename in a way that allows us to easily find it automatically. This means it should be placed in the same place in all collections and be delimited from any other information in the filename in a clear way. (It is important that the identifier can be easily and consistently identified in the filename for automated treatment.) I believe it is preferable to have the identifier at the end. In this way, the information placed before the identifier can be used for sorting, and whatever someone chooses to use for sorting, the identifier can still always be easily found, using a standard pattern, even in non-default filenames.

After discussion, it has been agreed that we will place the identifier at the START of the filename, rather than the end as originally suggested here.

  1. In order to be human-readable, the filename can include information in addition to the identifier. (This allows humans to quickly recognize what kind of a novel they are dealing with when they see the filename without having to look at a metadata table. Some tools, like "stylo" or TXM use the filename by default when displaying results or graphs. Candidates for this additional information are (a short version of) the author name, (a short version of) the title, the year of publication (first publication or year of the edition being used as the copy text?), or a combination of these items. The major limiting factor here is the length of the filename. (Again, because tools use the filename as a label, these labels should not be too long. Some tools may truncate the labels; how can we make sure they still make sense to humans and allow to uniquely identify the texts?) This means that, only (a short version of) the author name should be added to the default filenames.

  2. The extension of the files should be ".xml". (We are dealing with XML files and should let everyone see this at once.)

  3. This means that the default filenames should follow the pattern "author_identifier.xml". ** we are now using the pattern "identifier_author.xml" **

  4. Alternative filenames derived from the default filenames could be: "author-titel_identifier.xml" or "year-author_identifier.xml". (These should not be used in the default folders.)

(3) Other remarks

A metadata table, rather than the list of filenames, should be used to get an overview of the collection. It is a much more flexible way of doing this, especially as it allows to sort/group, as needed, by various corpus composition criteria, such as canonicity, text size, time period and author gender in addition to year of the copy text, year of first publication, author name, title, etc. (We already have two scripts that create such a metadata table, one written by Lou using XSLT and one written by me using Python.)

What's here

E5C-discussion-paper ELTeC Corpus Composition Criteria Compliance Calculations : draft for discussion

Challenges-on-text-selection Reports on challenges regarding text selection and balancing

Workflow Step-by-step introduction for contributing texts to ELTeC.

Uploading-files-on-GitHub-Step-by-Step How to upload texts on GitHub

textFeatures Table of textual features and their encodings

teiHeaders Instructions for compiling an ELTeC Header

choosingTitles Suggestions on how to select texts for ELTeC

Versioning-Guidelines-for-ELTeC Draft for defining our versioning guidelines.

Filenames and identifiers: A proposal

Please feel free to add ideas and discussion notes

Call-for-Contributions What texts can you contribute?

Example-Texts Add an example here!

ELTeC-List-of-Candidates Draft table for text candidates

Online-Text-Collections Some links to less well known collections

Clone this wiki locally