Skip to content

TS Budapest

Martina Scholger edited this page Sep 23, 2019 · 20 revisions

TS - W1 Creating ELTeC

Topic: Corpus design and text contribution for ELTeC

The aim of the Training School is to give participants hands-on experience in creating ELTeC TEI-XML versions of source texts starting from scanned page images or from a pre-existing HTML version. We will supply a set of raw materials for candidates to work on, along with detailed instructions. At the end of the Training School, each participant should be able to contribute new TEI encoded texts to the ELTeC GitHub repository. Participants will work in pairs with guidance from Martina, Carolin, Christian and Lou.

On Monday, participants will learn how to use Oxygen for encoding texts in TEI XML, and will be introduced to the sampling and balancing criteria as well as other metadata requirements for ELTeC. On Tuesday, they will work with two different transformation scenarios for creating TEI XML texts, firstly starting from page images, and secondly starting with HTML. We will also discuss the different encoding levels provided for ELTeC texts, including proposals for encoding linguistic analysis. In the closing sessions on Wednesday, participants will be asked to present what they have achieved, and discuss the outcomes of this and other parallel workshops.

The Training School is intended for any researcher interested in contributing texts to the ELTeC. Some previous experience of computer use is needed, but no knowledge of TEI XML is assumed.

Schedule

Monday time Topics Learned Trainer
13:30-14:00 30 min Joint Opening Session what our Action is about and the tasks of the WGs Carolin
14:00-15:45 105 min Introduction to Oxygen and TEI XML basic XML structure (elements, attributes, hierarchy) and how to use oxygen, slides http://distantreading.github.io/Training/Budapest/introduction_XMLTEI.html Martina
16:15-16:45 30 min Introduction to ELTeC what ELTeC is about, Design/Goals, first introduction about components (metadata, encoding) Carolin
16:45-17:45 60 min First practical session everyone chooses a text (not necessarily one of ours) and creates a header for it, slides http://distantreading.github.io/Training/Budapest/tutorial-hdr.html Lou
17:45-18:00 15 min Github practice. Everyone adds their chosen header to their own github repo.
Tuesday
9:00-10:45 105 min Introduction to OCR with OCR4all practical and the theoretical background reduced to the minimum, introduce the workflow, then only very briefly explain the individual modules, and the rest of the time we will play around, short discussion advantages and disadvantages of ABBYY and OCR4all (not a presentation of ABBY) generate plain text files from the material Christian
11:15-13:00 105 min Working with OCR Second practical: everyone chooses a different Polish text, slides http://distantreading.github.io/Training/Budapest/tutorial-pol.html Lou
13:00-14:00 60 min LUNCH
14:00-14:45 45 min Varieties of markup and conversion tools (talk/discussion) Lou
14:45-16:30 105 min Working with other OCR outputs (Abbyy xml ─ docx -- HTML ) Third practical: everyone chooses a different text, slides http://distantreading.github.io/Training/Budapest/tutorial-abbyy.html Lou
16:30-17:00 30 min Summary - what have we learned? Carolin
17:15-18:45 Katherine Bode Lecture
Wednesday
9:00-10:45 105 min Concluding session create presentation of the outputs of the workshop, upload data Lou, Martina, Christian
11:15-12-45 90 min Joint final session Lou, Martina, Christian, Carolin

What's here

E5C-discussion-paper ELTeC Corpus Composition Criteria Compliance Calculations : draft for discussion

Challenges-on-text-selection Reports on challenges regarding text selection and balancing

Workflow Step-by-step introduction for contributing texts to ELTeC.

Uploading-files-on-GitHub-Step-by-Step How to upload texts on GitHub

textFeatures Table of textual features and their encodings

teiHeaders Instructions for compiling an ELTeC Header

choosingTitles Suggestions on how to select texts for ELTeC

Versioning-Guidelines-for-ELTeC Draft for defining our versioning guidelines.

Filenames and identifiers: A proposal

Please feel free to add ideas and discussion notes

Call-for-Contributions What texts can you contribute?

Example-Texts Add an example here!

ELTeC-List-of-Candidates Draft table for text candidates

Online-Text-Collections Some links to less well known collections

Clone this wiki locally