-
Notifications
You must be signed in to change notification settings - Fork 9
TS Budapest
Topic: Corpus design and text contribution for ELTeC
The aim of the Training School is to give participants hands-on experience in creating ELTeC TEI-XML versions of source texts starting from scanned page images or from a pre-existing HTML version. We will supply a set of raw materials for candidates to work on, along with detailed instructions. At the end of the Training School, each participant should be able to contribute new TEI encoded texts to the ELTeC GitHub repository. Participants will work in pairs with guidance from Martina, Carolin, Christian and Lou.
On Monday, participants will learn how to use Oxygen for encoding texts in TEI XML, and will be introduced to the sampling and balancing criteria as well as other metadata requirements for ELTeC. On Tuesday, they will work with two different transformation scenarios for creating TEI XML texts, firstly starting from page images, and secondly starting with HTML. We will also discuss the different encoding levels provided for ELTeC texts, including proposals for encoding linguistic analysis. In the closing sessions on Wednesday, participants will be asked to present what they have achieved, and discuss the outcomes of this and other parallel workshops.
The Training School is intended for any researcher interested in contributing texts to the ELTeC. Some previous experience of computer use is needed, but no knowledge of TEI XML is assumed.
| Monday | time | Topics | Learned | Trainer |
|---|---|---|---|---|
| 13:30-14:00 | 30 min | Joint Opening Session | what our Action is about and the tasks of the WGs | Carolin |
| 14:00-15:45 | 105 min | Introduction to Oxygen and TEI XML | basic XML structure (elements, attributes, hierarchy) and how to use oxygen, slides http://distantreading.github.io/Training/Budapest/introduction_XMLTEI.html | Martina |
| 16:15-16:45 | 30 min | Introduction to ELTeC | what ELTeC is about, Design/Goals, first introduction about components (metadata, encoding) | Carolin |
| 16:45-17:45 | 60 min | First practical session | everyone chooses a text (not necessarily one of ours) and creates a header for it, slides http://distantreading.github.io/Training/Budapest/tutorial-hdr.html | Lou |
| 17:45-18:00 | 15 min | Github practice. | Everyone adds their chosen header to their own github repo. | |
| Tuesday | ||||
| 9:00-10:45 | 105 min | Introduction to OCR with OCR4all | practical and the theoretical background reduced to the minimum, introduce the workflow, then only very briefly explain the individual modules, and the rest of the time we will play around, short discussion advantages and disadvantages of ABBYY and OCR4all (not a presentation of ABBY) generate plain text files from the material | Christian |
| 11:15-13:00 | 105 min | Working with OCR | Second practical: everyone chooses a different Polish text, slides http://distantreading.github.io/Training/Budapest/tutorial-pol.html | Lou |
| 13:00-14:00 | 60 min | LUNCH | ||
| 14:00-14:45 | 45 min | Varieties of markup and conversion tools (talk/discussion) | Lou | |
| 14:45-16:30 | 105 min | Working with other OCR outputs (Abbyy xml ─ docx -- HTML ) | Third practical: everyone chooses a different text, slides http://distantreading.github.io/Training/Budapest/tutorial-abbyy.html | Lou |
| 16:30-17:00 | 30 min | Summary - what have we learned? | Carolin | |
| 17:15-18:45 | Katherine Bode Lecture | |||
| Wednesday | ||||
| 9:00-10:45 | 105 min | Concluding session | create presentation of the outputs of the workshop, upload data | Lou, Martina, Christian |
| 11:15-12-45 | 90 min | Joint final session | Lou, Martina, Christian, Carolin |
E5C-discussion-paper ELTeC Corpus Composition Criteria Compliance Calculations : draft for discussion
Challenges-on-text-selection Reports on challenges regarding text selection and balancing
Workflow Step-by-step introduction for contributing texts to ELTeC.
Uploading-files-on-GitHub-Step-by-Step How to upload texts on GitHub
textFeatures Table of textual features and their encodings
teiHeaders Instructions for compiling an ELTeC Header
choosingTitles Suggestions on how to select texts for ELTeC
Versioning-Guidelines-for-ELTeC Draft for defining our versioning guidelines.
Filenames and identifiers: A proposal
Please feel free to add ideas and discussion notes
Call-for-Contributions What texts can you contribute?
Example-Texts Add an example here!
ELTeC-List-of-Candidates Draft table for text candidates
Online-Text-Collections Some links to less well known collections