eltec-slides/eng.html at main · distantreading/eltec-slides · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

<title>ELTeC</title>

<link rel="stylesheet" href="dist/reset.css">
<link rel="stylesheet" href="dist/reveal.css">
<link rel="stylesheet" href="dist/theme/simple.css" id="theme">
<link rel="stylesheet" href="plugin/highlight/monokai.css" id="highlight-theme">
</head>

<body>
<div class="reveal">
<div class="slides">

<section
    data-markdown=""
    data-charset="utf-8"
    data-separator="^\n--\n"
    data-separator-vertical="^\n---\n"
    data-separator-notes="^::"
    data-background-image="img/basics/distant-reading_icon.png"
    data-background-size="100px"
    data-background-position="right 10px top 10px">

<textarea data-template>

# What is ELTeC all about?

<img data-src="img/basics/distant-reading_logo.png" height="40"></img>
<br/><br/>

**Christof Schöch (Trier, Germany)**

<br/>

***
Belgrade Training School, March 22, 2022
<br/>https://distantreading.github.io/eltec-slides/
***
<img data-src="img/basics/tcdh-slim.png" height="50"></img>&nbsp;&nbsp;&nbsp;<img data-src="img/basics/uni-trier.png" height="50"></img>&nbsp;&nbsp;&nbsp;<img data-src="img/basics/cost-and-eu.png" height="70"></img>


::
- Happy to be here, thanks for the invite
- Multilingual approaches in CLS are on the rise
  -
- Need to be supported by three things
  - multilingual data
  - methods to handle it
  - people who know the context
- So I believe it is inherently a community effort
- At least, that is the premise of ELTeC, the "European Literary Text Collection"


--
### Overview
1. [What is ELTeC?](#/2)
2. [Composition criteria](#/3)
3. [Encoding principles](#/4)
4. [Publication strategy](#/5)
4. [Usage scenarios](#/6)
5. [Conclusion](#/7)

::
- So I'd like to present ELTeC today, with a focus on multilingualism


--
## (1) What is ELTeC?


---
### ELTeC in context
* COST Action "Distant Reading for European Literary History" <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
  * Research network (31 countries, 200+ researchers)
  * Ambition: Foster digital, cross-lingual research into the history of the European novel
* Areas of activity:  <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * Build a multilingual corpus of the European novel
  * Develop appropriate, digital methods of analysis
  * Thinking about the theoretical consequences
  * Creation of a network of researchers across Europe
  * Capacity building: training schools, exchanges, joint projects


---
### "European Literary Text Collection"
* A multilingual corpus of the European novel <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
  * Time period: 1840-1920 (production, availability, OCR, copyright)
  * At least 10 different languages (currently: 10 complete, 7 more in progress, + extensions)
  * Comparable (!) collections of 100 novels per language
* Key characteristics <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * Each corpus represents the variety of production
  * Texts encoded in XML-TEI and linguistically-annotated (POS, NE)
  * Everything published under open licences (CC)
* More information  <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="3" -->
  * http://www.distant-reading.net/eltec/
  * Latest release: [v1.1.0, April 2021, 14 collections, 1200 novels](https://github.com/COST-ELTeC/ELTeC)

---
### Progress of our work on ELTeC
<img data-src="img/eltec-overview_numnovels.png" height="500"></img>
<br/><small>See: https://distantreading.github.io/ELTeC/</small>


--
## (2) Eligibility and composition criteria

---
### Eligibility criteria
* Novels, i.e. narrative, fictional prose of a certain length
* Minimal length: 10.000 words
* Novels written originally in the language of the collection
* Novels first (or at least also) published in Europe

---
### Composition criteria
* Objectives <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
  * Comparability of the collections
  * Represent the diversity of the novel production
  * Go beyond the canon (and the usual collections)
* Criteria taken into account <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * Period of publication: 1840-59, 1860-79, 1880-99, 1900-1919
  * Length of the text: short (10-50k), medium (50-100k), long (100k+)
  * Author gender: male, female, diverse/mixed
  * Reprint count, 1970-2010: low (0-1), high (2+)
  * Number of novels per author 3 (9-11 authors), 1 (otherwise)


---
#### Composition of the collections
<br/>
<small>

|ELTeC-eng|ELTeC-rom|
|:---:|:---:|
|<img data-src="img/mosaic-eng.svg" height="400">|<img data-src="img/mosaic-rom.svg" height="400">|
|100 novels<br/>EC5 100<br/>excellent balance| 100 novels<br/>EC5 83<br/>balance difficult to obtain|
|||

</small>

---
#### The diversity paradox
<a href="img/eltec-overview_paradox.png"><img data-src="img/eltec-overview_paradox.png" height="400"></img></a>

* Three objectives
  * Comparability of the copora (enabled by strict criteria)
  * Diversity of texts within a corpus (enforced by strict criteria)
  * Diversity of languages in ELTeC (suported by loose criteria)


--
## (3) Encoding principles

---
### Three levels of encoding
* Everything is encoded in XML-TEI (of course!?) <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
* There is a common header for metadata <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
* Three levels of encoding <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="3" -->
  * Level 0: minimal TEI encoding (metadata + `div`, `p`, `hi`)
  * Level 1: semantic TEI encoding (`foreign`, `emph` etc.)
  * Level 2: TEI with token-level linguistic annotation (UPos, NE)
* Controlled by a set of schemas <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="4" -->
  * Schemas connected via "ODD chaining" (see Burnard et al. 2021)
  * Validation with RelaxNG and Schematron
  * Validation socially enforced by Lou


---
### Metadata
* Composition criteria (see above) <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
* Provenance <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * digital source
  * print source
  * first edition
* Type of novel (optional) <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="3" -->
  * Subgenre of the novel
  * Narrative perspective
* Textual characteristics <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="4" -->
  * Language
  * Orthography (original vs. modernised)
  * Alphabet (latin, cyrillic, transition)
  * Encoding level (see above)


::
- Nothing special from the point of view of digital editing
- But more detailed than what is standard in collection building


--
## (4) Publication strategy


---
### Publication strategy
* For the needs of the project <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
  * Space for collaboration (XML) : [Github](https://github.com/cost-eltec)
  * Publication of 'releases' with DOI (XML) : Github + [Zenodo](https://zenodo.org/communities/eltec/)
  * Overview (HTML, XML) : [Github.io](https://distantreading.github.io/ELTeC/)
* Distribution platforms (beyond Zenodo):  <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * [TEI Publisher](https://teipublisher.com/exist/apps/eltec/index.html)
  * [GAMS](http://glossa.uni-graz.at/context:eltec)
  * [TextGrid Rep](https://dev.textgridrep.org/browse/3tg6g.0)
* Further publication formats  <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="3" -->
  * Packages for usage with analysis tools like TXM or Antconc
  * Publication via analysis platforms like CWB, TXM Portal, NoSketchEngine etc.


---
#### Github
<img data-src="img/eltec_github.png" height="500"></img>

https://github.com/cost-eltec


---
#### Zenodo
<img data-src="img/eltec_zenodo.png" height="500"></img>

https://zenodo.org/communities/eltec/

---
#### TEI Publisher
<img data-src="img/eltec_teip.png" height="500"></img>

https://teipublisher.com/exist/apps/eltec/index.html

---
#### GAMS (Graz)
<img data-src="img/gams.png" height="500"></img>

https://glossa.uni-graz.at/archive/objects/context:eltec/methods/sdef:Context/get?mode=home#


--
## (5) Usage scenarios

---
### Some scenarios
* Shared objectives <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
  * Adapt existing statistical methods to the multiple european languages
  * Evaluate these methods in a multilingual context (beyond eng, deu, fra)
* Some examples <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
  * Linguistic annotation: Cínkova et al. 2020
  * Annotation of Named Entities: Frontini et al. 2020
  * Identification of direct speech: Byszuk et al. 2020
  * Title analysis: Patras et al. 2021
  * Stylometric Authorship Attribution: Schöch et al.


---
### Identification of direct speech
<img data-src="img/byszuk-2020.png" height="400"></img>

* Key results
  * "Multilingual sentence embeddings" clearly surpass baseline
  * Performance: score F1 ~ 0.89 for all nine languages

---
### Title analysis
<img data-src="img/patras-2021_annotation.png" width="500"></img> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" -->
<br/><img data-src="img/patras-2021_lengths.png" width="500"></img> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="2" -->

---
### Stylometry: Dendrograms
<a href="img/ELTeC-fra_eders-d_1000.png"><img height="500" data-src="img/ELTeC-fra_eders-d_1000.png"></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="img/ELTeC-rom_eders-d_1000.png"><img height="500" data-src="img/ELTeC-rom_eders-d_1000.png"></a>
<br/>ELTeC-fra &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ELTeC-rom


---
### Stylometry: Evaluation
<a href="img/results_ELTeC-hun.svg"><img height="200" data-src="img/delta-hun.png"></a></img>&nbsp;&nbsp;&nbsp;<a href="img/results_ELTeC-fra.svg"><img height="200" data-src="img/delta-fra.png"></img></a><br/><a href="img/results_ELTeC-rom.svg"><img height="200" data-src="img/delta-rom.png"></a></img>&nbsp;&nbsp;&nbsp;<a href="img/results_ELTeC-slv.svg"><img height="200" data-src="img/delta-slv.png"></img></a><br/><br/>(Aktuell: deu, eng, fra, hun, por, rom, slv)


--
## Conclusion

---
### So, what is ELTEC?
* A multilingual resource, of course <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="1" -->
* A learning opportunity regarding collaborative research <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="2" -->
* A rallying point for a European, multilingual community <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="3" -->
* A foundation for the development of cross-lingual methods <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="4" -->
* A modest start for a history of European literature that would be truly digital, multilingual and diverse <!-- .element: class="fragment fade-in-then-semi-out" data-fragment-index="5" -->

---
#### Final Action Conference
<img data-src="img/conference.png" height="500"></img>

https://www.distant-reading.net/events/conference-programme/

::
- Final Action Conference
- Building, Annotating, Analysing ELTeC
- Free participation!


---
### Thank you!
<img height="500" data-src="img/danke.png">


---
### References
<small>

* Creation of ELTeC (selection)
  * Lou Burnard, Christof Schöch, Carolin Odebrecht: “In Search of Comity: TEI for Distant Reading”, in: _Journal of the Text Encoding Initiative_, 2021. https://doi.org/10.4000/jtei.3500
  * Christof Schöch, Roxana Patraș, Diana Santos, Tomaž Erjavec: “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives”, in: _Modern Languages Open_ 2021. http://doi.org/10.3828/mlo.v0i0.364
  * Cinková, Silvie, Tomaž Erjavec, Cláudia Freitas, et al., ‘Evaluation of Taggers for 19th-Century Fiction’, in DH_Budapest_2019, <http://elte-dh.hu/dh_budapest_2019-abstract-booklet/>
  * Frontini, Francesca, Carmen Brando, Joanna Byszuk et al., ‘Named Entity Recognition for Distant Reading’, in CLARIN Annual Conference 2020 Proceedings, pp. 27–41 <https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf>
  * Stanković, Ranka, Cvetana Krstev, Branislava Šandrih Todorović, und Mihailo Škorić. 2021. „Annotation of the Serbian ELTeC Collection“. Infotheca 21 (2): 43–59. https://doi.org/10.18485/infotheca.2021.21.2.3.
<br/><br/>

* Usage of ELTeC (selection)
  * Cinková, Silvie, and Jan Rybicki, ‘Stylometry in a Bilingual Setup’, in Proceedings of LREC 2020, pp. 977–984 <https://www.aclweb.org/anthology/2020.lrec-1.123/>
  * Byszuk, Joanna, Michał Woźniak, Mike Kestemont et al. ‘Detecting Direct Speech in Multilingual Collection of 19th Century Novels’, in Proceedings of LT4HALA 2020, pp. 100–104 <https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/LT4HALAbook.pdf>
  * Mihurko-Poniž, Katja, Rosario Arias, J. Berenike Herrmann et al. ‘Thresholds to the “Great Unread”: Titling Practices across Multilingual Collections of European Novels’, Day of DH 2021, <https://www.youtube.com/watch?v=fMtkwCxkzfw>.
  * Krstev, Cvetana. 2021. „White as Snow, Black as Night – Similes in Old Serbian Literary Texts“. Infotheca 21 (2): 119–36. https://doi.org/10.18485/infotheca.2021.21.2.6.
  * Škorić, Mihailo, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, und Maciej Eder. 2022. „Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution“. Mathematics 10 (5): 838. https://doi.org/10.3390/math10050838.


</small>

</textarea>
</section>
</div>
</div>

<script src="dist/reveal.js"></script>
<script src="plugin/notes/notes.js"></script>
<script src="plugin/markdown/markdown.js"></script>
<script src="plugin/highlight/highlight.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
center: false,
controls: true,
progress: true,

// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ],
});
</script>
</body>
</html>