WG1/graz-slides.xml at master · distantreading/WG1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="file:/home/lou/Public/TEIslides/teislides.rnc" type="application/relax-ng-compact-syntax"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>In search of comity: TEI for distant reading</title>
                <author xml:id="LB42">Lou Burnard</author>
                <author>Christof Schöch</author>
                <author>Carolin Odebrecht</author>
            </titleStmt>
            <publicationStmt>
                <p>unpublished draft</p>
            </publicationStmt>
            <sourceDesc>
                <p>born digital</p>
            </sourceDesc>
        </fileDesc>
        <revisionDesc>
            <change when="2019-09-13">Created file</change>
        </revisionDesc>
    </teiHeader>
    <!--suggestions
   markup should ensure comparability to texts from different languages
   and publication histories
    more on what a text is; e.g. we create a benchmark corpus for NLP / distant reading
        this is a new perspective on using the TEI
        orientation on surface phenomena
        integrate already existing texts (refer to textgrid etc.) - re-using resources
    unique for ELTeC:
        inform about language collections: ELTeC is special in this sense that we collect also texts from languages others than English, French, German
        European picture of what a novel might be should include Romanian, Russian, Italien, Slovenian etc. languages
        stress that we do not only rely on canons - we aim to detect and encode texts that are not already in focus of DH / distant reading

       some information about the metadata plots as a tool for monitoring corpus building

YOu could add our cooperation with Christian Reul and OCR4 all


Will you add references? (cf our abstract)
-->
    <text>
        <body>
            <!--  <div type="slide">

                <figure>
                    <graphic url="media/comity-ss.png"/>
                </figure>

            </div>-->
            <div type="slide">
                <head>Searching for comity</head>
                <p><ref target="https://www.lexico.com/en/definition/comity"
                        >https://www.lexico.com/en/definition/comity</ref></p>

                <q>Those participating in conversational encounters have to have a care for the
                    preservation of good relations by promoting the other's positive self-image, by
                    avoiding offence, encouraging comity, and so on. The negotiation of meaning is
                    also a negotiation of social relations.</q>
                <p><ref target="#B11">(Widdowson, 1990, p. 110)</ref>
                </p>
                <p>Our project is to establish a comity which can bridge the gap (identified by Jan
                    Rybicki in his plenary) between the two most interesting and longest-established
                    subfields of the Digital Humanities ... </p>

            </div>
            <div type="slide">
                <head>The TEI was born interdisciplinary</head>

                <p>This is not such a new idea...</p>
                <p>Participants at the Poughkeepsie conference came from many disciplines, including
                    computer science and computer services, NLP, theoretical linguistics but also
                    literary studies, classics, historical, and medieval studies. </p>
                <p>Its advisory board and formative working groups sought varied representation,
                    both geographically and by discipline. </p>
                <p>There is ample anecdotal evidence of cross disciplinary fertilisation</p>
            </div>
            <div type="slide">
                <head>What text really is... </head>
                <list>
                    <item>"Text" is in the mind of the reader: a construction of and for a
                        particular community</item>
                    <item>A generic abstract model summarizing the <term>significant
                            properties</term> of all texts is conceivable</item>
                    <item>Such properties may usefully be considered independently of <list>
                            <item>their expression in a particular document</item>
                            <item>their use in a particular discipline</item>
                        </list></item>
                </list>
                <p>TEI has always had to mediate two orthodoxies <list>
                        <item>text is no less and no more than the documents which instantiate
                            it</item>
                        <item>text is no less and no more than a linguistic phenomenon, a bag of
                            words whose statistical properties suffice to describe it</item>
                    </list></p>
                <p rend="box">Is this model of text compatible with the model underlying the fields
                    of stylometry, stylistics, textual analytics, aka distant reading?</p>
            </div>
            <div type="slide">
                <head>COST Action 16204 “Distant Reading for European Literary History” </head>
                <figure><graphic url="media/distantreading.png" height="20%"/></figure>
                <list>
                    <item>see <ptr target="https://www.distant-reading.net"/></item>

                    <item>European network with 35 members from 23 countries bringing together
                        researchers from different disciplines and scientific backgrounds </item>
                    <item>Like TEI, COST is a community initiative fostering collaboration,
                        interoperability, and mutual understanding</item>
                    <item>Four working groups, reporting to a management committee comprising two
                        national representatives from each participating country </item></list></div>
            <div type="slide"><head>COST contd.</head>
                <list>
                    <item>We report here on WG1 ("Scholarly resources") which is charged with design
                        and construction of the the European Literary Text Collection (ELTeC). <list>
                            <item>a set of comparable corpora for each of at least a dozen European
                                languages, </item>
                            <item>a balanced selection of 100 novels from the 19th century</item>
                            <item> metadata situating them in their contexts of production and of
                                reception. </item>
                        </list></item>
                </list>
                <p rend="box">ELTeC is of course a TEI application... </p>
            </div>
            <div type="slide">
                <head>ELTeC Encoding Requirements</head>
                <list>
                    <item>support computational approaches to literary text analysis (authorship
                        attribution, topic modelling, stylistic analysis ...)</item>
                    <item>enrich corpora with metadata and impose only a minimal structure</item>
                    <item>editorial issues of lesser interest</item>
                    <item>markup should offer the encoder very little choice, and the software
                        developer very few surprises</item>
                    <item>aim to facilitate uniform and consistent access across multiple
                        corpora</item>
                </list>
                <p rend="box">Traditional TEI, by contrast, rejoices in variety, which makes
                    comparative work harder</p>
            </div>
            <div type="slide">
                <head>ELTeC encoding scheme/s</head>
                <list>
                    <item>level 0: minimal encoding scheme for texts produced manually or by OCR
                        from print originals</item>
                    <item>level 1 : somewhat richer format derivable automatically from texts
                        encoded in other formats (Word, HTML TEI ...)</item>
                    <item>level 2 : lingistically annotated and segmented</item>
                </list>
                <p rend="box">and a tightly constrained Header common to each level</p>
            </div>
            <div type="slide">
                <head>level 0 : minimal </head>
                <list>
                    <item>discard non authorial front or back matter</item>
                    <item>distinguish titlepage from other front matter</item>
                    <item>mark chapter divisions and headings, but no substructure</item>
                    <item>mark paragraphs (MLE blocks of text)</item>
                    <item>reassemble words broken across lines</item>
                    <item>discard paratext, illustrations, notes, corrections</item>
                    <item>(optionally) mark pagebreaks and highlighting</item>
                    <item>text may or may not be normalized: no indication either way</item>

                </list>
            </div>
            <div type="slide">
                <head>for example</head>
                <figure>
                    <graphic url="media/p302-level0.png"/>
                </figure>
            </div>
            <div type="slide">
                <head>level 1 : some enrichment</head>
                <list>
                    <item>mark chapter substructure with <gi>milestone</gi> and <gi>label</gi>
                        elements</item>
                    <item>interpret highlighting, where possible, using <gi>emph</gi>
                        <gi>foreign</gi> or <gi>title</gi></item>
                    <item>record authorial notes (gathered together into back)</item>
                    <item>record graphics as <gi>gap</gi></item>
                    <item>normalized forms are explicit but original forms are lost</item>

                </list>
            </div>
            <div type="slide">
                <head>level 2 : basic linguistic annotation</head>
                <list>
                    <item>end to end segmentation using <gi>s</gi></item>
                    <item>tokenisation using <gi>w</gi>, with att.linguistic attributes
                            <att>pos</att>, <att>lemma</att>, and <att>join</att></item>
                    <item>(probably) mark named entities with <gi>rs</gi></item>
                    <item>inline annotation is possible because tagging is minimal </item>
                </list>
                <p rend="box">... a work in progress</p>
            </div>
            <div type="slide">
                <head>Header checklist</head>
                <p>All headers provide, in consistent format: <list>
                        <item>Identification of title and language (<att>xml:id</att> and
                                <att>xml:lang</att> on <gi>TEI</gi>)</item>
                        <item>Title of the work, followed by the phrase <val>: ELTeC edition</val>
                                (<gi>title</gi>)</item>
                        <item>Author, in format <val>Surname, Forename/s (birthYear-deathYear)</val>
                                (<gi>author</gi>) </item>
                        <item>Statements of responsibility in format <val>encoded by Name</val>
                                (<gi>respStmt</gi>) </item>
                        <item>Date of publication in ELTeC (<gi>publicationStmt</gi>) </item>
                        <item>Source description containing one or more <gi>bibl</gi>
                            elements</item>
                        <item>Encoding level : use the <att>n</att> attribute of
                                <gi>encodingDesc</gi></item>
                        <item>Profile description specifying languages used and sampling criteria
                            values (<gi>langUsage</gi>, <gi>textDesc</gi>)</item>
                        <item>Revision description containing at least one <gi>change</gi></item>
                    </list></p>
                <p>Most requirements are enforced by the schema: others by schematron rules </p>
                <p rend="box">a compromise between desire for minimal tagging and desire for
                    precision</p>
            </div>
            <div type="slide">
                <head>Metadata -1 </head>
                <p>A novel, for us, is an original, continuous fictional text, of over 10k words,
                    published as a single work. </p>
                <list>
                    <item>Non-opportunistic design, aiming for a balanced representation of
                        pre-defined features common across european tradition</item>
                    <item>specifically: <list>
                            <item>date of first publication (one of five time slots between 1840 and
                                1920)</item>
                            <item>longevity/canonicity (high/low, as indicated by reprint
                                count)</item>
                            <item>author sex (one of 3 values as perceived in 19c)</item>
                            <item>size (short/medium/long)</item>
                        </list></item>
                    <item>Achieving a balance of these features is NOT EASY</item>
                    <item>These features are encoded in the header using a customized
                            <gi>textDesc</gi> element</item>
                </list>
                <egXML xmlns="http://www.tei-c.org/ns/Examples" rend="tiny">
                    <textDesc xmlns:e="http://distantreading.net/eltec/ns">
                        <e:authorGender key="M"/>
                        <e:size key="short"/>
                        <e:canonicity key="low"/>
                        <e:timeSlot key="T1"/>
                    </textDesc>
                </egXML>
            </div>
            <div type="slide">
                <head>Metadata - 2</head>
                <p>We aim to record the complete pedigree of our encoded texts in the header</p>
                <list>
                    <item>multiple types of <gi>bibl</gi> may be found in the <gi>sourceDesc</gi>
                        <list>
                            <item>printEdition: print edition from which our text was derived (e.g.
                                by OCR)</item>
                            <item>digitalEdition: published online edition from which our text was
                                derived</item>
                            <item>firstEdition: details of first edition, irrespective of whether
                                this was our source</item>
                        </list></item>
                    <item>multiple <gi>respStmt</gi>s acknowledge encoders, editors, etc of previous
                        editions</item>
                </list>
            </div>
            <div type="slide">
                <head>Publication issues</head>
                <p>Only out of copyright works are included, and all texts are published under CC-BY
                    licence. </p>
                <list>
                    <item>maintenance and development in publically accessible github repositories
                        at <ptr target="https://github.com/COST-ELTeC"/></item>
                    <item>Releases archived under Zenodo (first one due next month)</item>
                </list>
                <p rend="box"> FAIR Guiding Principles: <hi>F</hi>indable on Zenodo;
                    <hi>A</hi>ccessible on Github; using TEI makes it <hi>I</hi>nteroperable and
                        <hi>R</hi>e-usable </p>

            </div>
            <div type="slide">
                <head>ODD chaining</head>
                <!--<p>Rather than maintain three independent but overlapping
                    schemas we use the facility known as ODD chaining...
                  </p>-->
                <figure>
                    <graphic height="70%" url="media/eltec-chains.png"/>
                </figure>
                <cb/>
                <list>
                    <item>base ELTeC ODD selects all and only elements required for each of the
                        three schemas, and supplies generic constraints</item>
                    <item>it is processed to create a TEI library, analogous to the
                        "p5subset" supplied with TEI P5</item>
                    <item>each ELTeC level is defined by a separate ODD,
                        which selects a subset from that
                        library</item>
                </list>
                <p>Promotes consistency of documentation and application</p>
            </div>
            <div type="slide">
                <head>State of play and future work</head>
                <p>Current state is visible at <ptr target="https://distantreading.github.io/ELTeC"
                    /></p>

                <list>
                    <item>Building the next (and subsequent!) releases</item>
                    <item>Improving metadata, e.g. operationalizing canonicity counts</item>
                    <item>Testing level2 proposals</item>
                </list>
                <p rend="box">News at <ptr target="https://www.distant-reading.net/news/"/></p>
            </div>
        </body>
    </text>
</TEI>