WG1/canonicity_proposal.xml at master · distantreading/WG1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Working paper on canonicity and corpus design parameters in the ELTeC context  </title>
                <author>COST Action CA16204 – WG1 </author>
            </titleStmt>
            <editionStmt>
                <edition><date>2018-09</date></edition>
            </editionStmt>
            <publicationStmt>
                <p>Unpublished discussion document prepared for COST Action 16204</p>
            </publicationStmt>
            <sourceDesc>
                <p>Extracted from sampling proposal of the COST Action 16204.</p>
            </sourceDesc>
        </fileDesc>
        <revisionDesc>
            <change> CarolinOdebrecht split up sampling proposal into two documents, add WG3 comments </change>
        </revisionDesc>
    </teiHeader>
    <text>
        <body>
            <head>Sampling criteria for the ELTeC</head>
            <div>
                <head>Outline</head>
                <list type="ordered">
                    <item>Introduction</item>
                    <item>On Canonicity</item>
                    <item>Representativeness and balance</item>
                    <item>Literature</item>
                </list>
            </div>
            <div>
                <head>Introduction</head>
                <p>The task for WG1 is to develop guidelines for data and metadata for the creation
                    of the ELTeC. This task can be split up into several distinct tasks: Guidelines
                    for corpus design, basic annotation and metadata schemes and workflow. This
                    discussion paper focuses on corpus design and metadata because both tasks
                    interplay with each other.</p>
                <p>The task for WG3 is to explore theoretical concerns that stem from the application
                    of Distant Reading methods to literary history. WG1's task of designing the
                    corpus guidelines needs to be closely communicated and coordinated with the WG3
                    in order apply the relevant textual, paratextual, and contextual genre markers
                    of th e novel. The role of WG3 is crucial in formulating the relevant research
                    questions due to its literary-historical and comparative expertise.</p>
                <p>This document is a joined working paper on canonicity and corpus design
                    parameters in the context for the ELTeC of WG1 and WG3.</p>
            </div>
            <div>
                <head>On Canonicity</head>
                <p>While a canon is the portrait of someone’s prestigious social, cultural, economic
                    status and it reflects normative self-promotional legitimating and rating
                    decisions. In contrast, a corpus design follows a research question or context
                    and is therefore more research goal oriented. The latter makes it paramount that
                    the hypothesis and research questions be clearly defined. A second important
                    aspect is the way of considering the actual texts. As Moisl (2009: 876) puts it
                    ‘Data is ontologically different from the world.’ So there is a difference
                    between texts in the world and data we create. By texts, we may consider the
                    manifestation or the extension or the work of a text (cf. IFLA 2009). A canon
                    can contain an extension of a certain text which is available in different
                    languages and prints. Ontologically, these different levels of text are
                    different from what a text in a corpus might be (cf. van Zundert and Andrews
                    2017). This means, that digitization is a kind of annotation, hence
                    interpretation (Odebrecht et al 2017). A representation of a text in a corpus
                    (e.g. transcription, OCR) is the result of interpretation. A corpus design needs
                    to consider this for sampling and digitization issues. </p>
                <p>At the end of the Action, ELTeC should contain literary texts (novels) from a
                    distinct period and in several languages. For each language (and thus for each
                    cultural context) there exists a diversity of canons which reflect different,
                    changing, historical perspectives on the notion of both the canon (either as
                    national or as part of world literature) and its counter-canon(s). Each canon is
                    a result of rating texts from different perspectives. The assessment can reflect
                    intellectual rating (a text is a representative of a certain literature
                    period/genre/subgenre, is influential, is important), economical rating (a text
                    is published in more than one print run), or readers rating (a text is most
                    popular within a certain reader group) (cf. Hermann 2011 or Winko 1996). All
                    these ratings can change over time and may also interplay which each other. A
                    canon can therefore reflect different interpretation of ‘famous’, ‘important’ or
                    ‘influential’ texts. These criteria are not overall comparable. For example,
                    texts from a smaller language community such as Czech are less likely to be
                    frequently reprinted than English texts of the same period and genre.
                    Additionally, the international visibility and awareness about particular texts
                    abroad often depends on the socio-cultural influence of the country and their
                    publishing houses.</p>
                <p>The criteria derived from a canon are not completely comparable and categorical.
                    Which prestigious group’s canon should be considered, which should be excluded,
                    and why? Are there comparable canons for novels in all countries of the language
                    in question? These questions echo discussions the debates about world literature
                    as circulation of texts from centres to peripheries and chime with the related
                    notions of world literature as a canon of universal masterpieces, both of which
                    deserve a critical examination. Algee-Hewitt and McGurl (2015) show an approach
                    to corpus design based on several canons and which kind of problems occur. Each
                    analysis of the corpus then only shows the different effects of the decisions
                    made by the normative group. Considering national canons is also very difficult
                    and somewhat problematic. An example is the German National Canon of literature
                    which was developed in the 18th century and was promoted by the national
                    educational system until the 1990. Since German reunification, the educational
                    system has not promoted a strict canon and does not recommend a list of books to
                    be read in school or at university (cf. Winko 1996). Thus, taking such types of
                    canons as part of a sampling base of ELTeC would mean reflecting a political and
                    social past of German education and politics. Choosing between canons can then
                    mean choosing between tastes of (current) literature (in past and present) and
                    tastes of past literature when the canon builder rates historical texts.
                    Finally, these canons are not built to be the sampling guidelines for a corpus
                    which we would like to build in the Action.</p>
                <p>The MoU of CA16204 formulates the goal as follows: “The main aim and objective of
                    the Action is to develop the resources and methods necessary to
                    change the way European literary history is written.” This goal requires a new
                    approach to corpus design, metadata design and annotation models. As Fowler
                    (2002, 214) puts it: ‘The current canon sets limits to our understanding of
                    literature, in several ways’. Relying on canons will obstruct the Action’s goal
                    in a fundamental way. Canons provide traditional and normative access to the
                    history of literature. In contrast, the Action focuses on new approaches to tell
                    another story. Instead, we might decide that our collection should contain a
                    mixture of works that have never been reprinted since their first appearance,
                    works that have been reprinted a small number of times within one or two decades
                    of their first appearance, and works that have been reprinted in almost every
                    decade since their first appearance. </p>
                <p>Therefore, we argue for a non-normative but metadata-based approach of sampling
                    criteria which will follow a corpus design approach. Corpus sampling criteria
                    are mostly oriented/developed by the research question or/and contexts of the
                    corpus creators group. In CA16204, we have neither a distinct research question
                    nor a fixed and previously known corpus creator group. The research context of
                    the Action is more interested in knowledge production in a methodological sense
                    and does not prefer a single method, model or theory. Furthermore, the member
                    group of the Action will fluctuate and consist of researches from different
                    disciplines with different theoretical and cultural contexts. Thus, we need to
                    build the corpus design on a methodical basis. This will enable us to select a
                    certain number of canonical texts as well but also to be more open and inclusive
                    than mainstream literary histories.</p>
            </div>
            <div>
                <head>Representativeness and balance</head>
                <p>Additional to the aspect of ‘prestige’ (canon), the aspect of representativeness
                    is problematic for corpus design. Developing criteria for corpus design means to
                    decide which kind of sample of the world shall be included in the data base.
                    Obviously, including the whole population of 19th century literature in several
                    languages is impossible. So we need to make a compromise between what we would
                    like to have in the corpus (all literature) and what we can put in the corpus
                    (sample). The biggest challenge in this process of selection will be the
                    availability of digitised texts, seeing as the project does not provide funding
                    for new digitisation projects. </p>
                <p>It is a truism that there is no such thing as a ‘good’ or a ‘bad’ corpus, because
                    how a corpus is designed depends on what kind of corpus it is and how it is
                    going to be used. (Hunston 2008, 155)</p>
                <p>Following Hunston (2008), a corpus design needs to follow the research goal. For
                    the Action, representativeness may be the relationship between the corpus and
                    the body of literature in question. ‘Representativeness refers to the extent to
                    which a sample includes the full range of variability in a population’ (Biber
                    1993, 243). To say something about the representativeness of a corpus requires
                    knowledge about the whole population of literature (in the period in question).
                    Actually, we don’t know every book of every language published/read/discussed in
                    Europe in the period in question. It is further ‘impossible to identify a
                    complete list of ‘categories’ that would exhaustively account for all texts
                    produced in a given language’, (Hunston 2008, 161) or context. Such categories
                    can refer to various factors such as characteristics of authors, e.g., gender or
                    place of birth, publishers, topics of the texts, readership etc. Against the
                    background of canon building, there is also ‘no true measure of the
                    ‘significance’ of a type of discourse to a community’. The chance, that a corpus
                    represents the whole population of something increases with the size of a
                    corpus. In this way, size and representativeness are connected. <note
                        place="foot" xml:id="ftn1" n="1"><p rend="footnote text"> See Biber (1993)
                            and Hunston (2008) for a detailed discussion.</p></note>
                    Representativeness is therefore a kind of ideal which we would like to pursue
                    but which cannot be achieved as whole. In line with the MoU, the ELTeC can be
                    designed as a monitor corpus where texts (from different languages and periods)
                    can be added over time. </p>
                <p>Balance refers to the internal proportion of the corpus. Note that a fully
                    balanced corpus is an ideal which we only can try to achieve next to the ideal
                    of representativeness (Hunston 2008, 163). According the MoU, the corpus shall
                    contain 2,500 full-texts novels at least in 10 different languages:</p>
                <list type="unordered">
                    <item>Languages: Dutch, English, French, German, Greek, Italian, Polish,
                        Portuguese, Russian, Spanish (ELTeC core)</item>
                    <item>first iteration: 6 subcollections (100 novels per language) 1840 to 1920
                        starting with British, French, Spanish, German, Greek, Polish</item>
                    <item>second iteration: 4 subcollections (100 novels per language) 1840 to
                        1920</item>
                    <item>third iteration: 6 subcollections in additional languages and
                        subcollections for all 16 languages 1780-1839, </item>
                    <item>We will also try to include additional languages such as Hungarian,
                        Serbian, Swiss , Romanian, Czech, Latvian, Norwegian.</item>
                </list>
                <p>In this way, ELTeC is balanced with respect to language. With respect to genre,
                    the corpus is not balanced but homogenous; all texts in the corpus shall be full
                    novels. With respect to time, the corpus design shall focus on the period 1840
                    to 1920.</p>
                <p>Before discussing the criteria in more detail, we would like to ask another
                    methodical question concerning corpus sampling: would we like to use each
                    criterion, with the intention to represent the variety of possible values, or
                    should the sample represent the distribution of those values across the
                    population? </p>
                <p>Let’s say we wish to select 100 texts from a population of texts published over a
                    period of (say) 20 decades. We might select five texts from the first decade,
                    five from the second, and so on, making up our 100 titles, evenly spread across
                    the possible decades. The probability that a text in our corpus will come from
                    any given decade will always be the same: 1 in 5. This selection represents the
                        <emph>variety</emph> of possible values for the criterion. Suppose now that
                    we look more closely at the number of titles from each decade actually available
                    in the population we are sampling. It’s more than likely that this number will
                    vary significantly: for example, we might notice that there are 2000 titles
                    published in decade x, and only 100 in decade y. To represent this population
                        <emph>statistically </emph>we should therefore make it 20 times more
                    probable that a randomly chosen title will come from decade x than from decade
                    y. Since the total number of titles we can choose is quite small relative to the
                    total number available in the population, strict application of this principle
                    may mean that we cannot choose any titles at all from some decades. This is one
                    reason for preferring to make our sampling represent variety rather than
                    frequency; another is that we cannot choose fractional numbers of titles. When
                    we start considering more than one criterion, the task of ensuring that the
                    numbers in our sample accurately reflect the distribution of all values across
                    the population becomes prohibitively complex. </p>
                <p>Following the approach of representing the variety of a population, we then need
                    to decide which criterion is balanced in which way and interplays with other
                    criteria. For example, we may want to choose novels from male and female authors
                    in a balanced way. This may mean that in the total of all novels one half will
                    be from female authors. Without any further regulation, we might have more
                    female authors in one decade than in other decades. If we would like to have an
                    equal number of male and female authors in every decade, we need to link the
                    criterion of the author’s gender with the criterion of time. Doing this, might
                    complicate the selection process (cf. finding novels for this proportion in
                    every decade of the period in question); it may also distort the reflection of
                    the changing share of women authors among novelists, which is in itself impo
                    rtant for the history of the genre. . So we have to decide which categories
                    shall be present in a balanced way in the corpus. </p>
            </div>
            <div>
                <head>Literature</head>
                <listBibl>
                    <bibl>Algee-Hewitt, Mark; McGurl, Mark (2015): <title>Between Canon and Corpus.
                            Six Perspectives on the 20th-Century Novels.</title> Stanford Literary
                        Lab Pamphlet no 8. </bibl>
                    <bibl>Biber, Douglas (1993): <title level="a">Representativeness in Corpus
                            Design.</title> In: <title level="j">Literary and Linguistic Computing
                        </title>(8), 243–257.</bibl>
                    <bibl>Herrmann, Leonhard (2011): <title level="a">System? Kanon? Epoche?</title>
                        In: Matthias Beilein, Claudia Stockinger und Simone Winko (Hg.):
                            <title>Kanon, Wertung und Vermittlung. Literatur in der
                            Wissensgesellschaft.</title> Berlin: De Gruyter (Studien und Texte zur
                        Sozialgeschichte der Literatur, Bd. 129), S. 59–75.</bibl>
                    <bibl>Hunston, Susan (2008): <title level="a">Collection strategies and design
                            decisions.</title> In: Anke Lüdeling und Merja Kytö (Hg.): <title>Corpus
                            Linguistics. An International Handbook</title>. 2 Bände. Berlin: De
                        Gruyter (1), S. 154–168.</bibl>
                    <bibl>IFLA (2009): <title>Functional Requirements for Bibliographic
                            Records</title> (Technical Report). Online verfügbar unter
                        http://www.ifla.org/publications/functional-requirements-for-bibliographic-records,
                        zuletzt geprüft am 23.12.2016.</bibl>
                    <bibl>Lüdeling, Anke (2011): <title level="a">Corpora in Linguistics. Sampling
                            and Annotation</title>. In: Karl Grandin (Hg.): <title level="m">Going
                            Digital. Evolutionary and Revolutionary Aspects of Digitization</title>.
                        New York: Science History Publications (Nobel Symposium, 147),
                        220–243.</bibl>
                    <bibl>Moisl, Hermann (2009): <title level="a">Exploratory Multivariate
                            Analysis</title>. In: Anke Lüdeling und Merja Kytö (Hg.): <title>Corpus
                            Linguistics. An International Handbook</title>. 2 Bände. Berlin: De
                        Gruyter (2), S. 874–899.</bibl>
                    <bibl>Winko, Simone (1996): <title level="a">Literarische Wertung und
                            Kanonbildung</title>. In: <title level="m">Grundzüge der
                            Literaturwissenschaft.</title> Hrsg. v. H. L. Arnold und H. Detering.
                        München, 585–600.</bibl>
                    <bibl>van Zundert, Joris; Andrews, Tara L. (2017): <title level="a">Qu'est-ce
                            qu'un texte numérique? A new rationale for the digital representation of
                            text.</title> In: <title level="j">Digital Scholarship in the Humanities
                        </title>(32), S. 78–88. DOI: 10.1093/llc/fqx039.</bibl>
                </listBibl>
            </div>
        </body>
    </text>
</TEI>