-
Notifications
You must be signed in to change notification settings - Fork 9
Expand file tree
/
Copy pathcanonicity_proposal.xml
More file actions
272 lines (272 loc) · 21.7 KB
/
canonicity_proposal.xml
File metadata and controls
272 lines (272 loc) · 21.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Working paper on canonicity and corpus design parameters in the ELTeC context </title>
<author>COST Action CA16204 – WG1 </author>
</titleStmt>
<editionStmt>
<edition><date>2018-09</date></edition>
</editionStmt>
<publicationStmt>
<p>Unpublished discussion document prepared for COST Action 16204</p>
</publicationStmt>
<sourceDesc>
<p>Extracted from sampling proposal of the COST Action 16204.</p>
</sourceDesc>
</fileDesc>
<revisionDesc>
<change> CarolinOdebrecht split up sampling proposal into two documents, add WG3 comments </change>
</revisionDesc>
</teiHeader>
<text>
<body>
<head>Sampling criteria for the ELTeC</head>
<div>
<head>Outline</head>
<list type="ordered">
<item>Introduction</item>
<item>On Canonicity</item>
<item>Representativeness and balance</item>
<item>Literature</item>
</list>
</div>
<div>
<head>Introduction</head>
<p>The task for WG1 is to develop guidelines for data and metadata for the creation
of the ELTeC. This task can be split up into several distinct tasks: Guidelines
for corpus design, basic annotation and metadata schemes and workflow. This
discussion paper focuses on corpus design and metadata because both tasks
interplay with each other.</p>
<p>The task for WG3 is to explore theoretical concerns that stem from the application
of Distant Reading methods to literary history. WG1's task of designing the
corpus guidelines needs to be closely communicated and coordinated with the WG3
in order apply the relevant textual, paratextual, and contextual genre markers
of th e novel. The role of WG3 is crucial in formulating the relevant research
questions due to its literary-historical and comparative expertise.</p>
<p>This document is a joined working paper on canonicity and corpus design
parameters in the context for the ELTeC of WG1 and WG3.</p>
</div>
<div>
<head>On Canonicity</head>
<p>While a canon is the portrait of someone’s prestigious social, cultural, economic
status and it reflects normative self-promotional legitimating and rating
decisions. In contrast, a corpus design follows a research question or context
and is therefore more research goal oriented. The latter makes it paramount that
the hypothesis and research questions be clearly defined. A second important
aspect is the way of considering the actual texts. As Moisl (2009: 876) puts it
‘Data is ontologically different from the world.’ So there is a difference
between texts in the world and data we create. By texts, we may consider the
manifestation or the extension or the work of a text (cf. IFLA 2009). A canon
can contain an extension of a certain text which is available in different
languages and prints. Ontologically, these different levels of text are
different from what a text in a corpus might be (cf. van Zundert and Andrews
2017). This means, that digitization is a kind of annotation, hence
interpretation (Odebrecht et al 2017). A representation of a text in a corpus
(e.g. transcription, OCR) is the result of interpretation. A corpus design needs
to consider this for sampling and digitization issues. </p>
<p>At the end of the Action, ELTeC should contain literary texts (novels) from a
distinct period and in several languages. For each language (and thus for each
cultural context) there exists a diversity of canons which reflect different,
changing, historical perspectives on the notion of both the canon (either as
national or as part of world literature) and its counter-canon(s). Each canon is
a result of rating texts from different perspectives. The assessment can reflect
intellectual rating (a text is a representative of a certain literature
period/genre/subgenre, is influential, is important), economical rating (a text
is published in more than one print run), or readers rating (a text is most
popular within a certain reader group) (cf. Hermann 2011 or Winko 1996). All
these ratings can change over time and may also interplay which each other. A
canon can therefore reflect different interpretation of ‘famous’, ‘important’ or
‘influential’ texts. These criteria are not overall comparable. For example,
texts from a smaller language community such as Czech are less likely to be
frequently reprinted than English texts of the same period and genre.
Additionally, the international visibility and awareness about particular texts
abroad often depends on the socio-cultural influence of the country and their
publishing houses.</p>
<p>The criteria derived from a canon are not completely comparable and categorical.
Which prestigious group’s canon should be considered, which should be excluded,
and why? Are there comparable canons for novels in all countries of the language
in question? These questions echo discussions the debates about world literature
as circulation of texts from centres to peripheries and chime with the related
notions of world literature as a canon of universal masterpieces, both of which
deserve a critical examination. Algee-Hewitt and McGurl (2015) show an approach
to corpus design based on several canons and which kind of problems occur. Each
analysis of the corpus then only shows the different effects of the decisions
made by the normative group. Considering national canons is also very difficult
and somewhat problematic. An example is the German National Canon of literature
which was developed in the 18th century and was promoted by the national
educational system until the 1990. Since German reunification, the educational
system has not promoted a strict canon and does not recommend a list of books to
be read in school or at university (cf. Winko 1996). Thus, taking such types of
canons as part of a sampling base of ELTeC would mean reflecting a political and
social past of German education and politics. Choosing between canons can then
mean choosing between tastes of (current) literature (in past and present) and
tastes of past literature when the canon builder rates historical texts.
Finally, these canons are not built to be the sampling guidelines for a corpus
which we would like to build in the Action.</p>
<p>The MoU of CA16204 formulates the goal as follows: “The main aim and objective of
the Action is to develop the resources and methods necessary to
change the way European literary history is written.” This goal requires a new
approach to corpus design, metadata design and annotation models. As Fowler
(2002, 214) puts it: ‘The current canon sets limits to our understanding of
literature, in several ways’. Relying on canons will obstruct the Action’s goal
in a fundamental way. Canons provide traditional and normative access to the
history of literature. In contrast, the Action focuses on new approaches to tell
another story. Instead, we might decide that our collection should contain a
mixture of works that have never been reprinted since their first appearance,
works that have been reprinted a small number of times within one or two decades
of their first appearance, and works that have been reprinted in almost every
decade since their first appearance. </p>
<p>Therefore, we argue for a non-normative but metadata-based approach of sampling
criteria which will follow a corpus design approach. Corpus sampling criteria
are mostly oriented/developed by the research question or/and contexts of the
corpus creators group. In CA16204, we have neither a distinct research question
nor a fixed and previously known corpus creator group. The research context of
the Action is more interested in knowledge production in a methodological sense
and does not prefer a single method, model or theory. Furthermore, the member
group of the Action will fluctuate and consist of researches from different
disciplines with different theoretical and cultural contexts. Thus, we need to
build the corpus design on a methodical basis. This will enable us to select a
certain number of canonical texts as well but also to be more open and inclusive
than mainstream literary histories.</p>
</div>
<div>
<head>Representativeness and balance</head>
<p>Additional to the aspect of ‘prestige’ (canon), the aspect of representativeness
is problematic for corpus design. Developing criteria for corpus design means to
decide which kind of sample of the world shall be included in the data base.
Obviously, including the whole population of 19th century literature in several
languages is impossible. So we need to make a compromise between what we would
like to have in the corpus (all literature) and what we can put in the corpus
(sample). The biggest challenge in this process of selection will be the
availability of digitised texts, seeing as the project does not provide funding
for new digitisation projects. </p>
<p>It is a truism that there is no such thing as a ‘good’ or a ‘bad’ corpus, because
how a corpus is designed depends on what kind of corpus it is and how it is
going to be used. (Hunston 2008, 155)</p>
<p>Following Hunston (2008), a corpus design needs to follow the research goal. For
the Action, representativeness may be the relationship between the corpus and
the body of literature in question. ‘Representativeness refers to the extent to
which a sample includes the full range of variability in a population’ (Biber
1993, 243). To say something about the representativeness of a corpus requires
knowledge about the whole population of literature (in the period in question).
Actually, we don’t know every book of every language published/read/discussed in
Europe in the period in question. It is further ‘impossible to identify a
complete list of ‘categories’ that would exhaustively account for all texts
produced in a given language’, (Hunston 2008, 161) or context. Such categories
can refer to various factors such as characteristics of authors, e.g., gender or
place of birth, publishers, topics of the texts, readership etc. Against the
background of canon building, there is also ‘no true measure of the
‘significance’ of a type of discourse to a community’. The chance, that a corpus
represents the whole population of something increases with the size of a
corpus. In this way, size and representativeness are connected. <note
place="foot" xml:id="ftn1" n="1"><p rend="footnote text"> See Biber (1993)
and Hunston (2008) for a detailed discussion.</p></note>
Representativeness is therefore a kind of ideal which we would like to pursue
but which cannot be achieved as whole. In line with the MoU, the ELTeC can be
designed as a monitor corpus where texts (from different languages and periods)
can be added over time. </p>
<p>Balance refers to the internal proportion of the corpus. Note that a fully
balanced corpus is an ideal which we only can try to achieve next to the ideal
of representativeness (Hunston 2008, 163). According the MoU, the corpus shall
contain 2,500 full-texts novels at least in 10 different languages:</p>
<list type="unordered">
<item>Languages: Dutch, English, French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish (ELTeC core)</item>
<item>first iteration: 6 subcollections (100 novels per language) 1840 to 1920
starting with British, French, Spanish, German, Greek, Polish</item>
<item>second iteration: 4 subcollections (100 novels per language) 1840 to
1920</item>
<item>third iteration: 6 subcollections in additional languages and
subcollections for all 16 languages 1780-1839, </item>
<item>We will also try to include additional languages such as Hungarian,
Serbian, Swiss , Romanian, Czech, Latvian, Norwegian.</item>
</list>
<p>In this way, ELTeC is balanced with respect to language. With respect to genre,
the corpus is not balanced but homogenous; all texts in the corpus shall be full
novels. With respect to time, the corpus design shall focus on the period 1840
to 1920.</p>
<p>Before discussing the criteria in more detail, we would like to ask another
methodical question concerning corpus sampling: would we like to use each
criterion, with the intention to represent the variety of possible values, or
should the sample represent the distribution of those values across the
population? </p>
<p>Let’s say we wish to select 100 texts from a population of texts published over a
period of (say) 20 decades. We might select five texts from the first decade,
five from the second, and so on, making up our 100 titles, evenly spread across
the possible decades. The probability that a text in our corpus will come from
any given decade will always be the same: 1 in 5. This selection represents the
<emph>variety</emph> of possible values for the criterion. Suppose now that
we look more closely at the number of titles from each decade actually available
in the population we are sampling. It’s more than likely that this number will
vary significantly: for example, we might notice that there are 2000 titles
published in decade x, and only 100 in decade y. To represent this population
<emph>statistically </emph>we should therefore make it 20 times more
probable that a randomly chosen title will come from decade x than from decade
y. Since the total number of titles we can choose is quite small relative to the
total number available in the population, strict application of this principle
may mean that we cannot choose any titles at all from some decades. This is one
reason for preferring to make our sampling represent variety rather than
frequency; another is that we cannot choose fractional numbers of titles. When
we start considering more than one criterion, the task of ensuring that the
numbers in our sample accurately reflect the distribution of all values across
the population becomes prohibitively complex. </p>
<p>Following the approach of representing the variety of a population, we then need
to decide which criterion is balanced in which way and interplays with other
criteria. For example, we may want to choose novels from male and female authors
in a balanced way. This may mean that in the total of all novels one half will
be from female authors. Without any further regulation, we might have more
female authors in one decade than in other decades. If we would like to have an
equal number of male and female authors in every decade, we need to link the
criterion of the author’s gender with the criterion of time. Doing this, might
complicate the selection process (cf. finding novels for this proportion in
every decade of the period in question); it may also distort the reflection of
the changing share of women authors among novelists, which is in itself impo
rtant for the history of the genre. . So we have to decide which categories
shall be present in a balanced way in the corpus. </p>
</div>
<div>
<head>Literature</head>
<listBibl>
<bibl>Algee-Hewitt, Mark; McGurl, Mark (2015): <title>Between Canon and Corpus.
Six Perspectives on the 20th-Century Novels.</title> Stanford Literary
Lab Pamphlet no 8. </bibl>
<bibl>Biber, Douglas (1993): <title level="a">Representativeness in Corpus
Design.</title> In: <title level="j">Literary and Linguistic Computing
</title>(8), 243–257.</bibl>
<bibl>Herrmann, Leonhard (2011): <title level="a">System? Kanon? Epoche?</title>
In: Matthias Beilein, Claudia Stockinger und Simone Winko (Hg.):
<title>Kanon, Wertung und Vermittlung. Literatur in der
Wissensgesellschaft.</title> Berlin: De Gruyter (Studien und Texte zur
Sozialgeschichte der Literatur, Bd. 129), S. 59–75.</bibl>
<bibl>Hunston, Susan (2008): <title level="a">Collection strategies and design
decisions.</title> In: Anke Lüdeling und Merja Kytö (Hg.): <title>Corpus
Linguistics. An International Handbook</title>. 2 Bände. Berlin: De
Gruyter (1), S. 154–168.</bibl>
<bibl>IFLA (2009): <title>Functional Requirements for Bibliographic
Records</title> (Technical Report). Online verfügbar unter
http://www.ifla.org/publications/functional-requirements-for-bibliographic-records,
zuletzt geprüft am 23.12.2016.</bibl>
<bibl>Lüdeling, Anke (2011): <title level="a">Corpora in Linguistics. Sampling
and Annotation</title>. In: Karl Grandin (Hg.): <title level="m">Going
Digital. Evolutionary and Revolutionary Aspects of Digitization</title>.
New York: Science History Publications (Nobel Symposium, 147),
220–243.</bibl>
<bibl>Moisl, Hermann (2009): <title level="a">Exploratory Multivariate
Analysis</title>. In: Anke Lüdeling und Merja Kytö (Hg.): <title>Corpus
Linguistics. An International Handbook</title>. 2 Bände. Berlin: De
Gruyter (2), S. 874–899.</bibl>
<bibl>Winko, Simone (1996): <title level="a">Literarische Wertung und
Kanonbildung</title>. In: <title level="m">Grundzüge der
Literaturwissenschaft.</title> Hrsg. v. H. L. Arnold und H. Detering.
München, 585–600.</bibl>
<bibl>van Zundert, Joris; Andrews, Tara L. (2017): <title level="a">Qu'est-ce
qu'un texte numérique? A new rationale for the digital representation of
text.</title> In: <title level="j">Digital Scholarship in the Humanities
</title>(32), S. 78–88. DOI: 10.1093/llc/fqx039.</bibl>
</listBibl>
</div>
</body>
</text>
</TEI>