I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org
Overview
I've extracted the existing attributions (from respStmt elements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.
Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.
If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)
Desired data model
Here are a few samples of what the updated respStmt elements will look like:
Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg0062/tlg001/tlg0062.tlg001.1st1K-grc1.xml#L28
to:
<respStmt>
<resp from="2015" to="2017">Lead Developer</resp>
<persName ref="https://orcid.org/0000-0003-1852-9204">Thibault Clérice</persName>
<orgName>University of Leipzig</orgName>
</respStmt>
Notes:
- We make use of
from and to attrs to denote the timeframe of the resp.
- We set a person's ORCID in
persName.ref
Simona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/stoa0146d/stoa001/stoa0146d.stoa001.opp-grc1.xml#L47
to:
<respStmt>
<resp when="2015">Project Manager</resp>
<persName>Simona Stoyanova</persName>
<orgName>University of Leipzig</orgName>
</respStmt>
<respStmt>
<resp from="2013" to="2014">Project Assistant</resp>
<persName>Simona Stoyanova</persName>
<orgName>University of Leipzig</orgName>
</respStmt>
Notes:
- We move from a single respStmt containing two
resp elements to a 1:1 relationship between respStmt and resp
when and from|to attrs denote the resp. timeframe
Gregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin
From
to:
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Gregory Crane</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Leonard Muellner</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<persName role="principal">Bruce Robertson</persName>
<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
Notes:
- We move from a single
respStmt containing multiple persName elements to a 1:1 relationship between respStmt and persName.
- We also include
orgName in each respStmt
Implementation
Extraction process
Each row in the attributions-data worksheet corresponds to a set of URNs extracted from the underlying XML files.
There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.
Editing attribution data in the spreadsheet
I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).
The unique-* worksheets show uniquevalues for the resp, orgName and persName.
Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:
-
Adding an additional row beneath "Proofreading and CTS conversion"
-
Edit the original resp to Proofreading
-
Set the resp in the new row to CTS conversion
-
Copy the other relevant fields (resp, orgName and persName) to the new row
-
Leave a comment on the row so I can ensure that the urn and key fields are also populated.
There are also several instances where slight variants in a person's name are used, or resp possibly contains data better suited for orgName .
We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the urn and key fields to de-duplicate data.
Bulk update process
Once edits have been finalized in the spreadsheet, I'll use the urn and key fields to map the edits back to the desired data model (see below)
I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.
I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org
Closing thoughts
I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.
As long as the XML files have respStmt with resp and one of persName or orgName, we can extract attributions for display on scale.perseus.org.
I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org
Overview
I've extracted the existing attributions (from
respStmtelements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.
If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)
Desired data model
Here are a few samples of what the updated
respStmtelements will look like:Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg0062/tlg001/tlg0062.tlg001.1st1K-grc1.xml#L28
to:
Notes:
fromandtoattrs to denote the timeframe of the resp.persName.refSimona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/stoa0146d/stoa001/stoa0146d.stoa001.opp-grc1.xml#L47
to:
Notes:
respelements to a 1:1 relationship betweenrespStmtandrespwhenandfrom|toattrs denote the resp. timeframeGregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin
From
First1KGreek/data/tlg0093/tlg005/tlg0093.tlg005.1st1K-grc1.xml
Line 12 in 3f5519b
to:
Notes:
respStmtcontaining multiplepersNameelements to a 1:1 relationship betweenrespStmtandpersName.orgNamein eachrespStmtImplementation
Extraction process
Each row in the
attributions-dataworksheet corresponds to a set of URNs extracted from the underlying XML files.There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.
Editing attribution data in the spreadsheet
I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).
The
unique-*worksheets show uniquevalues for theresp,orgNameandpersName.Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:
Adding an additional row beneath "Proofreading and CTS conversion"
Edit the original
respto ProofreadingSet the
respin the new row toCTS conversionCopy the other relevant fields (
resp,orgNameandpersName) to the new rowLeave a comment on the row so I can ensure that the
urnandkeyfields are also populated.There are also several instances where slight variants in a person's name are used, or
resppossibly contains data better suited fororgName.We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the
urnandkeyfields to de-duplicate data.Bulk update process
Once edits have been finalized in the spreadsheet, I'll use the
urnandkeyfields to map the edits back to the desired data model (see below)I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.
I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org
Closing thoughts
I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.
As long as the XML files have
respStmtwithrespand one ofpersNameororgName, we can extract attributions for display on scale.perseus.org.