This repository contains a dictionary-based method for identifying architectural discourse in Late Antique and Medieval Latin texts. It is applied to a test corpus of Latin inscriptions.
The workflow combines a dictionary-based approach with an evaluation against GLiNER named-entity recognition models.
EccLexica is being developed as part of the ANR E-cclesia project.
.
├── data/
│ └── dicts
│ └── auto_terms.csv
│ └── asso_terms.csv
│ └── mat_terms.csv
│ └── sample.csv
│
├── output/
│
├── inscr-inscr-preprocess.py
├── inscr-inscr-lemmatize.py
├── inscr-inscr-score.py
├── gliner-eval.py
└── README.md
- Main dataset
The main dataset is not stored in this repository due to its size.
Please download EDCS_text_cleaned_2022-09-12.json from Zenodo:
https://zenodo.org/records/7072337
After downloading, place the file in the following location:
data/EDCS_text_cleaned_2022-09-12.json
- Sample dataset (
data/sample.csv)- 100 inscriptions.
- Used for testing and GLiNER evaluation.
This repository includes three standalone command-line scripts for a split workflow:
inscr-preprocess.py— filter raw JSON inscription data by date and province.inscr-lemmatize.py— lemmatize filtered texts with spaCy.inscr-score.py— calculate architectural scores from lemmatized texts.
- Python 3.8+
pandasnumpyspacygliner- Latin spaCy model:
la_core_web_md
Install dependencies if needed:
pip install pandas numpy spacy
python -m spacy download la_core_web_mdThis step loads a JSON file, filters inscriptions by the default date range and excluded provinces, and saves the filtered CSV.
python inscr-preprocess.py \
--input data/EDCS_text_cleaned_2022-09-12.json \
--output data/edcs_filtered_inscriptions.csvUse --exclude-provinces to supply a comma-separated list of provinces to exclude without an interactive prompt:
python inscr-preprocess.py --exclude-provinces "Achaia,Aegyptus,Syria" --no-promptThis step reads the filtered CSV and adds a lemmatized_text column.
python inscr-lemmatize.py \
--input data/edcs_filtered_inscriptions.csv \
--output data/edcs_lemmatized_inscriptions.csvThis step reads the lemmatized CSV, calculates architectural scores, and writes the scored output.
python inscr-score.py \
--input data/edcs_lemmatized_inscriptions.csv \
--output output/edcs_architectural_scores.csv \
--high-output output/edcs_architectural_scores_gt50.csvTo disable creation of per-score-bin subset CSV files, use:
python inscr-score.py --no-subsetsinscr-preprocess.pymay prompt interactively for excluded provinces unless--exclude-provincesor--no-promptis used.inscr-lemmatize.pyuses spaCy and needs the model installed.inscr-score.pyexpects the lemmatized data to contain alemmatized_textcolumn.
Evaluates GLiNER NER models against the dictionary-based architectural terms.
What it does:
- Loads a scored sample dataset from
output/sample.csv. - Runs multiple GLiNER models on:
- Lemmatized text
- Cleaned interpretive text
- Compares GLiNER-extracted entities with dictionary-based terms.
Evaluated labels:
building/typebuilding/partbuilding/material
Metrics computed:
- Precision
- Recall
- F1 score
- Percentage and count of overlapping terms
- Number of inscriptions with at least one shared term
- Scores are heuristic and intended for comparative and exploratory analysis, not absolute classification.
Heřmánková, Petra. “EDCS”. Zenodo, September 12, 2022. https://doi.org/10.5281/zenodo.7072337.

