🧠 DigitAI (2025): A TEI-Aware AI Tutor for XML Encoding

DigitAI is a local RAG (Retrieval-Augmented Generation) system built to help students, archivists, and researchers understand and encode TEI documents. It uses a mix of vector search and graph structure to give helpful, grounded answers without needing the cloud. This is the first stage of the project, developed in 2025.

🚀 Features

🔎 Hybrid Retrieval: Combines semantic embeddings (FAISS + BGE-M3) with graph-based lookup (Neo4j)
🧠 Local LLM-Compatible: Designed to integrate with local models like Mistral or LLaMA via Ollama
⚙️ Configurable Pipeline: Driven by digitaiCore/config.yaml for easy control over indexing, logging, and Neo4j settings
🧪 Research-Ready: Supports embedding visualization, document analysis, and RAG-based exploration

📂 Project Structure

digitai/
├── digitaiCore/
│   ├── config.yaml
│   ├── config_loader.py
│   ├── embed_bge_m3.py
│   ├── neo4j_exporter.py
│   ├── faiss_index_builder.py    # [TODO] Build and save FAISS index
│   └── rag_pipeline.py           # [TODO] Hybrid FAISS + Neo4j search
├── data/
│   └── p5/
├── requirements.txt
└── README.md

⚙️ Setup

Preliminary

System requirements:
- Storage: at least 7.00 GB (approx. 1.00 GB for core dependencies and 5.20 GB for Ollama’s qwen3.8b model download)
- Other requirements: still being tested and optimized for less powerful systems
Python installation: 3.13 (as of Spring 2025)
We are using this version to optimize threading performance on Macs, but DigitAI should run on Mac or Windows.

1. Clone the Repo

git clone https://github.com/newtfire/digitai.git
cd digitai

2. Create & Activate Virtual Environment Within the Project Directory

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

3. Install Core Dependencies

pip install -e .

4. Download and Install Ollama and the Current Qwen Model

Download Ollama here.
Once installed, select and download the Qwen model (the project is configured by default to use this model provided by Ollama).

🔧 Configuration

All settings are controlled in digitaiCore/config.yaml. This file lets you manage global options like file paths, logging, Neo4j connection details, and model settings. If you want to change where outputs go, adjust which model is used, or tweak how embeddings run, this is the place to do it.

🛑 Reminder: Never commit passwords. Use a .env file or local override if needed.

🧪 How to Run the Pipeline

The pipeline we have constructed should build the vector embeddings that the AI LLM needs to run and to query the RAG system. If you are building this on a local computer, the following steps should run the pipeline to build what you need. Be sure that you are running Python 3.13.

Export Nodes from Neo4j

python digitaiCore/neo4j_exporter.py

Embed with BGE-M3 and Save to JSONL

python digitaiCore/embed_bge_m3.py

Output: data/p5/p5Embeddings.jsonl (and FAISS index if configured)

Run RAG query (Currently in beta)

python digitaiCore/rag_pipeline.py

This grabs the most relevant TEI content using both semantic similarity and structural relationships, builds a custom context window, and sends it to your local LLM to answer or explain.

🧠 What You Can Use It For

✍️ Ask for Markup Help — Get suggestions for how to encode specific TEI structures
🛠️ Debug Your Encoding — Review or compare your TEI markup against the official specs (coming soon)
🎓 Learn TEI — Ask questions and get clear, simple explanations using real schema content

🔮 What’s Next?

🎯 Fine-Tuning Prep — Begin curating examples and formatting data for instruction tuning
🧪 RAG Evaluation — Test output accuracy and context relevance using real-world queries
🛠️ Interface Refinement — Improve prompt formatting, response handling, and context window logic
🧱 Local.yaml Overrides — Add clean support for optional secrets/config overrides
📦 (Optional) Docker Packaging — Package for easier setup across devices, if needed

🛠 Current Maintainers & Contributors

Alexander C. Fisher

Role: Project Lead, Pipeline Developer, and Literature Review Co-Lead
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @afish2003

Designed and built the full Python pipeline: embeddings, FAISS indexing, Neo4j integration, and RAG prompting
Leads configuration design, system architecture, and interface logic
Will lead the upcoming fine-tuning phase to improve LLM performance
Co-leads the literature review, analyzing scholarly sources to inform system design

Hadleigh Jae Bills

Role: Data Pipeline Lead and Graph Architect
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @HadleighJae

Prepares and structures TEI-derived JSON for both vector and graph pipelines
Builds and maintains the full Neo4j graph with custom Cypher logic
Designs the data model the pipeline relies on and supports structural debugging
Leads research on TEI schema logic and contributes key sources for literature review

Michael Simons

Role: Battle-Tester and Documentation
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @mrs7068

Recent addition to the project as an additional battle-tester
Maintains project documentation

Dr. Elisa Beshero-Bondar

Role: Faculty Advisor, XSLT Architect, and Research Lead
Affiliation: Faculty @ Penn State Behrend
GitHub: @ebeshero

Authored the XSLT transformation for converting TEI P5 XML into structured JSON
Provides core expertise in TEI, digital editing, and scholarly infrastructure
Leads the literature review and guides the team’s research direction

This project is part of an ongoing digital humanities research initiative at Penn State Behrend. Please cite responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
RAG-data-prep		RAG-data-prep
data		data
digitaiCore		digitaiCore
digitai_core.egg-info		digitai_core.egg-info
dist		dist
jupyter-notebooks		jupyter-notebooks
prelimAlpha		prelimAlpha
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 DigitAI (2025): A TEI-Aware AI Tutor for XML Encoding

🚀 Features

📂 Project Structure

⚙️ Setup

Preliminary

1. Clone the Repo

2. Create & Activate Virtual Environment Within the Project Directory

3. Install Core Dependencies

4. Download and Install Ollama and the Current Qwen Model

🔧 Configuration

🧪 How to Run the Pipeline

Export Nodes from Neo4j

Embed with BGE-M3 and Save to JSONL

Run RAG query (Currently in beta)

🧠 What You Can Use It For

🔮 What’s Next?

🛠 Current Maintainers & Contributors

Alexander C. Fisher

Hadleigh Jae Bills

Michael Simons

Dr. Elisa Beshero-Bondar

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 DigitAI (2025): A TEI-Aware AI Tutor for XML Encoding

🚀 Features

📂 Project Structure

⚙️ Setup

Preliminary

1. Clone the Repo

2. Create & Activate Virtual Environment Within the Project Directory

3. Install Core Dependencies

4. Download and Install Ollama and the Current Qwen Model

🔧 Configuration

🧪 How to Run the Pipeline

Export Nodes from Neo4j

Embed with BGE-M3 and Save to JSONL

Run RAG query (Currently in beta)

🧠 What You Can Use It For

🔮 What’s Next?

🛠 Current Maintainers & Contributors

Alexander C. Fisher

Hadleigh Jae Bills

Michael Simons

Dr. Elisa Beshero-Bondar

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages