DigitAI is a local RAG (Retrieval-Augmented Generation) system built to help students, archivists, and researchers understand and encode TEI documents. It uses a mix of vector search and graph structure to give helpful, grounded answers without needing the cloud. This is the first stage of the project, developed in 2025.
- 🔎 Hybrid Retrieval: Combines semantic embeddings (FAISS + BGE-M3) with graph-based lookup (Neo4j)
- 🧠 Local LLM-Compatible: Designed to integrate with local models like Mistral or LLaMA via Ollama
- ⚙️ Configurable Pipeline: Driven by
digitaiCore/config.yamlfor easy control over indexing, logging, and Neo4j settings - 🧪 Research-Ready: Supports embedding visualization, document analysis, and RAG-based exploration
digitai/
├── digitaiCore/
│ ├── config.yaml
│ ├── config_loader.py
│ ├── embed_bge_m3.py
│ ├── neo4j_exporter.py
│ ├── faiss_index_builder.py # [TODO] Build and save FAISS index
│ └── rag_pipeline.py # [TODO] Hybrid FAISS + Neo4j search
├── data/
│ └── p5/
├── requirements.txt
└── README.md
- System requirements:
- Storage: at least 7.00 GB (approx. 1.00 GB for core dependencies and 5.20 GB for Ollama’s qwen3.8b model download)
- Other requirements: still being tested and optimized for less powerful systems
- Python installation: 3.13 (as of Spring 2025)
- We are using this version to optimize threading performance on Macs, but DigitAI should run on Mac or Windows.
git clone https://github.com/newtfire/digitai.git
cd digitaipython -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windowspip install -e .- Download Ollama here.
- Once installed, select and download the Qwen model (the project is configured by default to use this model provided by Ollama).
All settings are controlled in digitaiCore/config.yaml. This file lets you manage global options like file paths, logging, Neo4j connection details, and model settings. If you want to change where outputs go, adjust which model is used, or tweak how embeddings run, this is the place to do it.
🛑 Reminder: Never commit passwords. Use a
.envfile or local override if needed.
The pipeline we have constructed should build the vector embeddings that the AI LLM needs to run and to query the RAG system. If you are building this on a local computer, the following steps should run the pipeline to build what you need. Be sure that you are running Python 3.13.
python digitaiCore/neo4j_exporter.pypython digitaiCore/embed_bge_m3.pyOutput:
data/p5/p5Embeddings.jsonl(and FAISS index if configured)
python digitaiCore/rag_pipeline.pyThis grabs the most relevant TEI content using both semantic similarity and structural relationships, builds a custom context window, and sends it to your local LLM to answer or explain.
- ✍️ Ask for Markup Help — Get suggestions for how to encode specific TEI structures
- 🛠️ Debug Your Encoding — Review or compare your TEI markup against the official specs (coming soon)
- 🎓 Learn TEI — Ask questions and get clear, simple explanations using real schema content
- 🎯 Fine-Tuning Prep — Begin curating examples and formatting data for instruction tuning
- 🧪 RAG Evaluation — Test output accuracy and context relevance using real-world queries
- 🛠️ Interface Refinement — Improve prompt formatting, response handling, and context window logic
- 🧱 Local.yaml Overrides — Add clean support for optional secrets/config overrides
- 📦 (Optional) Docker Packaging — Package for easier setup across devices, if needed
Role: Project Lead, Pipeline Developer, and Literature Review Co-Lead
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @afish2003
- Designed and built the full Python pipeline: embeddings, FAISS indexing, Neo4j integration, and RAG prompting
- Leads configuration design, system architecture, and interface logic
- Will lead the upcoming fine-tuning phase to improve LLM performance
- Co-leads the literature review, analyzing scholarly sources to inform system design
Role: Data Pipeline Lead and Graph Architect
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @HadleighJae
- Prepares and structures TEI-derived JSON for both vector and graph pipelines
- Builds and maintains the full Neo4j graph with custom Cypher logic
- Designs the data model the pipeline relies on and supports structural debugging
- Leads research on TEI schema logic and contributes key sources for literature review
Role: Battle-Tester and Documentation
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @mrs7068
- Recent addition to the project as an additional battle-tester
- Maintains project documentation
Role: Faculty Advisor, XSLT Architect, and Research Lead
Affiliation: Faculty @ Penn State Behrend
GitHub: @ebeshero
- Authored the XSLT transformation for converting TEI P5 XML into structured JSON
- Provides core expertise in TEI, digital editing, and scholarly infrastructure
- Leads the literature review and guides the team’s research direction
This project is part of an ongoing digital humanities research initiative at Penn State Behrend. Please cite responsibly.