Skip to content

newtfire/digitai-2025

Repository files navigation

🧠 DigitAI (2025): A TEI-Aware AI Tutor for XML Encoding

DigitAI is a local RAG (Retrieval-Augmented Generation) system built to help students, archivists, and researchers understand and encode TEI documents. It uses a mix of vector search and graph structure to give helpful, grounded answers without needing the cloud. This is the first stage of the project, developed in 2025.

🚀 Features

  • 🔎 Hybrid Retrieval: Combines semantic embeddings (FAISS + BGE-M3) with graph-based lookup (Neo4j)
  • 🧠 Local LLM-Compatible: Designed to integrate with local models like Mistral or LLaMA via Ollama
  • ⚙️ Configurable Pipeline: Driven by digitaiCore/config.yaml for easy control over indexing, logging, and Neo4j settings
  • 🧪 Research-Ready: Supports embedding visualization, document analysis, and RAG-based exploration

📂 Project Structure

digitai/
├── digitaiCore/
│   ├── config.yaml
│   ├── config_loader.py
│   ├── embed_bge_m3.py
│   ├── neo4j_exporter.py
│   ├── faiss_index_builder.py    # [TODO] Build and save FAISS index
│   └── rag_pipeline.py           # [TODO] Hybrid FAISS + Neo4j search
├── data/
│   └── p5/
├── requirements.txt
└── README.md

⚙️ Setup

Preliminary

  • System requirements:
    • Storage: at least 7.00 GB (approx. 1.00 GB for core dependencies and 5.20 GB for Ollama’s qwen3.8b model download)
    • Other requirements: still being tested and optimized for less powerful systems
  • Python installation: 3.13 (as of Spring 2025)
  • We are using this version to optimize threading performance on Macs, but DigitAI should run on Mac or Windows.

1. Clone the Repo

git clone https://github.com/newtfire/digitai.git
cd digitai

2. Create & Activate Virtual Environment Within the Project Directory

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

3. Install Core Dependencies

pip install -e .

4. Download and Install Ollama and the Current Qwen Model

  • Download Ollama here.
  • Once installed, select and download the Qwen model (the project is configured by default to use this model provided by Ollama).

🔧 Configuration

All settings are controlled in digitaiCore/config.yaml. This file lets you manage global options like file paths, logging, Neo4j connection details, and model settings. If you want to change where outputs go, adjust which model is used, or tweak how embeddings run, this is the place to do it.

🛑 Reminder: Never commit passwords. Use a .env file or local override if needed.


🧪 How to Run the Pipeline

The pipeline we have constructed should build the vector embeddings that the AI LLM needs to run and to query the RAG system. If you are building this on a local computer, the following steps should run the pipeline to build what you need. Be sure that you are running Python 3.13.

Export Nodes from Neo4j

python digitaiCore/neo4j_exporter.py

Embed with BGE-M3 and Save to JSONL

python digitaiCore/embed_bge_m3.py

Output: data/p5/p5Embeddings.jsonl (and FAISS index if configured)

Run RAG query (Currently in beta)

python digitaiCore/rag_pipeline.py

This grabs the most relevant TEI content using both semantic similarity and structural relationships, builds a custom context window, and sends it to your local LLM to answer or explain.


🧠 What You Can Use It For

  • ✍️ Ask for Markup Help — Get suggestions for how to encode specific TEI structures
  • 🛠️ Debug Your Encoding — Review or compare your TEI markup against the official specs (coming soon)
  • 🎓 Learn TEI — Ask questions and get clear, simple explanations using real schema content

🔮 What’s Next?

  • 🎯 Fine-Tuning Prep — Begin curating examples and formatting data for instruction tuning
  • 🧪 RAG Evaluation — Test output accuracy and context relevance using real-world queries
  • 🛠️ Interface Refinement — Improve prompt formatting, response handling, and context window logic
  • 🧱 Local.yaml Overrides — Add clean support for optional secrets/config overrides
  • 📦 (Optional) Docker Packaging — Package for easier setup across devices, if needed

🛠 Current Maintainers & Contributors

Alexander C. Fisher

Role: Project Lead, Pipeline Developer, and Literature Review Co-Lead
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @afish2003

  • Designed and built the full Python pipeline: embeddings, FAISS indexing, Neo4j integration, and RAG prompting
  • Leads configuration design, system architecture, and interface logic
  • Will lead the upcoming fine-tuning phase to improve LLM performance
  • Co-leads the literature review, analyzing scholarly sources to inform system design

Hadleigh Jae Bills

Role: Data Pipeline Lead and Graph Architect
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @HadleighJae

  • Prepares and structures TEI-derived JSON for both vector and graph pipelines
  • Builds and maintains the full Neo4j graph with custom Cypher logic
  • Designs the data model the pipeline relies on and supports structural debugging
  • Leads research on TEI schema logic and contributes key sources for literature review

Michael Simons

Role: Battle-Tester and Documentation
Affiliation: DIGIT Major @ Penn State Behrend
GitHub: @mrs7068

  • Recent addition to the project as an additional battle-tester
  • Maintains project documentation

Dr. Elisa Beshero-Bondar

Role: Faculty Advisor, XSLT Architect, and Research Lead
Affiliation: Faculty @ Penn State Behrend
GitHub: @ebeshero

  • Authored the XSLT transformation for converting TEI P5 XML into structured JSON
  • Provides core expertise in TEI, digital editing, and scholarly infrastructure
  • Leads the literature review and guides the team’s research direction

This project is part of an ongoing digital humanities research initiative at Penn State Behrend. Please cite responsibly.


About

repo for experimenting with small language models and explainable AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors