Skip to content

Shanyu-Dabbiru/tax-code-rag

Repository files navigation

tax-code-pipeline 🏛️

A production-ready RAG pipeline for indexing and querying Title 26 (Internal Revenue Code). This project focuses on the engineering challenges of handling hierarchical legal structures, metadata integrity, and system observability.

🏗 System Architecture

  • ETL & Ingestion: Custom Python-based parser for Title 26 XML. Implements cleaning logic to strip boilerplate while preserving statutory hierarchy.
  • Hierarchical Indexing: Moves beyond fixed-length windowing to semantic section-based chunking, ensuring retrieval preserves legal context.
  • Vector Store: Qdrant (Dockerized) for metadata-filtered vector search.
  • Observability: Arize Phoenix (OpenTelemetry) integration for trace logging, latency tracking, and retrieval debugging.
  • Evaluation: Quantitative benchmarking via Ragas (Faithfulness, Relevancy, and Context Precision).

🛠 Tech Stack

  • Python 3.10+
  • Infrastructure: Docker, Docker Compose
  • Vector DB: Qdrant
  • Observability: Arize Phoenix / OpenTelemetry
  • Orchestration: LlamaIndex (or LangChain)

📂 Repository Structure . ├── src/ │ ├── ingestion/ # XML/HTML parsing & cleaning │ ├── processing/ # Hierarchical chunking & embedding │ ├── retrieval/ # Hybrid search & reranking logic │ └── api/ # FastAPI entry points ├── infra/ # Docker Compose & DB configs ├── eval/ # Ragas evaluation scripts └── data/ # Raw Title 26 sources (git-ignored)

About

A production-grade RAG pipeline for the Internal Revenue Code (Title 26) featuring hierarchical chunking and hybrid search.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors