Project Setup Guide

Environment Setup

Create a new conda environment called 'data_decomposer': conda create --name data_decomposer python=3.10
Activate the environment: conda activate data_decomposer
Install requirements: pip install -r requirements.txt

Project Architecture Overview

Repository Structure

The repository is organized into the following main directories:

core/: Core pipeline interfaces and system setup
- base_implementation.py: Abstract base class for all implementations
- config.py: Configuration management
- factory.py: Factory pattern for creating implementation instances
data/: Data storage for input datasets
data_processing/: Data processing and question generation scripts
- Jupyter notebooks for generating questions from different data types (passages, tables, etc.)
implementations/: Contains different system implementations
- symphony/: Symphony implementation with data decomposition and execution
- ReSP/: Retrieval-enhanced Structured Processing implementation
- XMODE/: Cross-modal data handling implementation
- baseline/: Baseline implementation for comparison
results/: Results storage for evaluation outputs
results_v2/: Extended results storage with additional metrics
scripts/: Command-line tools and utilities
- auto_extract_embeddings.py: Extract embeddings from data using a GPT embedding model
- build_index.py: Build search indices for data retrieval
- run_query.py: Run queries against the system
- train.py: Training a T5-based autoencoder model
- extract_embeddings.py: Extracting embeddings from the trained T5-based autoencoder model
- passage_embedd_and_index.py: Process and index passage data
- build_representation_index.py: Build indices for cross-modal representations
- csv_to_sqlite.py: Convert CSV data to SQLite database format
tests/: Test suite for validating system functionality

Usage

Processing a Single Query

To process a query with ground truth answer for source relevance scoring:

python main.py --config config.yaml --ground-truth-answer "Ground truth answer text" "Your query here"

Running Evaluation on Multiple Queries

To evaluate the system against a dataset of queries and ground truth answers:

python evaluate_qa.py --config config.yaml --gt-file path/to/groundtruth.csv --output results.json

The ground truth file should be a CSV with columns: question, answer, text, table. Where:

question: The query to process
answer: The ground truth answer
text: Comma-separated list of expected text source files
table: Comma-separated list of expected table source files

Example:

"question","answer","text","table"
"What is the mechanism of action for Cetuximab?","Cetuximab is an EGFR binding FAB, targeting the EGFR in humans.","None","drugbank-targets"

Passage QA Pipeline (OpenRouter)

All passage QA generation and processing scripts use OpenRouter for LLM calls. Place your key in .env at the repo root as OPENROUTER_API_KEY.

End-to-end runner (workers decide mode):
- Sequential (workers=1):
  - python data_processing/run_passage_pipeline.py --num-questions 100 --workers 1
- Parallel (workers>1):
  - python data_processing/run_passage_pipeline.py --workers 4 --num-questions 100
- Optional: restrict sources considered to the first N passages for debugging: --limit N
- Passage source for each generation attempt is randomly sampled from the available passage set.
- Optional reproducibility: --seed 42
- Default model is openai/gpt-5.2 (override with --model)
- Outputs are CSV files:
  - all generated QA rows: --generated-file (default data_processing/passage_generated.csv)
  - processed QA rows: --processed-file (default data_processing/passage_processed.csv)
- Each output row includes:
  - question_id: unique ID to reference the question
  - short_answer: direct short answer
  - answer_reasoning: brief supporting reasoning
- Files are flushed to disk every 20 written rows during execution.
Individual stages remain available:
- Generation only: python data_processing/generate_passage_questions.py --model openai/gpt-5.2
- Processing only: python data_processing/process_passage_questions.py --model openai/gpt-5.2

Acknowledgements

While building the benchmark and implementing the three methods, I used Github Co-pilot as an assistive tool. I primary used Co-pilot for assisting in writing boiler plate code for functions that I planned, designed and architected myself.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
core		core
data		data
data_processing		data_processing
implementations		implementations
logs		logs
results		results
results_v2		results_v2
scripts		scripts
tests		tests
.gitignore		.gitignore
350passage_output_gpt-4o_filtered_answer_validated.gt		350passage_output_gpt-4o_filtered_answer_validated.gt
AGENTS.md		AGENTS.md
README.md		README.md
baseline_eval_results.json		baseline_eval_results.json
config.yaml		config.yaml
decomposer_context.txt		decomposer_context.txt
evaluate_qa.py		evaluate_qa.py
evaluation_results.json		evaluation_results.json
evaluation_results_resp.json		evaluation_results_resp.json
filter_qa_by_llm_score.py		filter_qa_by_llm_score.py
finlake.png		finlake.png
grouped_passages_by_drug.json		grouped_passages_by_drug.json
main.py		main.py
rag_eval_results.json		rag_eval_results.json
requirements.txt		requirements.txt
run_server.py		run_server.py
sample_passage_qa_pairs.gt		sample_passage_qa_pairs.gt
sample_passage_qa_pairs_2.gt		sample_passage_qa_pairs_2.gt
sample_passage_qa_pairs_3.gt		sample_passage_qa_pairs_3.gt
sample_questions.txt		sample_questions.txt
sample_table_and_passage_qa_pairs.gt		sample_table_and_passage_qa_pairs.gt
sample_table_qa_pairs.gt		sample_table_qa_pairs.gt
sample_table_qa_pairs_2.gt		sample_table_qa_pairs_2.gt
setup.py		setup.py
symphony_eval_results.json		symphony_eval_results.json
table_query_content.txt		table_query_content.txt
test.csv		test.csv
text_query_executor.txt		text_query_executor.txt
xmode_eval_results.json		xmode_eval_results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Setup Guide

Environment Setup

Project Architecture Overview

Repository Structure

Usage

Processing a Single Query

Running Evaluation on Multiple Queries

Passage QA Pipeline (OpenRouter)

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Setup Guide

Environment Setup

Project Architecture Overview

Repository Structure

Usage

Processing a Single Query

Running Evaluation on Multiple Queries

Passage QA Pipeline (OpenRouter)

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages