-
Create a new conda environment called 'data_decomposer':
conda create --name data_decomposer python=3.10 -
Activate the environment:
conda activate data_decomposer -
Install requirements:
pip install -r requirements.txt
The repository is organized into the following main directories:
-
core/: Core pipeline interfaces and system setupbase_implementation.py: Abstract base class for all implementationsconfig.py: Configuration managementfactory.py: Factory pattern for creating implementation instances
-
data/: Data storage for input datasets -
data_processing/: Data processing and question generation scripts- Jupyter notebooks for generating questions from different data types (passages, tables, etc.)
-
implementations/: Contains different system implementationssymphony/: Symphony implementation with data decomposition and executionReSP/: Retrieval-enhanced Structured Processing implementationXMODE/: Cross-modal data handling implementationbaseline/: Baseline implementation for comparison
-
results/: Results storage for evaluation outputs -
results_v2/: Extended results storage with additional metrics -
scripts/: Command-line tools and utilitiesauto_extract_embeddings.py: Extract embeddings from data using a GPT embedding modelbuild_index.py: Build search indices for data retrievalrun_query.py: Run queries against the systemtrain.py: Training a T5-based autoencoder modelextract_embeddings.py: Extracting embeddings from the trained T5-based autoencoder modelpassage_embedd_and_index.py: Process and index passage databuild_representation_index.py: Build indices for cross-modal representationscsv_to_sqlite.py: Convert CSV data to SQLite database format
-
tests/: Test suite for validating system functionality
To process a query with ground truth answer for source relevance scoring:
python main.py --config config.yaml --ground-truth-answer "Ground truth answer text" "Your query here"To evaluate the system against a dataset of queries and ground truth answers:
python evaluate_qa.py --config config.yaml --gt-file path/to/groundtruth.csv --output results.jsonThe ground truth file should be a CSV with columns: question, answer, text, table. Where:
question: The query to processanswer: The ground truth answertext: Comma-separated list of expected text source filestable: Comma-separated list of expected table source files
Example:
"question","answer","text","table"
"What is the mechanism of action for Cetuximab?","Cetuximab is an EGFR binding FAB, targeting the EGFR in humans.","None","drugbank-targets"All passage QA generation and processing scripts use OpenRouter for LLM calls. Place your key in .env at the repo root as OPENROUTER_API_KEY.
-
End-to-end runner (workers decide mode):
- Sequential (workers=1):
python data_processing/run_passage_pipeline.py --num-questions 100 --workers 1
- Parallel (workers>1):
python data_processing/run_passage_pipeline.py --workers 4 --num-questions 100
- Optional: restrict sources considered to the first N passages for debugging:
--limit N - Passage source for each generation attempt is randomly sampled from the available passage set.
- Optional reproducibility:
--seed 42 - Default model is
openai/gpt-5.2(override with--model) - Outputs are CSV files:
- all generated QA rows:
--generated-file(defaultdata_processing/passage_generated.csv) - processed QA rows:
--processed-file(defaultdata_processing/passage_processed.csv)
- all generated QA rows:
- Each output row includes:
question_id: unique ID to reference the questionshort_answer: direct short answeranswer_reasoning: brief supporting reasoning
- Files are flushed to disk every 20 written rows during execution.
- Sequential (workers=1):
-
Individual stages remain available:
- Generation only:
python data_processing/generate_passage_questions.py --model openai/gpt-5.2 - Processing only:
python data_processing/process_passage_questions.py --model openai/gpt-5.2
- Generation only:
While building the benchmark and implementing the three methods, I used Github Co-pilot as an assistive tool. I primary used Co-pilot for assisting in writing boiler plate code for functions that I planned, designed and architected myself.