DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using an end-to-end Retrieval-Augmented Generation (RAG) workflow with a Large Language Model (LLM). It provides a pipeline to ingest documents, build/search an index, and draft a DMP through a FastAPI web UI.
This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.
👉 Learn more: DMP-Chef.
The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.
src/data_ingestion.py— Loads, cleans, and chunks documents; builds the vector index.src/core_pipeline_UI.py— Retrieves relevant chunks and generates the final output.
dmpchef/
│── app.py # FastAPI entry point (defines `app = FastAPI()` + API routes). Run: `uvicorn app:app --reload`
│── README.md # Project overview, setup instructions, usage examples, API docs
│── requirements.txt # Python dependencies for `pip install -r requirements.txt`
│── setup.py # Optional packaging config (enables `pip install -e .` for editable installs)
│── .env # Local environment variables (keys/config) — keep private; DO NOT commit
│── .gitignore # Git ignore rules (e.g., venv, __pycache__, logs, .env, local data)
│
├── config/ # App/pipeline configuration
│ ├── __init__.py # Makes `config` importable as a package
│ ├── config.yaml # Main settings (models, paths, chunking, retriever params, etc.)
│ └── config_schema.py # Schema/validation for config (pydantic/dataclasses validation)
│
├── data/ # Input documents / datasets / outputs
│ ├── inputs/ # User-facing templates + example inputs
│ │ ├── dmp-template.md # Markdown prompt template used by the LLM
│ │ └── nih-dms-plan-template.docx # NIH blank DOCX template (used to preserve exact Word formatting)
│ ├── pdfs/ # NIH guidance PDFs used for RAG (config.paths.data_pdfs points here)
│ └── outputs/ # Generated artifacts
│ ├── md/ # Generated Markdown DMPs (config.paths.output_md points here)
│ ├── docx/ # Generated DOCX DMPs (config.paths.output_docx points here)
│ └── json/ # Generated JSON outputs (dmptool schema) (core_pipeline_UI writes here)
│
├── model/ # Model-related code + (optionally) persisted artifacts
│ ├── __init__.py # Makes `model` importable
│ └── models.py # Model definitions / wrappers (LLM + embeddings config objects, etc.)
│
├── src/ # Main application source code (core pipeline + reusable modules)
│ ├── __init__.py # Package marker for `src`
│ ├── core_pipeline_UI.py # Main RAG pipeline logic invoked by the app/UI (retrieve → prompt → generate)
│ └── data_ingestion.py # Ingestion + preprocessing + indexing utilities (load PDFs, chunk, embed, store)
│
├── prompt/ # Prompt templates and prompt utilities
│ ├── __init__.py # Package marker for `prompt`
│ └── prompt_library.py # Centralized prompt templates (system/user prompts, formatting, guardrails)
│
├── logger/ # Custom logging utilities
│ ├── __init__.py # Package marker for `logger`
│ └── custom_logger.py # Logger setup (formatters, handlers, file/console logging)
│
├── exception/ # Custom exception definitions
│ ├── __init__.py # Package marker for `exception`
│ └── custom_exception.py # Custom error classes for clearer debugging and error handling
│
├── utils/ # Shared helpers used across the project
│ ├── __init__.py # Package marker for `utils`
│ ├── config_loader.py # Loads/validates configuration (YAML/env), provides defaults
│ ├── model_loader.py # Loads LLM/embeddings clients and related model settings
│ ├── dmptool_json.py # ✅ Builds dmptool JSON output schema (used by core_pipeline_UI)
│ └── nih_docx_writer.py # ✅ Fills NIH blank DOCX template to preserve exact Word formatting
│
├── notebook_DMP_RAG/ # Notebooks / experiments / prototypes (not production code)
└── venv/ # Local virtual environment — ignore in git
Windows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1macOS/Linux:
python -m venv venv
source venv/bin/activatepip install -r requirements.txt(Optional, recommended for development)
pip install -e .What happens: the app reads documents in data/, splits them into chunks, and builds an index (vector store) for retrieval.
Workflow
- Add reference documents to:
data/ - Run
src/data_ingestion.pyonce to build the index (or enable rebuild)
Rebuild the index (if needed)
- Set
force_rebuild_index=Truein your config/YAML, or - Delete the saved index folder (often
data/index/) and run ingestion again
Start the server from the project root (where app.py is):
uvicorn app:app --reloadOpen in your browser:
http://127.0.0.1:8000/
- Open the NIH Data Management Plan Generator page.
- Fill in the form fields (Project Title, research summary, data types/source, human subjects + consent, volume/format).
- Click Generate DMP.
Generation time depends on your CPU/GPU.
- JSON (structured)
- Markdown (NIH-style narrative)
git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .
conda create -n dmpchef python=3.10 -y
conda activate dmpchef
python -m pip install --upgrade pip
pip install -r requirements.txt
# optional: install package
pip install -e .
# build index (example)
python src/data_ingestion.py
# start app
uvicorn app:app --reloadThen open:
http://127.0.0.1:8000/
This work is licensed under the MIT License. See LICENSE for more information.
Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.
If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.