DMP Chef

DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using an end-to-end Retrieval-Augmented Generation (RAG) workflow with a Large Language Model (LLM). It provides a pipeline to ingest documents, build/search an index, and draft a DMP through a FastAPI web UI.

This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.

👉 Learn more: DMP-Chef.

Standards followed

The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.

Main files

src/data_ingestion.py — Loads, cleans, and chunks documents; builds the vector index.
src/core_pipeline_UI.py — Retrieves relevant chunks and generates the final output.

Repository Structure

dmpchef/
│── app.py                  # FastAPI entry point (defines `app = FastAPI()` + API routes). Run: `uvicorn app:app --reload`
│── README.md               # Project overview, setup instructions, usage examples, API docs
│── requirements.txt        # Python dependencies for `pip install -r requirements.txt`
│── setup.py                # Optional packaging config (enables `pip install -e .` for editable installs)
│── .env                    # Local environment variables (keys/config) — keep private; DO NOT commit
│── .gitignore              # Git ignore rules (e.g., venv, __pycache__, logs, .env, local data)
│
├── config/                 # App/pipeline configuration
│   ├── __init__.py         # Makes `config` importable as a package
│   ├── config.yaml         # Main settings (models, paths, chunking, retriever params, etc.)
│   └── config_schema.py    # Schema/validation for config (pydantic/dataclasses validation)
│
├── data/                   # Input documents / datasets / outputs
│   ├── inputs/             # User-facing templates + example inputs
│   │   ├── dmp-template.md                 # Markdown prompt template used by the LLM
│   │   └── nih-dms-plan-template.docx      # NIH blank DOCX template (used to preserve exact Word formatting)
│   ├── pdfs/               # NIH guidance PDFs used for RAG (config.paths.data_pdfs points here)
│   └── outputs/            # Generated artifacts
│       ├── md/             # Generated Markdown DMPs (config.paths.output_md points here)
│       ├── docx/           # Generated DOCX DMPs (config.paths.output_docx points here)
│       └── json/           # Generated JSON outputs (dmptool schema) (core_pipeline_UI writes here)
│
├── model/                  # Model-related code + (optionally) persisted artifacts
│   ├── __init__.py         # Makes `model` importable
│   └── models.py           # Model definitions / wrappers (LLM + embeddings config objects, etc.)
│
├── src/                    # Main application source code (core pipeline + reusable modules)
│   ├── __init__.py         # Package marker for `src`
│   ├── core_pipeline_UI.py # Main RAG pipeline logic invoked by the app/UI (retrieve → prompt → generate)
│   └── data_ingestion.py   # Ingestion + preprocessing + indexing utilities (load PDFs, chunk, embed, store)
│
├── prompt/                 # Prompt templates and prompt utilities
│   ├── __init__.py         # Package marker for `prompt`
│   └── prompt_library.py   # Centralized prompt templates (system/user prompts, formatting, guardrails)
│
├── logger/                 # Custom logging utilities
│   ├── __init__.py         # Package marker for `logger`
│   └── custom_logger.py    # Logger setup (formatters, handlers, file/console logging)
│
├── exception/              # Custom exception definitions
│   ├── __init__.py         # Package marker for `exception`
│   └── custom_exception.py # Custom error classes for clearer debugging and error handling
│
├── utils/                  # Shared helpers used across the project
│   ├── __init__.py         # Package marker for `utils`
│   ├── config_loader.py    # Loads/validates configuration (YAML/env), provides defaults
│   ├── model_loader.py     # Loads LLM/embeddings clients and related model settings
│   ├── dmptool_json.py     # ✅ Builds dmptool JSON output schema (used by core_pipeline_UI)
│   └── nih_docx_writer.py  # ✅ Fills NIH blank DOCX template to preserve exact Word formatting
│
├── notebook_DMP_RAG/       # Notebooks / experiments / prototypes (not production code)
└── venv/                   # Local virtual environment — ignore in git

Setup (Local Development)

Step 1 — Create and activate a virtual environment

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

macOS/Linux:

python -m venv venv
source venv/bin/activate

Step 2 — Install dependencies

pip install -r requirements.txt

(Optional, recommended for development)

pip install -e .

Step 3 — Run the Pipeline (Ingestion + Indexing)

What happens: the app reads documents in data/, splits them into chunks, and builds an index (vector store) for retrieval.

Workflow

Add reference documents to: data/
Run src/data_ingestion.py once to build the index (or enable rebuild)

Rebuild the index (if needed)

Set force_rebuild_index=True in your config/YAML, or
Delete the saved index folder (often data/index/) and run ingestion again

Step 4 — Start the Web App (FastAPI)

Start the server from the project root (where app.py is):

uvicorn app:app --reload

Open in your browser:

http://127.0.0.1:8000/

Generate a DMP (Web UI)

Open the NIH Data Management Plan Generator page.
Fill in the form fields (Project Title, research summary, data types/source, human subjects + consent, volume/format).
Click Generate DMP.

Generation time depends on your CPU/GPU.

Outputs

JSON (structured)
Markdown (NIH-style narrative)

Setup (Example Commands — Conda)

git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .

conda create -n dmpchef python=3.10 -y
conda activate dmpchef

python -m pip install --upgrade pip
pip install -r requirements.txt

# optional: install package
pip install -e .

# build index (example)
python src/data_ingestion.py

# start app
uvicorn app:app --reload

Then open:

http://127.0.0.1:8000/

License

This work is licensed under the MIT License. See LICENSE for more information.

Feedback and contribution

Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.

How to cite

If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DMP Chef

Standards followed

Main files

Repository Structure

Setup (Local Development)

Step 1 — Create and activate a virtual environment

Step 2 — Install dependencies

Step 3 — Run the Pipeline (Ingestion + Indexing)

Step 4 — Start the Web App (FastAPI)

Generate a DMP (Web UI)

Outputs

Setup (Example Commands — Conda)

License

Feedback and contribution

How to cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
config		config
data		data
exception		exception
logger		logger
model		model
notebook_DMP_RAG		notebook_DMP_RAG
prompt		prompt
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.py		setup.py

License

fairdataihub/dmpchef

Folders and files

Latest commit

History

Repository files navigation

DMP Chef

Standards followed

Main files

Repository Structure

Setup (Local Development)

Step 1 — Create and activate a virtual environment

Step 2 — Install dependencies

Step 3 — Run the Pipeline (Ingestion + Indexing)

Step 4 — Start the Web App (FastAPI)

Generate a DMP (Web UI)

Outputs

Setup (Example Commands — Conda)

License

Feedback and contribution

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages