MedPromptEval is a comprehensive framework for evaluating and improving large language models for medical question answering through systematic prompt engineering, multi-model evaluation, and detailed metrics analysis.
MedPromptEval provides a complete framework for generating high-quality system prompts that guide language models in answering medical questions, evaluating their performance, and analyzing the results. It's designed to help researchers and practitioners understand which combinations of models and prompting strategies yield the most accurate, relevant, and unbiased medical answers.
Medical question answering requires both accuracy and appropriate explanation. This framework allows you to:
- Generate diverse system prompts using different reasoning methodologies
- Test these prompts with various language models against medical QA datasets
- Analyze answer quality through comprehensive metrics
- Identify the most effective model and prompt combinations for medical QA
MedPromptEval can be used for:
-
Medical LLM Research:
- Benchmark different medical LLMs on standardized datasets
- Identify optimal prompting strategies for different medical domains
-
Medical Education:
- Evaluate LLMs for patient education content generation
- Ensure medical explanations are accurate and at appropriate reading levels
-
Clinical Decision Support:
- Test how well LLMs reason about medical cases
- Identify and reduce bias in medical recommendations
-
Healthcare Documentation:
- Assess models for medical summarization tasks
- Evaluate factual consistency between source documents and model outputs
- Supports multiple LLMs
- Phi-2
- Mistral-7B
- Llama-3-8B
- Llama-3.2-1B
- DeepSeek-R1-Distill-Qwen-1.5B
- Qwen3-1.7B
- Gemma-3-1B-IT
- Granite-3.3-2B
- Gemma-2-2B
- OpenBioLLM-8B
- Generates prompts using various reasoning methodologies:
- Chain of Thought: Step-by-step reasoning through medical concepts
- Trigger Chain of Thought: Using prompts that elicit medical reasoning
- Self Consistency: Generating multiple reasoning paths for verification
- Prompt Chaining: Breaking complex medical questions into sub-prompts
- ReAct: Reasoning and acting iteratively for clinical scenarios
- Tree of Thoughts: Exploring multiple diagnostic or treatment branches
- Role-Based: Assuming the persona of a relevant medical specialist
- Metacognitive Prompting: Self-reflection on medical reasoning processes
- Uncertainty-Based Prompting: Acknowledging knowledge limitations and providing confidence assessments
- Guided Prompting: Using structured frameworks for medical explanations
- Separate optimized configurations for:
- Prompt generation (creative, diverse)
- Answer generation (factual, precise)
- Command-line interface for easy experimentation
- JSON output format for further processing
- Evaluation pipeline for testing system prompts on real QA datasets
- Multi-model evaluation capabilities for comprehensive performance comparisons
- Memory-efficient evaluation with comprehensive NLP metrics
- Multiple Prompt Types: Supports various prompt engineering techniques including Chain of Thought, Self-Consistency, ReAct, and more
- Comprehensive Metrics: Evaluates answers using multiple metrics including semantic similarity, ROUGE scores, BLEU score, and BERTScore
- Flexible Model Support: Works with any Hugging Face model
- Incremental Processing: Results are written to CSV immediately after each evaluation, providing:
- Crash resistance
- Progress visibility
- Memory efficiency
- Resume Capability: Continue processing from any question using the
--resume-fromargument - CSV Output Handling:
- Automatic directory creation
- Safe append mode
- Header management
- Long text handling
- Advanced Visualization: Comprehensive analysis tools with:
- Model comparisons
- Prompt type analysis
- Metric distributions
- Correlation analysis
- Best configurations analysis
- Normalized metric scoring
config.py: Contains model configurations and prompt type definitionsprompt_generation.py: Core implementation of thePromptGeneratorclassanswer_generation.py: Handles question answering with different modelsmetrics_evaluator.py: Evaluation module for comparing model answers with ground truthpipeline.py: Comprehensive pipeline for evaluating generated prompts on question-answering datasetstest_pipeline.py: Command-line interface for running the generatorvisualizer.py: Visualization tools for analyzing evaluation resultssummarize_experiments.py: Automated analysis and visualization of experiment results.datasets/: Directory containing medical QA datasetsresults/: Output directory for evaluation resultsvisualizations/: Generated charts and plots from analysis
- Python 3.8+
- PyTorch
- Transformers
- Hugging Face account with API token (for some models)
- dotenv
- pandas
- tqdm
- colorama (for terminal output formatting)
- nltk, rouge, bert_score (for NLP metrics)
- sentence-transformers (for semantic embedding)
- textstat (for readability metrics)
- textblob (for sentiment analysis)
- matplotlib, seaborn (for data visualization)
- scipy (for statistical analysis)
- numpy (for numerical operations)
-
Clone this repository
-
Install requirements:
pip install -r requirements.txtAlternatively, install individual packages:
pip install torch transformers huggingface_hub python-dotenv pandas tqdm colorama nltk rouge bert_score sentence_transformers textstat textblob matplotlib seaborn scipy numpy -
Create a
.envfile in the root directory with your Hugging Face token:HUGGINGFACE_TOKEN=your_token_hereYou can get your token from your Hugging Face account settings.
Run the generator with default settings (Mistral-7B model):
python test_pipeline.pySpecify different parameters:
python test_pipeline.py --model phi-2 --output_dir outputs/phi2 --num_prompts 3Generate prompts for specific reasoning types:
python test_pipeline.py --model phi-2 --prompt_types "chain of thought" "role based" --num_prompts 2Evaluate how well the generated system prompts perform on question-answering tasks:
python pipeline.py --dataset datasets/cleaned/medquad_cleaned.csv --output results/qa_results.csv --prompt-models phi-2 --answer-models mistral-7bFor comprehensive multi-model evaluation, specify multiple models:
python pipeline.py --dataset datasets/cleaned/medquad.csv --output results/multi_model_results.csv --prompt-models phi-2 mistral-7b --answer-models phi-2 mistral-7b llama-3-8b --prompt-types "chain of thought" "self consistency" --prompts-per-type 2 --num-questions 5This example would test:
- 2 prompt models × 3 answer models × 2 prompt types × 2 prompts per type = 24 combinations per question
- Across 5 sample questions, resulting in 120 total evaluations
# Step 1: Generate prompts with Phi-2
python test_pipeline.py --model phi-2 --num_prompts 1 --output_dir outputs/phi2_prompts
# Step 2: Run evaluation with all prompt types
python pipeline.py \
--dataset datasets/cleaned/medquad_cleaned.csv \
--output results/phi2_prompt_types.csv \
--prompt-models phi-2 \
--answer-models phi-2 \
--num-questions 20# Evaluate all models on the same dataset with the same prompt type
python pipeline.py \
--dataset datasets/cleaned/medquad_cleaned.csv \
--output results/model_comparison.csv \
--prompt-models phi-2 \
--answer-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
--prompt-types "chain of thought" \
--num-questions 25# Run with all available models and prompt types
python pipeline.py \
--dataset datasets/cleaned/medquad_cleaned.csv \
--output results/comprehensive_evaluation.csv \
--prompt-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
--answer-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
--prompt-types "chain of thought" "trigger chain of thought" "self consistency" "prompt chaining" "react" "tree of thoughts" "role based" "metacognitive prompting" "uncertainty based prompting" "guided prompting" \
--prompts-per-type 2 \
--num-questions 10# Run with minimal memory usage
python pipeline.py \
--dataset datasets/cleaned/medquad_cleaned.csv \
--output results/low_resource.csv \
--prompt-models phi-2 \
--answer-models phi-2 \
--prompt-types "chain of thought" \
--num-questions 10 \
--exclude-long-text \
--no-verbose--model: Choose the model to use (phi-2,mistral-7b,llama-3-8b,gemma-2-2b,openbiollm-8b)--output_dir: Directory to save generated prompts--num_prompts: Number of prompts to generate per type--prompt_types: Specific prompt types to generate (default: all types)--no_auth: Run without Hugging Face authentication
--dataset: Path to a CSV file containing question-answer pairs (required)--output: Path for the output CSV file (default:results/qa_results.csv)--prompt-models: Models to use for generating prompts, can specify multiple (default:phi-2)--answer-models: Models to use for answering questions, can specify multiple (default:mistral-7b)--prompt-types: Specific prompt types to use (default: all types)--prompts-per-type: Number of prompt variations to generate per type (default: 1)--num-questions: Number of question-answer pairs to process (default: all)--no-auth: Run without Hugging Face authentication--no-metrics: Disable all metrics evaluation--no-deepeval: Disable DeepEval metrics to reduce memory usage while still calculating basic NLP metrics--exclude-long-text: Exclude long text fields from CSV output--no-verbose: Disable colorized metrics display in the terminal--list-metrics: List all available metrics with descriptions and exit
The system uses two separate model configurations to optimize for different tasks:
-
PROMPT_MODEL_CONFIGS: Models optimized for prompt generation
- Higher temperature (0.7) for creativity
- Balanced top_p (0.9) and top_k (50) for diverse suggestions
- Shorter output (512 tokens) focused on system prompt creation
- Higher repetition penalty (1.2) to avoid repetitive patterns
- Optimized for generating structured, clear instructions
-
ANSWER_MODEL_CONFIGS: Models optimized for answering medical questions
- Lower temperature (0.3) for factual, precise answers
- Higher max_new_tokens (1024) for more detailed responses
- Lower top_p (0.7) and top_k (40) for more focused outputs
- Balanced repetition penalty (1.1) for natural but focused answers
- Optimized for generating comprehensive, accurate medical explanations
Each model configuration can be customized in config.py to fine-tune the generation parameters for specific use cases.
The system generates a JSON file containing:
- Model metadata
- Generated prompts for each reasoning methodology
Example output structure:
{
"metadata": {
"model_info": {
"name": "mistralai/Mistral-7B-v0.1",
"description": "Mistral 7B base model, good for general instruction following"
}
},
"prompts": {
"chain of thought": [
"You are a medical AI assistant. When answering medical questions, break down your reasoning into clear, logical steps..."
],
"role based": [
"Assume the role of a medical specialist most relevant to the question being asked..."
],
...
}
}The pipeline produces a CSV file with these columns:
prompt_num: A sequential number for each prompt generated during executionquestion: The original questioncorrect_answer: The ground truth answerprompt_model: Which model generated the system promptprompt_model_key: The configuration key for the prompt modelprompt_type: The type of reasoning (chain-of-thought, etc.)prompt_variation: The variation number for this prompt typesystem_prompt: The full system prompt textanswer_model: Which model generated the answeranswer_model_key: The configuration key for the answer modelmodel_answer: The model's answer to the question- Various metrics columns (see Metrics Evaluation section)
The pipeline writes results to the CSV file incrementally after each question-answer evaluation, rather than saving everything at the end. This provides several benefits:
- Crash resistance: If the pipeline is interrupted or crashes, all processed results up to that point are already saved
- Progress visibility: You can open the CSV file while the pipeline is running to see current results
- Reduced memory usage: The pipeline doesn't need to keep all results in memory
- Reliable execution: Even with large-scale evaluations across multiple models, your results are saved as they're generated
After each result is processed, you'll see output like this:
✓ Result #42 saved: Q5, chain of thought (2/3), Phi-2 → Mistral-7B
This indicates that:
- Result #42 has been saved to the CSV file
- It's for question #5
- Using the "chain of thought" prompt type, variation 2 of 3
- Generated with Phi-2 as the prompt model and Mistral-7B as the answer model
The system includes a comprehensive metrics evaluation component that analyzes the quality of generated answers using various NLP techniques. This allows for quantitative assessment of model performance without requiring external APIs.
The metrics evaluator provides the following metrics:
Semantic and Relevance Metrics:
semantic_similarity: Cosine similarity between question and answer embeddings (relevance)answer_similarity: Cosine similarity between model answer and correct answerentailment_score: Score indicating whether the model answer entails (is consistent with) the correct answerentailment_label: Classification label: "entailment", "neutral", or "contradiction"
Text Comparison Metrics:
rouge1_f,rouge2_f,rougeL_f: ROUGE metrics for text overlap assessmentbleu_score: BLEU score for measuring precisionbertscore_precision,bertscore_recall,bertscore_f1: BERTScore metrics for semantic evaluation
Readability and Style Metrics:
answer_length: Number of words in the answerflesch_reading_ease: Readability score (higher = easier to read)flesch_kincaid_grade: US grade level required to understand the textsentiment_polarity: Sentiment of the answer (-1 to +1)sentiment_subjectivity: Subjectivity of the answer (0 to 1)
Comparative Metrics:
comparison_answer_length_delta: Difference in length between model and correct answerscomparison_flesch_reading_ease_delta: Difference in readabilitycomparison_flesch_kincaid_grade_delta: Difference in grade levelcomparison_sentiment_polarity_delta: Difference in sentimentcomparison_sentiment_subjectivity_delta: Difference in subjectivitycomparison_relevance_delta: Difference in question relevancecomparison_summary: Overall summary of key differences
Reference Metrics:
For each model metric, there's a corresponding correct_ version that provides the same measurement for the reference answer, enabling direct comparison.
When verbose mode is enabled (default), metrics are displayed with color coding in the terminal:
- Green: Good scores
- Yellow: Moderate scores
- Red: Poor scores
This provides immediate visual feedback on answer quality during evaluation.
The evaluation results are saved to CSV files with comprehensive metrics. You can analyze these results using:
- Spreadsheet Applications: Open in Excel, Google Sheets, etc. for basic filtering and visualization
- Data Analysis Libraries: Use pandas, matplotlib, or other Python libraries for advanced analysis
- Built-in Visualization Tools: Use the provided
visualizer.pymodule for comprehensive visual analysis
The framework includes a powerful visualization module to help analyze and interpret evaluation results. By default, visualizations are saved in the results/visualizations directory, but you can customize the output location and organization:
# Generate a comprehensive report with all visualizations (default location)
python visualizer.py --results results/qa_results.csv
# Generate visualizations in a specific subfolder
python visualizer.py --results results/qa_results.csv --subfolder model_analysis
# Generate visualizations in a custom directory and subfolder
python visualizer.py --results results/qa_results.csv --output-dir results/analysis --subfolder prompt_analysis
# Generate specific visualization types
python visualizer.py --results results/qa_results.csv --report-type basic
python visualizer.py --results results/qa_results.csv --report-type metrics
python visualizer.py --results results/qa_results.csv --report-type prompts
# Analyze a specific model's performance
python visualizer.py --results results/qa_results.csv --report-type model --model mistral-7bVisualizations are organized in the following structure:
results/
visualizations/ # Default output directory
[subfolder if specified]/ # Optional subfolder for organization
model_comparison_by_answer_similarity.png
prompt_type_comparison_multi_metric.png
...
comprehensive: Generate all visualizations (default)basic: Basic model and prompt type comparisonsmetrics: Focus on metric distributions and correlationsmodel: Detailed analysis of a specific modelprompts: Analysis of prompt types and question difficulty
The visualizer creates multiple types of charts and analyses:
- Model Comparisons: Bar charts comparing performance across different models
- Prompt Type Analysis: Compare the effectiveness of different prompt methodologies
- Heatmaps: Visualize performance across combinations of prompt and answer models
- Metric Distributions: Understand the distribution of metric scores
- Correlation Matrices: See relationships between different metrics
- Question Difficulty Analysis: Identify which questions are hardest/easiest
- Per-Model Reports: Detailed performance reports for each model
All visualizations are saved as high-quality PNG files in the specified output directory.
You can also use the visualizer programmatically:
from visualizer import ResultsVisualizer
# Initialize the visualizer with default settings
visualizer = ResultsVisualizer(
results_path="results/qa_results.csv"
)
# Initialize with custom output organization
visualizer = ResultsVisualizer(
results_path="results/qa_results.csv",
output_dir="results/analysis",
subfolder="model_comparison"
)
# Generate specific visualizations
visualizer.plot_model_comparison(metric="answer_similarity")
visualizer.plot_prompt_type_comparison()
visualizer.plot_heatmap(metric="entailment_score")
visualizer.plot_metric_distributions(by_column="answer_model")
visualizer.plot_correlation_matrix()
visualizer.plot_question_difficulty()
# Generate a comprehensive report
visualizer.generate_comprehensive_report()Add new model configurations to the PROMPT_MODEL_CONFIGS and/or ANSWER_MODEL_CONFIGS dictionaries in config.py:
"new-model": {
"name": "organization/model-name",
"description": "Description of the model",
"max_new_tokens": 512,
"temperature": 0.7,
...
}Add new prompt types to the PROMPT_TYPES dictionary in config.py:
"new prompt type": "Description of the reasoning methodology"Extend the metrics_evaluator.py file to include additional evaluation metrics:
- Implement a new metric calculation method
- Add the metric to the appropriate category in
METRIC_CATEGORIES - Update the documentation in
get_metrics_documentation
To add a new custom metric:
# 1. Add the metric name to METRIC_CATEGORIES in metrics_evaluator.py
METRIC_CATEGORIES = {
'Custom Metrics': [
'my_new_metric',
# ... other metrics
],
# ... other categories
}
# 2. Implement the calculation in _calculate_nlp_metrics method
def _calculate_nlp_metrics(self, question, model_answer, correct_answer):
# ... existing code ...
# Calculate your custom metric
metrics['my_new_metric'] = calculate_my_metric(model_answer, correct_answer)
return metrics
# 3. Update documentation
def get_metrics_documentation(self):
metrics_docs = {}
# ... existing code ...
metrics_docs["Custom Metrics"] = [
{"name": "my_new_metric", "description": "Description of what this metric measures"}
]
return metrics_docsAccess your new metric in the evaluation results CSV or through the metrics dictionary.
- Models are configured to run on CPU by default for stability
- For better performance with larger models, consider using a machine with a GPU
- Adjust generation parameters in
config.pyto balance between quality and speed - When running multi-model evaluations, be aware that memory usage increases with each loaded model
- Consider running extensive multi-model evaluations on high-memory machines or in batches
After the pipeline completes, it automatically analyzes the results to find the best combinations of models and prompt types. The analysis:
- Normalizes all metrics to a 0-1 scale for fair comparison
- Groups metrics into weighted categories:
- Semantic similarity (30%): How well the answer matches the reference
- Answer similarity (20%): How similar the answer is to the reference
- ROUGE scores (15%): Text overlap metrics
- BERTScore (15%): BERT-based semantic similarity
- Entailment (20%): Logical consistency
- Calculates overall scores for each combination
- Provides detailed reasoning for why each combination performed well
The analysis results are saved in two formats:
-
CSV File (
results/your_results.csv):- Contains all raw results with individual metrics
- Includes all prompt variations and model combinations
- Preserves full text of prompts and answers
-
Analysis JSON (
results/your_results.analysis.json):{ "best_combinations": [ { "prompt_model": "model_name", "answer_model": "model_name", "prompt_type": "prompt_type", "scores": { "overall": 0.85, "semantic_similarity": 0.90, "answer_similarity": 0.85, "rouge_scores": 0.80, "bertscore": 0.88, "entailment": 0.82 }, "reasoning": "Detailed explanation of why this combination performed well" } ], "metric_weights": { "semantic_similarity": 0.3, "answer_similarity": 0.2, "rouge_scores": 0.15, "bertscore": 0.15, "entailment": 0.2 }, "analysis_summary": { "total_combinations": 100, "best_overall_score": 0.85, "average_overall_score": 0.75 } }
The pipeline includes these additional options for results analysis:
--no-analysis: Disable automatic results analysis after pipeline completion--no-metrics: Disable metrics evaluation (also disables analysis)--exclude-long-text: Exclude long text fields from CSV output
# Run pipeline with automatic analysis
python pipeline.py \
--dataset datasets/cleaned/your_dataset.csv \
--output results/your_results.csv \
--prompt-models phi-2 mistral-7b \
--answer-models phi-2 mistral-7b \
--prompt-types "chain of thought" "role based" \
--prompts-per-type 2 \
--num-questions 10
# Run pipeline without analysis
python pipeline.py \
--dataset datasets/cleaned/your_dataset.csv \
--output results/your_results.csv \
--no-analysisWhen the pipeline completes, you'll see a summary like this:
Top 5 Best Combinations:
================================================================================
1. mistral-7b → phi-2 (chain of thought)
Overall Score: 0.853
Reasoning: Excellent semantic similarity with reference answers | High answer similarity indicating good content matching | The mistral-7b model generated effective prompts for the chain of thought methodology | The phi-2 model produced high-quality answers based on these prompts
--------------------------------------------------------------------------------
2. phi-2 → mistral-7b (role based)
Overall Score: 0.842
Reasoning: Strong text overlap with reference answers | High BERT-based semantic similarity | The phi-2 model generated effective prompts for the role based methodology | The mistral-7b model produced high-quality answers based on these prompts
--------------------------------------------------------------------------------
This analysis helps you:
- Identify the most effective model combinations
- Understand which prompt types work best
- See detailed metrics for each combination
- Get explanations for why certain combinations performed well
The summarize_experiments.py script provides automated analysis and visualization of experiment results. It processes the output CSVs from all evaluation experiments and generates:
- A summary figure with bar charts and heatmaps comparing prompt types, models, and configurations across datasets.
- A best configuration summary table (
best_configurations_summary.csv) that includes, for each experiment:- Dataset and experiment type
- Best prompt type, prompt model, and answer model
- System prompt used
- All raw metric values for the best configuration
- The overall weighted score
- Normalized versions of each experiment's results in the
normalized_results/directory.
To run the summarization and generate all outputs:
python summarize_experiments.pyThis will create:
summary_figure.png: A multi-panel figure visualizing the main findingsbest_configurations_summary.csv: A table of the best configuration and metrics for each experimentnormalized_results/: Folder containing normalized CSVs for each experiment
You can use these outputs directly in your paper or for further analysis.
The pipeline supports incremental processing and can resume from interruptions:
-
Incremental CSV Writing: Results are written to CSV immediately after each evaluation, providing:
- Crash resistance - no data loss if the process is interrupted
- Progress visibility - results can be monitored in real-time
- Memory efficiency - results don't need to be held in memory
-
Resume Functionality: Use the
--resume-fromargument to continue from a specific question:python pipeline.py \ --dataset datasets/cleaned/your_dataset.csv \ --output results/your_results.csv \ --prompt-models llama-3.2-1b \ --answer-models llama-3.2-1b \ --prompt-types "chain of thought" "guided prompting" \ --prompts-per-type 5 \ --num-questions 20 \ --resume-from 19 # Resume from question #19
-
Progress Tracking:
- Shows overall progress with a progress bar
- Displays current question being processed
- Indicates which model combinations are being evaluated
- Confirms each result being saved to CSV
The pipeline manages CSV output with the following features:
- Automatic Directory Creation: Creates output directories if they don't exist
- Safe Append Mode: Never overwrites existing results, always appends new data
- Header Management: Properly handles CSV headers for both new and existing files
- Long Text Handling: Option to exclude long text fields with
--exclude-long-text
This ensures that results are saved reliably and efficiently, even when the pipeline is interrupted or when processing large datasets.
