MedPromptEval - A Medical LLM Question and Answer Evaluation Framework

MedPromptEval is a comprehensive framework for evaluating and improving large language models for medical question answering through systematic prompt engineering, multi-model evaluation, and detailed metrics analysis.

Overview

MedPromptEval provides a complete framework for generating high-quality system prompts that guide language models in answering medical questions, evaluating their performance, and analyzing the results. It's designed to help researchers and practitioners understand which combinations of models and prompting strategies yield the most accurate, relevant, and unbiased medical answers.

What Problem Does It Solve?

Medical question answering requires both accuracy and appropriate explanation. This framework allows you to:

Generate diverse system prompts using different reasoning methodologies
Test these prompts with various language models against medical QA datasets
Analyze answer quality through comprehensive metrics
Identify the most effective model and prompt combinations for medical QA

Real-World Applications

MedPromptEval can be used for:

Medical LLM Research:
- Benchmark different medical LLMs on standardized datasets
- Identify optimal prompting strategies for different medical domains
Medical Education:
- Evaluate LLMs for patient education content generation
- Ensure medical explanations are accurate and at appropriate reading levels
Clinical Decision Support:
- Test how well LLMs reason about medical cases
- Identify and reduce bias in medical recommendations
Healthcare Documentation:
- Assess models for medical summarization tasks
- Evaluate factual consistency between source documents and model outputs

Features

Supports multiple LLMs
- Phi-2
- Mistral-7B
- Llama-3-8B
- Llama-3.2-1B
- DeepSeek-R1-Distill-Qwen-1.5B
- Qwen3-1.7B
- Gemma-3-1B-IT
- Granite-3.3-2B
- Gemma-2-2B
- OpenBioLLM-8B
Generates prompts using various reasoning methodologies:
- Chain of Thought: Step-by-step reasoning through medical concepts
- Trigger Chain of Thought: Using prompts that elicit medical reasoning
- Self Consistency: Generating multiple reasoning paths for verification
- Prompt Chaining: Breaking complex medical questions into sub-prompts
- ReAct: Reasoning and acting iteratively for clinical scenarios
- Tree of Thoughts: Exploring multiple diagnostic or treatment branches
- Role-Based: Assuming the persona of a relevant medical specialist
- Metacognitive Prompting: Self-reflection on medical reasoning processes
- Uncertainty-Based Prompting: Acknowledging knowledge limitations and providing confidence assessments
- Guided Prompting: Using structured frameworks for medical explanations
Separate optimized configurations for:
- Prompt generation (creative, diverse)
- Answer generation (factual, precise)
Command-line interface for easy experimentation
JSON output format for further processing
Evaluation pipeline for testing system prompts on real QA datasets
Multi-model evaluation capabilities for comprehensive performance comparisons
Memory-efficient evaluation with comprehensive NLP metrics
Multiple Prompt Types: Supports various prompt engineering techniques including Chain of Thought, Self-Consistency, ReAct, and more
Comprehensive Metrics: Evaluates answers using multiple metrics including semantic similarity, ROUGE scores, BLEU score, and BERTScore
Flexible Model Support: Works with any Hugging Face model
Incremental Processing: Results are written to CSV immediately after each evaluation, providing:
- Crash resistance
- Progress visibility
- Memory efficiency
Resume Capability: Continue processing from any question using the --resume-from argument
CSV Output Handling:
- Automatic directory creation
- Safe append mode
- Header management
- Long text handling
Advanced Visualization: Comprehensive analysis tools with:
- Model comparisons
- Prompt type analysis
- Metric distributions
- Correlation analysis
- Best configurations analysis
- Normalized metric scoring

Project Structure

config.py: Contains model configurations and prompt type definitions
prompt_generation.py: Core implementation of the PromptGenerator class
answer_generation.py: Handles question answering with different models
metrics_evaluator.py: Evaluation module for comparing model answers with ground truth
pipeline.py: Comprehensive pipeline for evaluating generated prompts on question-answering datasets
test_pipeline.py: Command-line interface for running the generator
visualizer.py: Visualization tools for analyzing evaluation results
summarize_experiments.py: Automated analysis and visualization of experiment results.
datasets/: Directory containing medical QA datasets
results/: Output directory for evaluation results
visualizations/: Generated charts and plots from analysis

Requirements

Python 3.8+
PyTorch
Transformers
Hugging Face account with API token (for some models)
dotenv
pandas
tqdm
colorama (for terminal output formatting)
nltk, rouge, bert_score (for NLP metrics)
sentence-transformers (for semantic embedding)
textstat (for readability metrics)
textblob (for sentiment analysis)
matplotlib, seaborn (for data visualization)
scipy (for statistical analysis)
numpy (for numerical operations)

Installation

Clone this repository

Install requirements:

pip install -r requirements.txt

Alternatively, install individual packages:

pip install torch transformers huggingface_hub python-dotenv pandas tqdm colorama nltk rouge bert_score sentence_transformers textstat textblob matplotlib seaborn scipy numpy

Create a .env file in the root directory with your Hugging Face token:
```
HUGGINGFACE_TOKEN=your_token_here
```
You can get your token from your Hugging Face account settings.

Usage

Basic Usage

Run the generator with default settings (Mistral-7B model):

python test_pipeline.py

Custom Configuration

Specify different parameters:

python test_pipeline.py --model phi-2 --output_dir outputs/phi2 --num_prompts 3

Generate prompts for specific reasoning types:

python test_pipeline.py --model phi-2 --prompt_types "chain of thought" "role based" --num_prompts 2

Evaluating QA Performance

Evaluate how well the generated system prompts perform on question-answering tasks:

python pipeline.py --dataset datasets/cleaned/medquad_cleaned.csv --output results/qa_results.csv --prompt-models phi-2 --answer-models mistral-7b

For comprehensive multi-model evaluation, specify multiple models:

python pipeline.py --dataset datasets/cleaned/medquad.csv --output results/multi_model_results.csv --prompt-models phi-2 mistral-7b --answer-models phi-2 mistral-7b llama-3-8b --prompt-types "chain of thought" "self consistency" --prompts-per-type 2 --num-questions 5

This example would test:

2 prompt models × 3 answer models × 2 prompt types × 2 prompts per type = 24 combinations per question
Across 5 sample questions, resulting in 120 total evaluations

Workflow Examples

Example 1: Testing a Single Model with Different Prompt Types

# Step 1: Generate prompts with Phi-2
python test_pipeline.py --model phi-2 --num_prompts 1 --output_dir outputs/phi2_prompts

# Step 2: Run evaluation with all prompt types
python pipeline.py \
  --dataset datasets/cleaned/medquad_cleaned.csv \
  --output results/phi2_prompt_types.csv \
  --prompt-models phi-2 \
  --answer-models phi-2 \
  --num-questions 20

Example 2: Comparing Models' Medical QA Capabilities

# Evaluate all models on the same dataset with the same prompt type
python pipeline.py \
  --dataset datasets/cleaned/medquad_cleaned.csv \
  --output results/model_comparison.csv \
  --prompt-models phi-2 \
  --answer-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
  --prompt-types "chain of thought" \
  --num-questions 25

Example 3: Comprehensive Evaluation Across All Models and Prompt Types

# Run with all available models and prompt types
python pipeline.py \
  --dataset datasets/cleaned/medquad_cleaned.csv \
  --output results/comprehensive_evaluation.csv \
  --prompt-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
  --answer-models phi-2 mistral-7b llama-3-8b llama-3.2-1b deepseek-qwen-1.5b qwen3-1.7b gemma-3-1b-it granite-3.3-2b gemma-2-2b openbiollm-8b \
  --prompt-types "chain of thought" "trigger chain of thought" "self consistency" "prompt chaining" "react" "tree of thoughts" "role based" "metacognitive prompting" "uncertainty based prompting" "guided prompting" \
  --prompts-per-type 2 \
  --num-questions 10

Example 4: Optimizing for Low-Resource Environments

# Run with minimal memory usage
python pipeline.py \
  --dataset datasets/cleaned/medquad_cleaned.csv \
  --output results/low_resource.csv \
  --prompt-models phi-2 \
  --answer-models phi-2 \
  --prompt-types "chain of thought" \
  --num-questions 10 \
  --exclude-long-text \
  --no-verbose

Available Arguments

For Prompt Generation (test_pipeline.py)

--model: Choose the model to use (phi-2, mistral-7b, llama-3-8b, gemma-2-2b, openbiollm-8b)
--output_dir: Directory to save generated prompts
--num_prompts: Number of prompts to generate per type
--prompt_types: Specific prompt types to generate (default: all types)
--no_auth: Run without Hugging Face authentication

For Evaluation Pipeline (pipeline.py)

--dataset: Path to a CSV file containing question-answer pairs (required)
--output: Path for the output CSV file (default: results/qa_results.csv)
--prompt-models: Models to use for generating prompts, can specify multiple (default: phi-2)
--answer-models: Models to use for answering questions, can specify multiple (default: mistral-7b)
--prompt-types: Specific prompt types to use (default: all types)
--prompts-per-type: Number of prompt variations to generate per type (default: 1)
--num-questions: Number of question-answer pairs to process (default: all)
--no-auth: Run without Hugging Face authentication
--no-metrics: Disable all metrics evaluation
--no-deepeval: Disable DeepEval metrics to reduce memory usage while still calculating basic NLP metrics
--exclude-long-text: Exclude long text fields from CSV output
--no-verbose: Disable colorized metrics display in the terminal
--list-metrics: List all available metrics with descriptions and exit

Configuration Details

The system uses two separate model configurations to optimize for different tasks:

PROMPT_MODEL_CONFIGS: Models optimized for prompt generation
- Higher temperature (0.7) for creativity
- Balanced top_p (0.9) and top_k (50) for diverse suggestions
- Shorter output (512 tokens) focused on system prompt creation
- Higher repetition penalty (1.2) to avoid repetitive patterns
- Optimized for generating structured, clear instructions
ANSWER_MODEL_CONFIGS: Models optimized for answering medical questions
- Lower temperature (0.3) for factual, precise answers
- Higher max_new_tokens (1024) for more detailed responses
- Lower top_p (0.7) and top_k (40) for more focused outputs
- Balanced repetition penalty (1.1) for natural but focused answers
- Optimized for generating comprehensive, accurate medical explanations

Each model configuration can be customized in config.py to fine-tune the generation parameters for specific use cases.

Output

Prompt Generation Output

The system generates a JSON file containing:

Model metadata
Generated prompts for each reasoning methodology

Example output structure:

{
  "metadata": {
    "model_info": {
      "name": "mistralai/Mistral-7B-v0.1",
      "description": "Mistral 7B base model, good for general instruction following"
    }
  },
  "prompts": {
    "chain of thought": [
      "You are a medical AI assistant. When answering medical questions, break down your reasoning into clear, logical steps..."
    ],
    "role based": [
      "Assume the role of a medical specialist most relevant to the question being asked..."
    ],
    ...
  }
}

Pipeline Output

The pipeline produces a CSV file with these columns:

prompt_num: A sequential number for each prompt generated during execution
question: The original question
correct_answer: The ground truth answer
prompt_model: Which model generated the system prompt
prompt_model_key: The configuration key for the prompt model
prompt_type: The type of reasoning (chain-of-thought, etc.)
prompt_variation: The variation number for this prompt type
system_prompt: The full system prompt text
answer_model: Which model generated the answer
answer_model_key: The configuration key for the answer model
model_answer: The model's answer to the question
Various metrics columns (see Metrics Evaluation section)

Incremental CSV Writing

The pipeline writes results to the CSV file incrementally after each question-answer evaluation, rather than saving everything at the end. This provides several benefits:

Crash resistance: If the pipeline is interrupted or crashes, all processed results up to that point are already saved
Progress visibility: You can open the CSV file while the pipeline is running to see current results
Reduced memory usage: The pipeline doesn't need to keep all results in memory
Reliable execution: Even with large-scale evaluations across multiple models, your results are saved as they're generated

After each result is processed, you'll see output like this:

✓ Result #42 saved: Q5, chain of thought (2/3), Phi-2 → Mistral-7B

This indicates that:

Result #42 has been saved to the CSV file
It's for question #5
Using the "chain of thought" prompt type, variation 2 of 3
Generated with Phi-2 as the prompt model and Mistral-7B as the answer model

Metrics Evaluation

The system includes a comprehensive metrics evaluation component that analyzes the quality of generated answers using various NLP techniques. This allows for quantitative assessment of model performance without requiring external APIs.

Available Metrics

The metrics evaluator provides the following metrics:

Semantic and Relevance Metrics:

semantic_similarity: Cosine similarity between question and answer embeddings (relevance)
answer_similarity: Cosine similarity between model answer and correct answer
entailment_score: Score indicating whether the model answer entails (is consistent with) the correct answer
entailment_label: Classification label: "entailment", "neutral", or "contradiction"

Text Comparison Metrics:

rouge1_f, rouge2_f, rougeL_f: ROUGE metrics for text overlap assessment
bleu_score: BLEU score for measuring precision
bertscore_precision, bertscore_recall, bertscore_f1: BERTScore metrics for semantic evaluation

Readability and Style Metrics:

answer_length: Number of words in the answer
flesch_reading_ease: Readability score (higher = easier to read)
flesch_kincaid_grade: US grade level required to understand the text
sentiment_polarity: Sentiment of the answer (-1 to +1)
sentiment_subjectivity: Subjectivity of the answer (0 to 1)

Comparative Metrics:

comparison_answer_length_delta: Difference in length between model and correct answers
comparison_flesch_reading_ease_delta: Difference in readability
comparison_flesch_kincaid_grade_delta: Difference in grade level
comparison_sentiment_polarity_delta: Difference in sentiment
comparison_sentiment_subjectivity_delta: Difference in subjectivity
comparison_relevance_delta: Difference in question relevance
comparison_summary: Overall summary of key differences

Reference Metrics: For each model metric, there's a corresponding correct_ version that provides the same measurement for the reference answer, enabling direct comparison.

Colorized Terminal Output

When verbose mode is enabled (default), metrics are displayed with color coding in the terminal:

Green: Good scores
Yellow: Moderate scores
Red: Poor scores

This provides immediate visual feedback on answer quality during evaluation.

Analyzing Results

The evaluation results are saved to CSV files with comprehensive metrics. You can analyze these results using:

Spreadsheet Applications: Open in Excel, Google Sheets, etc. for basic filtering and visualization
Data Analysis Libraries: Use pandas, matplotlib, or other Python libraries for advanced analysis
Built-in Visualization Tools: Use the provided visualizer.py module for comprehensive visual analysis

Using the Visualizer

The framework includes a powerful visualization module to help analyze and interpret evaluation results. By default, visualizations are saved in the results/visualizations directory, but you can customize the output location and organization:

# Generate a comprehensive report with all visualizations (default location)
python visualizer.py --results results/qa_results.csv

# Generate visualizations in a specific subfolder
python visualizer.py --results results/qa_results.csv --subfolder model_analysis

# Generate visualizations in a custom directory and subfolder
python visualizer.py --results results/qa_results.csv --output-dir results/analysis --subfolder prompt_analysis

# Generate specific visualization types
python visualizer.py --results results/qa_results.csv --report-type basic
python visualizer.py --results results/qa_results.csv --report-type metrics
python visualizer.py --results results/qa_results.csv --report-type prompts

# Analyze a specific model's performance
python visualizer.py --results results/qa_results.csv --report-type model --model mistral-7b

Output Organization

Visualizations are organized in the following structure:

results/
  visualizations/              # Default output directory
    [subfolder if specified]/  # Optional subfolder for organization
      model_comparison_by_answer_similarity.png
      prompt_type_comparison_multi_metric.png
      ...

Available Report Types

comprehensive: Generate all visualizations (default)
basic: Basic model and prompt type comparisons
metrics: Focus on metric distributions and correlations
model: Detailed analysis of a specific model
prompts: Analysis of prompt types and question difficulty

Visualization Capabilities

The visualizer creates multiple types of charts and analyses:

Model Comparisons: Bar charts comparing performance across different models
Prompt Type Analysis: Compare the effectiveness of different prompt methodologies
Heatmaps: Visualize performance across combinations of prompt and answer models
Metric Distributions: Understand the distribution of metric scores
Correlation Matrices: See relationships between different metrics
Question Difficulty Analysis: Identify which questions are hardest/easiest
Per-Model Reports: Detailed performance reports for each model

All visualizations are saved as high-quality PNG files in the specified output directory.

Using the Visualizer in Python Code

You can also use the visualizer programmatically:

from visualizer import ResultsVisualizer

# Initialize the visualizer with default settings
visualizer = ResultsVisualizer(
    results_path="results/qa_results.csv"
)

# Initialize with custom output organization
visualizer = ResultsVisualizer(
    results_path="results/qa_results.csv",
    output_dir="results/analysis",
    subfolder="model_comparison"
)

# Generate specific visualizations
visualizer.plot_model_comparison(metric="answer_similarity")
visualizer.plot_prompt_type_comparison()
visualizer.plot_heatmap(metric="entailment_score")
visualizer.plot_metric_distributions(by_column="answer_model")
visualizer.plot_correlation_matrix()
visualizer.plot_question_difficulty()

# Generate a comprehensive report
visualizer.generate_comprehensive_report()

Extending the System

Adding New Models

Add new model configurations to the PROMPT_MODEL_CONFIGS and/or ANSWER_MODEL_CONFIGS dictionaries in config.py:

"new-model": {
    "name": "organization/model-name",
    "description": "Description of the model",
    "max_new_tokens": 512,
    "temperature": 0.7,
    ...
}

Adding New Prompt Types

Add new prompt types to the PROMPT_TYPES dictionary in config.py:

"new prompt type": "Description of the reasoning methodology"

Adding New Metrics

Extend the metrics_evaluator.py file to include additional evaluation metrics:

Implement a new metric calculation method
Add the metric to the appropriate category in METRIC_CATEGORIES
Update the documentation in get_metrics_documentation

Adding Custom Metrics

To add a new custom metric:

# 1. Add the metric name to METRIC_CATEGORIES in metrics_evaluator.py
METRIC_CATEGORIES = {
    'Custom Metrics': [
        'my_new_metric',
        # ... other metrics
    ],
    # ... other categories
}

# 2. Implement the calculation in _calculate_nlp_metrics method
def _calculate_nlp_metrics(self, question, model_answer, correct_answer):
    # ... existing code ...
    
    # Calculate your custom metric
    metrics['my_new_metric'] = calculate_my_metric(model_answer, correct_answer)
    
    return metrics

# 3. Update documentation
def get_metrics_documentation(self):
    metrics_docs = {}
    # ... existing code ...
    
    metrics_docs["Custom Metrics"] = [
        {"name": "my_new_metric", "description": "Description of what this metric measures"}
    ]
    
    return metrics_docs

Access your new metric in the evaluation results CSV or through the metrics dictionary.

Performance Considerations

Models are configured to run on CPU by default for stability
For better performance with larger models, consider using a machine with a GPU
Adjust generation parameters in config.py to balance between quality and speed
When running multi-model evaluations, be aware that memory usage increases with each loaded model
Consider running extensive multi-model evaluations on high-memory machines or in batches

Results Analysis

After the pipeline completes, it automatically analyzes the results to find the best combinations of models and prompt types. The analysis:

Normalizes all metrics to a 0-1 scale for fair comparison
Groups metrics into weighted categories:
- Semantic similarity (30%): How well the answer matches the reference
- Answer similarity (20%): How similar the answer is to the reference
- ROUGE scores (15%): Text overlap metrics
- BERTScore (15%): BERT-based semantic similarity
- Entailment (20%): Logical consistency
Calculates overall scores for each combination
Provides detailed reasoning for why each combination performed well

Analysis Output

The analysis results are saved in two formats:

CSV File (results/your_results.csv):
- Contains all raw results with individual metrics
- Includes all prompt variations and model combinations
- Preserves full text of prompts and answers

Analysis JSON (results/your_results.analysis.json):

{
  "best_combinations": [
    {
      "prompt_model": "model_name",
      "answer_model": "model_name",
      "prompt_type": "prompt_type",
      "scores": {
        "overall": 0.85,
        "semantic_similarity": 0.90,
        "answer_similarity": 0.85,
        "rouge_scores": 0.80,
        "bertscore": 0.88,
        "entailment": 0.82
      },
      "reasoning": "Detailed explanation of why this combination performed well"
    }
  ],
  "metric_weights": {
    "semantic_similarity": 0.3,
    "answer_similarity": 0.2,
    "rouge_scores": 0.15,
    "bertscore": 0.15,
    "entailment": 0.2
  },
  "analysis_summary": {
    "total_combinations": 100,
    "best_overall_score": 0.85,
    "average_overall_score": 0.75
  }
}

Command Line Options

The pipeline includes these additional options for results analysis:

--no-analysis: Disable automatic results analysis after pipeline completion
--no-metrics: Disable metrics evaluation (also disables analysis)
--exclude-long-text: Exclude long text fields from CSV output

Example Usage

# Run pipeline with automatic analysis
python pipeline.py \
  --dataset datasets/cleaned/your_dataset.csv \
  --output results/your_results.csv \
  --prompt-models phi-2 mistral-7b \
  --answer-models phi-2 mistral-7b \
  --prompt-types "chain of thought" "role based" \
  --prompts-per-type 2 \
  --num-questions 10

# Run pipeline without analysis
python pipeline.py \
  --dataset datasets/cleaned/your_dataset.csv \
  --output results/your_results.csv \
  --no-analysis

Analysis Output Example

When the pipeline completes, you'll see a summary like this:

Top 5 Best Combinations:
================================================================================

1. mistral-7b → phi-2 (chain of thought)
   Overall Score: 0.853
   Reasoning: Excellent semantic similarity with reference answers | High answer similarity indicating good content matching | The mistral-7b model generated effective prompts for the chain of thought methodology | The phi-2 model produced high-quality answers based on these prompts
--------------------------------------------------------------------------------

2. phi-2 → mistral-7b (role based)
   Overall Score: 0.842
   Reasoning: Strong text overlap with reference answers | High BERT-based semantic similarity | The phi-2 model generated effective prompts for the role based methodology | The mistral-7b model produced high-quality answers based on these prompts
--------------------------------------------------------------------------------

This analysis helps you:

Identify the most effective model combinations
Understand which prompt types work best
See detailed metrics for each combination
Get explanations for why certain combinations performed well

Experiment Summarization and Visualization

The summarize_experiments.py script provides automated analysis and visualization of experiment results. It processes the output CSVs from all evaluation experiments and generates:

A summary figure with bar charts and heatmaps comparing prompt types, models, and configurations across datasets.
A best configuration summary table (best_configurations_summary.csv) that includes, for each experiment:
- Dataset and experiment type
- Best prompt type, prompt model, and answer model
- System prompt used
- All raw metric values for the best configuration
- The overall weighted score
Normalized versions of each experiment's results in the normalized_results/ directory.

Usage

To run the summarization and generate all outputs:

python summarize_experiments.py

This will create:

summary_figure.png: A multi-panel figure visualizing the main findings
best_configurations_summary.csv: A table of the best configuration and metrics for each experiment
normalized_results/: Folder containing normalized CSVs for each experiment

You can use these outputs directly in your paper or for further analysis.

Pipeline Features

Incremental Processing and Resume Capability

The pipeline supports incremental processing and can resume from interruptions:

Incremental CSV Writing: Results are written to CSV immediately after each evaluation, providing:
- Crash resistance - no data loss if the process is interrupted
- Progress visibility - results can be monitored in real-time
- Memory efficiency - results don't need to be held in memory

Resume Functionality: Use the --resume-from argument to continue from a specific question:

python pipeline.py \
  --dataset datasets/cleaned/your_dataset.csv \
  --output results/your_results.csv \
  --prompt-models llama-3.2-1b \
  --answer-models llama-3.2-1b \
  --prompt-types "chain of thought" "guided prompting" \
  --prompts-per-type 5 \
  --num-questions 20 \
  --resume-from 19  # Resume from question #19

Progress Tracking:
- Shows overall progress with a progress bar
- Displays current question being processed
- Indicates which model combinations are being evaluated
- Confirms each result being saved to CSV

CSV Output Handling

The pipeline manages CSV output with the following features:

Automatic Directory Creation: Creates output directories if they don't exist
Safe Append Mode: Never overwrites existing results, always appends new data
Header Management: Properly handles CSV headers for both new and existing files
Long Text Handling: Option to exclude long text fields with --exclude-long-text

This ensures that results are saved reliably and efficiently, even when the pipeline is interrupted or when processing large datasets.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data_processing		data_processing
.gitignore		.gitignore
README.md		README.md
answer_generation.py		answer_generation.py
best_configurations_summary.csv		best_configurations_summary.csv
config.py		config.py
metrics_evaluator.py		metrics_evaluator.py
pipeline.py		pipeline.py
prompt_generation.py		prompt_generation.py
requirements.txt		requirements.txt
run.sh		run.sh
summarize_experiments.py		summarize_experiments.py
summary_figure.png		summary_figure.png
visualizer.py		visualizer.py

Folders and files

Latest commit

History

Repository files navigation

MedPromptEval - A Medical LLM Question and Answer Evaluation Framework

Overview

What Problem Does It Solve?

Real-World Applications

Features

Project Structure

Requirements

Installation

Usage

Basic Usage

Custom Configuration

Evaluating QA Performance

Workflow Examples

Example 1: Testing a Single Model with Different Prompt Types

Example 2: Comparing Models' Medical QA Capabilities

Example 3: Comprehensive Evaluation Across All Models and Prompt Types

Example 4: Optimizing for Low-Resource Environments

Available Arguments

For Prompt Generation (test_pipeline.py)

For Evaluation Pipeline (pipeline.py)

Configuration Details

Output

Prompt Generation Output

Pipeline Output

Incremental CSV Writing

Metrics Evaluation

Available Metrics

Colorized Terminal Output

Analyzing Results

Using the Visualizer

Output Organization

Available Report Types

Visualization Capabilities

Using the Visualizer in Python Code

Extending the System

Adding New Models

Adding New Prompt Types

Adding New Metrics

Adding Custom Metrics

Performance Considerations

Results Analysis

Analysis Output

Command Line Options

Example Usage

Analysis Output Example

Experiment Summarization and Visualization

Usage

Pipeline Features

Incremental Processing and Resume Capability

CSV Output Handling

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages