👑 Meeseeks Benchmark

🚀 Latest News

We officially released the multilingual version of Meeseeks!

📋 Previous Versions

Temporarily removed for arr submitting

📖 Introduction

Meeseeks is an instruction-following benchmark designed to evaluate how well models can adhere to user instructions in a multi-turn scenario.
A key feature of Meeseeks is its self-correction loop, where models receive structured feedback and must refine their responses accordingly.

This benchmark provides a realistic evaluation of a model’s adaptability, instruction adherence, and iterative improvement.

📊 Leaderboard

🍄‍🟫 A Quick Example

ROUND1-Input	Evaluation Content	Capability tags
Generate 32 colloquial user comments and 40 formal user comments from a consumer perspective in short video comment sections. Each comment should be exactly 7 characters long and must not contain the following words:["this", "good", "that"]	Whether 32 colloquial user comments were generated	Element number requirement
	Whether 40 formal user comments were generated	Element number requirement
	Whether all comments are exactly 7 characters	Generate in 0∼10 words、Generate at accurate word number
	Whether comments are non-repetitive	Generate repeat/non-repeat content
	Whether comments do not contain forbidden words: ["this", "good", "that"]	Generate with certain keywords
💡 Let's activate multi-round mode!
ROUND2 - Input (if ROUND1 model output fails to meet requirement: "Whether all comments are exactly 7 characters")
Your response has the following issues: Whether all comments are exactly 7 characters: ❌ Content character count does not match range[7, 7] [mom prouds of you] character count: 4 Please provide your corrected response based on this information. Note: Only output the answer, do not output additional information.
ROUND3 - Input ...
...

🚀 Quick Start

Step 1: Environment Setup

1.1 Install Dependencies

Run the automated installation script:

bash install_deps.sh

This script will:

Detect your Python version (3.9 or 3.10+)
Install all required dependencies
Resolve version conflicts automatically
Install language-specific NLP libraries (Chinese, Japanese, Korean, Arabic, German, French, etc.)

Requirements: Python 3.9+ (Python 3.10+ recommended)

1.2 Configure API Keys

Create a .env file in the project root with your API configurations:

# Qwen API Configuration (Extract Model)
QWEN_API_KEY=your_api_key_here
QWEN_BASE_URL=your_api_base_url_here
QWEN_MODEL=your_model_name_here

# Qwen Coder API Configuration (Score Model)
QWEN_CODER_API_KEY=your_api_key_here
QWEN_CODER_BASE_URL=your_api_base_url_here
QWEN_CODER_MODEL=your_model_name_here

# Tested Model API Configuration (Model Under Evaluation)
TESTED_MODEL_API_KEY=your_api_key_here
TESTED_MODEL_BASE_URL=your_api_base_url_here
TESTED_MODEL_NAME=your_model_name_here

💡 Tip: All three models support OpenAI-compatible API format. You can use the same model for all three roles if needed.

Step 2: Run Evaluation

2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)

Run evaluation for all Asia languages:

python default_run_asia.py

Or filter specific languages:

# Evaluate only Chinese data
python default_run_asia.py --chinese

# Evaluate only Japanese data
python default_run_asia.py --japanese

# Evaluate only Korean data
python default_run_asia.py --korean

# Combine multiple languages
python default_run_asia.py --chinese --japanese

2.2 English & Multi-language Evaluation

Run evaluation for all supported languages:

python default_run_eng.py

Or filter specific languages:

# Evaluate only English data
python default_run_eng.py --english

# Evaluate only German data
python default_run_eng.py --german

# Evaluate other languages
python default_run_eng.py --french    # French
python default_run_eng.py --spanish   # Spanish
python default_run_eng.py --portuguese # Portuguese
python default_run_eng.py --russian   # Russian
python default_run_eng.py --arabic    # Arabic
python default_run_eng.py --indonesian # Indonesian

# Combine multiple languages
python default_run_eng.py --english --german --french

⚙️ Model Requirements

Before running any evaluation, you need to configure three model APIs:

Tested Model (TESTED_MODEL_* in .env)
- The model you want to evaluate
- Must support OpenAI-compatible Chat Completions API
Extract Model (QWEN_* in .env)
- Recommended: Qwen2.5-Coder-32B-Instruct
- Used to extract structured outputs from model responses
- Requires strong code generation and structure understanding
Score Model (QWEN_CODER_* in .env)
- Recommended: Qwen2.5-32B-Instruct
- Used to evaluate and score the extracted results
- Requires strong reasoning and judgment capabilities

💡 Hardware & API Options

If you have a GPU:
Deploy open-source Qwen2.5 series models locally using vLLM, TGI, or similar frameworks.
If you don't have a GPU:
Use commercial APIs instead:
- ✅ Highly recommended: Claude 3.7 Sonnet or GPT-4
- Any OpenAI-compatible API endpoint will work

📂 Evaluation Results

Results will be automatically saved to:

Asia languages: evaluation_results_asia/
English & others: evaluation_results_english/

Each directory contains:

round_1.json, round_2.json: Detailed evaluation results per round
round_1_stats.json, round_2_stats.json: Statistical summaries
Structured logs and scoring information for analysis

🙏 Contributors behind the scenes

Temporarily removed for arr submitting

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
evaluation_results		evaluation_results
evaluation_results_english		evaluation_results_english
input_data		input_data
src_code		src_code
.gitignore		.gitignore
README.md		README.md
default_run_asia.py		default_run_asia.py
default_run_eng.py		default_run_eng.py
install_deps.sh		install_deps.sh
leaderboard.svg		leaderboard.svg
logo.jpg		logo.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👑 Meeseeks Benchmark

🚀 Latest News

📋 Previous Versions

📖 Introduction

📊 Leaderboard

🍄‍🟫 A Quick Example

🚀 Quick Start

Step 1: Environment Setup

1.1 Install Dependencies

1.2 Configure API Keys

Step 2: Run Evaluation

2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)

2.2 English & Multi-language Evaluation

⚙️ Model Requirements

💡 Hardware & API Options

📂 Evaluation Results

🙏 Contributors behind the scenes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

👑 Meeseeks Benchmark

🚀 Latest News

📋 Previous Versions

📖 Introduction

📊 Leaderboard

🍄‍🟫 A Quick Example

🚀 Quick Start

Step 1: Environment Setup

1.1 Install Dependencies

1.2 Configure API Keys

Step 2: Run Evaluation

2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)

2.2 English & Multi-language Evaluation

⚙️ Model Requirements

💡 Hardware & API Options

📂 Evaluation Results

🙏 Contributors behind the scenes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages