We officially released the multilingual version of Meeseeks!
Temporarily removed for arr submitting
Meeseeks is an instruction-following benchmark designed to evaluate how well models can adhere to user instructions in a multi-turn scenario.
A key feature of Meeseeks is its self-correction loop, where models receive structured feedback and must refine their responses accordingly.
This benchmark provides a realistic evaluation of a model’s adaptability, instruction adherence, and iterative improvement.
| ROUND1-Input | Evaluation Content | Capability tags |
|---|---|---|
| Generate 32 colloquial user comments and 40 formal user comments from a consumer perspective in short video comment sections. Each comment should be exactly 7 characters long and must not contain the following words:["this", "good", "that"] | Whether 32 colloquial user comments were generated | Element number requirement |
| Whether 40 formal user comments were generated | Element number requirement | |
| Whether all comments are exactly 7 characters | Generate in 0∼10 words、Generate at accurate word number | |
| Whether comments are non-repetitive | Generate repeat/non-repeat content | |
| Whether comments do not contain forbidden words: ["this", "good", "that"] | Generate with certain keywords | |
| 💡 Let's activate multi-round mode! | ||
| ROUND2 - Input (if ROUND1 model output fails to meet requirement: "Whether all comments are exactly 7 characters") | ||
| Your response has the following issues: Whether all comments are exactly 7 characters: ❌ Content character count does not match range[7, 7] [mom prouds of you] character count: 4 Please provide your corrected response based on this information. Note: Only output the answer, do not output additional information. | ||
| ROUND3 - Input ... | ||
| ... | ||
Run the automated installation script:
bash install_deps.shThis script will:
- Detect your Python version (3.9 or 3.10+)
- Install all required dependencies
- Resolve version conflicts automatically
- Install language-specific NLP libraries (Chinese, Japanese, Korean, Arabic, German, French, etc.)
Requirements: Python 3.9+ (Python 3.10+ recommended)
Create a .env file in the project root with your API configurations:
# Qwen API Configuration (Extract Model)
QWEN_API_KEY=your_api_key_here
QWEN_BASE_URL=your_api_base_url_here
QWEN_MODEL=your_model_name_here
# Qwen Coder API Configuration (Score Model)
QWEN_CODER_API_KEY=your_api_key_here
QWEN_CODER_BASE_URL=your_api_base_url_here
QWEN_CODER_MODEL=your_model_name_here
# Tested Model API Configuration (Model Under Evaluation)
TESTED_MODEL_API_KEY=your_api_key_here
TESTED_MODEL_BASE_URL=your_api_base_url_here
TESTED_MODEL_NAME=your_model_name_here💡 Tip: All three models support OpenAI-compatible API format. You can use the same model for all three roles if needed.
Run evaluation for all Asia languages:
python default_run_asia.pyOr filter specific languages:
# Evaluate only Chinese data
python default_run_asia.py --chinese
# Evaluate only Japanese data
python default_run_asia.py --japanese
# Evaluate only Korean data
python default_run_asia.py --korean
# Combine multiple languages
python default_run_asia.py --chinese --japaneseRun evaluation for all supported languages:
python default_run_eng.pyOr filter specific languages:
# Evaluate only English data
python default_run_eng.py --english
# Evaluate only German data
python default_run_eng.py --german
# Evaluate other languages
python default_run_eng.py --french # French
python default_run_eng.py --spanish # Spanish
python default_run_eng.py --portuguese # Portuguese
python default_run_eng.py --russian # Russian
python default_run_eng.py --arabic # Arabic
python default_run_eng.py --indonesian # Indonesian
# Combine multiple languages
python default_run_eng.py --english --german --frenchBefore running any evaluation, you need to configure three model APIs:
-
Tested Model (
TESTED_MODEL_*in.env)- The model you want to evaluate
- Must support OpenAI-compatible Chat Completions API
-
Extract Model (
QWEN_*in.env)- Recommended: Qwen2.5-Coder-32B-Instruct
- Used to extract structured outputs from model responses
- Requires strong code generation and structure understanding
-
Score Model (
QWEN_CODER_*in.env)- Recommended: Qwen2.5-32B-Instruct
- Used to evaluate and score the extracted results
- Requires strong reasoning and judgment capabilities
-
If you have a GPU:
Deploy open-source Qwen2.5 series models locally using vLLM, TGI, or similar frameworks. -
If you don't have a GPU:
Use commercial APIs instead:- ✅ Highly recommended: Claude 3.7 Sonnet or GPT-4
- Any OpenAI-compatible API endpoint will work
Results will be automatically saved to:
- Asia languages:
evaluation_results_asia/ - English & others:
evaluation_results_english/
Each directory contains:
round_1.json,round_2.json: Detailed evaluation results per roundround_1_stats.json,round_2_stats.json: Statistical summaries- Structured logs and scoring information for analysis
Temporarily removed for arr submitting
