Skip to content

meituan-longcat/Meeseeks

Repository files navigation

👑 Meeseeks Benchmark

Meeseeks Logo

🚀 Latest News

We officially released the multilingual version of Meeseeks!

📋 Previous Versions

Temporarily removed for arr submitting

📖 Introduction

Meeseeks is an instruction-following benchmark designed to evaluate how well models can adhere to user instructions in a multi-turn scenario.
A key feature of Meeseeks is its self-correction loop, where models receive structured feedback and must refine their responses accordingly.

This benchmark provides a realistic evaluation of a model’s adaptability, instruction adherence, and iterative improvement.


📊 Leaderboard

leaderboard


🍄‍🟫 A Quick Example

ROUND1-Input Evaluation Content Capability tags
Generate 32 colloquial user comments and 40 formal user comments from a consumer perspective in short video comment sections. Each comment should be exactly 7 characters long and must not contain the following words:["this", "good", "that"] Whether 32 colloquial user comments were generated Element number requirement
Whether 40 formal user comments were generated Element number requirement
Whether all comments are exactly 7 characters Generate in 0∼10 words、Generate at accurate word number
Whether comments are non-repetitive Generate repeat/non-repeat content
Whether comments do not contain forbidden words: ["this", "good", "that"] Generate with certain keywords
💡 Let's activate multi-round mode!
ROUND2 - Input (if ROUND1 model output fails to meet requirement: "Whether all comments are exactly 7 characters")
Your response has the following issues: Whether all comments are exactly 7 characters: ❌ Content character count does not match range[7, 7] [mom prouds of you] character count: 4 Please provide your corrected response based on this information. Note: Only output the answer, do not output additional information.
ROUND3 - Input ...
...

🚀 Quick Start

Step 1: Environment Setup

1.1 Install Dependencies

Run the automated installation script:

bash install_deps.sh

This script will:

  • Detect your Python version (3.9 or 3.10+)
  • Install all required dependencies
  • Resolve version conflicts automatically
  • Install language-specific NLP libraries (Chinese, Japanese, Korean, Arabic, German, French, etc.)

Requirements: Python 3.9+ (Python 3.10+ recommended)

1.2 Configure API Keys

Create a .env file in the project root with your API configurations:

# Qwen API Configuration (Extract Model)
QWEN_API_KEY=your_api_key_here
QWEN_BASE_URL=your_api_base_url_here
QWEN_MODEL=your_model_name_here

# Qwen Coder API Configuration (Score Model)
QWEN_CODER_API_KEY=your_api_key_here
QWEN_CODER_BASE_URL=your_api_base_url_here
QWEN_CODER_MODEL=your_model_name_here

# Tested Model API Configuration (Model Under Evaluation)
TESTED_MODEL_API_KEY=your_api_key_here
TESTED_MODEL_BASE_URL=your_api_base_url_here
TESTED_MODEL_NAME=your_model_name_here

💡 Tip: All three models support OpenAI-compatible API format. You can use the same model for all three roles if needed.


Step 2: Run Evaluation

2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)

Run evaluation for all Asia languages:

python default_run_asia.py

Or filter specific languages:

# Evaluate only Chinese data
python default_run_asia.py --chinese

# Evaluate only Japanese data
python default_run_asia.py --japanese

# Evaluate only Korean data
python default_run_asia.py --korean

# Combine multiple languages
python default_run_asia.py --chinese --japanese

2.2 English & Multi-language Evaluation

Run evaluation for all supported languages:

python default_run_eng.py

Or filter specific languages:

# Evaluate only English data
python default_run_eng.py --english

# Evaluate only German data
python default_run_eng.py --german

# Evaluate other languages
python default_run_eng.py --french    # French
python default_run_eng.py --spanish   # Spanish
python default_run_eng.py --portuguese # Portuguese
python default_run_eng.py --russian   # Russian
python default_run_eng.py --arabic    # Arabic
python default_run_eng.py --indonesian # Indonesian

# Combine multiple languages
python default_run_eng.py --english --german --french

⚙️ Model Requirements

Before running any evaluation, you need to configure three model APIs:

  1. Tested Model (TESTED_MODEL_* in .env)

    • The model you want to evaluate
    • Must support OpenAI-compatible Chat Completions API
  2. Extract Model (QWEN_* in .env)

    • Recommended: Qwen2.5-Coder-32B-Instruct
    • Used to extract structured outputs from model responses
    • Requires strong code generation and structure understanding
  3. Score Model (QWEN_CODER_* in .env)

    • Recommended: Qwen2.5-32B-Instruct
    • Used to evaluate and score the extracted results
    • Requires strong reasoning and judgment capabilities

💡 Hardware & API Options

  • If you have a GPU:
    Deploy open-source Qwen2.5 series models locally using vLLM, TGI, or similar frameworks.

  • If you don't have a GPU:
    Use commercial APIs instead:

    • Highly recommended: Claude 3.7 Sonnet or GPT-4
    • Any OpenAI-compatible API endpoint will work

📂 Evaluation Results

Results will be automatically saved to:

  • Asia languages: evaluation_results_asia/
  • English & others: evaluation_results_english/

Each directory contains:

  • round_1.json, round_2.json: Detailed evaluation results per round
  • round_1_stats.json, round_2_stats.json: Statistical summaries
  • Structured logs and scoring information for analysis

🙏 Contributors behind the scenes

Temporarily removed for arr submitting

About

A iterative feedback driven benchmark on LLM's instruction following ability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors