Skip to content

[Model BadCase]: MiniMax-M2 can't get the official performance with LiveCodeBench (LCB) and SWE-bench Verified #53

@lidoo233

Description

@lidoo233

Basic Information - Models Used

MiniMax-M2

Information about environment and deployment

environment

OS: Ubuntu 22.04.5 LTS
GPU: H200*8
Python: 3.12.11
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post1
sgl_kernel: 0.3.16.post4

The deployment command for Sglang

python3 -m sglang.launch_server --model-path /MiniMax-M2 \
        --tp-size 8 --ep-size 8 --tool-call-parser minimax-m2 --trust-remote-code\
        --reasoning-parser minimax-append-think --nnodes 1 --node-rank 0 \
        --host 0.0.0.0 --port 8000 --mem-fraction-static 0.85

Description

  1. LiveCodeBench (LCB)

In OpenCompass’s LiveCodeBench (LCB) benchmark, the MiniMax-M2 model was evaluated 5 times independently but consistently failed to achieve official score of 83.

Test Results

lcb_code_generation 29.25
lcb_code_execution  42.38
lcb_test_output 88.46
  1. SWE-bench Verified

I evaluated the MiniMax-M2 model using the mini-swe-agent tool, running each test case for 350 steps. The final score achieved was 38, which was lower than official.
In multiple test cases, the model generated unexpected code formatting, requiring the agent to repeatedly correct the output structure, but finally failed. Some such example are attached.

sphinx-doc__sphinx-9591.traj.json

sympy__sympy-20916.traj.json

django__django-13964.traj.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions