[Model BadCase]: MiniMax-M2 can't get the official performance with LiveCodeBench (LCB)	and SWE-bench Verified

### Basic Information - Models Used

MiniMax-M2

### Information about environment and deployment

environment
```
OS: Ubuntu 22.04.5 LTS
GPU: H200*8
Python: 3.12.11
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post1
sgl_kernel: 0.3.16.post4
``` 

The deployment command for Sglang
```shell
python3 -m sglang.launch_server --model-path /MiniMax-M2 \
        --tp-size 8 --ep-size 8 --tool-call-parser minimax-m2 --trust-remote-code\
        --reasoning-parser minimax-append-think --nnodes 1 --node-rank 0 \
        --host 0.0.0.0 --port 8000 --mem-fraction-static 0.85
``` 

### Description

1. LiveCodeBench (LCB)

> In OpenCompass’s LiveCodeBench (LCB) benchmark, the MiniMax-M2 model was evaluated 5 times independently but consistently failed to achieve official score of 83.

> Test Results
> ```
> lcb_code_generation 29.25
> lcb_code_execution  42.38
> lcb_test_output 88.46
> ``` 

2. SWE-bench Verified

> I evaluated the MiniMax-M2 model using the mini-swe-agent tool, running each test case for 350 steps. The final score achieved was 38, which was lower than official.
> In multiple test cases, the model generated unexpected code formatting, requiring the agent to repeatedly correct the output structure, but finally failed. Some such example are attached.

> [sphinx-doc__sphinx-9591.traj.json](https://github.com/user-attachments/files/23787612/sphinx-doc__sphinx-9591.traj.json)

> [sympy__sympy-20916.traj.json](https://github.com/user-attachments/files/23787635/sympy__sympy-20916.traj.json)

> [django__django-13964.traj.json](https://github.com/user-attachments/files/23787588/django__django-13964.traj.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model BadCase]: MiniMax-M2 can't get the official performance with LiveCodeBench (LCB) and SWE-bench Verified #53

Basic Information - Models Used

Information about environment and deployment

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Model BadCase]: MiniMax-M2 can't get the official performance with LiveCodeBench (LCB) and SWE-bench Verified #53

Description

Basic Information - Models Used

Information about environment and deployment

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions