You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In OpenCompass’s LiveCodeBench (LCB) benchmark, the MiniMax-M2 model was evaluated 5 times independently but consistently failed to achieve official score of 83.
I evaluated the MiniMax-M2 model using the mini-swe-agent tool, running each test case for 350 steps. The final score achieved was 38, which was lower than official.
In multiple test cases, the model generated unexpected code formatting, requiring the agent to repeatedly correct the output structure, but finally failed. Some such example are attached.