Large Reasoning Models (LRMs) have achieved rapid progress, yet existing benchmarksβfocused on mathematics, programming, or common-sense reasoningβsuffer from poor long-context evaluation, weak difficulty control, ambiguous answers, and narrow coverage of reasoning paradigms.
GrAlgoBench introduces a benchmark of graph algorithm problems to evaluate LRMs. Graph tasks naturally provide:
- Effective long-context reasoning β graph descriptions induce long inputs, testing context scalability.
- Scalable difficulty control β complexity grows smoothly with graph size (8β160 nodes).
- Standardized evaluation β outputs are integers/nodes/edges, enabling exact and programmatic checking.
- Diverse reasoning paradigms β tasks span Enumeration, Exploration, and Intuition, mapping to brute-force, search, and greedy paradigms.
Experiments on nine tasks across three categories uncover two major weaknesses of current LRMs:
- Performance collapse under long contexts β accuracy drops sharply as graphs or text length grow, due to step-by-step execution errors, weak memory, and redundant reasoning.
- Ineffective self-verification β models often engage in verbose self-checking that inflates reasoning traces but rarely improves correctness, becoming the main driver of over-thinking.
By addressing the shortcomings of prior benchmarks, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and application-relevant testbed for advancing the study of reasoning in LRMs.
GrAlgoBench/
βββ data_generation/ # Scripts for dataset construction
βββ Inference/ # Model inference scripts and configs
βββ error_analysis/ # Scripts for analyzing model errors
βββ overthinking/ # Overthinking analysis module
βββ label/ # Response labeling and segmentation
βββ judge/ # Segment effectiveness judgment
βββ entropy_analysis/ # Token entropy analysis and wordclouds
βββ logs/ # Default log directory
βββ results/ # Saved inference results
βββ scripts/ # Helper bash scripts for batch running
βββ common_utils.py # Shared utilities and configurations
βββ README.md # Documentation
conda create -n gralgobench python=3.10
conda activate gralgobench
pip install -r requirements.txtpython ./data_generation/build_dataset.py
Datasets are organized as:
dataset/
βββ MKC_easy.pkl
βββ MKC_medium.pkl
βββ MST_hard.pkl
βββ ...
Single-task example:
python Inference/infer_open.py \
--LLM Qwen3-8B \
--task MST \
--difficulty medium \
--batch_size 32 \
--gpu_num 4Batch execution via script:
bash scripts/infer_open.shTo analyze model errors, follow these steps:
-
Reformat raw responses
This step parses and normalizes model outputs into a consistent structure:python error_analysis/reformat.py
-
Run the error analysis script
Specify the task, evaluation model, and the model that generated the responses:python error_analysis/error_analysis.py \ --task MST \ --llm gpt5_mini \ --response_generated_from_what_model Qwen3-8B
GrAlgoBench provides comprehensive analysis tools to understand model behavior and reasoning patterns across multiple dimensions:
Analyze model self-verification and reasoning redundancy patterns:
./overthinking/run_overthinking.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./overthinking/run_overthinking.sh 0,1 \
--LLM Qwen3-32B \
--task_type math_competition \
--batch_size 16Segment and label model reasoning steps for fine-grained analysis:
./label/run_label.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./label/run_label.sh 0,1 \
--LLM Qwen3-32B \
--task_type math_competition \
--batch_size 16Evaluate the effectiveness of individual reasoning segments:
./judge/run_judge.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./judge/run_judge.sh 0,1 \
--LLM Qwen3-32B \
--task_type math \
--batch_size 16Analyze token-level uncertainty and model confidence patterns:
./entropy_analysis/run_entropy_analysis.sh infer 0,1 \
--LLM Qwen3-32B --task MaxDegree --difficulty easy
./entropy_analysis/run_entropy_analysis.sh analyze 0,1 \
--min_freq 40000 --top_k 100
./entropy_analysis/run_entropy_analysis.sh wordcloud 0,1