reasoning-evaluation

Here are 4 public repositories matching this topic...

IAAR-Shanghai / GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

benchmark openai evaluation-framework large-language-models chatgpt llm-eval qwen deepseek knowledge-evaluation reliable-evaluation gamearena guessarena domain-specific-eval reasoning-evaluation

Updated Nov 15, 2025
Python

ejentum / benchmarks

Star

Benchmark methodology, task sets, and evaluation results for RA²R

benchmarks elephant bbh musr scicode agentic-ai livecodebench reasoning-evaluation arc-agi-3 llm-benchmarks ejentum ra2r causalbench

Updated Jun 1, 2026
Python

MSc Thesis Project – Framework for Causality-Aware Structured Multi-Step Reasoning in Legal Argument Generation – AI Systems Engineering, supervised by Prof. R. Pietrantuono and PhD Cristian Mascia (2026)

neo4j reactjs argument-generation legal-ai causal-reasoning llm react-agents agentic-ai groq-cloud reasoning-evaluation

Updated May 30, 2026
Python

Martin123132 / The-Marked-Bench-

Star

The Marked Bench: a versioned contradiction-detection benchmark for AI reasoning evaluation.

leaderboard schema-validation ai-safety reasoning contradiction-detection multihop-reasoning ai-evaluation result-card explanation-evaluation ai-benchmark reasoning-evaluation benchmark-submissions external-submissions benchmark-governance submission-evidence conformance-report benchmark-standard

Updated May 31, 2026
Python

Improve this page

Add a description, image, and links to the reasoning-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the reasoning-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly