A two-part LLM benchmark study built on the same 100-question Slovak multiple-choice dataset. It covers mathematics, spatial and temporal reasoning, social reasoning, and trick questions. Originally conducted as part of a thesis, now replicated in 2026 using exclusively open-source models.
The repository keeps both a corrected post-submission Round 1 snapshot and a
sanitized archival copy of the originally submitted 2025 materials.
2025-thesis should be treated as the primary Round 1 directory for browsing
results and code, while 2025-thesis-archive is included only for transparency,
preserving the originally submitted state before post-submission fixes and cleanup.
- Dataset
- Round 1 β thesis benchmark (2025)
- Results (2025 thesis benchmark)
- Round 2 β open-source benchmark (2026)
- Results (2026 open-source benchmark)
- Methodology notes
- API / inference settings
- Reproducibility
- Structure
π‘ Reasoning models are indicated by a lightbulb icon.
100 multiple-choice questions in Slovak, used identically across both rounds to ensure comparability. The dataset draws from three sources:
- SimpleBench β a subset of questions from the public dataset, translated into Slovak and slightly modified
- Math Kangaroo-style problems β non-standard reasoning problems adapted from a Czech worksheet collection drawing on the international Math Kangaroo competition and related educational publications, including PoΔΓtejte s klokanem 1995β1999 (Prodos, ISBN 80-7230-068-7, ISBN 80-7230-077-6)
- Original questions β authored specifically for this benchmark
Each question in dataset.json includes a category and origin field for full traceability.
The dataset covers five question categories:
| Category (English) | Count |
|---|---|
| Mathematics | 39 |
| Trick questions | 36 |
| Spatial reasoning | 13 |
| Temporal reasoning | 7 |
| Social reasoning | 5 |
The dataset is included in both dataset.json and dataset.csv for each round.
dataset.json is the canonical machine-readable source used by the benchmark scripts
and for reproducibility; dataset.csv is a human-readable export included for easier
browsing on GitHub or in spreadsheet tools.
All models in this round are proprietary, with the exception of the two
DeepSeek models, which are MIT licensed and have a reported size of
671B (37B active).
| Model | Creator | API Model ID |
|---|---|---|
| ChatGPT 4o | OpenAI | gpt-4o-2024-08-06 |
| ChatGPT o1 π‘ | OpenAI | o1-2024-12-17 |
| Claude 3.7 Sonnet | Anthropic | claude-3-7-sonnet-20250219 |
| Claude 3.7 Sonnet Thinking π‘ | Anthropic | claude-3-7-sonnet-20250219 |
| DeepSeek-V3 | DeepSeek | deepseek-chat (V3-0324) |
| DeepSeek-R1 π‘ | DeepSeek | deepseek-reasoner |
| Gemini 2.0 Flash | gemini-2.0-flash |
|
| Gemini 2.5 Pro Experimental π‘ | gemini-2.5-pro-exp-03-25 |
|
| Perplexity Sonar | Perplexity | sonar-pro |
| Perplexity Sonar Pro π‘ | Perplexity | sonar-reasoning-pro |
| Grok 3 | xAI | N/A (tested via web UI) |
| Grok 3 Thinking π‘ | xAI | N/A (tested via web UI) |
Note on Grok 3: xAI did not release an API for Grok 3 for over 40 days after the model's release. Both Grok 3 variants were tested manually via the web UI, so their results are not directly comparable to the API-based runs.
π‘ Reasoning models are indicated by a lightbulb icon.
| Model | Score |
|---|---|
| Gemini 2.5 Pro Experimental π‘ | 84/100 |
| Claude 3.7 Sonnet Thinking π‘ | 84/100 |
| ChatGPT o1 π‘ | 84/100 |
| DeepSeek-R1 π‘ | 83/100 |
| Grok 3 Thinking π‘ | 79/100 |
| Grok 3 | 74/100 |
| Claude 3.7 Sonnet | 72/100 |
| Perplexity Sonar Pro π‘ | 71/100 |
| DeepSeek-V3 | 70/100 |
| ChatGPT 4o | 68/100 |
| Gemini 2.0 Flash | 65/100 |
| Perplexity Sonar | 58/100 |
Per-question breakdown available in 2025-thesis/results.csv.
Replication of the original benchmark using open-source models only.
| Model | Creator | License | Parameters |
|---|---|---|---|
| DeepSeek V3.2 | DeepSeek | MIT | 685B (37B active) |
| DeepSeek V3.2 π‘ | DeepSeek | MIT | 685B (37B active) |
| GLM-5 | Z.ai | MIT | 744B (40B active) |
| GLM-5 π‘ | Z.ai | MIT | 744B (40B active) |
| Kimi K2.5 | Moonshot AI | Modified MIT | 1T (32B active) |
| Kimi K2.5 π‘ | Moonshot AI | Modified MIT | 1T (32B active) |
| MiMo-V2-Flash π‘ | Xiaomi | MIT | 309B (15B active) |
| Mistral Large 3 | Mistral AI | Apache | 675B (41B active) |
| Mistral Small 4 | Mistral AI | Apache | 119B (6B active) |
| Mistral Small 4 π‘ | Mistral AI | Apache | 119B (6B active) |
| Nemotron 3 Super | NVIDIA | MIT | 120B (12B active) |
| Nemotron 3 Super π‘ | NVIDIA | MIT | 120B (12B active) |
| Qwen 3.5 27B | Alibaba | Apache | 27B |
| Qwen 3.5 27B π‘ | Alibaba | Apache | 27B |
| Qwen 3.5 35B A3B | Alibaba | Apache | 35B (3B active) |
| Qwen 3.5 35B A3B π‘ | Alibaba | Apache | 35B (3B active) |
| Qwen 3.5 122B A10B | Alibaba | Apache | 122B (10B active) |
| Qwen 3.5 122B A10B π‘ | Alibaba | Apache | 122B (10B active) |
| Qwen 3.5 397B A17B | Alibaba | Apache | 397B (17B active) |
| Qwen 3.5 397B A17B π‘ | Alibaba | Apache | 397B (17B active) |
| Gemma 4 26B A4B | Apache | 26B (4B active) | |
| Gemma 4 26B A4B π‘ | Apache | 26B (4B active) |
Note on Nemotron 3 Super: NVIDIA's official model card lists the supported languages as English, French, German, Italian, Japanese, Spanish, and Chinese. Despite Slovak not being listed, both Nemotron variants performed competitively in this benchmark, including ahead of some substantially larger models.
π‘ Reasoning models are indicated by a lightbulb icon.
| Model | Score |
|---|---|
| Qwen 3.5 122B A10B π‘ | 88/100 |
| Qwen 3.5 27B π‘ | 88/100 |
| Qwen 3.5 397B A17B | 87/100 |
| GLM-5 π‘ | 86/100 |
| Qwen 3.5 397B A17B π‘ | 85/100 |
| Qwen 3.5 122B A10B | 84/100 |
| DeepSeek V3.2 π‘ | 82/100 |
| Qwen 3.5 35B A3B π‘ | 82/100 |
| Qwen 3.5 27B | 81/100 |
| Kimi K2.5 π‘ | 81/92* |
| Kimi K2.5 | 79/100 |
| DeepSeek V3.2 | 79/100 |
| MiMo-V2-Flash π‘ | 78/100 |
| Qwen 3.5 35B A3B | 77/100 |
| Nemotron 3 Super π‘ | 75/100 |
| Gemma 4 26B A4B π‘ | 74/100 |
| Mistral Small 4 π‘ | 73/100 |
| Nemotron 3 Super | 72/100 |
| Gemma 4 26B A4B | 70/100 |
| GLM-5 | 67/100 |
| Mistral Large 3 | 64/100 |
| Mistral Small 4 | 50/100 |
Per-question breakdown available in 2026-oss/results.csv.
*Moonshot AI had by far the least reliable API infrastructure in this benchmark.
Kimi K2.5 π‘still had 8 unresolved questions after 4 helper-script retry runs spread across several hours, with repeatedRequest timed outandengine_overloaded_errorfailures.
- Same 100-question dataset used in both rounds
- Both rounds conducted entirely in Slovak β all questions are in Slovak
- One API call per question for provider-hosted models
- Gemma 4 runs were benchmarked locally, using one independent inference run per question
- No system prompt used in either round β models received a single user prompt per question, with no additional system instruction layer
- Zero-shot evaluation β one run per question, no examples or additional context provided
top_pis never set explicitly β it is left at each provider's default, since common API guidance recommends adjusting eithertemperatureortop_p, but not both simultaneously- Results stored as JSON per model, summarized in
results.csv
On zero-shot evaluation: Round 1 used a single zero-shot run per question β what the thesis described as the most "bare" view of model performance. A more robust approach (used by benchmarks like SimpleBench) averages results across 5 runs or uses majority voting across 5 attempts. Round 2 intentionally preserves the single zero-shot run to maintain direct comparability with Round 1, with the understanding that this is a known limitation of both rounds. The setup is deliberately lean and transparent rather than built around heavyweight benchmark infrastructure.
This section lists the API parameters used for provider-hosted runs and the local MLX inference settings used for Gemma 4.
Settings used per model during thesis testing:
| Model | Model ID | temperature | max_tokens |
|---|---|---|---|
| ChatGPT 4o | gpt-4o-2024-08-06 |
1.0 |
default |
| ChatGPT o1 π‘ | o1-2024-12-17 |
1.0 |
default |
| Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 |
1.0 |
1024 |
| Claude 3.7 Sonnet Thinking π‘ | claude-3-7-sonnet-20250219 |
1.0 |
8000 |
| DeepSeek-V3 | deepseek-chat |
1.0 |
default |
| DeepSeek-R1 π‘ | deepseek-reasoner |
1.0 |
default |
| Gemini 2.0 Flash | gemini-2.0-flash |
1.0 |
default |
| Gemini 2.5 Pro Experimental π‘ | gemini-2.5-pro-exp-03-25 |
1.0 |
default |
| Perplexity Sonar | sonar-pro |
1.0 |
default |
| Perplexity Sonar Pro π‘ | sonar-reasoning-pro |
1.0 |
default |
| Grok 3 | N/A (tested via web UI) | β | β |
| Grok 3 Thinking π‘ | N/A (tested via web UI) | β | β |
- Claude 3.7 Sonnet Thinking π‘ used extended thinking with
budget_tokens: 6000. - Gemini 2.0 Flash and Gemini 2.5 Pro Experimental π‘ were tested via the
google-generativeaiSDK, with generation settings left at their defaults. - Grok 3 and Grok 3 Thinking π‘ were tested manually via the web UI because xAI did not provide an API for more than 40 days after release.
| Model | Model ID | temperature | max_tokens |
|---|---|---|---|
| DeepSeek-V3.2 | deepseek-chat |
1.0 |
default |
| DeepSeek-V3.2 π‘ | deepseek-reasoner |
not supported | default |
| GLM-5 | glm-5 |
1.0 |
default |
| GLM-5 π‘ | glm-5 |
1.0 |
default |
| Kimi K2.5 | kimi-k2.5 |
0.6 |
default |
| Kimi K2.5 π‘ | kimi-k2.5 |
1.0 |
default |
| MiMo-V2-Flash π‘ | mimo-v2-flash |
0.8 |
default |
| Mistral Large 3 | mistral-large-2512 |
0.3 |
default |
| Mistral Small 4 | mistral-small-2603 |
0.3 |
default |
| Mistral Small 4 π‘ | mistral-small-2603 |
0.7 |
default |
| Nemotron 3 Super | nvidia/nemotron-3-super-120b-a12b |
1.0 |
16000 |
| Nemotron 3 Super π‘ | nvidia/nemotron-3-super-120b-a12b |
1.0 |
16000 |
| Qwen 3.5 27B | qwen3.5-27b |
0.7 |
default |
| Qwen 3.5 27B π‘ | qwen3.5-27b |
1.0 |
default |
| Qwen 3.5 35B A3B | qwen3.5-35b-a3b |
0.7 |
default |
| Qwen 3.5 35B A3B π‘ | qwen3.5-35b-a3b |
1.0 |
default |
| Qwen 3.5 122B A10B | qwen3.5-122b-a10b |
0.7 |
default |
| Qwen 3.5 122B A10B π‘ | qwen3.5-122b-a10b |
1.0 |
default |
| Qwen 3.5 397B A17B | qwen3.5-397b-a17b |
0.7 |
default |
| Qwen 3.5 397B A17B π‘ | qwen3.5-397b-a17b |
0.6 |
default |
| Gemma 4 26B A4B | mlx-community/gemma-4-26b-a4b-it-4bit |
1.0 |
8192 |
| Gemma 4 26B A4B π‘ | mlx-community/gemma-4-26b-a4b-it-4bit |
1.0 |
32768 |
Where provider documentation or model cards specified recommended temperature
values, those recommendations were followed.
DeepSeek-V3.2 π‘does not supporttemperature.GLM-5andKimi K2.5use explicitthinkingon/off switches for standard vs reasoning runs.MiMo-V2-Flash π‘enables thinking viachat_template_kwargs.Mistral Small 4 π‘setsreasoning_effort="high".Nemotron 3 Superuses explicitenable_thinkingon/off control; the thinking variant also setsreasoning_budget: 4096.Qwen 3.5models use hybrid thinking mode by default, so standard runs explicitly disable thinking and reasoning runs explicitly enable it.Gemma 4 26B A4BandGemma 4 26B A4B π‘were benchmarked locally viamlx-vlm 0.4.4on a MacBook Pro with anM4 Prochip and24 GBunified memory, using themlx-community/gemma-4-26b-a4b-it-4bitsnapshot8bcfa0de037c2b1bfa323a1e8d1f0132243b9e87.- The Gemma 4 reasoning run used
enable_thinking=Truewith no explicitthinking_budget. - Gemma 4 result files also include runtime telemetry such as token counts, throughput, and peak memory usage.
The actively maintained runner in this repository is the open-source Round 2
benchmark. The 2025-thesis and 2025-thesis-archive directories are kept
mainly as historical Round 1 materials.
Requires Python 3.10+.
python -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .envThen add the required provider API keys to .env.
Gemma 4 runs were executed locally via mlx-vlm 0.4.4 on an MBP with an M4 Pro chip.
List available models:
cd 2026-oss
python run_benchmark.py --listRun a single model:
cd 2026-oss
python run_benchmark.py --model <model-key>Run the full suite:
cd 2026-oss
python run_benchmark.py --allOutputs are written to the Round 2 results/ directory as one JSON file per
model.
2025-thesis-archiveβ historical archive of the original 2025 materials2025-thesisβ corrected Round 1 benchmark2026-ossβ open-source Round 2 benchmark
Each round directory contains the dataset, raw model outputs, benchmark code, and a score summary.