This repository contains three sub-projects that create a benchmark dataset for GemBox Software components and evaluate LLMs on that generated dataset:
- Project 1: Inputs (C#) - Contains raw question-answer pairs about common GemBox usage tasks (e.g. printing options, reading Excel files, etc.).
- Project 2: Bench Filter (Python) - Filters "1-inputs" into a benchmark dataset (JSON format), see example: GBS-benchmark at HuggingFace
- Project 3: Benchmark Direct (Python) - Uses the dataset to evaluate LLMs on accuracy, speed and cost when answering GemBox API questions.
Requirements:
- Visual Studio Code - official download
- C# 13.0 / .NET 9.0 SDK - The easiest install is via VS Code C# extension.
- GemBox.Spreadsheet and dependencies - If not installed automatically by VS Code when opening the workspace, get v2025.9.10 via NuGet:
dotnet add package GemBox.Spreadsheet --version 2025.9.107
dotnet add package HarfBuzzSharp.NativeAssets.Linux
dotnet add package SkiaSharp.NativeAssets.Linux.NoDependenciesNext steps:
- Git clone:
git clone https://github.com/ZSvedic/GemBox-benchmark- For the Python project, use uv package manager to install dependencies:
cd GemBox-benchmark/3-benchmark-llm/ # Go to the Python project.
uv venv --python 3.10 # Env with py 3.10 or newer.
source .venv/bin/activate # For Linux/MacOS.
uv sync # Install dependencies.
cd .. # Go back to the root.- Create an ".env" file in the root ("GemBox-benchmark" folder) with your API keys. If only using OpenRouter, then only OPENROUTER_API_KEY is needed. Example:
OPENROUTER_API_KEY=...
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
MISTRAL_API_KEY=...- Open in VS Code:
code GB-benchmark.code-workspaceVS Code should show "There are unresolved dependencies" popup on first open. Select "Restore" to install all .NET dependencies.
- VS Code "Run and Debug" tab should now have run configurations for each of the subprojects below, or you can run each project from CLI.
This C# project contains Q&A data for the dataset. It uses the GemBox.Spreadsheet library to enumerate typical tasks. Comments before task contain a single "Question:" and one or more "Mask:" for each code answer. Each mask specifies a regex that will mask certain part of the following code line before asking LLM to fill it.
...
// Question: How do you enable printing of row and column headings?
// Mask: \bPrintOptions\.PrintHeadings\b
// Mask: \btrue\b
worksheet.PrintOptions.PrintHeadings = true;
// Question: How do you set the worksheet to print in landscape orientation?
// Mask: \bPrintOptions\.Portrait\b
// Mask: \bfalse\b
worksheet.PrintOptions.Portrait = false;
...This Python project filters .cs files from "1-inputs" to extract Q&A into a benchmark dataset. Each dataset row contains:
- category (from the .cs file name),
- question (EN language query),
- masked_code (code snippet with
???placeholders), - answers (correct text to fill
???placeholders).
Simply execute run.sh to use default input path and to log the output.
Example JSONL dataset
...
{"category": "PrintView", "question": "How do you enable printing of row and column headings?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.PrintHeadings", "true"]}
{"category": "PrintView", "question": "How do you set the worksheet to print in landscape orientation?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.Portrait", "false"]}
...This Python project uses the dataset to run LLM evaluations. It supports OpenAI, Google, and many other providers via OpenRouter. Each model is asked to fill in ??? placeholders, and the outputs are validated. The evaluation measures error rate, speed, and cost.
...
BenchmarkContext:
timeout_seconds: 2
delay_ms: 50
verbose: False
truncate_length: 150
max_parallel_questions: 30
retry_failures: True
benchmark_n_times: 1
reasoning_effort: low
web_search: False
context:
Benchmarking 4 model(s) on 28 question(s) 3 times.
=== Run 1 of 3 ===
...
Q3: How do you enable printing of row and column headings?
worksheet.??? = ???;
Q4: How do you set the worksheet to print in landscape orientation?
worksheet.??? = ???;
...
A3: ['PrintOptions.PrintHeadings', 'true']
✓ CORRECT
A4: ['PrintOptions.Orientation', 'Orientation.Landscape']
✗ INCORRECT, expected: ['PrintOptions.Portrait', 'false']
...
=== SUMMARY OF: Plain call + low ===
gemini-2.5-flash-lite, tokens_mdn=395, cost_mdn=$0.000044, time_mdn=0.00s, error_rate_mdn=50%, api_issues_count=0/1
gemini-2.5-flash, tokens_mdn=584, cost_mdn=$0.000639, time_mdn=0.00s, error_rate_mdn=50%, api_issues_count=0/1