GemBox Benchmark

This repository contains three sub-projects that create a benchmark dataset for GemBox Software components and evaluate LLMs on that generated dataset:

Project 1: Inputs (C#) - Contains raw question-answer pairs about common GemBox usage tasks (e.g. printing options, reading Excel files, etc.).
Project 2: Bench Filter (Python) - Filters "1-inputs" into a benchmark dataset (JSON format), see example: GBS-benchmark at HuggingFace
Project 3: Benchmark Direct (Python) - Uses the dataset to evaluate LLMs on accuracy, speed and cost when answering GemBox API questions.

Installation & Setup

Requirements:

Visual Studio Code - official download
C# 13.0 / .NET 9.0 SDK - The easiest install is via VS Code C# extension.
GemBox.Spreadsheet and dependencies - If not installed automatically by VS Code when opening the workspace, get v2025.9.10 via NuGet:

dotnet add package GemBox.Spreadsheet --version 2025.9.107
dotnet add package HarfBuzzSharp.NativeAssets.Linux
dotnet add package SkiaSharp.NativeAssets.Linux.NoDependencies

Next steps:

Git clone:

git clone https://github.com/ZSvedic/GemBox-benchmark

For the Python project, use uv package manager to install dependencies:

cd GemBox-benchmark/3-benchmark-llm/    # Go to the Python project.
uv venv --python 3.10                   # Env with py 3.10 or newer.
source .venv/bin/activate               # For Linux/MacOS.
uv sync                                 # Install dependencies.
cd ..                                   # Go back to the root.

Create an ".env" file in the root ("GemBox-benchmark" folder) with your API keys. If only using OpenRouter, then only OPENROUTER_API_KEY is needed. Example:

OPENROUTER_API_KEY=...
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
MISTRAL_API_KEY=...

Open in VS Code:

code GB-benchmark.code-workspace

VS Code should show "There are unresolved dependencies" popup on first open. Select "Restore" to install all .NET dependencies.

VS Code "Run and Debug" tab should now have run configurations for each of the subprojects below, or you can run each project from CLI.

Project "1-inputs" (optional)

This C# project contains Q&A data for the dataset. It uses the GemBox.Spreadsheet library to enumerate typical tasks. Comments before task contain a single "Question:" and one or more "Mask:" for each code answer. Each mask specifies a regex that will mask certain part of the following code line before asking LLM to fill it.

Example input C# code

...
    // Question: How do you enable printing of row and column headings?
    // Mask: \bPrintOptions\.PrintHeadings\b
    // Mask: \btrue\b
    worksheet.PrintOptions.PrintHeadings = true;

    // Question: How do you set the worksheet to print in landscape orientation?
    // Mask: \bPrintOptions\.Portrait\b
    // Mask: \bfalse\b
    worksheet.PrintOptions.Portrait = false;
...

Project "2-bench-filter" (optional)

This Python project filters .cs files from "1-inputs" to extract Q&A into a benchmark dataset. Each dataset row contains:

category (from the .cs file name),
question (EN language query),
masked_code (code snippet with ??? placeholders),
answers (correct text to fill ??? placeholders).

Simply execute run.sh to use default input path and to log the output.

Example JSONL dataset

...
{"category": "PrintView", "question": "How do you enable printing of row and column headings?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.PrintHeadings", "true"]}
{"category": "PrintView", "question": "How do you set the worksheet to print in landscape orientation?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.Portrait", "false"]}
...

Project "3-benchmark-direct"

This Python project uses the dataset to run LLM evaluations. It supports OpenAI, Google, and many other providers via OpenRouter. Each model is asked to fill in ??? placeholders, and the outputs are validated. The evaluation measures error rate, speed, and cost.

Example output and results

...
BenchmarkContext:
    timeout_seconds: 2
    delay_ms: 50
    verbose: False
    truncate_length: 150
    max_parallel_questions: 30
    retry_failures: True
    benchmark_n_times: 1
    reasoning_effort: low
    web_search: False
    context:  

Benchmarking 4 model(s) on 28 question(s) 3 times.

=== Run 1 of 3 ===

...
Q3: How do you enable printing of row and column headings?
worksheet.??? = ???;
Q4: How do you set the worksheet to print in landscape orientation?
worksheet.??? = ???;
...
A3: ['PrintOptions.PrintHeadings', 'true']
✓ CORRECT
A4: ['PrintOptions.Orientation', 'Orientation.Landscape']
✗ INCORRECT, expected: ['PrintOptions.Portrait', 'false']
...
=== SUMMARY OF: Plain call + low ===
    gemini-2.5-flash-lite,                  tokens_mdn=395, cost_mdn=$0.000044,     time_mdn=0.00s, error_rate_mdn=50%,     api_issues_count=0/1
    gemini-2.5-flash,                       tokens_mdn=584, cost_mdn=$0.000639,     time_mdn=0.00s, error_rate_mdn=50%,     api_issues_count=0/1

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.claude		.claude
.github/agents		.github/agents
1-inputs		1-inputs
2-bench-filter		2-bench-filter
3-benchmark-direct		3-benchmark-direct
.gitignore		.gitignore
GB-benchmark.code-workspace		GB-benchmark.code-workspace
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GemBox Benchmark

Installation & Setup

Project "1-inputs" (optional)

Example input C# code

Project "2-bench-filter" (optional)

Example JSONL dataset

Project "3-benchmark-direct"

Example output and results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

ZSvedic/GemBox-benchmark

Folders and files

Latest commit

History

Repository files navigation

GemBox Benchmark

Installation & Setup

Project "1-inputs" (optional)

Example input C# code

Project "2-bench-filter" (optional)

Example JSONL dataset

Project "3-benchmark-direct"

Example output and results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages