Skip to content

Projects that create a benchmark dataset for GemBox components, and test LLMs on that dataset.

License

Notifications You must be signed in to change notification settings

ZSvedic/GemBox-benchmark

Repository files navigation

GemBox Benchmark

This repository contains three sub-projects that create a benchmark dataset for GemBox Software components and evaluate LLMs on that generated dataset:

  • Project 1: Inputs (C#) - Contains raw question-answer pairs about common GemBox usage tasks (e.g. printing options, reading Excel files, etc.).
  • Project 2: Bench Filter (Python) - Filters "1-inputs" into a benchmark dataset (JSON format), see example: GBS-benchmark at HuggingFace
  • Project 3: Benchmark Direct (Python) - Uses the dataset to evaluate LLMs on accuracy, speed and cost when answering GemBox API questions.

Installation & Setup

Requirements:

dotnet add package GemBox.Spreadsheet --version 2025.9.107
dotnet add package HarfBuzzSharp.NativeAssets.Linux
dotnet add package SkiaSharp.NativeAssets.Linux.NoDependencies

Next steps:

  1. Git clone:
git clone https://github.com/ZSvedic/GemBox-benchmark
  1. For the Python project, use uv package manager to install dependencies:
cd GemBox-benchmark/3-benchmark-llm/    # Go to the Python project.
uv venv --python 3.10                   # Env with py 3.10 or newer.
source .venv/bin/activate               # For Linux/MacOS.
uv sync                                 # Install dependencies.
cd ..                                   # Go back to the root.
  1. Create an ".env" file in the root ("GemBox-benchmark" folder) with your API keys. If only using OpenRouter, then only OPENROUTER_API_KEY is needed. Example:
OPENROUTER_API_KEY=...
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
MISTRAL_API_KEY=...
  1. Open in VS Code:
code GB-benchmark.code-workspace

VS Code should show "There are unresolved dependencies" popup on first open. Select "Restore" to install all .NET dependencies.

  1. VS Code "Run and Debug" tab should now have run configurations for each of the subprojects below, or you can run each project from CLI.

Project "1-inputs" (optional)

This C# project contains Q&A data for the dataset. It uses the GemBox.Spreadsheet library to enumerate typical tasks. Comments before task contain a single "Question:" and one or more "Mask:" for each code answer. Each mask specifies a regex that will mask certain part of the following code line before asking LLM to fill it.

Example input C# code

...
    // Question: How do you enable printing of row and column headings?
    // Mask: \bPrintOptions\.PrintHeadings\b
    // Mask: \btrue\b
    worksheet.PrintOptions.PrintHeadings = true;

    // Question: How do you set the worksheet to print in landscape orientation?
    // Mask: \bPrintOptions\.Portrait\b
    // Mask: \bfalse\b
    worksheet.PrintOptions.Portrait = false;
...

Project "2-bench-filter" (optional)

This Python project filters .cs files from "1-inputs" to extract Q&A into a benchmark dataset. Each dataset row contains:

  • category (from the .cs file name),
  • question (EN language query),
  • masked_code (code snippet with ??? placeholders),
  • answers (correct text to fill ??? placeholders).

Simply execute run.sh to use default input path and to log the output.

...
{"category": "PrintView", "question": "How do you enable printing of row and column headings?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.PrintHeadings", "true"]}
{"category": "PrintView", "question": "How do you set the worksheet to print in landscape orientation?", "masked_code": "worksheet.??? = ???;", "answers": ["PrintOptions.Portrait", "false"]}
...

Project "3-benchmark-direct"

This Python project uses the dataset to run LLM evaluations. It supports OpenAI, Google, and many other providers via OpenRouter. Each model is asked to fill in ??? placeholders, and the outputs are validated. The evaluation measures error rate, speed, and cost.

Example output and results

...
BenchmarkContext:
    timeout_seconds: 2
    delay_ms: 50
    verbose: False
    truncate_length: 150
    max_parallel_questions: 30
    retry_failures: True
    benchmark_n_times: 1
    reasoning_effort: low
    web_search: False
    context:  

Benchmarking 4 model(s) on 28 question(s) 3 times.

=== Run 1 of 3 ===

...
Q3: How do you enable printing of row and column headings?
worksheet.??? = ???;
Q4: How do you set the worksheet to print in landscape orientation?
worksheet.??? = ???;
...
A3: ['PrintOptions.PrintHeadings', 'true']
✓ CORRECT
A4: ['PrintOptions.Orientation', 'Orientation.Landscape']
✗ INCORRECT, expected: ['PrintOptions.Portrait', 'false']
...
=== SUMMARY OF: Plain call + low ===
    gemini-2.5-flash-lite,                  tokens_mdn=395, cost_mdn=$0.000044,     time_mdn=0.00s, error_rate_mdn=50%,     api_issues_count=0/1
    gemini-2.5-flash,                       tokens_mdn=584, cost_mdn=$0.000639,     time_mdn=0.00s, error_rate_mdn=50%,     api_issues_count=0/1

About

Projects that create a benchmark dataset for GemBox components, and test LLMs on that dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors