Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions spnl/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ tokio = { version = "1.44.1", features = ["rt-multi-thread"] }
dirs = "6.0"
reqwest = { version = "0.12", features = ["blocking", "json"] }
serde_json = "1.0"
csv = "1.3"

[[bench]]
name = "haystack"
Expand All @@ -137,3 +138,7 @@ required-features = ["tok"]
name = "ruler"
harness = false
required-features = ["tok"]

[[bench]]
name = "ragcsv"
harness = false
148 changes: 148 additions & 0 deletions spnl/benches/README-RAGCSV.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# RAGCSV Benchmark for SPNL

Evaluates RAG answer accuracy from a CSV of evaluation queries using SPNL span queries. Each row contains a question, document fragments, and an expected answer. The benchmark executes each query, then uses a second LLM call to grade the response accuracy on a 0-100 scale.

## Overview

1. Load a CSV file with evaluation data (questions, document fragments, expected answers)
2. For each row, construct a span query that provides the document fragments as context and asks the question
3. Grade the model's response against the expected answer using a second LLM query
4. Report quantile statistics for accuracy and response time

## CSV Schema

The CSV has no headers. Columns are positional:

| Index | Content |
|-------|---------|
| 0 | Expected result |
| 1 | Fragments -- Python-style single-quoted JSON array of `{'page_content': '...', 'metadata': {'title': '...'}, ...}` |
| 2 | Hallucination detection proposal (unused) |
| 3 | Timestamp (unused) |
| 4 | Question to pose |

## Configuration via Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `RAGCSV_FILE` | (required) | Path to the CSV file |
| `RAGCSV_MODEL` | `ollama/granite3.3:2b` | Model for inference |
| `RAGCSV_GRADING_MODEL` | same as `RAGCSV_MODEL` | Model for accuracy grading |
| `RAGCSV_CONCURRENCY` | `1` | Max parallel query executions |
| `RAGCSV_LIMIT` | all rows | Limit number of rows to process |
| `RAGCSV_MAX_TOKENS` | `512` | Max tokens for primary query |
| `RAGCSV_DEBUG` | `false` | Print first row's query/response details |

## Usage Examples

### Basic Usage

```bash
RAGCSV_FILE=~/data/eval.csv cargo bench --bench ragcsv
```

### Debug Mode

```bash
RAGCSV_FILE=~/data/eval.csv RAGCSV_DEBUG=1 RAGCSV_LIMIT=3 \
cargo bench --bench ragcsv
```

### Concurrent Execution

```bash
RAGCSV_FILE=~/data/eval.csv \
RAGCSV_MODEL="ollama/granite3.3:2b" \
RAGCSV_CONCURRENCY=4 \
RAGCSV_LIMIT=100 \
cargo bench --bench ragcsv
```

### Separate Grading Model

```bash
RAGCSV_FILE=~/data/eval.csv \
RAGCSV_MODEL="ollama/granite3.3:2b" \
RAGCSV_GRADING_MODEL="ollama/llama3.2:3b" \
cargo bench --bench ragcsv
```

## Output Format

Progress is shown during execution:

```
[00:01:23] 42/100 | Avg Acc=72.3% | Pass(>=75%)=28/42
```

Final report includes quantile statistics for accuracy and total time:

```
=== RAGCSV Eval Accuracy (n=100) ===
min: 10.0%
p25: 55.0%
p50: 75.0%
p75: 85.0%
p90: 95.0%
p99: 100.0%
max: 100.0%
avg: 71.2%
pass (>=75%): 58/100

=== RAGCSV Eval Total Time (n=100) ===
min: 234ms
p25: 456ms
p50: 623ms
p75: 891ms
p90: 1203ms
p99: 2456ms
max: 3012ms
avg: 702ms
```

## Implementation Details

### Query Construction

Each row produces a span query of the form:

```
g model
(cross
(system "You are a helpful assistant. Answer the question based only on the provided documents.")
(plus [user "Document: {title}\n{page_content}", ...])
(user question)
)
temperature
max_tokens
```

### Accuracy Grading

After getting the model response, a second query grades accuracy:

```
g grading_model
(cross
(system "You are an accuracy evaluator...")
(user "Expected answer: ...\n\nActual answer: ...\n\nAccuracy score (0-100):")
)
0.0 // deterministic
16 // small max_tokens
```

The grading model returns an integer 0-100, which is parsed from the response.

### Concurrency

Rows are dispatched as tokio tasks with a semaphore limiting parallelism to `RAGCSV_CONCURRENCY`. Results are collected via an mpsc channel.

## Future Work

- Capture TTFT (time to first token) and ITL (inter-token latency) metrics programmatically
- Support additional CSV schemas
- Add per-row result CSV export

---

Made with Bob
Loading
Loading