Get started with blind evaluation and hyperparameter tuning in minutes.
Note: This application requires Ollama (© Ollama, Inc.) running locally. Ollama is a separate open-source project licensed under the MIT License.
-
Ollama installed and running:
# Check if Ollama is running curl http://127.0.0.1:11434/api/tags
-
Python 3.10+ with dependencies installed:
pip install -r requirements.txt -
At least 2 models downloaded (for arena mode):
ollama pull gemma3:1b ollama pull qwen2.5:3b ollama pull llama3.2:3b
python web_chat.pyOpen your browser: http://127.0.0.1:7860
Compare multiple models side-by-side with known identities.
- Select Models: Click the dropdown, select 2-6 models
- Models appear as chips:
gemma3:1b (T=0.7 P=0.9 K=40)
- Models appear as chips:
- Configure Hyperparameters (optional): Adjust sliders before adding
- Temperature, top_p, top_k, repeat_penalty, num_predict, seed
- Add to Arena: Click "➕ Add to Arena"
- Set System Prompt (optional): Click "🔧 System Prompt" to set instructions
- Send Prompt: Type your question, click "Send to Arena"
- Compare Responses: See all responses side-by-side in real-time
- Export: Click "💾 Export" to download JSON with full conversation history
Example Use Case: Compare code generation quality across qwen2.5:3b, llama3.2:3b, and gemma3:1b for Python refactoring tasks.
Eliminate bias by hiding model identities during evaluation.
- Toggle Blind Mode: Click the 🎭 "Blind Mode" toggle (turns purple when active)
- Select & Add Models: Same as standard mode, but models will be hidden
- You'll see: "Model A", "Model B", "Model C" instead of real names
- Display order is randomized to prevent position bias
- Send Prompts & Vote:
- Type your question, click "Send to Arena"
- Click 👍 or 👎 on each response to vote
- You won't know which model is which
- Reveal When Ready: Click "🔓 Reveal All Models" button
- Shows actual model names, hyperparameters, and vote counts
- Voting gets locked after reveal
- Export with Privacy:
- Before reveal: Export contains masked names (
_blind.jsonsuffix) - After reveal: Export contains full mapping
- Before reveal: Export contains masked names (
Example Use Case: Resolve team debates about which model is "best" without brand bias. Run evaluation, vote, then reveal — the data speaks for itself.
Find optimal hyperparameters by comparing multiple configurations of the same model.
- Select Base Model: Choose
gemma3:1bfrom dropdown - Configure First Instance:
- Set Temperature = 0.1 (very deterministic)
- Click "➕ Add to Arena"
- Add Second Instance:
- Select
gemma3:1bagain - Set Temperature = 0.9 (more creative)
- Click "➕ Add to Arena"
- Select
- Add Third Instance:
- Select
gemma3:1bagain - Set Temperature = 2.0 (maximum creativity)
- Click "➕ Add to Arena"
- Select
- Run Comparison: Send prompts and see how parameter changes affect output
- Chips will show:
gemma3:1b (T=0.1 ...),gemma3:1b (T=0.9 ...),gemma3:1b (T=2.0 ...)
- Chips will show:
- Export Results: Download JSON with full parameter sets for analysis
Example Use Case: Determine whether creative writing tasks benefit from T=1.5 or T=2.0 by testing multiple temperatures on the same model.
Fine-tune each model instance independently with 6 parameters:
| Parameter | Range | Default | Best For | Notes |
|---|---|---|---|---|
| Temperature | 0.01-2.0 | 0.7 | Creativity control | Low (0.1-0.3) = factual, High (1.5-2.0) = creative |
| top_p | 0-1 | 0.9 | Nucleus sampling | Lower = more focused, Higher = more diverse |
| top_k | 0-100 | 40 | Token limit | Restricts vocabulary per step |
| repeat_penalty | 1.0-2.0 | 1.1 | Avoid repetition | Higher = more variation, 1.0 = no penalty |
| num_predict | -1 to 4096 | -1 | Response length | -1 = unlimited, set to cap tokens |
| seed | 0+ | 0 | Reproducibility | 0 = random, >0 = deterministic |
Quick Settings for Common Tasks:
- Code Generation: T=0.2, P=0.8, K=20, R=1.2, M=-1, S=0
- Creative Writing: T=1.5, P=0.95, K=50, R=1.3, M=-1, S=0
- Factual Q&A: T=0.5, P=0.85, K=30, R=1.1, M=500, S=0
- Reproducible Tests: T=0.7, P=0.9, K=40, R=1.1, M=-1, S=42 (any seed >0)
Visual Indicators:
- Core params always shown:
T=0.7 P=0.9 K=40 - Advanced params shown when non-default:
+ R=1.5 M=500 S=42
- Run multiple rounds: Vote on 3-5 prompts before revealing for statistically meaningful results
- Diverse prompts: Test different task types (reasoning, creativity, factual)
- Team evaluations: Share the blind session with colleagues for consensus voting
- Export before reveal: Save masked JSON for audit trails showing no bias
- Start with defaults: Use baseline (0.7, 0.9, 40, 1.1, -1, 0) as control
- Change one at a time: Isolate effects by varying single parameter
- Document results: Export after each test for comparison
- Use seed for A/B tests: Set seed > 0 to ensure identical starting conditions
- Model size matters: Smaller models (1B-3B) run faster on CPU, larger (7B+) benefit from GPU
- Limit num_predict: Set to 500-1000 for faster responses in testing
- Batch similar prompts: Test same prompt across configs before moving to next question
# Verify Ollama is running
curl http://127.0.0.1:11434/api/tags
# Restart Ollama if needed (Windows)
taskkill /F /IM ollama.exe
ollama serve- Check GPU usage: Large models (7B+) slow on CPU-only systems
- Reduce num_predict: Set to 500 instead of -1 (unlimited)
- Use smaller models: Try
gemma3:1borqwen2.5:3binstead ofllama3.2:7b
- Verify in export: Download JSON and check
model_instancesarray - Check Ollama version: Ensure Ollama is up-to-date (v0.1.0+)
- Restart session: Click "New Chat" and reconfigure models
- Labels not showing: Hard refresh browser (Ctrl+F5)
- Reveal button missing: Ensure blind mode toggle is active (purple background)
- Votes not saving: Check browser console for localStorage errors
{
"session_id": "20260127_143022",
"timestamp": "2026-01-27T14:30:22.123Z",
"blind_mode": false,
"model_instances": [
{
"id": "gemma3_1b__0.7_0.9_40_1.1_-1_0",
"model": "gemma3:1b",
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1,
"num_predict": -1,
"seed": 0
}
],
"history": [
{
"prompt": "Explain quantum computing",
"responses": {
"gemma3_1b__0.7_0.9_40_1.1_-1_0": {
"content": "Quantum computing uses qubits...",
"metrics": {"duration_s": 2.34, "tokens": 150}
}
}
}
]
}{
"session_id": "20260127_143022_blind",
"blind_mode": true,
"revealed": false,
"model_instances": [
{"id": "MASKED", "model": "MASKED"}
],
"history": [
{
"prompt": "Explain quantum computing",
"responses": {
"Model A": {"content": "...", "votes": {"up": 1, "down": 0}}
}
}
]
}{
"blind_mode": true,
"revealed": true,
"blind_mapping": {
"Model A": "gemma3:1b",
"Model B": "qwen2.5:3b"
},
"model_instances": [
{
"id": "gemma3_1b__0.7_0.9_40_1.1_-1_0",
"model": "gemma3:1b",
"blind_label": "Model A"
}
],
"vote_summary": {
"Model A": {"up": 3, "down": 1},
"Model B": {"up": 5, "down": 0}
}
}- Issues: Check BUG_FIXES.md for known issues
- API Reference: See API.md for endpoint details
- Contributing: Read CONTRIBUTING.md for development setup
- Changelog: CHANGELOG.md for version history
Ready to start? Run python web_chat.py and visit http://127.0.0.1:7860 🚀