cld2labs
diff --git a/‎README.md‎
Lines changed: 29 additions & 0 deletions b/‎README.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎ui/src/images/UI_2.png‎
-413 KB b/‎ui/src/images/UI_2.png‎
-413 KB
diff --git a/‎ui/src/images/UI_4.png‎
-492 KB b/‎ui/src/images/UI_4.png‎
-492 KB
diff --git a/‎ui/src/images/UI_5_agent.png‎
-342 KB b/‎ui/src/images/UI_5_agent.png‎
-342 KB
@@ -556,6 +556,35 @@ docker compose up -d
 
 ---
 
+## Performance Benchmarks
+
+The following benchmarks were collected by running DocuBot's full 9-agent documentation pipeline across three inference environments. Use these results to choose the right deployment profile for your needs.
+
+> **Note:** Intel Enterprise Inference was tested on Intel Xeon hardware to demonstrate on-premises SLM deployment for enterprise codebases.
+
+### Results
+
+| Model Type / Inference Provider | Model Name | Deployment | Context Window | Avg Input Tokens | Avg Output Tokens | Avg Total Tokens / Request | P50 Latency (ms) | P95 Latency (ms) | Throughput (req/sec) | Hardware Profile |
+|---|---|---|---|---|---|---|---|---|---|---|
+| vLLM | Qwen3-4B-Instruct-2507 | Local | 262.1K | 3,040 | 307.7 | 5809 | 15,864 | 40,809 | 0.0580 | Apple Silicon (Metal) |
+| Enterprise Inference / SLM · [Intel OPEA EI](https://opea.dev) | Qwen3-4B-Instruct-2507 | CPU (Xeon) | 8.1K | 4,211.9 | 270 | 4481 | 10,540 | 32,205 | 0.076 | CPU-only |
+| OpenAI (Cloud) | gpt-4o-mini | API (Cloud) | 128K | 3,820.11 | 316.41 | 4136.52 | 7,760 | 23,535 | 0.108 | N/A |
+
+### Key Findings
+
+- **Cloud leads on speed**: gpt-4o-mini delivers 26% faster P50 latency (7,760ms vs 10,540ms on Xeon) and 42% higher throughput (0.108 vs 0.076 req/sec) compared to CPU-only Qwen3-4B — cloud GPUs eliminate hardware bottlenecks for latency-sensitive pipelines.
+- **Context window gap critically impacts multi-agent workflows**: Cloud's 128K vs Xeon's 8.1K (94% reduction) forces aggressive prompt truncation for code documentation tasks, while vLLM's 262.1K on Apple Silicon enables full-context processing without chunking strategies.
+- **Cloud generates more with less input**: gpt-4o-mini produces 17% more output (316 vs 270 tokens) while consuming 9% less input (3,820 vs 4,212), indicating superior prompt compression and generation efficiency.
+- **Apple Silicon throughput lags despite large context**: Despite a 32× larger context window (262.1K vs 8.1K), Apple Silicon achieves only 0.058 req/sec — 46% slower than cloud and 24% slower than CPU-only Xeon — suggesting Metal optimization gaps for multi-agent workloads.
+- **Deployment stability affects cost predictability**: Cloud shows 26% token variance per run (3,618–4,915 tokens/req) vs Xeon's 7% variance (4,375–4,688), reflecting dynamic resource allocation versus consistent CPU-bound processing.
+
+### Model Capabilities
+
+| Model | Highlights |
+|---|---|
+| **Qwen3-4B-Instruct-2507** | 4B-parameter code-specialized model with 262.1K native context (deployment-limited to 8.1K on Xeon CPU). Supports multi-agent documentation generation, code analysis, and structured JSON output. Enables full on-premises deployment with data sovereignty for enterprise codebases. |
+| **gpt-4o-mini** | Cloud-native multimodal model with 128K context, optimized for code understanding and technical documentation. Delivers 42% higher throughput and 26% lower latency versus CPU-based alternatives while supporting concurrent multi-agent orchestration at cloud scale. |
+
 ## Environment Variables
 
 Configure the application behavior using environment variables in `api/.env`: