-
Notifications
You must be signed in to change notification settings - Fork 15
WIP feat (browsers): create throughput benchmark for browser providers #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kisernl
wants to merge
2
commits into
master
Choose a base branch
from
step-throughput-benchmark
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,190 @@ | ||
| name: Browser Throughput Benchmark | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - 'src/browser/**' | ||
| - 'src/util/**' | ||
| - 'src/run.ts' | ||
| - 'src/merge-results.ts' | ||
| - 'package.json' | ||
| schedule: | ||
| - cron: '0 3 * * *' # Daily at 03:00 UTC (offset from main browser benchmark) | ||
| workflow_dispatch: | ||
| inputs: | ||
| iterations: | ||
| description: 'Iterations per provider (sessions)' | ||
| required: false | ||
| default: '10' | ||
|
|
||
| concurrency: | ||
| group: browser-throughput-benchmarks | ||
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| contents: write | ||
| pull-requests: write | ||
|
|
||
| jobs: | ||
| bench: | ||
| name: Bench ${{ matrix.provider }} | ||
| runs-on: namespace-profile-default | ||
| timeout-minutes: 60 | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| provider: | ||
| - browserbase | ||
| - hyperbrowser | ||
| - kernel | ||
| - steel | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: 24 | ||
| cache: 'npm' | ||
| - name: Install dependencies | ||
| run: | | ||
| if [ "${{ github.event_name }}" = "schedule" ]; then | ||
| npm update | ||
| else | ||
| npm ci | ||
| fi | ||
| - name: Clear stale results from checkout | ||
| run: rm -rf results/browser-throughput/ | ||
| - name: Run browser throughput benchmark | ||
| env: | ||
| BROWSERBASE_API_KEY: ${{ secrets.BROWSERBASE_API_KEY }} | ||
| BROWSERBASE_PROJECT_ID: ${{ secrets.BROWSERBASE_PROJECT_ID }} | ||
| HYPERBROWSER_API_KEY: ${{ secrets.HYPERBROWSER_API_KEY }} | ||
| KERNEL_API_KEY: ${{ secrets.KERNEL_API_KEY }} | ||
| STEEL_API_KEY: ${{ secrets.STEEL_API_KEY }} | ||
| run: | | ||
| npm run bench -- \ | ||
| --mode browser-throughput \ | ||
| --provider ${{ matrix.provider }} \ | ||
| --iterations ${{ github.event_name == 'pull_request' && '3' || github.event.inputs.iterations || '10' }} | ||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: browser-throughput-results-${{ matrix.provider }} | ||
| path: results/browser-throughput/ | ||
| if-no-files-found: ignore | ||
| retention-days: 7 | ||
|
|
||
| collect: | ||
| name: Collect Results | ||
| runs-on: namespace-profile-default | ||
| needs: bench | ||
| if: always() | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: 24 | ||
| cache: 'npm' | ||
| - name: Install dependencies | ||
| run: | | ||
| if [ "${{ github.event_name }}" = "schedule" ]; then | ||
| npm update | ||
| else | ||
| npm ci | ||
| fi | ||
| - name: Download all artifacts | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| path: artifacts/ | ||
| pattern: browser-throughput-results-* | ||
| - name: Merge results | ||
| run: npx tsx src/merge-results.ts --input artifacts --mode browser-throughput | ||
| - run: npm run generate-browser-throughput-svg | ||
| - name: Upload SVG as artifact | ||
| if: github.event_name == 'pull_request' | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: browser-throughput-benchmark-svg | ||
| path: browser-throughput.svg | ||
| if-no-files-found: ignore | ||
| retention-days: 7 | ||
| - name: Post results to PR | ||
| if: github.event_name == 'pull_request' | ||
| continue-on-error: true | ||
| uses: actions/github-script@v7 | ||
| with: | ||
| script: | | ||
| const fs = require('fs'); | ||
| const path = require('path'); | ||
|
|
||
| const runUrl = `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`; | ||
| const latestPath = path.join('results', 'browser-throughput', 'latest.json'); | ||
|
|
||
| let body = '## Browser Throughput Benchmark Results\n\n'; | ||
|
|
||
| if (!fs.existsSync(latestPath)) { | ||
| body += '> No browser-throughput benchmark results were generated.\n\n'; | ||
| } else { | ||
| const data = JSON.parse(fs.readFileSync(latestPath, 'utf-8')); | ||
| const results = data.results | ||
| .filter(r => !r.skipped) | ||
| .sort((a, b) => (b.compositeScore || 0) - (a.compositeScore || 0)); | ||
|
|
||
| if (results.length === 0) { | ||
| body += '> No browser-throughput benchmark results were generated.\n\n'; | ||
| } else { | ||
| body += '| # | Provider | Score | APS (med) | Task (med) | Task (p95) | Screenshot | Status |\n'; | ||
| body += '|---|----------|-------|-----------|------------|------------|------------|--------|\n'; | ||
|
|
||
| results.forEach((r, i) => { | ||
| const name = r.provider.charAt(0).toUpperCase() + r.provider.slice(1); | ||
| const score = r.compositeScore !== undefined ? r.compositeScore.toFixed(1) : '--'; | ||
| const aps = r.summary.actionsPerSecond.median.toFixed(2) + '/s'; | ||
| const taskMed = (r.summary.taskMs.median / 1000).toFixed(2) + 's'; | ||
| const taskP95 = (r.summary.taskMs.p95 / 1000).toFixed(2) + 's'; | ||
| const screenshotMed = Math.round(r.summary.perActionType.screenshot?.median || 0) + 'ms'; | ||
| const expectedActions = 50; | ||
| const ok = r.iterations.filter(it => !it.error && it.actionsCompleted === expectedActions).length; | ||
| const count = r.iterations.length; | ||
| body += `| ${i + 1} | ${name} | ${score} | ${aps} | ${taskMed} | ${taskP95} | ${screenshotMed} | ${ok}/${count} |\n`; | ||
| }); | ||
|
|
||
| body += '\n'; | ||
| } | ||
| } | ||
|
|
||
| body += `---\n*[View full run](${runUrl}) · SVG available as [build artifact](${runUrl}#artifacts)*`; | ||
|
|
||
| const marker = '## Browser Throughput Benchmark Results'; | ||
| const { data: comments } = await github.rest.issues.listComments({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: context.issue.number, | ||
| }); | ||
|
|
||
| const existing = comments.find(c => c.body.startsWith(marker)); | ||
|
|
||
| if (existing) { | ||
| await github.rest.issues.updateComment({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| comment_id: existing.id, | ||
| body, | ||
| }); | ||
| } else { | ||
| await github.rest.issues.createComment({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: context.issue.number, | ||
| body, | ||
| }); | ||
| } | ||
| - name: Commit and push | ||
| if: github.event_name != 'pull_request' | ||
| run: | | ||
| git config user.name "github-actions[bot]" | ||
| git config user.email "github-actions[bot]@users.noreply.github.com" | ||
| git add package.json package-lock.json browser-throughput.svg results/browser-throughput/ | ||
| git diff --cached --quiet && echo "No changes to commit" && exit 0 | ||
| git commit -m "chore: update browser throughput benchmark results [skip ci]" | ||
| git push |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| # Browser Step Throughput Benchmark | ||
|
|
||
| This document describes the **browser step throughput benchmark** — a measurement of how fast a browser provider can execute sequential agent-style actions inside a single running session. It is a complement to the existing browser lifecycle benchmark, which measures session provisioning latency. | ||
|
|
||
| ## Why this benchmark exists | ||
|
|
||
| The existing browser benchmark (`src/browser/benchmark.ts`) measures the **lifecycle**: | ||
|
|
||
| ``` | ||
| session create → CDP connect → single page load → release | ||
| ``` | ||
|
|
||
| That answers one important question: *how fast can I get a fresh browser?* It is the right metric for short-lived sessions where each task spins up a new browser. | ||
|
|
||
| It does **not** answer the question that matters for long-running agent workloads: | ||
|
|
||
| > *Once a session is up, how fast does each individual action complete, and does performance hold up over the course of a session?* | ||
|
|
||
| A vision-based agent might run for thirty minutes to several hours and execute hundreds of browser actions inside a single session. For those workloads, provisioning speed is a negligible fraction of total runtime — the per-action throughput is the bottleneck. A provider that creates a session in 200ms but takes 800ms per screenshot will lose to a provider that takes 2s to create a session but only 100ms per screenshot, every single time. | ||
|
|
||
| This benchmark closes that gap. | ||
|
|
||
| ## What gets measured | ||
|
|
||
| For each provider, the benchmark runs **N sessions** (default 10, configurable). Each session executes a fixed sequence of **50 sequential actions** end-to-end inside one running browser. We record, in order: | ||
|
|
||
| - Session creation time (`createMs`) | ||
| - CDP connection time (`connectMs`) | ||
| - Wall-clock duration of each individual action, tagged by action type | ||
| - Session release time (`releaseMs`) | ||
| - Total wall-clock time (`totalMs`) | ||
| - Sum of action durations (`taskMs`) | ||
| - Actions per second over the session (`actionsCompleted / (taskMs / 1000)`) | ||
|
|
||
| From that raw data we summarize across iterations: | ||
|
|
||
| - `actionsPerSecond` — median, p95, p99 | ||
| - `taskMs` — median, p95, p99 | ||
| - `totalMs` — median, p95, p99 | ||
| - `createMs` — median, p95, p99 | ||
| - `perActionType` — median, p95, p99 for each of the six action types | ||
|
|
||
| ## The 50-action sequence | ||
|
|
||
| Each session repeats a 10-action loop five times against Wikipedia: | ||
|
|
||
| ``` | ||
| 1. goto('https://en.wikipedia.org/wiki/Special:Random') | ||
| 2. waitForSelector('#firstHeading') | ||
| 3. screenshot() | ||
| 4. textContent('#firstHeading') | ||
| 5. click('#mw-content-text a[href^="/wiki/"]:not([href*=":"])') | ||
| 6. waitForSelector('#firstHeading') | ||
| 7. screenshot() | ||
| 8. textContent('#firstHeading') | ||
| 9. page.goBack({ waitUntil: 'commit' }) | ||
| 10. waitForSelector('#firstHeading') | ||
| ``` | ||
|
|
||
| Five loops × ten actions = **50 actions per session**. | ||
|
|
||
| This pattern simulates what a vision-based agent actually does on each turn: navigate, wait for the DOM, capture a screenshot for an LLM to look at, extract some text, take an action, observe the result, and move on. | ||
|
|
||
| ### Why Wikipedia | ||
|
|
||
| Wikipedia's `Special:Random` endpoint is intentionally chosen over real-world target sites. It gives us: | ||
|
|
||
| - **Global availability** — no geographic restrictions, no auth flows. | ||
| - **Consistent structure** — every article page has `#firstHeading` and a `#mw-content-text` body container, so the same selectors work everywhere. | ||
| - **A rich, deterministic link graph** — every random article exposes many `/wiki/...` outbound links to follow. | ||
| - **Stable, predictable load times** — Wikipedia's CDN serves pages quickly and consistently across regions. | ||
| - **No meaningful bot detection** for scripted, polite traffic. | ||
|
|
||
| That isolates the variable we care about: the provider's per-action overhead. Page-level variance is small enough that differences between providers are real, not noise from the target site. | ||
|
|
||
| ### Why these six action types | ||
|
|
||
| Together they cover the surface area of nearly every agent action: | ||
|
|
||
| | Action type | Represents | | ||
| | ------------------ | ----------------------------------------------------------- | | ||
| | `navigate` | Full-page transitions (HTTP + page load + render) | | ||
| | `waitForSelector` | DOM polling — measures CDP round-trip + selector evaluation | | ||
| | `screenshot` | Pixel capture — relevant for vision-based agents | | ||
| | `textContent` | DOM read — cheapest possible action, isolates raw CDP cost | | ||
| | `click` | Synthetic input event + waiting for the navigation it triggers | | ||
| | `goBack` | History navigation, exercises bfcache behavior | | ||
|
|
||
| Per-action breakdown matters: two providers can have identical end-to-end times but very different cost structures (one is screenshot-bound, the other is click-bound). The `perActionType` summary surfaces those differences. | ||
|
|
||
| ### Stealth + real viewport | ||
|
|
||
| Every provider is configured with the settings agent workloads typically use: | ||
|
|
||
| ```typescript | ||
| sessionCreateOptions: { | ||
| stealth: true, | ||
| headless: true, | ||
| viewport: { width: 1920, height: 1080 }, | ||
| } | ||
| ``` | ||
|
|
||
| This makes the comparison apples-to-apples and reflects realistic agent conditions (stealth mode often changes performance characteristics, and a 1920×1080 viewport produces meaningfully larger screenshots than the default). | ||
|
|
||
| ## How the runner behaves | ||
|
|
||
| A few deliberate choices in `runThroughputIteration`: | ||
|
|
||
| - **Each action is timed individually** with `performance.now()` immediately before and after the Playwright call. The session timing is the *sum of action durations*, not measured separately — that way action-level numbers always add up to the session number. | ||
| - **A failing action does not abort the session.** If `click` times out on action 5, the loop records the failure and proceeds with action 6. This lets us measure partial completion rates and observe how providers degrade under stress, instead of throwing away an entire session because one action got unlucky. | ||
| - **The action index is recorded.** With 50 ordered actions per session, downstream analysis can detect if late-session actions are systematically slower than early-session ones — a useful signal for memory leaks or resource exhaustion in long-running sessions. | ||
| - **Action timeout is 30 seconds**, applied per-action via `withTimeout`. A single slow action can't hang an entire run, and the timeout lands well above any reasonable real action duration. | ||
| - **`page.goBack` uses `waitUntil: 'commit'`** rather than the Playwright default of `'load'`, because browsers restoring a page from the back-forward cache fire `pageshow` instead of `load` — `'load'` would hang for the full timeout on every bfcache restore. The next `waitForSelector` confirms arrival on the previous page. | ||
|
|
||
| ## Scoring | ||
|
|
||
| The composite score is a single number (0–100, higher is better) for at-a-glance comparison. The weighting was chosen to reflect what actually matters for agent workloads: | ||
|
|
||
| ``` | ||
| score = ( | ||
| 0.40 × score(actionsPerSecond.median) // throughput is the primary signal | ||
| + 0.25 × score(taskMs.median) // total time per session | ||
| + 0.20 × score(taskMs.p95) // tail consistency (worst sessions) | ||
| + 0.15 × score(screenshot.median) // vision-agent proxy | ||
| ) × successRate | ||
| ``` | ||
|
|
||
| Where the sub-scores are linear: | ||
|
|
||
| - `score(actionsPerSecond)` — 0/sec → 0, 10/sec → 100 (linear). | ||
| - `score(latencyMs)` — 0ms → 100, 30,000ms → 0 (linear, clamped to 0). | ||
| - `successRate` — fraction of sessions that completed all 50 actions without error. A session that completes only 49/50 does not count toward `successRate`. This deliberately punishes flakiness — an agent that fails 1 action in 50 fails 1 in every 50, period. | ||
|
|
||
| ### Why these weights | ||
|
|
||
| - **40% on throughput**, because actions/sec is the headline metric for agent workloads. Doubling APS halves the wall-clock cost of any agent task. | ||
| - **25% on median task time**, to reward the typical case. | ||
| - **20% on p95 task time**, to reward consistency. A provider with a great median but a long tail is dangerous for agents that run for hours — the tail is what you actually pay. | ||
| - **15% on screenshot median**, because vision agents bottleneck on screenshot capture. It's separated out so this specific cost can't hide inside the aggregate. | ||
| - **× successRate**, because partial successes aren't useful. A provider that wins on speed but fails 10% of sessions is worse than a slower one that finishes. | ||
|
|
||
| ### Why not just use APS | ||
|
|
||
| A single-axis score would hide important detail. A provider can have great throughput but terrible p95 (one in twenty sessions falls off a cliff) — which is unusable for production agents. The composite score forces all four axes to be acceptable to score well. | ||
|
|
||
| The full per-action distribution is preserved in the JSON output, so anyone who cares about a different weighting can compute their own score from the raw data. | ||
|
|
||
| ## Running it | ||
|
|
||
| ```bash | ||
| # Single provider, single session — useful for development | ||
| npm run bench:browser-throughput:browserbase -- --iterations 1 | ||
|
|
||
| # All four providers, default 10 sessions each | ||
| npm run bench:browser-throughput | ||
|
|
||
| # Specific provider with custom iteration count | ||
| npm run bench -- --mode browser-throughput --provider hyperbrowser --iterations 25 | ||
| ``` | ||
|
|
||
| Required environment variables (set in `.env` or your shell): | ||
|
|
||
| - `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID` | ||
| - `HYPERBROWSER_API_KEY` | ||
| - `KERNEL_API_KEY` | ||
| - `STEEL_API_KEY` | ||
|
|
||
| Missing credentials cause that provider to be reported as `SKIPPED` rather than failing the run. | ||
|
|
||
| ## Output | ||
|
|
||
| Results are written to `results/browser-throughput/YYYY-MM-DD.json` and copied to `results/browser-throughput/latest.json`. Each iteration's JSON includes the full ordered action list with per-action durations, success flags, and errors — enough to reconstruct any per-action analysis without re-running. | ||
|
|
||
| The SVG generator produces `browser-throughput.svg` with a ranked comparison table: | ||
|
|
||
| ```bash | ||
| npm run generate-browser-throughput-svg | ||
| ``` | ||
|
|
||
| ## Scheduling | ||
|
|
||
| The GitHub Actions workflow `browser-throughput-benchmarks.yml` runs daily at 03:00 UTC (offset from the lifecycle browser benchmark at 00:00) with 10 iterations per provider. Pull requests touching browser code run a faster 3-iteration version and post a comparison table as a PR comment. | ||
|
|
||
| ## Limitations | ||
|
|
||
| - Wikipedia's CDN is fast and globally distributed — providers in regions closer to Wikipedia's edge nodes will benefit. This is acceptable for a relative comparison but it is not representative of every real-world target site. | ||
| - A 50-action session is short relative to real agent workloads. It catches per-action overhead and basic session drift, but multi-hour memory leaks or long-tail GC pauses will not show up here. | ||
| - The benchmark does not currently model concurrent sessions per account. Some providers may have very different per-action latency under high concurrency. | ||
| - Wikipedia's HTML occasionally changes. If `#firstHeading` or `#mw-content-text` get renamed or restructured, the selectors in the runner will need updating. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.