feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap by sanil-23 · Pull Request #2461 · tinyhumansai/openhuman

sanil-23 · 2026-05-21T18:14:32Z

Summary

Throttle outbound cloud embedding requests (OpenHumancloud, openai, remote custom:) to the backend's hard 60/min per-account cap, proactively, instead of tripping it and absorbing 429s.
Gate is applied at the single shared chokepoint OpenAiEmbedding::embed — the cloud provider delegates to it, openai/custom: use it directly — so none of the embedder construction paths can bypass it.
Budget is a process-global, per-endpoint token bucket keyed by base URL (the quota is account-wide; provider instances are ephemeral). Mirrors the existing proxy::set_runtime_proxy_config global-state pattern.
Capacity is one token (minimum-interval pacing) — no burst that could exceed the hard cap in a rolling minute; an idle bucket still lets a lone interactive query embed through immediately.
Configurable: memory.embedding_rate_limit_per_min (default 60, 0 disables) + env OPENHUMAN_MEMORY_EMBED_RATE_LIMIT. Loopback endpoints are exempt (a local Ollama/LocalAI custom: server isn't the cloud quota this guards).

Problem

The cloud embedding backend caps requests at a hard 60/min per account. Every embed() is one HTTP POST, and memory-tree ingest fans out one call per chunk across job workers, so under load we exceed the cap. There was no proactive limiter — only reactive 429 handling (inference/provider/reliable.rs), and embeddings/openai.rs downgrades the resulting 429 to a warning breadcrumb. So we were hitting the limit and absorbing the error rather than staying under it.

Solution

New src/openhuman/embeddings/rate_limit.rs: async token bucket + process-global registry keyed by endpoint URL; acquire_embedding_slot(), set_embedding_rate_limit(), loopback exemption.
embeddings/openai.rs: acquire_embedding_slot(&self.base_url).await immediately before the POST (after the empty-batch short-circuit).
Config field on MemoryConfig (config/schema/storage_memory.rs), env override + commit to the global limiter in config/schema/load.rs::apply_env_overrides (next to the proxy commit, keeping the pure overlay side-effect-free).
Design decision (hard cap): capacity = 1 token, refilling at limit/60/sec. A full limit-sized burst could reach ~2×limit in the first rolling minute and trip a hard cap; capacity 1 paces requests with no burst while keeping lone/idle requests instant. Trade-off: a retrieval firing 2–3 query embeds back-to-back may add ~1–2s; sustained ingest runs at the 1/sec the backend allows anyway.
The existing reactive 429 retry/backoff is preserved as a backstop.

Submission Checklist

Tests added or updated (happy path + failure/edge): bucket math incl. capped refill + partial-refill wait, loopback exemption incl. malformed-URL → throttled, no-burst pacing of back-to-back acquires, disabled-limit bypass, set/read round-trip.
Diff coverage ≥ 80% — focused cargo test for the changed modules passes (136 passed / 0 failed); the new module is comprehensively unit-tested. Did not run cargo-llvm-cov locally; the dedicated Rust Core Coverage CI check is the binding gate and will confirm ≥80% on changed lines.
Coverage matrix updated — N/A: no matrix row required for this diff (the Coverage Matrix Sync check passes; this is an internal reliability/throttling behavior, not a catalogued feature surface).
No new external network dependencies — no new crates; rate-limit tests exercise pure logic (no sockets), the existing OpenAI tests keep using the loopback mock.
Manual smoke checklist — N/A: no release-cut UI surface touched (core/config only).
Linked issue closed via Closes #NNN — N/A: ad-hoc work, no tracking issue.

Impact

Platform: core (Rust); affects desktop/CLI memory ingest + retrieval embedding throughput. No UI/Tauri changes.
Performance: cloud embeds now paced at ≤60/min; back-to-back embeds may be spaced ~1s (lone/idle embeds unaffected). Local Ollama/LocalAI and none are not throttled.
Compatibility: additive config field with a serde default (existing config.toml unaffected). No public API or embedding-signature change.
Security/migration: none.

Closes: N/A (ad-hoc, no tracking issue)
Follow-up PR(s)/TODOs: optionally surface embedding_rate_limit_per_min in the config-update RPC (config/schemas.rs MemorySettingsUpdate + ops.rs) and Settings UI; optionally extend the same loopback-exempt gate to the native Ollama embedders if a remote Ollama is ever supported.

AI Authored PR Metadata

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: feat/embedding-rate-limit
Commit SHA: 702d192b76693f41963c0fe16e9a5085ecf21cc1

Validation Run

pnpm --filter openhuman-app format:check — N/A: no app/ changes (ran cargo fmt for Rust instead).
pnpm typecheck — N/A: no TypeScript changes.
Focused tests: cargo test --lib embeddings::rate_limit embeddings::openai embeddings::tests config::schema::storage_memory config::schema::load → 136 passed / 0 failed.
Rust fmt/check (changed): cargo fmt + cargo check --lib + cargo clippy --lib all clean for the 5 changed files.
Tauri fmt/check — N/A: blocked locally (see Validation Blocked); CI runs it.

Validation Blocked

command: pnpm rust:check / cargo check --manifest-path app/src-tauri/Cargo.toml (also the pre-push hook; pushed with --no-verify).
error: failed to read app/src-tauri/vendor/tauri-cef/crates/tauri/Cargo.toml: No such file or directory — the vendored CEF crates are not populated in this worktree.
impact: Environment-only; unrelated to this core-only, additive change. The core lib compiles and tests/clippy pass; the Tauri shell links the core but no shell-facing API changed. CI runs the shell check in a properly-provisioned environment (the Verify tauri-cef submodule pin check passes).

Behavior Changes

Intended behavior change: cloud embedding HTTP requests are throttled to ≤ memory.embedding_rate_limit_per_min (default 60/min).
User-visible effect: under heavy ingest, embeddings pace at the backend's allowed rate instead of erroring; negligible for normal interactive use.

Parity Contract

Legacy behavior preserved: reactive 429 retry/backoff unchanged (now a backstop); ollama/none providers unthrottled; embedding signature unchanged.
Guard/fallback/dispatch parity: limit == 0 and loopback short-circuit before any bucket work; empty-batch embeds still short-circuit before acquiring a token.

Summary by CodeRabbit

New Features

Embedding requests now support configurable per-minute rate limiting (default: 60 requests/minute; set to 0 to disable)
Added OPENHUMAN_MEMORY_EMBED_RATE_LIMIT environment variable for runtime rate limit configuration
Loopback hosts are automatically exempt from rate limiting

…s hard 60/min Cloud embedding backends (OpenHuman/Voyage, OpenAI, custom remote endpoints) cap requests at a hard 60/min per account. Every embed() is one HTTP POST and memory-tree ingest fans out one call per chunk across job workers, so without throttling we trip the cap and absorb 429s (openai.rs downgrades them to a warning breadcrumb). Gate every cloud embed at the shared OpenAiEmbedding::embed chokepoint (the cloud provider delegates to it; openai/custom use it directly) through a process-global, per-endpoint token bucket keyed by base URL. Capacity is one token (minimum-interval pacing) so we never burst past the hard cap; an idle bucket still lets a lone interactive query embed through immediately. Loopback endpoints are exempt -- a local Ollama/LocalAI server isn't the cloud quota this guards. Configurable via memory.embedding_rate_limit_per_min (default 60, 0 disables) and OPENHUMAN_MEMORY_EMBED_RATE_LIMIT; committed to the process-global limiter at config load alongside the proxy commit. Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-05-21T18:16:04Z

📝 Walkthrough

Walkthrough

Adds a process-global, per-endpoint token-bucket rate limiter for embedding requests, exposes it as a public submodule, wires configuration (including OPENHUMAN_MEMORY_EMBED_RATE_LIMIT), and invokes the acquisition gate from the OpenAI embedding provider. Includes comprehensive unit tests.

Changes

Embedding request rate limiting

Layer / File(s)	Summary
Configuration schema and defaults `src/openhuman/config/schema/storage_memory.rs`	`MemoryConfig` adds `embedding_rate_limit_per_min: u32`, a `default_embedding_rate_limit_per_min()` returning 60, Default initialization, and Debug output updates.
Configuration loading and global limiter setup `src/openhuman/config/schema/load.rs`	`apply_env_overlay_with` parses `OPENHUMAN_MEMORY_EMBED_RATE_LIMIT`; `apply_env_overrides_from` calls `set_embedding_rate_limit` with the resolved value.
Rate limiter implementation and tests `src/openhuman/embeddings/rate_limit.rs`	New process-global per-endpoint token-bucket limiter with single-token no-burst buckets, refill math, loopback exemption, setter/getter, async `acquire_embedding_slot`, and unit tests for classification, pacing, refill, and waits.
Module exposure and provider integration `src/openhuman/embeddings/mod.rs`, `src/openhuman/embeddings/openai.rs`	Public `rate_limit` submodule and integration: `OpenAiEmbedding::embed` awaits `acquire_embedding_slot(&self.base_url)` before sending outbound requests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tinyhumansai/openhuman#2190: Also modifies OpenAiEmbedding::embed; that PR changes error reporting for non-2xx responses which may interact with rate-limit behavior.

Suggested reviewers

graycyrus
M3gA-Mind

Poem

🐰 I count the tokens, one by one,
Slow the race of requests undone,
Loopback hops skip past the gate,
Buckets fill to keep the rate,
Embeddings hum at measured fun.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap' is clear, specific, and directly summarizes the main change—implementing rate limiting for cloud embedding requests. It accurately reflects the primary purpose of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 1742-1746: The code in apply_env_overlay_with reads std::env
directly (OPENHUMAN_MEMORY_EMBED_RATE_LIMIT) instead of using the injected
EnvLookup; change that branch to call the provided EnvLookup instance (e.g.,
env_lookup or env) to retrieve the variable (use its get/get_var/get or similar
method), then trim/parse::<u32>() and assign to
self.memory.embedding_rate_limit_per_min as before so injected-env tests and
overlay behavior are consistent; keep the existing parsing and assignment logic
but source the value from the EnvLookup parameter rather than std::env::var.

In `@src/openhuman/embeddings/rate_limit.rs`:
- Around line 59-66: The function set_embedding_rate_limit currently clears
BUCKETS on every call; change it to first read the existing value via
CONFIGURED_LIMIT.load(Ordering::Relaxed) and compare to the incoming per_minute,
and only call CONFIGURED_LIMIT.store(...) and clear the registry (BUCKETS.get()
... .clear()) when the configured limit actually differs; keep the same
lock/unwrapping logic around BUCKETS but avoid resetting pacing state when the
value is unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fffad63d-21ec-4867-911f-efbe2aee52b5

📥 Commits

Reviewing files that changed from the base of the PR and between b1bbc53 and 702d192.

📒 Files selected for processing (5)

src/openhuman/config/schema/load.rs
src/openhuman/config/schema/storage_memory.rs
src/openhuman/embeddings/mod.rs
src/openhuman/embeddings/openai.rs
src/openhuman/embeddings/rate_limit.rs

- set_embedding_rate_limit: only clear the per-endpoint bucket registry when the rate actually changes (swap + compare), so repeated config reloads with an unchanged value don't keep handing out a fresh burst token and erode the hard-cap pacing guarantee. - load.rs: read OPENHUMAN_MEMORY_EMBED_RATE_LIMIT via the injected EnvLookup (env.get) rather than std::env, so the override honors the apply_env_overlay_with contract and works under injected-env tests. Co-Authored-By: Claude <noreply@anthropic.com>

sanil-23 · 2026-05-22T07:51:35Z

@graycyrus pls review

graycyrus

Looks good, nice work!

…s hard 60/min cap (tinyhumansai#2461) Co-authored-by: sanil-23 <sanil@alphahuman.xyz> Co-authored-by: Claude <noreply@anthropic.com>

sanil-23 requested a review from a team May 21, 2026 18:14

coderabbitai Bot added feature Net-new user-facing capability or product behavior. working A PR that is being worked on by the team. labels May 21, 2026

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread src/openhuman/config/schema/load.rs Outdated

Comment thread src/openhuman/embeddings/rate_limit.rs

coderabbitai Bot approved these changes May 21, 2026

View reviewed changes

graycyrus approved these changes May 22, 2026

View reviewed changes

graycyrus merged commit 9d0cce7 into tinyhumansai:main May 22, 2026
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap#2461

feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap#2461
graycyrus merged 2 commits into
tinyhumansai:mainfrom
sanil-23:feat/embedding-rate-limit

sanil-23 commented May 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

sanil-23 commented May 22, 2026

Uh oh!

graycyrus left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanil-23 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Summary by CodeRabbit

New Features

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sanil-23 commented May 22, 2026

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanil-23 commented May 21, 2026 •

edited

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading