Skip to content

feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap#2461

Merged
graycyrus merged 2 commits into
tinyhumansai:mainfrom
sanil-23:feat/embedding-rate-limit
May 22, 2026
Merged

feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap#2461
graycyrus merged 2 commits into
tinyhumansai:mainfrom
sanil-23:feat/embedding-rate-limit

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 21, 2026

Summary

  • Throttle outbound cloud embedding requests (OpenHumancloud, openai, remote custom:) to the backend's hard 60/min per-account cap, proactively, instead of tripping it and absorbing 429s.
  • Gate is applied at the single shared chokepoint OpenAiEmbedding::embed — the cloud provider delegates to it, openai/custom: use it directly — so none of the embedder construction paths can bypass it.
  • Budget is a process-global, per-endpoint token bucket keyed by base URL (the quota is account-wide; provider instances are ephemeral). Mirrors the existing proxy::set_runtime_proxy_config global-state pattern.
  • Capacity is one token (minimum-interval pacing) — no burst that could exceed the hard cap in a rolling minute; an idle bucket still lets a lone interactive query embed through immediately.
  • Configurable: memory.embedding_rate_limit_per_min (default 60, 0 disables) + env OPENHUMAN_MEMORY_EMBED_RATE_LIMIT. Loopback endpoints are exempt (a local Ollama/LocalAI custom: server isn't the cloud quota this guards).

Problem

The cloud embedding backend caps requests at a hard 60/min per account. Every embed() is one HTTP POST, and memory-tree ingest fans out one call per chunk across job workers, so under load we exceed the cap. There was no proactive limiter — only reactive 429 handling (inference/provider/reliable.rs), and embeddings/openai.rs downgrades the resulting 429 to a warning breadcrumb. So we were hitting the limit and absorbing the error rather than staying under it.

Solution

  • New src/openhuman/embeddings/rate_limit.rs: async token bucket + process-global registry keyed by endpoint URL; acquire_embedding_slot(), set_embedding_rate_limit(), loopback exemption.
  • embeddings/openai.rs: acquire_embedding_slot(&self.base_url).await immediately before the POST (after the empty-batch short-circuit).
  • Config field on MemoryConfig (config/schema/storage_memory.rs), env override + commit to the global limiter in config/schema/load.rs::apply_env_overrides (next to the proxy commit, keeping the pure overlay side-effect-free).
  • Design decision (hard cap): capacity = 1 token, refilling at limit/60/sec. A full limit-sized burst could reach ~2×limit in the first rolling minute and trip a hard cap; capacity 1 paces requests with no burst while keeping lone/idle requests instant. Trade-off: a retrieval firing 2–3 query embeds back-to-back may add ~1–2s; sustained ingest runs at the 1/sec the backend allows anyway.
  • The existing reactive 429 retry/backoff is preserved as a backstop.

Submission Checklist

  • Tests added or updated (happy path + failure/edge): bucket math incl. capped refill + partial-refill wait, loopback exemption incl. malformed-URL → throttled, no-burst pacing of back-to-back acquires, disabled-limit bypass, set/read round-trip.
  • Diff coverage ≥ 80% — focused cargo test for the changed modules passes (136 passed / 0 failed); the new module is comprehensively unit-tested. Did not run cargo-llvm-cov locally; the dedicated Rust Core Coverage CI check is the binding gate and will confirm ≥80% on changed lines.
  • Coverage matrix updated — N/A: no matrix row required for this diff (the Coverage Matrix Sync check passes; this is an internal reliability/throttling behavior, not a catalogued feature surface).
  • No new external network dependencies — no new crates; rate-limit tests exercise pure logic (no sockets), the existing OpenAI tests keep using the loopback mock.
  • Manual smoke checklist — N/A: no release-cut UI surface touched (core/config only).
  • Linked issue closed via Closes #NNN — N/A: ad-hoc work, no tracking issue.

Impact

  • Platform: core (Rust); affects desktop/CLI memory ingest + retrieval embedding throughput. No UI/Tauri changes.
  • Performance: cloud embeds now paced at ≤60/min; back-to-back embeds may be spaced ~1s (lone/idle embeds unaffected). Local Ollama/LocalAI and none are not throttled.
  • Compatibility: additive config field with a serde default (existing config.toml unaffected). No public API or embedding-signature change.
  • Security/migration: none.

Related

  • Closes: N/A (ad-hoc, no tracking issue)
  • Follow-up PR(s)/TODOs: optionally surface embedding_rate_limit_per_min in the config-update RPC (config/schemas.rs MemorySettingsUpdate + ops.rs) and Settings UI; optionally extend the same loopback-exempt gate to the native Ollama embedders if a remote Ollama is ever supported.

AI Authored PR Metadata

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: feat/embedding-rate-limit
  • Commit SHA: 702d192b76693f41963c0fe16e9a5085ecf21cc1

Validation Run

  • pnpm --filter openhuman-app format:check — N/A: no app/ changes (ran cargo fmt for Rust instead).
  • pnpm typecheck — N/A: no TypeScript changes.
  • Focused tests: cargo test --lib embeddings::rate_limit embeddings::openai embeddings::tests config::schema::storage_memory config::schema::load → 136 passed / 0 failed.
  • Rust fmt/check (changed): cargo fmt + cargo check --lib + cargo clippy --lib all clean for the 5 changed files.
  • Tauri fmt/check — N/A: blocked locally (see Validation Blocked); CI runs it.

Validation Blocked

  • command: pnpm rust:check / cargo check --manifest-path app/src-tauri/Cargo.toml (also the pre-push hook; pushed with --no-verify).
  • error: failed to read app/src-tauri/vendor/tauri-cef/crates/tauri/Cargo.toml: No such file or directory — the vendored CEF crates are not populated in this worktree.
  • impact: Environment-only; unrelated to this core-only, additive change. The core lib compiles and tests/clippy pass; the Tauri shell links the core but no shell-facing API changed. CI runs the shell check in a properly-provisioned environment (the Verify tauri-cef submodule pin check passes).

Behavior Changes

  • Intended behavior change: cloud embedding HTTP requests are throttled to ≤ memory.embedding_rate_limit_per_min (default 60/min).
  • User-visible effect: under heavy ingest, embeddings pace at the backend's allowed rate instead of erroring; negligible for normal interactive use.

Parity Contract

  • Legacy behavior preserved: reactive 429 retry/backoff unchanged (now a backstop); ollama/none providers unthrottled; embedding signature unchanged.
  • Guard/fallback/dispatch parity: limit == 0 and loopback short-circuit before any bucket work; empty-batch embeds still short-circuit before acquiring a token.

Summary by CodeRabbit

New Features

  • Embedding requests now support configurable per-minute rate limiting (default: 60 requests/minute; set to 0 to disable)
  • Added OPENHUMAN_MEMORY_EMBED_RATE_LIMIT environment variable for runtime rate limit configuration
  • Loopback hosts are automatically exempt from rate limiting

Review Change Stack

…s hard 60/min

Cloud embedding backends (OpenHuman/Voyage, OpenAI, custom remote endpoints)
cap requests at a hard 60/min per account. Every embed() is one HTTP POST and
memory-tree ingest fans out one call per chunk across job workers, so without
throttling we trip the cap and absorb 429s (openai.rs downgrades them to a
warning breadcrumb).

Gate every cloud embed at the shared OpenAiEmbedding::embed chokepoint (the
cloud provider delegates to it; openai/custom use it directly) through a
process-global, per-endpoint token bucket keyed by base URL. Capacity is one
token (minimum-interval pacing) so we never burst past the hard cap; an idle
bucket still lets a lone interactive query embed through immediately. Loopback
endpoints are exempt -- a local Ollama/LocalAI server isn't the cloud quota
this guards.

Configurable via memory.embedding_rate_limit_per_min (default 60, 0 disables)
and OPENHUMAN_MEMORY_EMBED_RATE_LIMIT; committed to the process-global limiter
at config load alongside the proxy commit.

Co-Authored-By: Claude <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team May 21, 2026 18:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

Adds a process-global, per-endpoint token-bucket rate limiter for embedding requests, exposes it as a public submodule, wires configuration (including OPENHUMAN_MEMORY_EMBED_RATE_LIMIT), and invokes the acquisition gate from the OpenAI embedding provider. Includes comprehensive unit tests.

Changes

Embedding request rate limiting

Layer / File(s) Summary
Configuration schema and defaults
src/openhuman/config/schema/storage_memory.rs
MemoryConfig adds embedding_rate_limit_per_min: u32, a default_embedding_rate_limit_per_min() returning 60, Default initialization, and Debug output updates.
Configuration loading and global limiter setup
src/openhuman/config/schema/load.rs
apply_env_overlay_with parses OPENHUMAN_MEMORY_EMBED_RATE_LIMIT; apply_env_overrides_from calls set_embedding_rate_limit with the resolved value.
Rate limiter implementation and tests
src/openhuman/embeddings/rate_limit.rs
New process-global per-endpoint token-bucket limiter with single-token no-burst buckets, refill math, loopback exemption, setter/getter, async acquire_embedding_slot, and unit tests for classification, pacing, refill, and waits.
Module exposure and provider integration
src/openhuman/embeddings/mod.rs, src/openhuman/embeddings/openai.rs
Public rate_limit submodule and integration: OpenAiEmbedding::embed awaits acquire_embedding_slot(&self.base_url) before sending outbound requests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2190: Also modifies OpenAiEmbedding::embed; that PR changes error reporting for non-2xx responses which may interact with rate-limit behavior.

Suggested reviewers

  • graycyrus
  • M3gA-Mind

Poem

🐰 I count the tokens, one by one,
Slow the race of requests undone,
Loopback hops skip past the gate,
Buckets fill to keep the rate,
Embeddings hum at measured fun.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(embeddings): rate-limit cloud embedding requests to the backend's hard 60/min cap' is clear, specific, and directly summarizes the main change—implementing rate limiting for cloud embedding requests. It accurately reflects the primary purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added feature Net-new user-facing capability or product behavior. working A PR that is being worked on by the team. labels May 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 1742-1746: The code in apply_env_overlay_with reads std::env
directly (OPENHUMAN_MEMORY_EMBED_RATE_LIMIT) instead of using the injected
EnvLookup; change that branch to call the provided EnvLookup instance (e.g.,
env_lookup or env) to retrieve the variable (use its get/get_var/get or similar
method), then trim/parse::<u32>() and assign to
self.memory.embedding_rate_limit_per_min as before so injected-env tests and
overlay behavior are consistent; keep the existing parsing and assignment logic
but source the value from the EnvLookup parameter rather than std::env::var.

In `@src/openhuman/embeddings/rate_limit.rs`:
- Around line 59-66: The function set_embedding_rate_limit currently clears
BUCKETS on every call; change it to first read the existing value via
CONFIGURED_LIMIT.load(Ordering::Relaxed) and compare to the incoming per_minute,
and only call CONFIGURED_LIMIT.store(...) and clear the registry (BUCKETS.get()
... .clear()) when the configured limit actually differs; keep the same
lock/unwrapping logic around BUCKETS but avoid resetting pacing state when the
value is unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fffad63d-21ec-4867-911f-efbe2aee52b5

📥 Commits

Reviewing files that changed from the base of the PR and between b1bbc53 and 702d192.

📒 Files selected for processing (5)
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/storage_memory.rs
  • src/openhuman/embeddings/mod.rs
  • src/openhuman/embeddings/openai.rs
  • src/openhuman/embeddings/rate_limit.rs

Comment thread src/openhuman/config/schema/load.rs Outdated
Comment thread src/openhuman/embeddings/rate_limit.rs
- set_embedding_rate_limit: only clear the per-endpoint bucket registry when
  the rate actually changes (swap + compare), so repeated config reloads with
  an unchanged value don't keep handing out a fresh burst token and erode the
  hard-cap pacing guarantee.
- load.rs: read OPENHUMAN_MEMORY_EMBED_RATE_LIMIT via the injected EnvLookup
  (env.get) rather than std::env, so the override honors the
  apply_env_overlay_with contract and works under injected-env tests.

Co-Authored-By: Claude <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor Author

@graycyrus pls review

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

@graycyrus graycyrus merged commit 9d0cce7 into tinyhumansai:main May 22, 2026
36 of 37 checks passed
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
…s hard 60/min cap (tinyhumansai#2461)

Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Co-authored-by: Claude <noreply@anthropic.com>
senamakel pushed a commit to aqilaziz/openhuman that referenced this pull request May 23, 2026
…s hard 60/min cap (tinyhumansai#2461)

Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Net-new user-facing capability or product behavior. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants