Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
22c182d
createCache(): Take CacheInfo as argument
edolstra Jun 7, 2026
d94497a
Deduplicate Cache/CacheInfo
edolstra Jun 7, 2026
343b7b1
nix serve: publish a bloom filter of valid store paths
edolstra Jun 6, 2026
6658cb9
BinaryCacheStore: consult bloom filter to skip definite misses
edolstra Jun 6, 2026
885295d
libstore: factor bloom-filter bit positions into a shared helper
edolstra Jun 6, 2026
0d82d91
Formatting
edolstra Jun 7, 2026
4bf9be5
libstore: hoist buildBloomFilter and add nix store generate-bloom-filter
edolstra Jun 7, 2026
c41ff96
Test false-positive rate
edolstra Jun 7, 2026
bc71363
tests: cover bloom-filter rule-out and disk-cache reuse via nix serve
edolstra Jun 7, 2026
e12c37e
nix serve: Implement ETag for the bloom filter
edolstra Jun 7, 2026
696e27e
tests: loop fake hashparts until one is ruled out by the bloom filter
edolstra Jun 7, 2026
1dbe07a
nix serve: Add --false-positive-rate flag for the bloom filter
edolstra Jun 7, 2026
8781bbe
Formatting
edolstra Jun 7, 2026
4b4ff7f
bloom filter: switch wire integers to u64 and use StringSink/StringSo…
edolstra Jun 7, 2026
55577fa
Move ConditionalGetResult
edolstra Jun 7, 2026
4f41029
bloom filter -> Bloom filter
edolstra Jun 8, 2026
38d11a1
bloom filter: validate falsePositiveRate and floor mBits at 8
edolstra Jun 8, 2026
1e5fe74
bloom filter: support absolute BloomFilter URLs
edolstra Jun 8, 2026
3f8dc57
bloom filter: fix doc comments
edolstra Jun 8, 2026
bf3cec8
bloom filter: drop BloomState; store raw blob; combine lookup+probe
edolstra Jun 8, 2026
538e107
bloom filter: inline maybeDisableBloomFilter as a local lambda
edolstra Jun 8, 2026
cbe4c3f
Drop catch all
edolstra Jun 8, 2026
c400002
Drop testing code
edolstra Jun 8, 2026
063501b
binary cache: add use-bloom-filter setting (default true)
edolstra Jun 8, 2026
cfce945
Drop noexcept
edolstra Jun 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/manual/source/SUMMARY.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@
- [Store Path Specification](protocols/store-path.md)
- [Nix Archive (NAR) Format](protocols/nix-archive/index.md)
- [Nix Cache Info Format](protocols/nix-cache-info.md)
- [Binary Cache Bloom Filter Format](protocols/binary-cache-bloom-filter.md)
- [Derivation "ATerm" file format](protocols/derivation-aterm.md)
- [Nix32 Encoding](protocols/nix32.md)
- [`builtins.wasm` Host Interface](protocols/wasm.md)
Expand Down
76 changes: 76 additions & 0 deletions doc/manual/source/protocols/binary-cache-bloom-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Binary Cache Bloom Filter Format

A [binary cache](@docroot@/package-management/binary-cache-substituter.md) may publish a Bloom filter of all store paths it contains.
The filter's URL is announced through the [`BloomFilter`](@docroot@/protocols/nix-cache-info.md#bloomfilter) field of the cache's [`nix-cache-info`](@docroot@/protocols/nix-cache-info.md) file — either as an absolute URL or as a path relative to the cache root.
A cache that does not advertise the field does not provide a Bloom filter; clients must not probe for one at a default path.

A Bloom filter lets a client decide that a store path is **definitely not** in the cache without issuing a `.narinfo` request.
Membership tests are one-sided: a "not present" answer is authoritative, while a "possibly present" answer must still be confirmed by fetching the `.narinfo`.
False positives occur at a configurable rate; false negatives do not.

MIME type: `application/octet-stream`

## Format

The response is binary, little-endian, with a fixed 32-byte header followed by the raw bit array:

| Offset | Size | Field | Description |
|-------:|-----------:|-----------|----------------------------------------------------------|
| 0 | 8 | `magic` | ASCII bytes `NixBloom` (no terminating NUL). |
| 8 | 8 | `version` | `uint64` format version. Currently `1`. |
| 16 | 8 | `k` | `uint64` number of hash functions. |
| 24 | 8 | `m` | `uint64` size of the bit array, in bits. Multiple of 8. |
| 32 | `m / 8` | `bits` | The bit array. Bit at position `p` is `bits[p / 8] >> (p % 8)` masked with `1`. |

The total response size is `32 + m / 8` bytes.

## Membership test

A client tests whether a store path *might* be in the cache as follows:

1. Take the path's [hash part](@docroot@/protocols/store-path.md) — the first 32 [Nix32](@docroot@/protocols/nix32.md) characters of its base name.
2. Decode it into a 20-byte (160-bit) sequence using Nix32 decoding.
3. Read two 64-bit unsigned values from the decoded bytes, little-endian:
- `h1` from bytes `0..8`
- `h2` from bytes `8..16`
(The trailing 4 bytes are unused.)
4. For each `i` in `0, 1, …, k − 1`, compute the bit position
```
pos = ((h1 + i * h2) mod 2^64) mod m
```
Comment thread
edolstra marked this conversation as resolved.
The intermediate addition and multiplication wrap modulo 2^64 (standard unsigned 64-bit overflow) before the modulo by `m`.
5. If every `bits[pos / 8] >> (pos % 8)` has its low bit set, the path is *possibly* present; otherwise it is *definitely not* present.

This is the standard Kirsch-Mitzenmacher double-hashing scheme.
Because a store path's hash part is already a cryptographic hash, no further hashing is required.

## Server-side construction

The server populates the filter by performing the same membership procedure for every valid store path and OR-ing in the resulting bits.

Parameters are chosen from the count `n` of valid paths and a target false-positive rate `p`:

```
m = ceil(-n * ln(p) / (ln 2)^2), rounded up to a multiple of 8
k = max(1, round((m / n) * ln 2))
```

If `n` is zero, the server may emit a minimal filter (e.g., `m = 8`, `k = 1`, all bits zero), which correctly reports every query as "not present".

The choice of `p` is server-defined and not advertised separately: a client can infer the asymptotic FPR from `m` and the number of paths in the cache, but does not need to in order to use the filter.

## Caching

The Bloom filter changes whenever the cache's path set changes.
Clients should refetch periodically; an HTTP cache lifetime on the order of minutes-to-hours is typically appropriate.

## Example

A cache containing roughly 500 000 paths, with a 1% target false-positive rate, produces a filter with `k = 7` and `m ≈ 4.7 × 10^6` bits — roughly 590 KB on the wire including the header.

## See Also

- [Nix Cache Info Format](@docroot@/protocols/nix-cache-info.md)
- [Store Path Specification](@docroot@/protocols/store-path.md)
- [Nix32 Encoding](@docroot@/protocols/nix32.md)
- [HTTP Binary Cache Store](@docroot@/store/types/http-binary-cache-store.md)
15 changes: 15 additions & 0 deletions doc/manual/source/protocols/nix-cache-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,27 @@ error: binary cache 'https://example.com' is for Nix stores with prefix '/nix/st

Integer. Sets the default for [`priority`](@docroot@/store/types/http-binary-cache-store.md#store-http-binary-cache-store-priority).

### `BloomFilter`

URL of a [Bloom filter](@docroot@/protocols/binary-cache-bloom-filter.md) that enumerates the store paths held by this cache.
Clients may use it to skip `.narinfo` requests for paths the filter rules out.

The value is either an absolute URL or a path relative to the cache root:

```
BloomFilter: /bloom-filter
BloomFilter: https://filters.example.com/cache-abc.bloom
```
Comment thread
edolstra marked this conversation as resolved.

If absent, the cache does not publish a Bloom filter and clients must not assume one is available at any default location.

## Example

```
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
BloomFilter: /bloom-filter
```

## Caching Behavior
Expand Down
12 changes: 7 additions & 5 deletions src/libstore-tests/nar-info-disk-cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,16 @@ TEST(NarInfoDiskCacheImpl, create_and_read)

// Set up "background noise" and check that different caches receive different ids
{
auto bc1 = cache->createCache("https://bar", "/nix/storedir", wantMassQuery, prio);
auto bc2 = cache->createCache("https://xyz", "/nix/storedir", false, 12);
auto bc1 =
cache->createCache("https://bar", "/nix/storedir", {.wantMassQuery = wantMassQuery, .priority = prio});
auto bc2 = cache->createCache("https://xyz", "/nix/storedir", {.priority = 12});
ASSERT_NE(bc1, bc2);
barId = bc1;
}

// Check that the fields are saved and returned correctly. This does not test
// the select statement yet, because of in-memory caching.
savedId = cache->createCache("http://foo", "/nix/storedir", wantMassQuery, prio);
savedId = cache->createCache("http://foo", "/nix/storedir", {.wantMassQuery = wantMassQuery, .priority = prio});
;
{
auto r = cache->upToDateCacheExists("http://foo");
Expand Down Expand Up @@ -84,7 +85,7 @@ TEST(NarInfoDiskCacheImpl, create_and_read)
}

// "Update", same data, check that the id number is reused
cache2->createCache("http://foo", "/nix/storedir", wantMassQuery, prio);
cache2->createCache("http://foo", "/nix/storedir", {.wantMassQuery = wantMassQuery, .priority = prio});

{
auto r = cache2->upToDateCacheExists("http://foo");
Expand All @@ -107,7 +108,8 @@ TEST(NarInfoDiskCacheImpl, create_and_read)
auto r0 = cache2->upToDateCacheExists("https://bar");
ASSERT_FALSE(r0);

cache2->createCache("https://bar", "/nix/storedir", !wantMassQuery, prio + 10);
cache2->createCache(
"https://bar", "/nix/storedir", {.wantMassQuery = !wantMassQuery, .priority = prio + 10});
auto r = cache2->upToDateCacheExists("https://bar");
ASSERT_EQ(r->wantMassQuery, !wantMassQuery);
ASSERT_EQ(r->priority, prio + 10);
Expand Down
117 changes: 117 additions & 0 deletions src/libstore/binary-cache-store.cc
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,16 @@
#include "nix/util/signals.hh"
#include "nix/util/archive.hh"
#include "nix/util/util.hh"
#include "nix/util/users.hh"
#include "nix/store/bloom-filter.hh"
#include "nix/store/pathlocks.hh"

#include <chrono>
#include <cstring>
#include <future>
#include <regex>
#include <fstream>
#include <span>
#include <sstream>
#include <variant>

Expand Down Expand Up @@ -68,11 +73,118 @@ void BinaryCacheStore::init()
config.wantMassQuery.setDefault(value == "1");
} else if (name == "Priority") {
config.priority.setDefault(std::stoi(value));
} else if (name == "BloomFilter") {
bloomFilterUrl = value;
}
}
}
}

BinaryCacheStore::ConditionalGetResult
BinaryCacheStore::getFileConditional(const std::string & path, const std::string & /*expectedETag*/)
{
/* Default: no ETag support; just do an ordinary fetch. */
auto data = getFile(path);
return ConditionalGetResult{.data = std::move(data), .etag = "", .notModified = false};
}

bool BinaryCacheStore::fetchBloomFilter(const std::string & uri)
{
/* Disable the Bloom filter for this cache for a short cooldown, so an
unavailable/broken filter doesn't cause a fetch on every query. */
auto disable = [&] {
auto state(bloomState.lock());
if (state->enabled) {
int t = 60;
debug("disabling Bloom filter for cache '%s' for %d seconds", uri, t);
state->enabled = false;
state->disabledUntil = std::chrono::steady_clock::now() + std::chrono::seconds(t);
}
return false;
};

auto expectedETag = diskCache->getBloomFilterETag(uri).value_or("");

/* `*bloomFilterUrl` can be a full (absolute) URL or a path relative to
the cache root; either way the resolution is done by `getFile()` /
`makeRequest()`, the same as for NAR URLs in `.narinfo` files. */
ConditionalGetResult res;
try {
res = getFileConditional(*bloomFilterUrl, expectedETag);
} catch (Error & e) {
warn("failed to fetch Bloom filter from cache '%s': %s; disabling for now", uri, e.message());
return disable();
}

if (res.notModified) {
debug("Bloom filter for '%s' unchanged (304 Not Modified)", uri);
diskCache->touchBloomFilter(uri, res.etag.empty() ? expectedETag : res.etag);
return true;
}

if (!res.data) {
warn("Bloom filter at '%s' returned 404; disabling for now", uri);
return disable();
}

const auto & body = *res.data;
auto params = parseBloomFilterHeader(body);
if (!params || body.size() != bloomFilterHeaderLen + params->mBits / 8) {
warn("Bloom filter from cache '%s' is malformed; disabling for now", uri);
return disable();
}

diskCache->upsertBloomFilter(uri, res.etag, {reinterpret_cast<const std::byte *>(body.data()), body.size()});
return true;
}

bool BinaryCacheStore::isDefinitelyMissing(const StorePath & storePath)
{
if (!diskCache || !bloomFilterUrl || !config.useBloomFilter)
return false;

const auto uri = config.getReference().render(/*withParams=*/false);

/* Per-process cooldown after a failed fetch, so an unavailable filter
doesn't cause a fetch on every query. */
{
auto state(bloomState.lock());
if (!state->enabled) {
if (std::chrono::steady_clock::now() < state->disabledUntil)
return false;
state->enabled = true; // cooldown elapsed; try again
}
}

auto r = diskCache->probeBloomFilter(uri, storePath);

if (!r) {
/* No fresh filter cached. Acquire a cross-process file lock so
concurrent first-probers don't all hit the network, then
re-check and fetch. */
auto lockDir = getCacheDir() / "bloom-filter-locks";
std::filesystem::create_directories(lockDir);
auto lockFile =
lockDir / hashString(HashAlgorithm::SHA256, uri).to_string(HashFormat::Base16, /*includePrefix=*/false);
PathLocks fetchLock(
{lockFile.string()}, fmt("waiting for another Nix process to fetch Bloom filter for '%s'...", uri));

r = diskCache->probeBloomFilter(uri, storePath);
if (!r) {
if (!fetchBloomFilter(uri))
return false;
r = diskCache->probeBloomFilter(uri, storePath);
}
}

if (!r)
return false;

if (!*r)
debug("Bloom filter for '%s' ruled out '%s'", uri, printStorePath(storePath));
return !*r;
}

std::optional<std::string> BinaryCacheStore::getNixCacheInfo()
{
return getFile(cacheInfoFile);
Expand Down Expand Up @@ -527,6 +639,8 @@ StorePath BinaryCacheStore::addToStoreFromDump(

bool BinaryCacheStore::isValidPathUncached(const StorePath & storePath)
{
if (isDefinitelyMissing(storePath))
return false;
// FIXME: this only checks whether a .narinfo with a matching hash
// part exists. So ‘f4kb...-foo’ matches ‘f4kb...-bar’, even
// though they shouldn't. Not easily fixed.
Expand Down Expand Up @@ -580,6 +694,9 @@ void BinaryCacheStore::queryPathInfoUncached(
auto callbackPtr = std::make_shared<decltype(callback)>(std::move(callback));

try {
if (isDefinitelyMissing(storePath))
return (*callbackPtr)({});

auto uri = config.getReference().render(/*FIXME withParams=*/false);
auto storePathS = printStorePath(storePath);
auto act = std::make_shared<Activity>(
Expand Down
67 changes: 67 additions & 0 deletions src/libstore/bloom-filter.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#include "nix/store/bloom-filter.hh"
#include "nix/util/serialise.hh"

#include <cmath>

namespace nix {

std::optional<BloomFilterParams> parseBloomFilterHeader(std::string_view header)
{
using namespace std::string_view_literals;
if (header.size() < bloomFilterHeaderLen || header.substr(0, 8) != "NixBloom"sv)
return std::nullopt;

StringSource source(header.substr(8));
uint64_t version;
uint32_t k;
uint64_t mBits;
try {
source >> version >> k >> mBits;
} catch (SerialisationError &) {
return std::nullopt;
}

if (version != 1 || mBits == 0 || mBits % 8 != 0)
return std::nullopt;

return BloomFilterParams{.k = k, .mBits = mBits};
}

std::string buildBloomFilter(const StorePathSet & paths, double falsePositiveRate)
{
/* Rejects NaN as well, because all comparisons with NaN are false. */
if (!(falsePositiveRate > 0 && falsePositiveRate < 1))
throw Error("Bloom filter false positive rate must be between 0 and 1, got %f", falsePositiveRate);

size_t n = paths.size();

uint64_t mBits = 8;
uint32_t k = 1;
if (n) {
constexpr double ln2 = 0.6931471805599453;
double mF = -double(n) * std::log(falsePositiveRate) / (ln2 * ln2);
/* `falsePositiveRate` very close to 1 makes `mF` round down to zero;
keep the floor of 8 bits so we never modulo by zero later. */
mBits = std::max<uint64_t>(8, ((uint64_t(std::ceil(mF)) + 7) / 8) * 8);
long kL = std::lround((double(mBits) / double(n)) * ln2);
k = uint32_t(std::max<long>(1, kL));
}
Comment thread
coderabbitai[bot] marked this conversation as resolved.

StringSink sink(bloomFilterHeaderLen + mBits / 8);

using namespace std::string_view_literals;
sink("NixBloom"sv);
sink << 1; // version
sink << k;
sink << mBits;
assert(sink.s.size() == bloomFilterHeaderLen);

sink.s.resize(bloomFilterHeaderLen + mBits / 8);
char * bits = sink.s.data() + bloomFilterHeaderLen;
for (auto & path : paths)
forEachBloomBitPosition(path, k, mBits, [&](uint64_t pos) { bits[pos / 8] |= uint8_t(1) << (pos % 8); });

return std::move(sink.s);
}

} // namespace nix
Loading
Loading