categories: add mmap .ndb backend for custom category lists#3157
categories: add mmap .ndb backend for custom category lists#3157fabiodepin wants to merge 2 commits into
Conversation
0ed3acd to
42fe9f9
Compare
|
Interesting stuff! This patch is quite big, so I might need some time to proper review it.
While I understand that load time is lower with this change, I really would like to see some tests and numbers about runtime/lookup performance. Is that possible? |
|
Thanks for the feedback. That makes sense. I'll address the CI warnings first. Regarding builds without pthread / --disable-global-context-support: I agree that runtime DB updates should not require pthread in that configuration. I'll adjust the implementation so that lock/unlock become no-ops there, and any runtime reload-specific path is either disabled or compiled out as needed, while keeping the basic .ndb loading path working when possible. For performance, yes — I can add tests and numbers. I'll collect:
I'll compare the legacy path (-G / existing structures) against the .ndb backend, and I can include both a real ndpiReader run and a smaller lookup-focused benchmark if useful. |
331dd37 to
68896c5
Compare
|
Thanks again for the review. Quick update on the requested points: CI warnings: Build without pthread (–disable-global-context-support): The basic .ndb loading path still works in this configuration. Runtime / lookup performance: It measures: Initial results (micro dataset) already show: I’m now extending this to larger datasets to provide more representative numbers. I’ll share detailed results shortly. |
|
Benchmark I ran a dedicated benchmark (not included in this PR to keep it focused): It measures:
⸻ Key results Large dataset (~7.5M hostnames, ~100k IPv4 rules)
Load time
Memory (RSS after load)
Hostname lookup
⸻ Interpretation
⸻ Current limitation The benchmark also highlights the current weak point:
This dominates lookup cost for large IPv4 tables and will be addressed separately. ⸻ Summary Even at large scale:
|
|
Happy to add the benchmark to the PR if that would make review easier. |
Interesting numbers, but quite different from the ones reported in #3151 |
yes, please. I would like to run some tests locally, before reviewing this patch |
|
Good point — the numbers differ because they come from two different kinds of measurements. In #3151, the numbers are from a real ndpiReader run, which includes:
The benchmark I shared here isolates only the category load + lookup path, so it measures:
So they are complementary rather than directly equivalent:
Regarding the benchmark: yes, I’ll add it to the PR so it can be run locally. |
68896c5 to
4bdb2d0
Compare
|
I’ve just pushed the benchmark code to the PR:
It’s a simple tool to compare
It supports micro/scale/stress profiles and load-only / lookup runs. A couple of notes:
If you want to try it quickly (from repo root):
Let me know if you’d like any adjustments or additional scenarios. |
|
Introduce a compiled .ndb backend (mmap) for external custom category matching, with LEGACY / NDB_ONLY / HYBRID modes, while keeping the existing -G (Aho-Corasick) path unchanged. Add ndpi_load_category_ndb_file() / ndpi_unload_category_ndb(), CLI options (--category-ndb, --category-ndb-reload-interval), a polling-based hot-reload helper, and the offline builder ndpi_gen_categories_bin. Implement shared hostname normalization (generator + runtime) and define the on-disk layout in ndpi_categories_bin.h (domains and IPv4/IPv6 prefix entries). The generator writes the database atomically (temporary file + fsync + rename), allowing ndpiReader to reload a valid file without restart. category_ndb: use no-op locks when global context support is disabled
…oad benchmarks Add a dedicated benchmark tool, category_ndb_bench, to compare the compiled .ndb backend against the legacy custom-category path. The benchmark measures: - load time - RSS / memory usage - hostname and IPv4 lookup latency - median and block percentiles It supports synthetic micro/scale/stress profiles, temp or persisted .ndb generation, load-only / lookup-only modes, and explicit backend selection (ndb / legacy / both). Also add guardrails and help text for large legacy runs: - default legacy safety caps for hosts / IPv4 rules - clear skip/error behavior depending on backend mode - mixed_global caveat in output/help - recommended commands in --help
108db01 to
4a368ac
Compare
|



Please sign (check) the below before submitting the Pull Request:
Link to the related 3151:
Describe changes:
Introduce a compiled .ndb backend (mmap) for external custom category matching, with LEGACY / NDB_ONLY / HYBRID modes, while keeping the existing -G / Aho-Corasick list path unchanged.
Adds ndpi_load_category_ndb_file() / ndpi_unload_category_ndb(), ndpiReader --category-ndb and --category-ndb-reload-interval, a polling-based hot-reload helper, offline builder ndpi_gen_categories_bin, shared hostname normalization (generator + runtime), and on-disk layout in ndpi_categories_bin.h (domains plus IPv4/IPv6 prefix entries).
The generator writes the database atomically (temporary file + fsync + rename) so ndpiReader can reload a new valid file without restart.