feat: add NIP-50 support #160

dskvr · 2025-11-12T13:35:01Z

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

Full-text search with relevance ranking (BM25 algorithm)
Configurable search backends (LMDB, Noop)
Background indexer with catch-up mechanism
Production-ready performance optimizations
benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

Abstract interface allowing pluggable search backends
Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

Inverted index stored in LMDB tables
Token-based posting lists with term frequency data
Document metadata for BM25 scoring (document length, kind)
Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

Async worker thread that catches up indexing of historical events
Clean shutdown and progress persistence via SearchState.lastIndexedLevId
Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

Executes search queries within the existing query scheduler
Integrates alongside traditional index scans
Validates content by requiring presence of all parsed query tokens in event text
BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

terms-tf-recency (default)
terms-recency-tf
tf-terms-recency
tf-recency-terms
recency-terms-tf
recency-tf-terms

Configuration Parameters

enabled: Master switch for search functionality
backend: Search provider implementation ("lmdb" or "noop")
indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
maxQueryTerms: Maximum query terms parsed
maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
maxCandidateDocs: Maximum candidates for scoring
overfetchFactor: Candidate over-fetch before post-filtering
recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
candidateRankMode: order or weighted
candidateRanking: Order used when mode=order (list above)
rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

Build strfry:
```
make -j$(nproc)
```

Update strfry.conf:

relay {
    search {
        enabled = true
        backend = "lmdb"
    }
}

Start strfry:
```
./build/strfry relay
```

Indexing behavior:

New events are indexed on write (writer path)
Background indexer catches up historical events and updates SearchState
NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

Multi-token queries with BM25 relevance scoring
Case-insensitive matching
Results ranked by relevance
Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

Tokenization: ~10-15 us/event (depends on content length)
Index insertion: ~50-100 us/event (LMDB commit overhead)
Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

Lower maxCandidateDocs for faster queries with slightly lower recall
Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review

A comprehensive benchmark suite is included under `bench/`:

bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

Prepare a test database:
```
bench/scripts/prepare.sh -s scenarios/small.yml --workers 4
```
This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

Run the benchmark:

bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)

Generate reports:

bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

Throughput: events/s sent and delivered
Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
Resource usage: RSS memory, CPU utilization, disk I/O
Search-specific: index catch-up state, results cardinality
System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

Index a test database:

# Import some events
cat events.ndjson | ./build/strfry import

# Start relay with search enabled
./build/strfry relay

Issue search queries via WebSocket:

["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]

Verify results are returned in relevance order

Integration Points

DBQuery.h: Search queries execute alongside traditional index scans
ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
QueryScheduler.h: Search provider injected into query execution path
cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

Stop the relay
Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
Enable search in config
Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

Set relay.search.enabled = false in config
Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

Search is limited to content field of events (does not index tags or metadata)
No phrase matching or proximity operators (only individual tokens)
No stemming or lemmatization (exact token matching)
Large result sets may require tuning maxCandidateDocs for optimal performance
Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

Phrase search and proximity operators
Stemming and language-specific analyzers
Alternative backends (e.g., external Elasticsearch/MeiliSearch)
Search query cost accounting for rate limiting

Related Issues

Potentially Resolves Request: NIP-50 Support #40
Implements NIP-50 as specified at: https://github.com/nostr-protocol/nips/blob/master/50.md

leesalminen · 2025-11-18T12:57:27Z

I've been working on testing this with @dskvr , have some feedback:

My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in sandwichfarm/feature/nip-50-indexertweaks, which is the branch I've continued testing on.

I started indexing the db with this config:

    search {
        # Enable NIP-50 search capability (requires search backend)
        enabled = true

        # Search backend to use: lmdb, noop (or external in future)
        backend = "lmdb"

        # Maximum number of search terms allowed in a query
        maxQueryTerms = 6

        # Comma-separated kinds/ranges to index. Supports: single (1), ranges (1000-1999), wildcard (*), exclusions (-5000-5999)
        indexedKinds = "0,1,34236,30000-30003,30023,34550"

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

        # Recency tie-breaker percent (0–100); 1 = 1% boost for newest events
        recencyBoostPercent = 1

        # Over-fetch multiplier to compensate for post-filtering (candidates = limit × factor, bounded by maxCandidateDocs)
        overfetchFactor = 5

        # Candidate ranking order before scoring: terms-tf-recency | terms-recency-tf | tf-terms-recency | tf-recency-terms | recency-terms-tf | recency-tf-terms
        candidateRanking = "terms-tf-recency"

        # Candidate ranking mode: order | weighted
        candidateRankMode = "weighted"

        # Weighted ranking weights (only used when candidateRankMode = "weighted")
        rankWeightTerms = 100
        rankWeightTf = 50
        rankWeightRecency = 10
    }

Indexing started running great, I came back this morning and my logs are getting spammed with:

[ 8B7FE6C0]INFO| Search indexer catching up: 13070001 to 13071000 (head: 18740192)

Where the counter never increments. It just keeps sending this same log over and over.

I tried search_set_state and incrementing by 1 and restart relay, but the logging issue persists.

It's possible this is a red herring log, where because of my indexedKinds filter, it's not counting up correctly.

My search_index_stats are:

Search index LMDB statistics:
  SearchIndex:
  entries        : 6375268
  depth          : 4
  branch pages   : 1430
  leaf pages     : 115305
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 478146560 bytes (456.00 MiB)
  SearchDocMeta:
  entries        : 6331151
  depth          : 4
  branch pages   : 687
  leaf pages     : 78768
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 325447680 bytes (310.37 MiB)
SearchState:
  lastIndexedLevId : 13070000
  indexVersion     : 1

On the bright side, query performance is great. Querying ["REQ", "test", { "search": "taylor swift" } ] is nearly instant, barely noticeable performance hit.

I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets.

Just my 2 sats.

feat: add NIP-50 support

61f4375

dskvr marked this pull request as ready for review November 12, 2025 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NIP-50 support #160

feat: add NIP-50 support #160

Uh oh!

dskvr commented Nov 12, 2025 •

edited

Loading

Uh oh!

leesalminen commented Nov 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add NIP-50 support #160

Are you sure you want to change the base?

feat: add NIP-50 support #160

Uh oh!

Conversation

dskvr commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Core Components

Database Schema

Configuration

Configuration Parameters

Usage

Enabling Search

Search Queries

Monitoring

Performance Characteristics

Indexing Performance

Query Performance

Benchmark Suite

Running Benchmarks

Benchmark Metrics

Testing

Manual Testing

Integration Points

Migration Notes

Existing Databases

Rollback

Known Limitations

Future Enhancements

Related Issues

Uh oh!

leesalminen commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dskvr commented Nov 12, 2025 •

edited

Loading

leesalminen commented Nov 18, 2025 •

edited

Loading