Skip to content

Conversation

@dskvr
Copy link

@dskvr dskvr commented Nov 12, 2025

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

  • Full-text search with relevance ranking (BM25 algorithm)
  • Configurable search backends (LMDB, Noop)
  • Background indexer with catch-up mechanism
  • Production-ready performance optimizations
  • benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

  • Abstract interface allowing pluggable search backends
  • Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

  • Inverted index stored in LMDB tables
  • Token-based posting lists with term frequency data
  • Document metadata for BM25 scoring (document length, kind)
  • Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

  • Async worker thread that catches up indexing of historical events
  • Clean shutdown and progress persistence via SearchState.lastIndexedLevId
  • Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

  • Executes search queries within the existing query scheduler
  • Integrates alongside traditional index scans
  • Validates content by requiring presence of all parsed query tokens in event text
  • BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

  • terms-tf-recency (default)
  • terms-recency-tf
  • tf-terms-recency
  • tf-recency-terms
  • recency-terms-tf
  • recency-tf-terms

Configuration Parameters

  • enabled: Master switch for search functionality
  • backend: Search provider implementation ("lmdb" or "noop")
  • indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
  • maxQueryTerms: Maximum query terms parsed
  • maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
  • maxCandidateDocs: Maximum candidates for scoring
  • overfetchFactor: Candidate over-fetch before post-filtering
  • recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
  • candidateRankMode: order or weighted
  • candidateRanking: Order used when mode=order (list above)
  • rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

  1. Build strfry:

    make -j$(nproc)
  2. Update strfry.conf:

    relay {
        search {
            enabled = true
            backend = "lmdb"
        }
    }
    
  3. Start strfry:

    ./build/strfry relay

Indexing behavior:

  • New events are indexed on write (writer path)
  • Background indexer catches up historical events and updates SearchState
  • NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

  • Multi-token queries with BM25 relevance scoring
  • Case-insensitive matching
  • Results ranked by relevance
  • Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

  • Tokenization: ~10-15 us/event (depends on content length)
  • Index insertion: ~50-100 us/event (LMDB commit overhead)
  • Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

  • Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
  • Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
  • Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

  • Lower maxCandidateDocs for faster queries with slightly lower recall
  • Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review A comprehensive benchmark suite is included under `bench/`:
bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

  1. Prepare a test database:

    bench/scripts/prepare.sh -s scenarios/small.yml --workers 4

    This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

  2. Run the benchmark:

    bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)
  3. Generate reports:

    bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

  • Throughput: events/s sent and delivered
  • Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
  • Resource usage: RSS memory, CPU utilization, disk I/O
  • Search-specific: index catch-up state, results cardinality
  • System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

  1. Index a test database:

    # Import some events
    cat events.ndjson | ./build/strfry import
    
    # Start relay with search enabled
    ./build/strfry relay
  2. Issue search queries via WebSocket:

    ["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]
  3. Verify results are returned in relevance order

Integration Points

  • DBQuery.h: Search queries execute alongside traditional index scans
  • ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
  • QueryScheduler.h: Search provider injected into query execution path
  • cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

  1. Stop the relay
  2. Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
  3. Enable search in config
  4. Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

  1. Set relay.search.enabled = false in config
  2. Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

  • Search is limited to content field of events (does not index tags or metadata)
  • No phrase matching or proximity operators (only individual tokens)
  • No stemming or lemmatization (exact token matching)
  • Large result sets may require tuning maxCandidateDocs for optimal performance
  • Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

  • Phrase search and proximity operators
  • Stemming and language-specific analyzers
  • Alternative backends (e.g., external Elasticsearch/MeiliSearch)
  • Search query cost accounting for rate limiting

Related Issues

@dskvr dskvr marked this pull request as ready for review November 12, 2025 14:18
@leesalminen
Copy link

leesalminen commented Nov 18, 2025

I've been working on testing this with @dskvr , have some feedback:

My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in sandwichfarm/feature/nip-50-indexertweaks, which is the branch I've continued testing on.

I started indexing the db with this config:

    search {
        # Enable NIP-50 search capability (requires search backend)
        enabled = true

        # Search backend to use: lmdb, noop (or external in future)
        backend = "lmdb"

        # Maximum number of search terms allowed in a query
        maxQueryTerms = 6

        # Comma-separated kinds/ranges to index. Supports: single (1), ranges (1000-1999), wildcard (*), exclusions (-5000-5999)
        indexedKinds = "0,1,34236,30000-30003,30023,34550"

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

        # Recency tie-breaker percent (0–100); 1 = 1% boost for newest events
        recencyBoostPercent = 1

        # Over-fetch multiplier to compensate for post-filtering (candidates = limit × factor, bounded by maxCandidateDocs)
        overfetchFactor = 5

        # Candidate ranking order before scoring: terms-tf-recency | terms-recency-tf | tf-terms-recency | tf-recency-terms | recency-terms-tf | recency-tf-terms
        candidateRanking = "terms-tf-recency"

        # Candidate ranking mode: order | weighted
        candidateRankMode = "weighted"

        # Weighted ranking weights (only used when candidateRankMode = "weighted")
        rankWeightTerms = 100
        rankWeightTf = 50
        rankWeightRecency = 10
    }
 

Indexing started running great, I came back this morning and my logs are getting spammed with:

[ 8B7FE6C0]INFO| Search indexer catching up: 13070001 to 13071000 (head: 18740192)

Where the counter never increments. It just keeps sending this same log over and over.

I tried search_set_state and incrementing by 1 and restart relay, but the logging issue persists.

It's possible this is a red herring log, where because of my indexedKinds filter, it's not counting up correctly.

My search_index_stats are:

Search index LMDB statistics:
  SearchIndex:
  entries        : 6375268
  depth          : 4
  branch pages   : 1430
  leaf pages     : 115305
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 478146560 bytes (456.00 MiB)
  SearchDocMeta:
  entries        : 6331151
  depth          : 4
  branch pages   : 687
  leaf pages     : 78768
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 325447680 bytes (310.37 MiB)
SearchState:
  lastIndexedLevId : 13070000
  indexVersion     : 1
  

On the bright side, query performance is great. Querying ["REQ", "test", { "search": "taylor swift" } ] is nearly instant, barely noticeable performance hit.

I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets.

Just my 2 sats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request: NIP-50 Support

2 participants