feat/bm25-market-similarity by vaisd · Pull Request #14 · MusashiBot/musashi-api

vaisd · 2026-05-19T07:05:36Z

Summary

Replaces the manual keyword-overlap path in areMarketsSimilar() with BM25 similarity scoring, as suggested by Taylor in the previous PR review.
Changes

Added buildBM25Stats(markets) to compute corpus IDF and avgdl in one pass. Using BM25+ variant to keep weights non-negative on small corpora
Added bm25Similarity(a, b, stats) which averages BM25 in both directions and normalizes by mean self-score, giving a [0, 1] score that's independent of order
areMarketsSimilar() now takes a BM25Stats argument, middle branch is bm25Sim > 0.4 instead of overlap >= 4
detectArbitrage() builds stats once over the full candidate pool and passes them into each pair check
Removed calculateKeywordOverlap() and its local stopword set since IDF handles common term downweighting naturally, and the old stopword list had already drifted from keyword-generator.ts
Extended test suite with a rare-term coincidence case and a high-volume shared-term case, also finance-noise padding so common terms get a more realistic df distribution

Title similarity, entity tiebreaker, and category gate are all unchanged.

Tradeoffs

Threshold tuning: 0.4 works well against the current test suite (rejects top out at ~14%, matches start at ~67%) but the corpus is small (~27 markets with padding). Will probably need retuning as more real cases come in, and might eventually make sense to be category-specific
IDF rebuild on cache miss: detectArbitrage sits behind a 15s cache so this isn't a per-request cost, but worth keeping an eye on if the market pool grows a lot
Confidence values changed: the old path mapped keyword count to confidence linearly from 0.5, the new path returns the BM25 ratio directly. The 0.5 floor in normalizeMinConfidence() still holds but downstream consumers may see different distributions

Test plan
Already Done
node --import tsx src/api/tests/arbitrage-detector.test.ts — 8/8 pass
tsc --noEmit -p tsconfig.json and tsc --noEmit -p api/tsconfig.json — clean
Need to be Done
Spot-check staging arbitrage results against the new threshold before merging

…ge detection

vercel · 2026-05-19T07:05:41Z

@vaisd is attempting to deploy a commit to the Victor's projects Team on Vercel.

A member of the Team first needs to authorize it.

vaisd added 2 commits May 15, 2026 03:20

fix: reduce false positives in movers and arbitrage detection

fc29761

feat: replace keyword overlap with BM25 similarity scoring in arbitra…

f9ef4d6

…ge detection

vaisd changed the title ~~Feat/bm25 market similarity~~ feat/bm25 market similarity May 19, 2026

vaisd changed the title ~~feat/bm25 market similarity~~ feat/bm25-market-similarity May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/bm25-market-similarity#14

feat/bm25-market-similarity#14
vaisd wants to merge 2 commits into
MusashiBot:mainfrom
vaisd:feat/bm25-market-similarity

vaisd commented May 19, 2026

Uh oh!

vercel Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaisd commented May 19, 2026

Summary

Tradeoffs

Uh oh!

vercel Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant