Conversation
AliceLJY
left a comment
There was a problem hiding this comment.
Good optimization — serial embedding calls were a real bottleneck, especially for local models.
What works well:
- Batch pre-embedding for non-profile candidates with graceful fallback to individual calls
filterNoiseByEmbeddingrefactoring cleanly separates bypass/embed/filter phases- The
precomputedVectoroptional parameter onprocessCandidateis a clean API extension
Nits (non-blocking):
- The
package-lock.jsonincludes a version bump (beta.9 → beta.10) and new optional platform-specific@lancedb/*dependencies. These are separate concerns — ideally the version bump would be in its own commit or PR to keep this diff focused on the batch optimization. - No tests added for the batch path. The existing tests may implicitly cover it if the mock embedder implements
embedBatch, but an explicit test for the batch→fallback path would be good to have.
Approved — the core change is solid and well-commented.
将 smart-extractor.ts 中的三处串行 embed 调用改为批量 embedBatch 调用
Covers 7 scenarios: Step 1b dedup batch, filterNoiseByEmbedding batch, candidate pre-compute batch, batch failure fallback, noise filter bypass correctness, profile exclusion from pre-computation
Tests 2/5/6 were exercising extractAndPersist (which never calls filterNoiseByEmbedding) and asserting embedBatch was called for the wrong reason. Now they call filterNoiseByEmbedding() directly with controlled inputs, verifying batch path, failure fallback, and bypass correctness.
进行了对应修改 |
|
@AliceLJY 麻烦看下 |
Review:
|
EF2: Register test/smart-extractor-batch-embed.test.mjs in CI core-regression group EF1: TypeScript verified - no new errors (pre-existing handleSupersede scopeFilter mismatch unrelated) MR1: Filter boundary-excluded candidates before batch pre-embedding to avoid wasted embed calls
|
Must Fix — 已全部解决
EF2:新测试文件未注册到 CI
Nice to Have — 已全部实现
附带修复(来自内部 Review)
|
改动总结
文件:src/smart-extractor.ts — 3 处改动,全部将串行 embed 调用改为批量 embed。
改动位置
Step 1b batchDedup (line ~288)
filterNoiseByEmbedding (line ~357)
Step 2 候选处理循环 (line ~307)