embeddings by fadil4u · Pull Request #96 · vitalops/datatune

fadil4u · 2026-01-07T06:47:19Z

Semantic Deduplication

clusters = dt.reduce(df, action="dedup", embedding_model="text-embedding-3-small", llm=llm)

Pass merge data to Datatune primitves

mapped = dt.map(
    prompt=mapping_prompt,
    output_fields=["passenger_trend_comment"],
    input_fields=["year","month","passengers"],
    clusters=clusters,
)(llm, df)

Reduces the amount of semantically similar rows sent to the LLM thereby reducing tokens and therefore cost

Dedup

Semantic deduplicator does the following things

Embeds rows and saves embeddings to disk. Embeds partition by partition
Cluster embeddings using FAISS HNSW index (approximate nearest neighbor)
An additional LLM evaluation step on each cluster
__call__ returns clusters

[{'canonical_id': 5, 'duplicate_ids': [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]

when passed to datatune primitves , only the canonical row is sent to the LLM and it's output is transmitted to its duplicate rows.

fadil4u added 8 commits January 7, 2026 12:16

added embedding for map

d5046ee

deduplicator

343b52a

deduplication for map

5c1380d

fixes

0e54865

dedup for filter

05a7188

remove computation

d57d3b6

dependecy

f00aa20

bug fix

53fde54

fadil4u marked this pull request as ready for review January 18, 2026 18:58

fadil4u requested a review from abhijithneilabraham January 18, 2026 18:58

fadil4u added 6 commits January 22, 2026 00:11

reduce

44906ef

fixes

74a6b16

clusters var

ddf2fc3

docs

5a7b796

reduce return df

ea4fd9a

docs

00a9df3

abhijithneilabraham merged commit 91ff17a into main Jan 31, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embeddings#96

embeddings#96
abhijithneilabraham merged 14 commits into
mainfrom
embedding

fadil4u commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fadil4u commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semantic Deduplication

Dedup

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fadil4u commented Jan 7, 2026 •

edited

Loading