Skip to content

embeddings#96

Merged
abhijithneilabraham merged 14 commits into
mainfrom
embedding
Jan 31, 2026
Merged

embeddings#96
abhijithneilabraham merged 14 commits into
mainfrom
embedding

Conversation

@fadil4u
Copy link
Copy Markdown
Collaborator

@fadil4u fadil4u commented Jan 7, 2026

Semantic Deduplication

clusters = dt.reduce(df, action="dedup", embedding_model="text-embedding-3-small", llm=llm)

Pass merge data to Datatune primitves

mapped = dt.map(
    prompt=mapping_prompt,
    output_fields=["passenger_trend_comment"],
    input_fields=["year","month","passengers"],
    clusters=clusters,
)(llm, df)

Reduces the amount of semantically similar rows sent to the LLM thereby reducing tokens and therefore cost

Dedup

Semantic deduplicator does the following things

  1. Embeds rows and saves embeddings to disk. Embeds partition by partition
  2. Cluster embeddings using FAISS HNSW index (approximate nearest neighbor)
  3. An additional LLM evaluation step on each cluster
  4. __call__ returns clusters
[{'canonical_id': 5, 'duplicate_ids': [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]

when passed to datatune primitves , only the canonical row is sent to the LLM and it's output is transmitted to its duplicate rows.

@fadil4u fadil4u marked this pull request as ready for review January 18, 2026 18:58
@abhijithneilabraham abhijithneilabraham merged commit 91ff17a into main Jan 31, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants