Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
d7e4307
Added benchmarking code and results, added fast DPP path
Pringled Jan 22, 2026
f369c03
Improved benchmark code
Pringled Jan 22, 2026
0894ccd
Improved benchmark code
Pringled Jan 23, 2026
0011535
Added latency benchmark
Pringled Jan 23, 2026
c578a72
Updated benchmarks
Pringled Jan 23, 2026
b465b3c
Updated benchmarks
Pringled Jan 23, 2026
6f597a0
Added ILMD to benchmarks
Pringled Jan 23, 2026
2ca59a5
Updated benchmarks
Pringled Jan 23, 2026
d53c5de
Updated benchmarks
Pringled Jan 23, 2026
a818c61
Updated benchmarks
Pringled Jan 23, 2026
8cb553a
Added note on SSD
Pringled Jan 23, 2026
e9b145e
Updated results
Pringled Jan 23, 2026
58c5c81
Updated results
Pringled Jan 23, 2026
8fce587
Updated results
Pringled Jan 23, 2026
95f84d3
Updated results
Pringled Jan 23, 2026
9a493b4
Updated results
Pringled Jan 23, 2026
ad0bff2
Updated results
Pringled Jan 23, 2026
0ac1288
Updated results
Pringled Jan 23, 2026
a52de26
Switched to budget relevance approach
Pringled Jan 23, 2026
91f6c64
Switched to budget relevance approach
Pringled Jan 23, 2026
ed891f8
Switched to budget relevance approach
Pringled Jan 23, 2026
3370010
Expanded results
Pringled Jan 23, 2026
c21b8c7
Updated results
Pringled Jan 23, 2026
737d009
Updated results
Pringled Jan 23, 2026
fd0c161
Updated results
Pringled Jan 23, 2026
3962a3a
Updated plots
Pringled Jan 23, 2026
ec351e4
Updated plots
Pringled Jan 23, 2026
8030d90
Updated plots
Pringled Jan 23, 2026
74f416f
Updated plots
Pringled Jan 23, 2026
feadfe1
Updated plots
Pringled Jan 23, 2026
391684e
Updated results
Pringled Jan 23, 2026
9dc73dc
Made code more consistent
Pringled Jan 23, 2026
5dc3de7
Made code more consistent
Pringled Jan 23, 2026
1ecc01f
Made code more consistent
Pringled Jan 23, 2026
bc47591
Added stars to plots
Pringled Jan 24, 2026
07ad32c
Updated plot, updated code
Pringled Jan 24, 2026
1897f32
Updated results, updated tables, fixed bug
Pringled Jan 24, 2026
960d6d9
Added diversity to table
Pringled Jan 24, 2026
4b7c0d4
Added diversity per strategy tables
Pringled Jan 25, 2026
c256b87
Updated results
Pringled Jan 25, 2026
491ee27
Updated results
Pringled Jan 25, 2026
169e65f
Updated results
Pringled Jan 25, 2026
1da0ebe
Added 99% threshold diamond to plot
Pringled Jan 25, 2026
9c5b788
Updated results
Pringled Jan 25, 2026
cb67c9f
Ran 10 times for statistical significant results, updated results
Pringled Jan 25, 2026
be1a702
Switched to using ndcg as main metric
Pringled Jan 25, 2026
6ca13f8
Updated results
Pringled Jan 26, 2026
a85c044
Changed split logic, updated results
Pringled Jan 26, 2026
3e05de8
Updated results
Pringled Jan 26, 2026
fa7de7f
Updated readme
Pringled Jan 26, 2026
eba8cac
Updated readme
Pringled Jan 26, 2026
0fa4d4d
Improved writing
Pringled Jan 26, 2026
26eb733
Improved writing
Pringled Jan 26, 2026
7e6fa9d
Improved writing
Pringled Jan 26, 2026
18fcee9
Improved writing
Pringled Jan 26, 2026
cc65ee1
Improved writing
Pringled Jan 26, 2026
60a12a8
Improved writing
Pringled Jan 26, 2026
69e960c
Updated readme
Pringled Jan 26, 2026
70e72f3
Updated readme
Pringled Jan 26, 2026
3c43169
Updated readme
Pringled Jan 26, 2026
dc6a499
Updated readme
Pringled Jan 26, 2026
f6d37a6
Updated readme
Pringled Jan 27, 2026
3119329
Updated readme
Pringled Jan 27, 2026
6916f9f
Updated settings
Pringled Jan 27, 2026
d7d89c1
Updated readme
Pringled Jan 27, 2026
fe825bf
Updated readme
Pringled Jan 27, 2026
4741fdf
Updated docstrings
Pringled Jan 27, 2026
9af941b
Updated docstrings
Pringled Jan 27, 2026
2b13487
Pinned dependencies for benchmarks
Pringled Jan 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 21 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

[Quickstart](#quickstart) •
[Supported Strategies](#supported-strategies) •
[Benchmarks](#benchmarks) •
[Motivation](#motivation) •
[Examples](#examples) •
[References](#references)
Expand All @@ -27,7 +28,7 @@
Pyversity is a fast, lightweight library for diversifying retrieval results.
Retrieval systems often return highly similar items. Pyversity efficiently re-ranks these results to encourage diversity, surfacing items that remain relevant but less redundant.

It implements several popular diversification strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the [supported strategies section](#supported-strategies). The only dependency is NumPy, making the package very lightweight.
It implements several popular diversification strategies such as MMR, MSD, DPP, SSD, and Cover with a clear, unified API and [benchmarks](benchmarks/) for each strategy. More information about the supported strategies can be found in the [supported strategies section](#supported-strategies). The only dependency is NumPy, making the package very lightweight.


## Quickstart
Expand All @@ -52,7 +53,7 @@ diversified_result = diversify(
embeddings=embeddings,
scores=scores,
k=10, # Number of items to select
strategy=Strategy.MMR, # Diversification strategy to use
strategy=Strategy.DPP, # Diversification strategy to use
diversity=0.5 # Diversity parameter (higher values prioritize diversity)
)

Expand All @@ -66,25 +67,30 @@ The `diversity` parameter tunes the trade-off between relevance and diversity: 0

## Supported Strategies

The following table describes the supported strategies, how they work, their time complexity, and when to use them. The papers linked in the [references](#references) section provide more in-depth information on the strengths/weaknesses of the supported strategies.
The papers linked in [references](#references) provide more details on each strategy.

| Strategy | What It Does | Time Complexity | When to Use |
| ------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
| **MMR** (Maximal Marginal Relevance) | Keeps the most relevant items while down-weighting those too similar to whats already picked. | **O(k · n · d)** | Good default. Fast, simple, and works well when you just want to avoid near-duplicates. |
| **MSD** (Max Sum of Distances) | Prefers items that are both relevant and far from *all* previous selections. | **O(k · n · d)** | Use when you want stronger spread, i.e. results that cover a wider range of topics or styles. |
| **DPP** (Determinantal Point Process) | Samples diverse yet relevant items using probabilistic repulsion. | **O(k · n · d + n · k²)** | Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection. |
| **COVER** (Facility-Location) | Ensures selected items collectively represent the full datasets structure. | **O(k · n²)** | Great for topic coverage or clustering scenarios, but slower for large `n`. |
| **SSD** (Sliding Spectrum Decomposition) | Sequence‑aware diversification: rewards novelty relative to recently shown items. | **O(k · n · d)** | Great for content feeds & infinite scroll, e.g. social/news/product feeds where users consume sequentially, as well as conversational RAG to avoid showing similar chunks within the recent window.
| **MMR** (Maximal Marginal Relevance) | Keeps the most relevant items while down-weighting those too similar to what's already picked. | **O(k · n · d)** | Simple baseline. Fast, easy to implement, competitive results. Recommended `diversity`: 0.4–0.7 |
| **MSD** (Max Sum of Distances) | Prefers items that are both relevant and far from *all* previous selections. | **O(k · n · d)** | Best for maximum variety (ILAD), but may sacrifice relevance. Recommended `diversity`: 0.3–0.5 |
| **DPP** (Determinantal Point Process) | Samples diverse yet relevant items using probabilistic "repulsion." | **O(k · n · d + n · k²)** | **Recommended default.** Best overall in benchmarks: highest accuracy and diversity. Recommended `diversity`: 0.5–0.8 |
| **COVER** (Facility-Location) | Ensures selected items collectively represent the full dataset's structure. | **O(k · n²)** | Great for topic coverage or clustering scenarios, but slower for large `n`. |
| **SSD** (Sliding Spectrum Decomposition) | Sequence‑aware diversification: rewards novelty relative to recently shown items. | **O(k · n · d)** | Great for content feeds & infinite scroll where users consume sequentially. Recommended `diversity`: 0.6–0.9 |

## Benchmarks

All strategies are evaluated across 4 recommendation datasets. Full methodology and detailed results can be found in [`benchmarks`](benchmarks/).

DPP is the clear winner: it boosts ILAD/ILMD by 26-44% and 54-86% respectively while improving relevance by 1.8-3.1%. Use `diversity=0.5-0.8` for best overall results.

## Motivation

Traditional retrieval systems rank results purely by relevance (how closely each item matches the query). While effective, this can lead to redundancy: top results often look nearly identical, which can create a poor user experience.

Diversification techniques like MMR, MSD, COVER, and DPP help balance relevance and variety.
Diversification techniques like MMR, MSD, COVER, DPP, and SSD help balance relevance and variety.
Each new item is chosen not only because it’s relevant, but also because it adds new information that wasn’t already covered by earlier results.

This improves exploration, user satisfaction, and coverage across many domains, for example:
This improves exploration, user satisfaction, and coverage across many domains:

- E-commerce: Show different product styles, not multiple copies of the same product.
- News search: Highlight articles from different outlets or viewpoints.
Expand All @@ -94,12 +100,9 @@ This improves exploration, user satisfaction, and coverage across many domains,

## Examples

The following examples illustrate how to apply different diversification strategies in various scenarios.

<details> <summary><b>Product / Web Search</b> — Simple diversification with MMR or DPP</summary> <br>

MMR and DPP are great general-purpose diversification strategies. They are fast, easy to use, and work well in many scenarios.
For example, in a product search setting where you want to show diverse items to a user, you can diversify the top results as follows:
MMR and DPP work well as general-purpose diversifiers. In product search, use them to avoid showing near-duplicate results:

```python
from pyversity import diversify, Strategy
Expand All @@ -120,7 +123,7 @@ result = diversify(

<details> <summary><b>Literature Search </b> — Represent the full topic space with COVER</summary> <br>

COVER (Facility-Location) is well-suited for scenarios where you want to ensure that the selected items collectively represent the entire dataset’s structure. For instance, when searching for academic papers on a broad topic, you might want to cover various subfields and methodologies:
COVER ensures the selected items collectively represent the full topic space. For academic papers, this means covering different subfields and methodologies:

```python
from pyversity import diversify, Strategy
Expand All @@ -143,7 +146,7 @@ result = diversify(
<summary><b>Conversational RAG</b> — Avoid redundant chunks with SSD</summary>
<br>

In retrieval-augmented generation (RAG) for conversational AI, it’s crucial to avoid feeding the model redundant or similar chunks of information within the recent conversation context. The SSD (Sliding Spectrum Decomposition) strategy is designed for sequence-aware diversification, making it ideal for this use case:
In conversational RAG, you want to avoid feeding the model redundant chunks. SSD diversifies relative to recent context, making it a natural fit:

```python
import numpy as np
Expand Down Expand Up @@ -171,7 +174,7 @@ recent_chunk_embeddings = np.vstack([recent_chunk_embeddings, chunk_embeddings[r

<details> <summary><b>Infinite Scroll / Recommendation Feed</b> — Sequence-aware novelty with SSD</summary> <br>

In content feeds or infinite scroll scenarios, users consume items sequentially. To keep the experience engaging, it’s important to introduce novelty relative to recently shown items. The SSD strategy is well-suited for this:
In content feeds, users consume items sequentially. SSD introduces novelty relative to recently shown items, keeping the experience fresh:

```python
import numpy as np
Expand Down Expand Up @@ -199,7 +202,7 @@ recent_feed_embeddings = np.vstack([recent_feed_embeddings, feed_embeddings[resu

<details> <summary><b>Single Long Document</b> — Pick diverse sections with MSD</summary> <br>

When summarizing or extracting information from a single long document, it’s beneficial to select sections that are both relevant and cover different parts of the document. The MSD strategy helps achieve this by preferring items that are far apart from each other:
When extracting from a long document, you want sections that cover different parts. MSD prefers items that are far apart from each other:

```python
from pyversity import diversify, Strategy
Expand Down
Loading