Skip to content

Commit e15511d

Browse files
feat: Add HuggingFace hub integration (#58)
* Refactor README and Vicinity class to support any serializable item type - Updated README.md to clarify that items can be strings or other serializable objects. - Modified the Vicinity class to accept a broader range of item types by changing type hints from `str` to `Any` in several methods. - Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type. * Update README.md to include examples for saving/loading vector stores and evaluating backends * Refactor Vicinity class to streamline token handling - Simplified the logic for checking and appending tokens in the insert method, ensuring that duplicate tokens are properly managed. * Refactor item handling in tests and Vicinity class - Updated the `items` fixture to return a mix of dictionaries and strings based on index parity. - Modified `test_vicinity_insert_duplicate` to use the updated `items` fixture for inserting items. - Adjusted `test_vicinity_delete_and_query` to reference items by their indices instead of hardcoded values. - Enhanced the Vicinity class to streamline token management, ensuring proper handling of duplicates and improving error messaging for token deletions. * Apply suggestions from code review Co-authored-by: Stephan Tulkens <stephantul@gmail.com> * Refactor token insertion in Vicinity class to simplify duplicate handling - Replaced the nested loop for checking duplicates with a single extend operation for tokens. - Improved efficiency by directly appending tokens to the items list, ensuring proper management of duplicates. * Refactor token deletion logic in Vicinity class to improve error handling - Replaced the nested loop for token matching with a more efficient list comprehension. - Enhanced error messaging to specify which tokens were not found in the vector space. * Enhance error handling in Vicinity class for JSON serialization - Added a try-except block around the JSON serialization process to catch JSONEncodeError. * Add non-serializable items fixture and test for Vicinity class - Introduced a new pytest fixture `non_serializable_items` that generates a list of non-serializable objects for testing. - Added a test case `test_vicinity_save_and_load_non_serializable_items` to verify that saving a Vicinity instance with non-serializable items raises a JSONEncodeError. - Updated the Vicinity class documentation to specify that JSONEncodeError may be raised if items are not serializable. * Add Hugging Face integration for Vicinity class - Introduced HuggingFaceMixin to enable saving and loading Vicinity instances to/from Hugging Face Hub - Added optional import of HuggingFaceMixin based on huggingface_hub and datasets library availability - Implemented methods for pushing Vicinity instances to the Hub, including dataset and metadata upload - Created a method to load Vicinity instances from Hugging Face repositories * Enhance Hugging Face integration with improved error handling and dataset card template - Added a dataset card template for Hugging Face Hub uploads - Improved error handling for Hugging Face integration with custom import error - Updated `push_to_hub` method to include model name/path in configuration - Removed conditional import of Hugging Face libraries in `vicinity.py` - Added `huggingface` optional dependency in `pyproject.toml` * Update pyproject.toml and README.md for improved package installation and Hugging Face integration - Added new optional dependency groups for integrations and backends in pyproject.toml - Updated README.md with new installation instructions for specific integrations and backends - Added documentation for pushing and loading vector stores from Hugging Face Hub - Simplified and clarified installation options in README * Add test for Vicinity.load_from_hub method - Implemented a new test case for loading a Vicinity instance from Hugging Face Hub - Added test to verify the print statement when loading from a repository - Introduced a constant for the print statement in the Hugging Face integration module - Updated the print statement to use string formatting for better flexibility * Remove test files for utils and vicinity modules - Deleted `tests/test_utils.py` containing tests for normalization utility functions - Removed `tests/test_vicinity.py` with comprehensive test cases for the Vicinity class - These test files are no longer needed, likely due to refactoring or migration of tests * Add comprehensive test suites for Vicinity and utility functions - Implemented `test_utils.py` with tests for vector normalization functions - Created `test_vicinity.py` with extensive test cases covering Vicinity class methods - Added `test_huggingface.py` to test Hugging Face integration functionality - Included tests for various scenarios such as: * Initialization and vector handling * Querying and thresholding * Insertion and deletion of vectors * Saving and loading vector stores * Handling non-serializable items * Hugging Face Hub integration --------- Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
1 parent ac189cc commit e15511d

6 files changed

Lines changed: 243 additions & 4 deletions

File tree

README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,6 @@
3636

3737
</div>
3838

39-
4039
Vicinity is a light-weight, low-dependency vector store. It provides a simple and intuitive interface for nearest neighbor search, with support for different backends and evaluation.
4140

4241
There are many nearest neighbors packages and methods out there. However, we found it difficult to compare them. Every package has its own interface, quirks, and limitations, and learning a new package can be time-consuming. In addition to that, how do you effectively evaluate different packages? How do you know which one is the best for your use case?
@@ -49,7 +48,7 @@ Install the package with:
4948
```bash
5049
pip install vicinity
5150
```
52-
Optionally, [install any of the supported backends](#installation), or simply install all of them with:
51+
Optionally, [install specific backends and integrations](#installation), or simply install all of them with:
5352
```bash
5453
pip install vicinity[all]
5554
```
@@ -93,6 +92,13 @@ vicinity.save('my_vector_store')
9392
vicinity = Vicinity.load('my_vector_store')
9493
```
9594

95+
Pushing and loading a vector store from the Hugging Face Hub:
96+
97+
```python
98+
vicinity.push_to_hub(model_name_or_path='my_vector_store', repo_id='my_vector_store')
99+
vicinity = Vicinity.load_from_hub(repo_id='my_vector_store')
100+
```
101+
96102
Evaluating a backend:
97103

98104
```python
@@ -173,9 +179,18 @@ The following installation options are available:
173179
# Install the base package
174180
pip install vicinity
175181

176-
# Install all backends
182+
# Install all integrations and backends
177183
pip install vicinity[all]
178184

185+
# Install all integrations
186+
pip install vicinity[integrations]
187+
188+
# Install specific integrations
189+
pip install vicinity[huggingface]
190+
191+
# Install all backends
192+
pip install vicinity[backends]
193+
179194
# Install specific backends
180195
pip install vicinity[annoy]
181196
pip install vicinity[faiss]

pyproject.toml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,14 @@ dev = [
4242
"ruff",
4343
"setuptools"
4444
]
45+
46+
# Integrations
47+
huggingface = ["datasets"]
48+
integrations = [
49+
"datasets"
50+
]
51+
52+
# Backends
4553
hnsw = ["hnswlib"]
4654
pynndescent = [
4755
"pynndescent>=0.5.10",
@@ -53,7 +61,20 @@ annoy = ["annoy"]
5361
faiss = ["faiss-cpu"]
5462
usearch = ["usearch"]
5563
voyager = ["voyager"]
64+
backends = [
65+
"hnswlib",
66+
"pynndescent>=0.5.10",
67+
"numba>=0.59.0",
68+
"llvmlite>=0.42.0",
69+
"numpy>=1.24.0",
70+
"annoy",
71+
"faiss-cpu",
72+
"usearch",
73+
"voyager"
74+
]
75+
5676
all = [
77+
"datasets",
5778
"hnswlib",
5879
"pynndescent>=0.5.10",
5980
"numba>=0.59.0",
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from __future__ import annotations
2+
3+
import io
4+
import sys
5+
6+
from vicinity import Vicinity
7+
from vicinity.datatypes import Backend
8+
from vicinity.integrations.huggingface import _MODEL_NAME_OR_PATH_PRINT_STATEMENT
9+
10+
BackendType = tuple[Backend, str]
11+
12+
13+
def test_load_from_hub(vicinity_instance: Vicinity) -> None:
14+
"""
15+
Test Vicinity.load_from_hub.
16+
17+
:param vicinity_instance: A Vicinity instance.
18+
"""
19+
repo_id = "davidberenstein1957/my-vicinity-repo"
20+
# get the first part of the print statement to test if model name or path is printed
21+
expected_print_statement = _MODEL_NAME_OR_PATH_PRINT_STATEMENT.split(":")[0]
22+
23+
# Capture the output
24+
captured_output = io.StringIO()
25+
sys.stdout = captured_output
26+
27+
Vicinity.load_from_hub(repo_id=repo_id)
28+
29+
# Reset redirect.
30+
sys.stdout = sys.__stdout__
31+
32+
# Check if the expected message is in the output
33+
assert expected_print_statement in captured_output.getvalue()
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
tags:
3+
- vicinity
4+
- vector-store
5+
---
6+
7+
# Dataset Card for {repo_id}
8+
9+
This dataset was created using the [vicinity](https://github.com/MinishLab/vicinity) library, a lightweight nearest neighbors library with flexible backends.
10+
11+
It contains a vector space with {num_items} items.
12+
13+
## Usage
14+
15+
You can load this dataset using the following code:
16+
17+
```python
18+
from vicinity import Vicinity
19+
vicinity = Vicinity.load_from_hub("{repo_id}")
20+
```
21+
22+
After loading the dataset, you can use the `vicinity.query` method to find the nearest neighbors to a vector.
23+
24+
## Configuration
25+
26+
The configuration of the dataset is stored in the `config.json` file. The vector backend is stored in the `backend` folder.
27+
28+
```bash
29+
{config}
30+
```
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
from __future__ import annotations
2+
3+
import json
4+
import logging
5+
import tempfile
6+
from pathlib import Path
7+
from typing import TYPE_CHECKING, Any
8+
9+
from vicinity.backends import BasicVectorStore, get_backend_class
10+
from vicinity.datatypes import Backend
11+
12+
if TYPE_CHECKING:
13+
from huggingface_hub import CommitInfo
14+
15+
from vicinity.vicinity import Vicinity
16+
17+
_HUB_IMPORT_ERROR = ImportError(
18+
"`datasets` and `huggingface_hub` are required to push to the Hugging Face Hub. Please install them with `pip install 'vicinity[huggingface]'`"
19+
)
20+
_MODEL_NAME_OR_PATH_PRINT_STATEMENT = (
21+
"Embeddings in Vicinity instance were created from model name or path: {model_name_or_path}"
22+
)
23+
24+
logger = logging.getLogger(__name__)
25+
26+
27+
class HuggingFaceMixin:
28+
def push_to_hub(
29+
self,
30+
model_name_or_path: str,
31+
repo_id: str,
32+
token: str | None = None,
33+
private: bool = False,
34+
**kwargs: Any,
35+
) -> "CommitInfo":
36+
"""
37+
Push the Vicinity instance to the Hugging Face Hub.
38+
39+
:param model_name_or_path: The name of the model or the path to the local directory
40+
that was used to create the embeddings in the Vicinity instance.
41+
:param repo_id: The repository ID on the Hugging Face Hub
42+
:param token: Optional authentication token for private repositories
43+
:param private: Whether to create a private repository
44+
:param **kwargs: Additional arguments passed to Dataset.push_to_hub()
45+
:return: The commit info
46+
"""
47+
try:
48+
from datasets import Dataset
49+
from huggingface_hub import DatasetCard, upload_file, upload_folder
50+
except ImportError:
51+
raise _HUB_IMPORT_ERROR
52+
53+
# Create and push dataset with items and vectors
54+
if isinstance(self.items[0], dict):
55+
dataset_dict = {k: [item[k] for item in self.items] for k in self.items[0].keys()}
56+
else:
57+
dataset_dict = {"items": self.items}
58+
if self.vector_store is not None:
59+
dataset_dict["vectors"] = self.vector_store.vectors
60+
dataset = Dataset.from_dict(dataset_dict)
61+
dataset.push_to_hub(repo_id, token=token, private=private, **kwargs)
62+
63+
# Save backend and config files to temp directory and upload
64+
with tempfile.TemporaryDirectory() as temp_dir:
65+
temp_path = Path(temp_dir)
66+
67+
# Save and upload backend
68+
self.backend.save(temp_path)
69+
upload_folder(
70+
repo_id=repo_id,
71+
folder_path=temp_path,
72+
token=token,
73+
repo_type="dataset",
74+
path_in_repo="backend",
75+
)
76+
77+
# Save and upload config
78+
config = {
79+
"metadata": self.metadata,
80+
"backend_type": self.backend.backend_type.value,
81+
"model_name_or_path": model_name_or_path,
82+
}
83+
config_path = temp_path / "config.json"
84+
config_path.write_text(json.dumps(config))
85+
upload_file(
86+
repo_id=repo_id,
87+
path_or_fileobj=config_path,
88+
token=token,
89+
repo_type="dataset",
90+
path_in_repo="config.json",
91+
)
92+
93+
# Load the dataset card template from the related path
94+
template_path = Path(__file__).parent / "dataset_card_template.md"
95+
template = template_path.read_text()
96+
content = template.format(repo_id=repo_id, num_items=len(self.items), config=json.dumps(config, indent=4))
97+
return DatasetCard(content=content).push_to_hub(repo_id=repo_id, token=token, repo_type="dataset")
98+
99+
@classmethod
100+
def load_from_hub(cls, repo_id: str, token: str | None = None, **kwargs: Any) -> "Vicinity":
101+
"""
102+
Load a Vicinity instance from the Hugging Face Hub.
103+
104+
:param repo_id: The repository ID on the Hugging Face Hub.
105+
:param token: Optional authentication token for private repositories.
106+
:param **kwargs: Additional arguments passed to load_dataset.
107+
:return: A Vicinity instance loaded from the Hub.
108+
"""
109+
try:
110+
from datasets import load_dataset
111+
from huggingface_hub import snapshot_download
112+
except ImportError:
113+
raise _HUB_IMPORT_ERROR
114+
115+
# Load dataset and extract items and vectors
116+
dataset = load_dataset(repo_id, token=token, split="train", **kwargs)
117+
if "items" in dataset.column_names:
118+
items = dataset["items"]
119+
else:
120+
# Create items from all columns except 'vectors'
121+
items = []
122+
columns = [col for col in dataset.column_names if col != "vectors"]
123+
for i in range(len(dataset)):
124+
items.append({col: dataset[col][i] for col in columns})
125+
has_vectors = "vectors" in dataset.column_names
126+
vector_store = BasicVectorStore(vectors=dataset["vectors"]) if has_vectors else None
127+
128+
# Download and load config and backend
129+
repo_path = Path(snapshot_download(repo_id=repo_id, token=token, repo_type="dataset"))
130+
with open(repo_path / "config.json") as f:
131+
config = json.load(f)
132+
model_name_or_path = config.pop("model_name_or_path")
133+
134+
print(_MODEL_NAME_OR_PATH_PRINT_STATEMENT.format(model_name_or_path=model_name_or_path))
135+
backend_type = Backend(config["backend_type"])
136+
backend = get_backend_class(backend_type).load(repo_path / "backend")
137+
138+
return cls(items=items, backend=backend, metadata=config["metadata"], vector_store=vector_store)

vicinity/vicinity.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,10 @@
1919

2020
logger = logging.getLogger(__name__)
2121

22+
from vicinity.integrations.huggingface import HuggingFaceMixin
2223

23-
class Vicinity:
24+
25+
class Vicinity(HuggingFaceMixin):
2426
"""
2527
Work with vector representations of items.
2628

0 commit comments

Comments
 (0)