Skip to content

Commit cd0c405

Browse files
authored
feat(graphrag-vectors): add filtering, timestamps, and CRUD operations (microsoft#2236)
* feat(graphrag-vectors): add filtering, timestamps, and CRUD operations Implement the vector store enhancements from the graphrag-vectors-design spec: New modules: - filtering.py: Pydantic-based filter expression system with F builder, operator overloads, JSON serialization, client-side evaluate(), and per-backend compilation (SQL for LanceDB/CosmosDB, OData for Azure AI Search) - timestamp.py: ISO 8601 timestamp explosion into filterable component fields Enhanced VectorStoreDocument: - data: dict for user-defined metadata fields - create_date / update_date: automatic ISO 8601 timestamps Enhanced VectorStore base class: - fields config for typed metadata columns - insert / count / remove / update CRUD methods - select, filters, include_vectors params on search methods - Automatic timestamp explosion on insert/update - User-defined date field explosion Backend implementations (LanceDB, Azure AI Search, CosmosDB): - Full filter compilation to native query languages - Typed schema creation with user-defined fields - All new CRUD operations Breaking changes: - search_by_id raises IndexError when document not found - Updated indexer_adapters.py caller to handle the new exception Tests: - 54 unit tests for filtering and timestamp modules - 28 LanceDB integration tests covering CRUD, filters, timestamps, select, include_vectors, and user-defined date field explosion * fix: resolve CI build failures (formatting, lint, pyright, test mocks) - Fix ruff formatting and lint errors across all changed files - Refactor filtering.py: move operator overloads from monkey-patching to direct class methods for pyright visibility - Use validation_alias/serialization_alias with populate_by_name for Pydantic AND/OR/NOT models (pyright + runtime compatible) - Use Operator enum members instead of string literals in FieldRef - Add missing abstract methods (insert, count, remove, update) to test mock VectorStore classes - Update mock method signatures to match base class (select, filters, include_vectors params) - Add docstrings to FieldRef magic methods (ruff D105) - Fix noqa:S608 placement in cosmosdb.py * feat: add top-level vector_size to VectorStoreConfig Add a vector_size field (default 3072) to VectorStoreConfig so users can set it once instead of on every individual index schema. The value is propagated to new IndexSchema entries during validation. * chore: add semversioner patch entry * chore: add ismatch and ftype to spellcheck dictionary * Add example notebooks for LanceDB, Azure AI Search, and CosmosDB vector stores - Three notebooks demonstrating: document loading, similarity search, metadata filtering with F builder, timestamp filtering, document update/removal - Sample data files (text_units.parquet, embeddings.text_unit_text.parquet) - Add CPY001, SLF001, DTZ005 to notebook lint ignores in pyproject.toml * refactor: extract model/tokenizer creation from generate_text_embeddings into callers
1 parent 97045b5 commit cd0c405

27 files changed

Lines changed: 3271 additions & 188 deletions
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "Add filtering, timestamp explosion, insert/count/remove/update operations to vector store API. Add top-level vector_size config to VectorStoreConfig."
4+
}

dictionary.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ dtypes
2727
ints
2828
genid
2929
isinstance
30+
ismatch
31+
ftype
3032

3133
# Azure
3234
abfs
Lines changed: 361 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,361 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "7fb27b941602401d91542211134fc71a",
6+
"metadata": {},
7+
"source": [
8+
"# Azure AI Search Vector Store Example\n",
9+
"\n",
10+
"This notebook demonstrates the `AzureAISearchVectorStore` from `graphrag_vectors`, including:\n",
11+
"- Loading documents with metadata and embeddings\n",
12+
"- Similarity search with field selection\n",
13+
"- Metadata filtering using the `F` filter builder (compiled to OData)\n",
14+
"- Timestamp-based filtering on exploded date fields\n",
15+
"- Document update and removal\n",
16+
"\n",
17+
"**Prerequisites**: Set `AZURE_AI_SEARCH_URL` in your `.env` file (and optionally `AZURE_AI_SEARCH_API_KEY`)."
18+
]
19+
},
20+
{
21+
"cell_type": "code",
22+
"execution_count": null,
23+
"id": "acae54e37e7d407bbb7b55eff062a284",
24+
"metadata": {},
25+
"outputs": [],
26+
"source": [
27+
"import os\n",
28+
"import time\n",
29+
"from pathlib import Path\n",
30+
"\n",
31+
"import pandas as pd\n",
32+
"from dotenv import load_dotenv\n",
33+
"from graphrag_vectors import F, VectorStoreDocument\n",
34+
"from graphrag_vectors.azure_ai_search import AzureAISearchVectorStore\n",
35+
"\n",
36+
"load_dotenv()\n",
37+
"\n",
38+
"# Load sample data (text units with embeddings)\n",
39+
"data_dir = Path(\"data\")\n",
40+
"text_units = pd.read_parquet(data_dir / \"text_units.parquet\")\n",
41+
"embeddings = pd.read_parquet(data_dir / \"embeddings.text_unit_text.parquet\")\n",
42+
"text_units = text_units.merge(embeddings, on=\"id\")\n",
43+
"\n",
44+
"print(\n",
45+
" f\"Loaded {len(text_units)} text units with columns: {text_units.columns.tolist()}\"\n",
46+
")"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": null,
52+
"id": "9a63283cbaf04dbcab1f6479b197f3a8",
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"# Create and connect to an Azure AI Search vector store\n",
57+
"url = os.environ[\"AZURE_AI_SEARCH_URL\"]\n",
58+
"api_key = os.environ.get(\"AZURE_AI_SEARCH_API_KEY\")\n",
59+
"\n",
60+
"store = AzureAISearchVectorStore(\n",
61+
" url=url,\n",
62+
" api_key=api_key,\n",
63+
" index_name=\"text_units\",\n",
64+
" fields={\n",
65+
" \"os\": \"str\",\n",
66+
" \"category\": \"str\",\n",
67+
" \"timestamp\": \"date\",\n",
68+
" },\n",
69+
")\n",
70+
"store.connect()\n",
71+
"store.create_index()\n",
72+
"\n",
73+
"# Load documents\n",
74+
"docs = [\n",
75+
" VectorStoreDocument(\n",
76+
" id=row[\"id\"],\n",
77+
" vector=row[\"embedding\"].tolist(),\n",
78+
" data=row.to_dict(),\n",
79+
" create_date=row.get(\"timestamp\"),\n",
80+
" )\n",
81+
" for _, row in text_units.iterrows()\n",
82+
"]\n",
83+
"store.load_documents(docs)\n",
84+
"print(f\"Loaded {len(docs)} documents into store\")\n",
85+
"\n",
86+
"# Allow time for Azure AI Search to propagate\n",
87+
"time.sleep(5)"
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"id": "8dd0d8092fe74a7c96281538738b07e2",
94+
"metadata": {},
95+
"outputs": [],
96+
"source": [
97+
"# Test count\n",
98+
"count = store.count()\n",
99+
"print(f\"Document count: {count}\")\n",
100+
"assert count == 42, f\"Expected 42, got {count}\""
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"id": "72eea5119410473aa328ad9291626812",
106+
"metadata": {},
107+
"source": [
108+
"## Vector Similarity Search\n",
109+
"\n",
110+
"Use `similarity_search_by_vector` to find the closest documents to a query embedding.\n",
111+
"The `select` parameter controls which metadata fields are returned in results."
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"id": "8edb47106e1a46a883d545849b8ab81b",
118+
"metadata": {},
119+
"outputs": [],
120+
"source": [
121+
"# Use the first document's embedding as a query vector\n",
122+
"query_vector = text_units.iloc[0][\"embedding\"].tolist()\n",
123+
"\n",
124+
"# Basic search - returns all fields\n",
125+
"results = store.similarity_search_by_vector(query_vector, k=3)\n",
126+
"print(f\"Found {len(results)} results:\")\n",
127+
"for r in results:\n",
128+
" print(\n",
129+
" f\" - {r.document.id}: score={r.score:.4f}, data keys={list(r.document.data.keys())}\"\n",
130+
" )\n",
131+
"\n",
132+
"# Search with select - only return 'os' field\n",
133+
"results = store.similarity_search_by_vector(query_vector, k=1, select=[\"os\"])\n",
134+
"result = results[0]\n",
135+
"print(\"\\nWith select=['os']:\")\n",
136+
"print(f\" Data fields: {result.document.data}\")\n",
137+
"assert \"os\" in result.document.data, \"Expected 'os' field in data\"\n",
138+
"assert \"category\" not in result.document.data, \"Expected 'category' to be excluded\"\n",
139+
"print(\" Select parameter confirmed - only 'os' field returned.\")"
140+
]
141+
},
142+
{
143+
"cell_type": "markdown",
144+
"id": "10185d26023b46108eb7d9f57d49d2b3",
145+
"metadata": {},
146+
"source": [
147+
"## Metadata Filtering\n",
148+
"\n",
149+
"Use the `F` filter builder to construct filter expressions with `==`, `!=`, `>`, `<`, `>=`, `<=`.\n",
150+
"Combine with `&` (AND), `|` (OR), and `~` (NOT). Filters are compiled to OData expressions for Azure AI Search."
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": null,
156+
"id": "8763a12b2bbd4a93a75aff182afb95dc",
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"# Filter by a single field\n",
161+
"print(\"=== Filter: os == 'windows' ===\")\n",
162+
"filtered = store.similarity_search_by_vector(\n",
163+
" query_vector, k=5, filters=F.os == \"windows\"\n",
164+
")\n",
165+
"print(f\"Found {len(filtered)} results:\")\n",
166+
"for r in filtered:\n",
167+
" print(f\" - {r.document.id}: os={r.document.data.get('os')}, score={r.score:.4f}\")\n",
168+
"\n",
169+
"# Compound filter with AND\n",
170+
"print(\"\\n=== Filter: os == 'windows' AND category == 'bug' ===\")\n",
171+
"filtered = store.similarity_search_by_vector(\n",
172+
" query_vector,\n",
173+
" k=5,\n",
174+
" filters=(F.os == \"windows\") & (F.category == \"bug\"),\n",
175+
")\n",
176+
"print(f\"Found {len(filtered)} results:\")\n",
177+
"for r in filtered:\n",
178+
" print(\n",
179+
" f\" - {r.document.id}: os={r.document.data.get('os')}, category={r.document.data.get('category')}\"\n",
180+
" )\n",
181+
"\n",
182+
"# OR filter\n",
183+
"print(\"\\n=== Filter: category == 'bug' OR category == 'feature' ===\")\n",
184+
"filtered = store.similarity_search_by_vector(\n",
185+
" query_vector,\n",
186+
" k=5,\n",
187+
" filters=(F.category == \"bug\") | (F.category == \"feature\"),\n",
188+
")\n",
189+
"print(f\"Found {len(filtered)} results:\")\n",
190+
"for r in filtered:\n",
191+
" print(f\" - {r.document.id}: category={r.document.data.get('category')}\")\n",
192+
"\n",
193+
"# NOT filter\n",
194+
"print(\"\\n=== Filter: NOT os == 'linux' ===\")\n",
195+
"filtered = store.similarity_search_by_vector(\n",
196+
" query_vector,\n",
197+
" k=3,\n",
198+
" filters=~(F.os == \"linux\"),\n",
199+
")\n",
200+
"print(f\"Found {len(filtered)} results:\")\n",
201+
"for r in filtered:\n",
202+
" print(f\" - {r.document.id}: os={r.document.data.get('os')}\")\n",
203+
"\n",
204+
"# Show the compiled OData filter string for debugging\n",
205+
"filter_expr = (F.os == \"windows\") & (F.category == \"bug\")\n",
206+
"print(f\"\\nCompiled OData filter: {store._compile_filter(filter_expr)}\")"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"id": "7623eae2785240b9bd12b16a66d81610",
212+
"metadata": {},
213+
"source": [
214+
"## Timestamp Filtering\n",
215+
"\n",
216+
"Date fields (declared as `\"date\"` in the `fields` dict) are automatically exploded into filterable components:\n",
217+
"`_year`, `_month`, `_day`, `_hour`, `_day_of_week`, `_quarter`.\n",
218+
"\n",
219+
"The built-in `create_date` and `update_date` fields are also exploded automatically."
220+
]
221+
},
222+
{
223+
"cell_type": "code",
224+
"execution_count": null,
225+
"id": "7cdc8c89c7104fffa095e18ddfef8986",
226+
"metadata": {},
227+
"outputs": [],
228+
"source": [
229+
"from datetime import datetime, timedelta\n",
230+
"\n",
231+
"# Filter by exploded field: documents created in December\n",
232+
"print(\"=== Filter: create_date_month == 12 (December) ===\")\n",
233+
"filtered = store.similarity_search_by_vector(\n",
234+
" query_vector,\n",
235+
" k=5,\n",
236+
" filters=F.create_date_month == 12,\n",
237+
")\n",
238+
"print(f\"Found {len(filtered)} results:\")\n",
239+
"for r in filtered:\n",
240+
" print(\n",
241+
" f\" - {r.document.id}: create_date={r.document.create_date}, month={r.document.data.get('create_date_month')}\"\n",
242+
" )\n",
243+
"\n",
244+
"# Filter by day of week\n",
245+
"print(\"\\n=== Filter: create_date_day_of_week == 'Monday' ===\")\n",
246+
"filtered = store.similarity_search_by_vector(\n",
247+
" query_vector,\n",
248+
" k=5,\n",
249+
" filters=F.create_date_day_of_week == \"Monday\",\n",
250+
")\n",
251+
"print(f\"Found {len(filtered)} results:\")\n",
252+
"for r in filtered:\n",
253+
" print(f\" - {r.document.id}: day={r.document.data.get('create_date_day_of_week')}\")\n",
254+
"\n",
255+
"# Filter by quarter\n",
256+
"print(\"\\n=== Filter: create_date_quarter == 4 (Q4) ===\")\n",
257+
"filtered = store.similarity_search_by_vector(\n",
258+
" query_vector,\n",
259+
" k=5,\n",
260+
" filters=F.create_date_quarter == 4,\n",
261+
")\n",
262+
"print(f\"Found {len(filtered)} results:\")\n",
263+
"for r in filtered:\n",
264+
" print(f\" - {r.document.id}: quarter={r.document.data.get('create_date_quarter')}\")\n",
265+
"\n",
266+
"# Range query on the raw create_date\n",
267+
"cutoff = (datetime.now() - timedelta(days=90)).isoformat()\n",
268+
"print(f\"\\n=== Filter: create_date >= '{cutoff[:10]}...' (last 90 days) ===\")\n",
269+
"filtered = store.similarity_search_by_vector(\n",
270+
" query_vector,\n",
271+
" k=5,\n",
272+
" filters=F.create_date >= cutoff,\n",
273+
")\n",
274+
"print(f\"Found {len(filtered)} results:\")\n",
275+
"for r in filtered:\n",
276+
" print(f\" - {r.document.id}: create_date={r.document.create_date}\")\n",
277+
"\n",
278+
"# Show compiled OData filter strings\n",
279+
"print(f\"\\nCompiled month filter: {store._compile_filter(F.create_date_month == 12)}\")\n",
280+
"print(f\"Compiled range filter: {store._compile_filter(F.create_date >= cutoff)}\")\n",
281+
"print(\n",
282+
" f\"Compiled compound filter: {store._compile_filter((F.create_date_quarter == 4) & (F.update_date_day_of_week == 'Monday'))}\"\n",
283+
")"
284+
]
285+
},
286+
{
287+
"cell_type": "markdown",
288+
"id": "b118ea5561624da68c537baed56e602f",
289+
"metadata": {},
290+
"source": [
291+
"## Document Update and Removal\n",
292+
"\n",
293+
"Use `update()` to modify a document's metadata and `remove()` to delete documents by ID.\n",
294+
"Azure AI Search operations may require a brief delay for propagation."
295+
]
296+
},
297+
{
298+
"cell_type": "code",
299+
"execution_count": null,
300+
"id": "938c804e27f84196a10c8828c723f798",
301+
"metadata": {},
302+
"outputs": [],
303+
"source": [
304+
"# Update a document\n",
305+
"doc_id = text_units[\"id\"].iloc[0]\n",
306+
"original = store.search_by_id(doc_id)\n",
307+
"print(f\"Original os: {original.data.get('os')}\")\n",
308+
"\n",
309+
"updated_doc = VectorStoreDocument(\n",
310+
" id=doc_id,\n",
311+
" vector=None,\n",
312+
" data={\"os\": \"updated-os-value\"},\n",
313+
")\n",
314+
"store.update(updated_doc)\n",
315+
"\n",
316+
"# Allow time for Azure AI Search to propagate\n",
317+
"time.sleep(2)\n",
318+
"\n",
319+
"result = store.search_by_id(doc_id)\n",
320+
"print(f\"Updated os: {result.data.get('os')}\")\n",
321+
"assert result.data.get(\"os\") == \"updated-os-value\", \"Update failed\"\n",
322+
"print(\"Update confirmed.\")"
323+
]
324+
},
325+
{
326+
"cell_type": "code",
327+
"execution_count": null,
328+
"id": "504fb2a444614c0babb325280ed9130a",
329+
"metadata": {},
330+
"outputs": [],
331+
"source": [
332+
"# Remove documents\n",
333+
"ids_to_delete = text_units[\"id\"].head(5).tolist()\n",
334+
"print(f\"Deleting {len(ids_to_delete)} documents...\")\n",
335+
"\n",
336+
"store.remove(ids_to_delete)\n",
337+
"\n",
338+
"# Allow time for Azure AI Search to propagate\n",
339+
"time.sleep(3)\n",
340+
"\n",
341+
"new_count = store.count()\n",
342+
"print(f\"Document count after delete: {new_count}\")\n",
343+
"assert new_count == 37, f\"Expected 37, got {new_count}\"\n",
344+
"print(\"Remove confirmed.\")"
345+
]
346+
}
347+
],
348+
"metadata": {
349+
"kernelspec": {
350+
"display_name": "Python 3",
351+
"language": "python",
352+
"name": "python3"
353+
},
354+
"language_info": {
355+
"name": "python",
356+
"version": "3.12.0"
357+
}
358+
},
359+
"nbformat": 4,
360+
"nbformat_minor": 5
361+
}

0 commit comments

Comments
 (0)