Skip to content

Commit 4b8f980

Browse files
narotsit-intuglejuhel-phanju-intugleJaskaranIntugle
authored
Features/merged semantic search (#21)
* feat: Duckdb compatible - Added duckdb adapter - Changes made in sql generator for query compatiblity - Added details in source * feat: updated model version compatible to scikit-learn==1.7.1, xgboost==3.0.4 * feat: added fallback for np.float128 as it's not supported by all systems * added Knowledge Builder Module * added DataProductBuilder, updated documentation * DataProductBuilder typo fix * added macos instructions * updated readme to correct API KEY * added dev1 * added httpfs manually for duckdb mac * updated version to 0.1.2dev2 * added SSL certs for nltk downloads for mac users * added warning when entering graph recursion * added " delimiter in dp_builder * semantic search * knowledge builder can be resumed if failed for dataset pipeline, not for link prediction * added key to yamls * Made LLM_PROVIDER optional to decouple from downstream * updated quickstart content * added configs, added syncing * added semantic search to knowledge_builder * updated tests, asyncio loop handling, sort in search * removed semantic search md * updated quickstart * incremented version --------- Co-authored-by: juhel-phanju-intugle <juhel@intugle.ai> Co-authored-by: JaskaranIntugle <jaskaran@intugle.ai>
1 parent 3188cc7 commit 4b8f980

24 files changed

Lines changed: 2507 additions & 592 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ notes.txt
209209

210210
testing_base
211211
models
212+
models_bak
212213

213214
settings.json
214215
archived/

README.md

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
</p>
55

66
[![Release](https://img.shields.io/github/release/Intugle/data-tools)](https://github.com/Intugle/data-tools/releases/tag/v0.1.0)
7-
[![Made with Python](https://img.shields.io/badge/Made_with-Python-blue?logo=python&logoColor=white)](https://www.python.org/)
7+
[![Made with Python](https://img.shields.io/badge/Made_with-Python-blue?logo=python&logoColor=white)](https://www.python.org/)
88
![contributions - welcome](https://img.shields.io/badge/contributions-welcome-blue)
99
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
1010
[![Open Issues](https://img.shields.io/github/issues-raw/Intugle/data-tools)](https://github.com/Intugle/data-tools/issues)
@@ -85,7 +85,7 @@ For a detailed, hands-on introduction to the project, please see the [`quickstar
8585
* **Accessing Enriched Metadata:** Learn how to access the profiling results and business glossary for each dataset.
8686
* **Visualizing Relationships:** Visualize the predicted links between your tables.
8787
* **Generating Data Products:** Use the semantic layer to generate data products and retrieve data.
88-
* **Serving the Semantic Layer:** Learn how to start the MCP server to interact with your semantic layer using natural language.
88+
* **Searching the Knowledge Base:** Use semantic search to find relevant columns in your datasets using natural language.
8989

9090
## Usage
9191

@@ -147,7 +147,72 @@ data_product = dp_builder.build(etl)
147147
print(data_product.to_df())
148148
```
149149

150-
For detailed code examples and a complete walkthrough, please refer to our quickstart notebooks.
150+
For detailed code examples and a complete walkthrough, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook.
151+
152+
### Semantic Search
153+
154+
The semantic search feature allows you to search for columns in your datasets using natural language. It is built on top of the [Qdrant](https://qdrant.tech/) vector database.
155+
156+
#### Prerequisites
157+
158+
To use the semantic search feature, you need to have a running Qdrant instance. You can start one using the following Docker command:
159+
160+
```bash
161+
docker run -p 6333:6333 -p 6334:6334 \
162+
-v qdrant_storage:/qdrant/storage:z \
163+
--name qdrant qdrant/qdrant
164+
```
165+
166+
You also need to configure the Qdrant URL and API key (if using authorization) in your environment variables:
167+
168+
```bash
169+
export QDRANT_URL="http://localhost:6333"
170+
export QDRANT_API_KEY="your-qdrant-api-key" # if authorization is used
171+
```
172+
173+
Currently, the semantic search feature only supports OpenAI embedding models. Therefore, you need to have an OpenAI API key set up in your environment. The default model is `text-embedding-ada-002`. You can change the embedding model by setting the `EMBEDDING_MODEL_NAME` environment variable.
174+
175+
**For OpenAI models:**
176+
177+
```bash
178+
export OPENAI_API_KEY="your-openai-api-key"
179+
export EMBEDDING_MODEL_NAME="openai:ada"
180+
```
181+
182+
**For Azure OpenAI models:**
183+
184+
```bash
185+
export AZURE_OPENAI_API_KEY="your-azure-openai-api-key"
186+
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"
187+
export OPENAI_API_VERSION="your-openai-api-version"
188+
export EMBEDDING_MODEL_NAME="azure_openai:ada"
189+
```
190+
191+
#### Usage
192+
193+
Once you have built the knowledge base, you can use the `search` method to perform a semantic search. The search function returns a pandas DataFrame containing the search results, including the column\'s profiling metrics, category, table name, and table glossary.
194+
195+
```python
196+
from intugle import KnowledgeBuilder
197+
198+
# Define your datasets
199+
datasets = {
200+
"allergies": {"path": "path/to/allergies.csv", "type": "csv"},
201+
"patients": {"path": "path/to/patients.csv", "type": "csv"},
202+
"claims": {"path": "path/to/claims.csv", "type": "csv"},
203+
# ... add other datasets
204+
}
205+
206+
# Build the knowledge base
207+
kb = KnowledgeBuilder(datasets, domain="Healthcare")
208+
kb.build()
209+
# Perform a semantic search
210+
search_results = kb.search("patient allergies")
211+
212+
# View the search results
213+
print(search_results)
214+
```
215+
For detailed code examples and a complete walkthrough, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook.
151216

152217
## Community
153218

0 commit comments

Comments
 (0)