Skip to content

dakouri-kobri/vector-database-qdrant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector Database with Qdrant

Overview

This project implements a semantic search system for the arXiv papers dataset using vector embeddings with Qdrant.


Dataset

This project uses a preprocessed version of the arXiv dataset containing paper metadata and precomputed embeddings.

The dataset includes:

  • Paper ID
  • Title
  • Authors
  • Abstract
  • Categories
  • Update date
  • Precomputed embedding vectors (1536 dimensions)

The embeddings were generated using OpenAI's text-embedding-ada-002 model, ensuring compatibility between stored vectors and query embeddings.

⚠️ Note: The dataset is not included in this repository due to its size.
Please download it manually from Kaggle and update the ARXIV_DATA_PATH in your .env file.

Download Instructions

  1. Go to the Kaggle dataset page
  2. Download the .zip file
  3. Extract it locally
  4. Set the path in .env:
ARXIV_DATA_PATH=path/to/your/ml-arxiv-embeddings.json.json

Features

  • Batch ingestion of large dataset (400k+ papers)
  • Vector similarity search
  • Natural language query search (OpenAI embeddings)
  • Author-aware filtered search
  • FastAPI interface for external usage

Tech Stack

  • Python (uv)
  • Qdrant (vector database)
  • OpenAI Embeddings
  • FastAPI (later stages)
  • Docker

Project Structure

vector-database-qdrant/
├─ src/
│  └─ vector_db_qdrant/
│     ├─ __init__.py
│     ├─ api.py
│     ├─ api_models.py
│     ├─ cli.py
│     ├─ config.py
│     ├─ data_loader.py
│     ├─ openai_client_manager.py
│     ├─ qdrant_client_manager.py
│     └─ search.py
├─ tests/
│  └─ test_smoke.py  
├─ docs/
│  ├─ api_response.png
│  └─ swagger_ui.png
├─ .env.example
├─ .gitignore
├─ pyproject.toml
├─ README.md
└─ uv.lock

Setup

1. Install dependencies

uv sync

2. Run Qdrant

docker run ...

3. Configure environment

cp .env.example .env

4. Load data

uv run load-data

Usage

CLI search

uv run search "attention mechanism in deep learning"

API

Run the FastAPI server:

uv run uvicorn vector_db_qdrant.api:app --reload

Then open:

http://localhost:8000/docs

Example request

{
  "query": "Papers on clustering by Andrew Ng",
  "top_n": 5
}

Demo

The application exposes a /search endpoint via FastAPI.
Below is the interactive Swagger UI where users can submit natural language queries:

FastAPI Swagger UI

Example Search Response

Below is an actual response returned by the API for the query:

"Papers on clustering by Andrew Ng"

The system returns the most relevant papers, including metadata and similarity scores:

Search Results


Challenges & Solutions

  • Windows Qdrant crash → solved via indexing_threshold=0
  • Large dataset → streaming + batching
  • Payload vs vector separation

Future Improvements

  • Hybrid search (BM25 + vectors)
  • Caching embeddings
  • Frontend UI

License

MIT


Author

Dakouri Kobri
Data Science, AI/ML, & Health Science Enthusiast

About

End-to-end semantic search system for arXiv papers using Qdrant vector database, OpenAI embeddings, FastAPI API, and CLI support — featuring batch ingestion, similarity search, and filtered queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages