This project implements a semantic search system for the arXiv papers dataset using vector embeddings with Qdrant.
This project uses a preprocessed version of the arXiv dataset containing paper metadata and precomputed embeddings.
- Source: Kaggle
- Dataset: arXiv Papers Dataset with Embeddings
- Link: https://www.kaggle.com/datasets/awester/arxiv-embeddings/versions/85
The dataset includes:
- Paper ID
- Title
- Authors
- Abstract
- Categories
- Update date
- Precomputed embedding vectors (1536 dimensions)
The embeddings were generated using OpenAI's text-embedding-ada-002 model, ensuring compatibility between stored vectors and query embeddings.
Please download it manually from Kaggle and update the ARXIV_DATA_PATH in your .env file.
- Go to the Kaggle dataset page
- Download the
.zipfile - Extract it locally
- Set the path in
.env:
ARXIV_DATA_PATH=path/to/your/ml-arxiv-embeddings.json.json- Batch ingestion of large dataset (400k+ papers)
- Vector similarity search
- Natural language query search (OpenAI embeddings)
- Author-aware filtered search
- FastAPI interface for external usage
- Python (uv)
- Qdrant (vector database)
- OpenAI Embeddings
- FastAPI (later stages)
- Docker
vector-database-qdrant/
├─ src/
│ └─ vector_db_qdrant/
│ ├─ __init__.py
│ ├─ api.py
│ ├─ api_models.py
│ ├─ cli.py
│ ├─ config.py
│ ├─ data_loader.py
│ ├─ openai_client_manager.py
│ ├─ qdrant_client_manager.py
│ └─ search.py
├─ tests/
│ └─ test_smoke.py
├─ docs/
│ ├─ api_response.png
│ └─ swagger_ui.png
├─ .env.example
├─ .gitignore
├─ pyproject.toml
├─ README.md
└─ uv.lock
uv syncdocker run ...cp .env.example .envuv run load-datauv run search "attention mechanism in deep learning"Run the FastAPI server:
uv run uvicorn vector_db_qdrant.api:app --reloadThen open:
{
"query": "Papers on clustering by Andrew Ng",
"top_n": 5
}The application exposes a /search endpoint via FastAPI.
Below is the interactive Swagger UI where users can submit natural language queries:
Below is an actual response returned by the API for the query:
"Papers on clustering by Andrew Ng"
The system returns the most relevant papers, including metadata and similarity scores:
- Windows Qdrant crash → solved via indexing_threshold=0
- Large dataset → streaming + batching
- Payload vs vector separation
- Hybrid search (BM25 + vectors)
- Caching embeddings
- Frontend UI
MIT
Dakouri Kobri
Data Science, AI/ML, & Health Science Enthusiast

