Photo & Video Search with CLIP

A powerful semantic search system using OpenAI's CLIP model and Qdrant vector database. Search through large collections of images and videos using natural language queries with semantic understanding.

Features

🔍 Semantic Image Search: Search images using natural language descriptions
🎬 Semantic Video Search: Search videos with frame-level understanding and clustering
🚀 GPU Acceleration: Optimized for CUDA-enabled GPUs with batch processing
🎯 High Accuracy: Powered by OpenAI's CLIP ViT-Large-Patch14 model
📊 Vector Database: Efficient similarity search with Qdrant
🎨 Interactive UI: Beautiful Streamlit web interface for both images and videos
📁 Recursive Scanning: Automatically processes all content in nested folders
⚡ Batch Processing: Process thousands of items efficiently
🎞️ Smart Frame Clustering: Videos are sampled, clustered, and mean-pooled for optimal search

Architecture

Image Search

┌─────────────────┐
│  Image Folder   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  CLIP Encoder   │ ──► Image Embeddings (768D vectors)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Qdrant Vector DB│ ──► Store & Index
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Text Query      │ ──► Text Embedding
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Cosine Search   │ ──► Top-K Results
└─────────────────┘

Video Search

┌─────────────────┐
│  Video Folder   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Frame Sampling  │ ──► Extract frames @ configurable FPS
│  (5-15 FPS)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  CLIP Encoder   │ ──► Frame Embeddings (768D vectors)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ HDBSCAN Cluster │ ──► Group similar frames
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Mean Pooling   │ ──► One embedding per cluster
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Qdrant Vector DB│ ──► Store with metadata
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Text Query      │ ──► Search across video segments
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Cosine Search   │ ──► Top-K Video Segments
└─────────────────┘

Prerequisites

Python 3.12
CUDA 12.1+ (for GPU support)
Docker (for Qdrant)
4GB+ GPU memory recommended
OpenCV for video processing

Installation

Option 1: Docker (Recommended)

The easiest way to run the application with all dependencies.

Build and start services:

docker-compose up -d --build

Access the application:

Streamlit UI: http://localhost:8501
Qdrant API: http://localhost:6333
Qdrant Dashboard: http://localhost:6333/dashboard

Index your images:

# Place images in ./images folder first
docker-compose exec app uv run python3 image_to_embedding.py

Stop services:

docker-compose down

Docker Architecture:

┌─────────────────────────────────────────┐
│  Docker Network: image-search-network   │
│                                         │
│  ┌──────────────┐    ┌──────────────┐ │
│  │   Qdrant     │◄───┤  Streamlit   │ │
│  │   :6333      │    │    App       │ │
│  │              │    │   :8501      │ │
│  └──────┬───────┘    └──────────────┘ │
│         │                              │
│         ▼                              │
│  qdrant_data/                          │
│  (persistent storage)                  │
└─────────────────────────────────────────┘

Data Persistence:

Vector DB data: ./qdrant_data/ (automatically created and persisted)
Images: ./images/ (mounted as read-only)
Search results: ./search_results/

Option 2: Local Installation

Clone the repository:

git clone <repository-url>
cd photo-doc-data-embeddings

Install dependencies with uv:

uv sync

Or install manually:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers pillow qdrant-client numpy streamlit

Start Qdrant vector database:

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant

Usage

Docker Deployment

Start the Application

docker-compose up -d

Index Images

# Place your images in ./images folder
docker-compose exec app uv run python3 image_to_embedding.py

Search via Web UI

Open http://localhost:8501 in your browser

Search via CLI

docker-compose exec app uv run python3 search_images.py "your query"

View Logs

docker-compose logs -f app
docker-compose logs -f qdrant

Stop Services

docker-compose down

Local Deployment

1. Index Your Images

First, process your images and create embeddings:

python3 image_to_embedding.py

Configuration (edit in image_to_embedding.py):

FOLDER_PATH = "./images"          # Your images folder
COLLECTION_NAME = "image_embeddings"
QDRANT_HOST = "localhost"
QDRANT_PORT = 6333

The script will:

Recursively scan all images in the folder
Generate embeddings using CLIP
Store vectors in Qdrant with metadata
Process in batches for efficiency

Supported image formats: .jpg, .jpeg, .png, .bmp, .gif, .tiff, .tif, .webp, .svg, .ico

Supported video formats: .mp4, .avi, .mov, .mkv

2. Search via Web UI (Recommended)

Launch the Streamlit interface:

streamlit run app.py

Features:

Enter natural language queries
Adjust number of results (1-200)
View images in a responsive grid
See similarity scores and metadata
Configure Qdrant connection

3. Search via Command Line

Search Images

python3 search_images.py "aadhaar card"
python3 search_images.py "passport photo"
python3 search_images.py "person smiling with glasses"

Results are copied to ./search_results/ folder with ranking and scores.

Search Videos

python3 search_videos.py "person walking"
python3 search_videos.py "car driving on highway"
python3 search_videos.py "people talking indoors"

Results show matching video segments with frame information.

Video Search Implementation

How It Works

Frame Sampling: Videos are sampled at a configurable rate (5-15 FPS)
- Controlled by VIDEO_SAMPLE_RATE environment variable
- Extracts representative frames from the entire video
Embedding Generation: Each frame is processed through CLIP
- Generates 768-dimensional embeddings
- Batch processing for efficiency
Clustering: Similar frames are grouped using HDBSCAN
- Identifies semantic scenes/segments in the video
- Filters out noise and transitional frames
- Configurable cluster parameters
Mean Pooling: Each cluster is represented by a single embedding
- Averages all frame embeddings in a cluster
- Normalized for cosine similarity search
- Preserves semantic information
Indexing: Pooled embeddings stored in Qdrant with metadata
- Video path, cluster info, frame indices
- Enables precise segment retrieval

Video Search Configuration

Environment variables for video processing:

# Frame sampling rate (frames per second)
VIDEO_SAMPLE_RATE=10  # Default: 10 FPS

# Clustering parameters
MIN_CLUSTER_SIZE=5    # Minimum frames to form a cluster
MIN_SAMPLES=3         # HDBSCAN min_samples parameter

# Database settings
VIDEO_COLLECTION_NAME=video_embeddings  # Qdrant collection name
VECTOR_DIMENSIONS=768                   # CLIP embedding size

Index Videos

# Set video folder path
export VIDEO_FOLDER_PATH="./videos"

# Run video indexing
python3 video_to_embedding.py

Or use the Streamlit UI to index videos interactively.

API Reference

ImageEmbeddingProcessor

Main class for image processing and search.

Initialization

processor = ImageEmbeddingProcessor(
    model_name="openai/clip-vit-large-patch14",
    batch_size=64  # Adjust based on GPU memory
)

Methods

image_to_embedding(image_path: str) -> np.ndarray

Converts a single image to embedding vector
Returns: 768-dimensional normalized numpy array

text_to_embedding(text: str) -> np.ndarray

Converts text to embedding vector
Returns: 768-dimensional normalized numpy array

batch_image_to_embeddings(image_paths: List[str]) -> np.ndarray

Process multiple images in batch
More efficient than individual processing
Returns: Array of embedding vectors

process_folder_to_qdrant(folder_path, collection_name, qdrant_host, qdrant_port)

Index all images in folder to Qdrant
Creates/recreates collection
Processes in batches with progress tracking

search_by_text(query_text, collection_name, qdrant_host, qdrant_port, limit)

Search for similar images using text query
Returns: List of results with scores and metadata

VideoEmbeddingProcessor

Main class for video processing and search.

Initialization

processor = VideoEmbeddingProcessor(
    model_name="openai/clip-vit-large-patch14"
)

Methods

process_videos_to_qdrant(folder_path, collection_name, qdrant_host, qdrant_port)

Index all videos in folder to Qdrant
Samples frames, generates embeddings, clusters, and stores
Automatic scene detection and segmentation

search_videos_by_text(query_text, collection_name, qdrant_host, qdrant_port, limit)

Search for video segments using text query
Returns: List of matching segments with metadata
- Video path and name
- Cluster ID and frame indices
- Similarity score
- Frame count information

get_collection_stats(collection_name, qdrant_host, qdrant_port)

Get statistics about indexed videos
Returns: Total embeddings, dimensions, distance metric

Configuration

GPU Settings

The script automatically detects and uses GPU if available. To force CPU:

self.device = "cpu"  # In ImageEmbeddingProcessor.__init__

Batch Size

Adjust based on your GPU memory:

4GB GPU: batch_size=32
6GB GPU: batch_size=64
8GB+ GPU: batch_size=128

Qdrant Configuration

Edit connection settings:

QDRANT_HOST = "localhost"  # Or remote host
QDRANT_PORT = 6333
COLLECTION_NAME = "image_embeddings"

Performance Tips

GPU Memory: Reduce batch size if you get OOM errors
Indexing Speed: Use GPU for 10x faster processing
Search Speed: Qdrant is optimized for sub-millisecond searches
Storage: ~3KB per image for embeddings

Benchmarks

On RTX 3050 (4GB):

Indexing: ~10-15 images/second
Search: <100ms for 50K images
Embedding dimension: 768

photo-doc-data-embeddings/
├── Dockerfile              # App container definition
├── docker-compose.yml      # Multi-container orchestration
├── .dockerignore          # Docker build exclusions
├── image_to_embedding.py  # Image processing & indexing
├── video_to_embedding.py  # Video processing & indexing
├── search_images.py       # CLI image search tool
├── search_videos.py       # CLI video search tool
├── app.py                 # Streamlit web interface (images & videos)
├── pyproject.toml         # Dependencies
├── README.md              # Documentation
├── video_embeddings/      # Video processing module
│   ├── __init__.py       # Module exports
│   ├── ingest.py         # Video frame sampling
│   ├── embedding.py      # Frame embedding generation
│   ├── cluster.py        # HDBSCAN clustering
│   ├── mean_pool.py      # Cluster pooling
│   ├── vector_db.py      # Qdrant operations
│   └── orchestrator.py   # Video indexing pipeline
├── qdrant_data/           # Vector DB storage (Docker)
├── images/                # Your image folder (create this)
├── videos/                # Your video folder (create this)
├── search_results/        # Search output folder
└── .venv/                 # Virtual environment (local)

Docker Configuration

Services

Qdrant (Vector Database)

Image: qdrant/qdrant:latest
Ports: 6333 (API), 6334 (gRPC)
Volume: ./qdrant_data:/qdrant/storage (persistent)
Network: image-search-network

App (Streamlit + CLIP)

Build: Custom Dockerfile with Python 3.12
Port: 8501
Environment:
- QDRANT_HOST=qdrant
- QDRANT_PORT=6333
Volumes:
- ./images:/app/images:ro (read-only)
- ./search_results:/app/search_results

Docker Commands

View running containers:

docker-compose ps

Access container shell:

docker-compose exec app bash
docker-compose exec qdrant sh

View resource usage:

docker stats

Clean up everything:

# Stop and remove containers
docker-compose down

# Remove volumes (WARNING: deletes all data)
docker-compose down -v

# Remove images
docker-compose down --rmi all

Backup and Restore

Backup Qdrant data:

tar -czf qdrant_backup_$(date +%Y%m%d).tar.gz qdrant_data/

Restore from backup:

docker-compose down
tar -xzf qdrant_backup_20231224.tar.gz
docker-compose up -d

GPU Support in Docker

To enable NVIDIA GPU support:

Install NVIDIA Container Toolkit
Add to docker-compose.yml:

services:
  app:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Project Structure

Troubleshooting

Docker Issues

Container fails to start:

# Check logs
docker-compose logs -f

# Rebuild without cache
docker-compose down
docker-compose build --no-cache
docker-compose up -d

Qdrant not ready:

# Check health status
docker-compose ps

# Wait for healthy status
docker-compose up -d
curl http://localhost:6333/health

App can't connect to Qdrant:

# Test connection from app container
docker-compose exec app curl http://qdrant:6333/health

# Verify network
docker network inspect photo-doc-data-embeddings_image-search-network

Permission issues with qdrant_data:

sudo chown -R $USER:$USER qdrant_data/

Port conflicts:

# Edit docker-compose.yml
ports:
  - "8502:8501"  # Change host port
  - "6334:6333"  # Change host port

Local Installation Issues

CUDA Errors

If you get cuBLAS errors:

# Use CPU instead
self.device = "cpu"

Or reinstall PyTorch with correct CUDA version:

uv remove torch torchvision
uv add torch torchvision --index https://download.pytorch.org/whl/cu121

Out of Memory

Reduce batch size:

processor = ImageEmbeddingProcessor(batch_size=16)

Qdrant Connection Failed

Ensure Qdrant is running:

docker ps | grep qdrant

Restart if needed:

docker restart <qdrant-container-id>

No Results Found

Check collection name matches
Verify images were indexed successfully

Example Queries

Image Search

"indian aadhaar card"
"passport photograph with blue background"
"person wearing glasses"
"document with signature"
"group photo outdoors"
"landscape with mountains"
"indoor office setting"

Video Search

"person walking outdoors"
"car driving on highway"
"people talking in meeting"
"sunset over ocean"
"cooking in kitchen"
"children playing in park"
"city traffic at night"

Technology Stack

CLIP: OpenAI's vision-language model
PyTorch: Deep learning framework
Transformers: Hugging Face model library
Qdrant: Vector similarity search engine
Streamlit: Web UI framework
NumPy: Numerical computing
Pillow: Image processing
OpenCV: Video processing
HDBSCAN: Density-based clustering
scikit-learn: Machine learning utilitiesork
NumPy: Numerical computing
Pillow: Image processing

Acknowledgments

OpenAI for the CLIP model
Qdrant team for the vector database
Hugging Face for model hosting

Support

For issues and questions, please open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
video_embeddings		video_embeddings
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
MODEL_SHARING_FIX.md		MODEL_SHARING_FIX.md
OOM_FIX.md		OOM_FIX.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
app_old.py		app_old.py
docker-compose.yml		docker-compose.yml
example_workflow.py		example_workflow.py
image_to_embedding.py		image_to_embedding.py
pyproject.toml		pyproject.toml
search_images.py		search_images.py
search_videos.py		search_videos.py
test_video_search.py		test_video_search.py
video_to_embedding.py		video_to_embedding.py

solve-ease/insight

Folders and files

Latest commit

History

Repository files navigation

Photo & Video Search with CLIP

Features

Architecture

Image Search

Video Search

Prerequisites

Installation

Option 1: Docker (Recommended)

Option 2: Local Installation

Usage

Docker Deployment

Start the Application

Index Images

Search via Web UI

Search via CLI

View Logs

Stop Services

Local Deployment

1. Index Your Images

2. Search via Web UI (Recommended)

3. Search via Command Line

Search Images

Search Videos

Video Search Implementation

How It Works

Video Search Configuration

Index Videos

API Reference

ImageEmbeddingProcessor

Initialization

Methods

VideoEmbeddingProcessor

Initialization

Methods

Configuration

GPU Settings

Batch Size

Qdrant Configuration

Performance Tips

Benchmarks

Docker Configuration

Services

Docker Commands

Backup and Restore

GPU Support in Docker

Project Structure

Troubleshooting

Docker Issues

Local Installation Issues

CUDA Errors

Out of Memory

Qdrant Connection Failed

No Results Found

Example Queries

Image Search

Video Search

Technology Stack

Acknowledgments

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages