Skip to content

A Rust application that parses directory files and generates AI embeddings using the Ollama API.

Notifications You must be signed in to change notification settings

devopsbob/ollama-embeddings

Repository files navigation

Ollama Embeddings Generator

A Rust application that parses directory files and generates AI embeddings using the Ollama API. This tool is designed to process text files into chunks and generate vector embeddings that can be used for semantic search, RAG (Retrieval Augmented Generation), and other AI applications.

Features

  • File Parsing: Recursively parses directories and extracts text content from supported file types
  • Text Chunking: Intelligently splits text into overlapping chunks with configurable sizes
  • Ollama Integration: Generates embeddings using local Ollama models
  • Multiple Output Formats: Supports JSON, JSON Lines, and CSV output formats
  • Configuration Management: Flexible configuration via files, environment variables, and CLI arguments
  • Progress Tracking: Real-time progress updates during embedding generation
  • Dry Run Mode: Analyze files without generating embeddings
  • Error Handling: Comprehensive error handling and logging

Supported File Types

  • Text files (.txt)
  • Markdown (.md)
  • Source code files (.rs, .py, .js, .ts, .html, .css)
  • Configuration files (.json, .yaml, .yml, .toml, .xml)
  • Data files (.csv, .log)
  • Document files (.pdf, .odt, .ods, .odp, .odg)
  • Microsoft Office files (.doc, .docx, .ppt, .pptx, .xls, .xlsx)

Installation

Prerequisites

  • Rust 1.70+ installed
  • Ollama running locally with an embedding model

Install Ollama and Models

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama
ollama serve

# Pull an embedding model
ollama pull nomic-embed-text

Build from Source

git clone <repository-url>
cd ollama-embeddings
cargo build --release

Quick Start Tools

File Inspector Script

Before processing large directories, use the inspect-files.sh script to analyze your files:

# Analyze a directory structure
./inspect-files.sh /path/to/documents

# Example output shows:
# - File counts and sizes per directory
# - Supported vs unsupported file types
# - Processing time estimates
# - Suggested command parameters

Interactive Menu

Use the interactive menu for common operations:

./menu.sh

Usage

Basic Usage

# Generate embeddings for a directory
./target/release/ollama-embeddings -d /path/to/documents

# Use a specific model
./target/release/ollama-embeddings -d /path/to/documents -m nomic-embed-text

# Save to file
./target/release/ollama-embeddings -d /path/to/documents -o embeddings.json

# Dry run to analyze files first
./target/release/ollama-embeddings -d /path/to/documents --dry-run

# Remove duplicate file references (symlinks, etc.)
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files

# Keep only newest version of files with same name
./target/release/ollama-embeddings -d /path/to/documents --detect-newest

# Combine both deduplication features
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files --detect-newest

Retraining Existing Embeddings

You can retrain existing embeddings with a different model without re-parsing the original files:

# Retrain embeddings with a different model
./target/release/ollama-embeddings --retrain-from /path/to/existing/embeddings --model new-model-name

# Example: Switch from nomic-embed-text to llama2
./target/release/ollama-embeddings --retrain-from ./embeddings --model llama2

# Retrain with custom output location
./target/release/ollama-embeddings --retrain-from ./embeddings --model mistral -o retrained_embeddings.json

Retrain Features:

  • Preserves original text chunks and metadata
  • Generates new embeddings using the specified model
  • Adds retrain metadata (original model, retrain timestamp)
  • Creates new output files with model name suffix
  • Supports both JSON and JSONL formats
  • Maintains chunk relationships and document structure

Advanced Usage

# Use custom Ollama server
./target/release/ollama-embeddings -d /path/to/docs -u http://remote-ollama:11434

# Use configuration file
./target/release/ollama-embeddings -d /path/to/docs -c config.toml

# Verbose logging
./target/release/ollama-embeddings -d /path/to/docs -v

Command Line Options

  • -d, --directory <DIR>: Input directory to process (required)
  • -c, --config <FILE>: Configuration file path
  • -o, --output <FILE>: Output file path
  • -m, --model <MODEL>: Ollama model to use for embeddings
  • -u, --url <URL>: Ollama server URL
  • --dry-run: Analyze files without generating embeddings
  • -v, --verbose: Enable verbose logging
  • --deduplicate-files: Remove duplicate file references (symlinks, hardlinks) and sort processing order
  • --detect-newest: When multiple files have the same name, keep only the newest based on modification time

File Deduplication Options

--deduplicate-files: Removes duplicate file references by canonicalizing paths and detecting when the same physical file is accessed via different paths (e.g., through symlinks). Files are sorted for consistent processing order.

--detect-newest: When multiple files share the same filename (in different directories), this option keeps only the newest version based on modification timestamp. Useful for:

  • Backup directories with multiple versions
  • Archive folders with dated copies
  • Development directories with old/new versions

These options can be used independently or together for comprehensive duplicate handling.

Interactive Menu

The project includes an interactive menu script that provides easy access to all build, test, and run commands:

./menu.sh

Menu Features:

  • Cargo & Build Commands: Build, clean, format, lint, documentation
  • Unit Testing: Run all tests, specific modules, with coverage
  • Application Testing: Dry runs, custom directories, embedding generation
  • System Information: Check dependencies, Ollama status, build status
  • Quick Start: Automated build + test workflow
  • Full Workflow: Complete clean + build + test + run cycle

The menu provides a user-friendly interface for developers and makes it easy to verify functionality without memorizing command-line options.

Configuration

Configuration File

Create a config.toml file:

[ollama]
base_url = "http://localhost:11434"
model = "nomic-embed-text"
timeout_seconds = 300

[processing]
chunk_size = 1000
chunk_overlap = 200
min_chunk_size = 100
max_file_size = 10485760  # 10MB
supported_extensions = ["txt", "md", "rs", "py", "js", "ts"]
concurrent_requests = 5
deduplicate_files = false  # Remove duplicate file references (symlinks, hardlinks)
detect_newest = false      # Keep only newest file when multiple files have same name

[output]
format = "Json"  # "Json", "JsonLines", or "Csv"
include_metadata = true
pretty_print = true

Environment Variables

  • OLLAMA_BASE_URL: Ollama server URL
  • OLLAMA_MODEL: Model name for embeddings
  • OLLAMA_TIMEOUT_SECONDS: Request timeout
  • CHUNK_SIZE: Text chunk size in characters
  • CHUNK_OVERLAP: Overlap between chunks
  • MIN_CHUNK_SIZE: Minimum chunk size to keep
  • MAX_FILE_SIZE: Maximum file size to process
  • CONCURRENT_REQUESTS: Number of concurrent API requests
  • OUTPUT_FORMAT: Output format (json, jsonlines, csv)
  • OUTPUT_FILE: Output file path
  • INCLUDE_METADATA: Include file metadata (true/false)
  • PRETTY_PRINT: Pretty print JSON output (true/false)

Output Format

JSON Format

[
  {
    "chunk": {
      "id": "chunk-uuid",
      "content": "Text content of the chunk...",
      "source_file": "/path/to/file.txt",
      "chunk_index": 0,
      "start_position": 0,
      "end_position": 500,
      "metadata": {
        "file_path": "/path/to/file.txt",
        "mime_type": "text/plain",
        "file_size": "1024",
        "chunk_size": "500",
        "file_extension": "txt",
        "language": "text"
      }
    },
    "embedding": {
      "id": "embedding-uuid",
      "chunk_id": "chunk-uuid",
      "vector": [0.1, 0.2, 0.3, ...],
      "model": "nomic-embed-text",
      "created_at": "2024-01-01T12:00:00Z"
    }
  }
]

JSON Lines Format

Each line contains a complete embedding document:

{"chunk": {...}, "embedding": {...}}
{"chunk": {...}, "embedding": {...}}

Architecture

The application is structured into several modules:

  • file_parser: Handles file discovery, reading, and content extraction
  • embedding: Manages text chunking and embedding document creation
  • ollama_client: HTTP client for Ollama API communication
  • config: Configuration management and validation
  • main: CLI interface and application orchestration

Key Components

FileParser

  • Recursively traverses directories
  • Filters files by extension and size
  • Handles text encoding detection
  • Validates text content

EmbeddingProcessor

  • Splits text into overlapping chunks
  • Finds natural breaking points (sentences, paragraphs)
  • Adds metadata to chunks
  • Detects programming languages

OllamaClient

  • Communicates with Ollama API
  • Handles model validation
  • Supports batch processing
  • Includes progress tracking

Development

Running Tests

cargo test

Running with Logging

RUST_LOG=debug cargo run -- -d /path/to/docs -v

Adding New File Types

To support new file types, add extensions to the supported_extensions list in your configuration or modify the default list in ProcessingConfig::default().

Performance Considerations

  • File Size: Large files are chunked to manage memory usage
  • Concurrency: Batch processing with configurable concurrency limits
  • Rate Limiting: Built-in delays between requests to avoid overwhelming Ollama
  • Memory Usage: Streaming processing for large directories

Troubleshooting

Common Issues

  1. Ollama not running: Ensure Ollama is started with ollama serve
  2. Model not available: Pull the required model with ollama pull <model-name>
  3. Connection timeout: Increase timeout_seconds in configuration
  4. Out of memory: Reduce chunk_size or max_file_size

Debug Mode

Enable debug logging to see detailed processing information:

RUST_LOG=debug ./target/release/ollama-embeddings -d /path/to/docs -v

Troubleshooting

Common Issues

"Model 'nomic-embed-text' is not available"

This error occurs when the embedding model is not properly installed or has a different name.

Solution:

  1. Check available models: ollama list
  2. If the model shows as nomic-embed-text:latest, use the full name:
    ./target/release/ollama-embeddings -d /path/to/docs --model nomic-embed-text:latest
  3. If the model is missing, install it:
    ollama pull nomic-embed-text

"Failed to connect to Ollama server"

Solutions:

  1. Start Ollama service: ollama serve
  2. Check if running: ps aux | grep ollama
  3. Test connectivity: curl http://localhost:11434/api/tags
  4. Use custom URL: --url http://your-ollama-server:11434

"Permission denied" or "File not found"

Solutions:

  1. Make binary executable: chmod +x target/release/ollama-embeddings
  2. Check directory permissions: ls -la /path/to/documents
  3. Use absolute paths instead of relative paths

Out of memory errors

Solutions:

  1. Process smaller directories or use file filters
  2. Reduce chunk size in configuration
  3. Use --dry-run to estimate memory usage first

Slow performance

Solutions:

  1. Use faster embedding models (smaller parameter count)
  2. Reduce chunk overlap in configuration
  3. Process files in smaller batches
  4. Use SSD storage for better I/O performance

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Roadmap

  • Support for PDF files ✅
  • Support for OpenDocument files (ODT, ODS, ODP, ODG) ✅
  • Support for Microsoft Office files (DOC, DOCX, PPT, PPTX, XLS, XLSX) ✅
  • Database storage backends
  • Web interface
  • Docker containerization
  • Kubernetes deployment manifests
  • Integration with vector databases (Qdrant, Pinecone, etc.)

About

A Rust application that parses directory files and generates AI embeddings using the Ollama API.

Resources

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published