LLaVA-Caption

By David "Zanshinmu" Van de Ven zanshin.g1@gmail.com

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions. Optimized for and primarily tested on Apple Silicon Macs, though cross-platform compatibility is available.

Overview

LLaVA Caption was designed to solve a specific problem in AI training: when using generated images, the original prompts often contain elements that aren't present in the final images. Manual verification and captioning is time-consuming, but inaccurate captions make for bad training data. This tool provides higher quality captions than BLIP, with options ranging from basic processing to near-manual quality.

Llava Caption was built and tested on Apple Silicon. While cross-platform tools make it accessible to PCs, it hasn't been tested on Linux or Windows.

Available Models

MLXModel (Recommended)(Default)

Uses Qwen2-VL-7B-Instruct-8bit with Apple's MLX framework
Apple Silicon only
Fast processing with 16GB unified memory
Accuracy comparable to VisionModel
Requirements: Apple Silicon Mac, 16GB+ unified memory

VisionModel

Uses Llama 3.2 Vision via Ollama
High accuracy with moderate resource requirements
Excellent results with secondary caption generation
Ideal for training Flux/SD3
Requirements: 24GB RAM, GPU recommended

DualModel (Experimental)

Combines LLaVA 1.5 and Mixtral
Highest potential accuracy but resource-intensive
Supports distributed processing across machines
Currently experimental: may need optimization
Requirements: 64GB RAM, GPU strongly recommended

Additional Models

OLModel: Basic Ollama-based processing
HFModel: Hugging Face transformers-based processing (Note: MPS not supported on Apple Silicon)
LCPModel: Direct LLaMA C++ processing

Installation

Prerequisites

Python 3.10 (python.org)
Git (git-scm.com)
Ollama (ollama.com/download) - Required for Ollama-based models

Installation with Poetry (Recommended)

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Install with Poetry
poetry install

# Activate virtual environment
poetry shell

# Verify installation
llava-caption --help

Alternative Installation with pip

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Verify installation
llava-caption --help

Usage

Basic Commands

# Basic usage with defaults
llava-caption /path/to/images/

# Specific model selection
llava-caption --model MLXModel /path/to/images/

# Direct captioning (no prompt comparison)
llava-caption --direct-caption /path/to/images/

Command Line Options

llava-caption [OPTIONS] DIRECTORY

Arguments:
  DIRECTORY                      Directory containing images

Model Selection:
  --model MODEL                  Model to use (default: MLXModel)
                                [env: LLAVA_PROCESSOR]

Processing Modes:
  --direct-caption              Enable direct captioning mode
  --secondary-caption           Enable secondary captioning
                                [env: SECONDARY_CAPTION]
  --no-preprocess              Disable text preprocessing
                                [env: PREPROCESSOR]

Model Parameters:
  --temperature FLOAT           Generation temperature (default: 0.0)
                                [env: TEMPERATURE]
  --gpu-layers INT             GPU layers (-1 for all)
                                [env: N_GPU_LAYERS]

Ollama Configuration:
  --ollama-address HOST:PORT    Ollama address (default: 127.0.0.1:11434)
                                [env: OLLAMA_REMOTEHOST]

Logging:
  --logging                     Enable detailed logging
  --sys-logging                Enable system logging

Example Usage Patterns

# MLXModel with direct captioning
llava-caption --model MLXModel --direct-caption /path/to/images/

# VisionModel with remote Ollama
llava-caption --model VisionModel --ollama-address 192.168.1.110:11434 /path/to/images/

# Secondary captioning with higher temperature
llava-caption --model VisionModel --secondary-caption --temperature 0.7 /path/to/images/

# Debug mode
llava-caption --logging --sys-logging /path/to/images/

Important Notes

PC Users

You may need to remove any mlx entries from requirements.txt to install successfully.

Model Downloads

Models are automatically downloaded via Hugging Face Hub or Ollama
Initial downloads may take time and significant disk space
Models are selected for optimal performance and resource usage

Resource Requirements

CPU Mode: Significant CPU and RAM usage, especially with HFModel
GPU Usage: Set TORCH_DEVICE="cuda:0" for Nvidia GPU support
Distributed Processing: Possible to run models across 2 hosts using DualModel

File Handling

Expects matching .png and .txt files in target directory
Existing text files will be overwritten with new captions
In direct caption mode, creates new .txt files for each image

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src/llava_caption		src/llava_caption
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-Caption

Overview

Available Models

MLXModel (Recommended)(Default)

VisionModel

DualModel (Experimental)

Additional Models

Installation

Prerequisites

Installation with Poetry (Recommended)

Alternative Installation with pip

Usage

Basic Commands

Command Line Options

Example Usage Patterns

Important Notes

PC Users

Model Downloads

Resource Requirements

File Handling

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLaVA-Caption

Overview

Available Models

MLXModel (Recommended)(Default)

VisionModel

DualModel (Experimental)

Additional Models

Installation

Prerequisites

Installation with Poetry (Recommended)

Alternative Installation with pip

Usage

Basic Commands

Command Line Options

Example Usage Patterns

Important Notes

PC Users

Model Downloads

Resource Requirements

File Handling

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages