Skip to content

Turn item photos into structured attributes using Vision Language Models. Output strict JSON with per-field confidence scores.

License

Notifications You must be signed in to change notification settings

boijuny/vis2attr

Repository files navigation

vis2attr - Vision Language Model for Attribute Extraction

Python 3.11+ License: MIT Code style: black

Turn item photos into structured attributes (brand, colors, materials, condition) using Visual Language Models. Output strict JSON with per-field confidence scores.

High-Level Architecture

flowchart LR
    A[📁 Images<br/>JPG/PNG/WebP] --> B[📝 Prompt<br/>Jinja2]
    B --> C[🤖 VLM<br/>Mistral/OpenAI]
    C --> D[📊 Parse<br/>JSON]
    D --> E[⚖️ Decide<br/>Thresholds]
    E --> F[💾 Store<br/>Parquet]
    F --> G[📈 Export<br/>CLI]
    
    CONFIG[⚙️ Config<br/>YAML] --> B
    CONFIG --> C
Loading

🚀 Quick Start

# Install
uv venv && source .venv/bin/activate
uv pip install -e .

# Set up API key
export MISTRAL_API_KEY=your_api_key_here

# Run analysis
vis2attr analyze --input ./images --output ./predictions.parquet

# Batch processing with custom config
vis2attr analyze --input ./items --batch --config ./my-config.yaml --verbose

# Generate reports
vis2attr report --predictions ./predictions.parquet --format summary

📊 Status

Core infrastructure complete with working pipeline. Ready for production improvements.

Status Component Description
Core Pipeline Complete data models, configuration, Mistral provider, JSON parser
File Processing Image ingestor with EXIF stripping, comprehensive test suite
CLI Interface Full analyze command with batch processing, verbose logging, and result export
Storage System Parquet-based storage with lineage tracking and metadata storage
🚧 Decision Rules Simple threshold-based decisions, needs sophisticated quality gates
🚧 Report Generation CLI command structure ready, needs full implementation
Metrics & Logging Basic logging only, needs structured metrics collection
Additional Providers Only Mistral implemented (OpenAI, Google, Anthropic planned)

📚 Documentation

  • Architecture - System design, data flow, and component overview
  • CLI Reference - Complete command-line interface documentation
  • Configuration - Configuration options and schema definitions
  • API Reference - Data models, schemas, and programming interface
  • Contributing - Development guidelines and contribution process

🎯 Design Principles

  • Schema-first & config-driven: No hard-coded fields
  • Ports & adapters: Swappable implementations via factory patterns
  • Type safety: Comprehensive data models with validation
  • Testability: Full test coverage for all components

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Turn item photos into structured attributes using Vision Language Models. Output strict JSON with per-field confidence scores.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published