This guide explains how to test the complete parselmouth pipeline locally using MinIO (an S3-compatible object storage) instead of Cloudflare R2.
- Docker and Docker Compose installed
- Python environment with parselmouth installed (
pixi install)
The easiest way to get started - this automatically starts MinIO and runs the interactive pipeline:
# This single command does everything:
# 1. Starts MinIO (docker-compose up -d)
# 2. Shows current storage statistics
# 3. Guides you through configuration
# 4. Runs the pipeline
pixi run test-interactiveIf you want more control or prefer command-line arguments:
# Start MinIO container
pixi run start-minio
# Wait a few seconds for MinIO to initialize
# MinIO UI will be available at http://localhost:9001With interactive prompts:
# Run with flag (MinIO must already be running)
pixi run test-pipeline --interactive
# or shorthand:
pixi run test-pipeline -iWith command-line arguments:
# Run with defaults (pytorch, noarch, package names starting with 't', fresh mode)
pixi run test-pipeline
# Or run with custom options
pixi run test-pipeline --helpCommand-line options:
--channel: Choose channel (conda-forge, pytorch, bioconda)--subdir: Choose subdir (noarch, linux-64, osx-arm64, etc.)--letter: Filter package NAMES by first character, or use 'all' for everything--mode: Processing mode - 'fresh' (default, reprocess all) or 'incremental' (skip existing)--interactiveor-i: Enable interactive prompts with statistics
Examples:
# Interactive mode (shows stats, guides you through options)
pixi run test-interactive
# Test pytorch with packages named torch*, torchvision*, etc.
pixi run test-pipeline
# Test conda-forge with packages named numpy*, napari*, etc.
pixi run test-pipeline --channel conda-forge --letter n
# Test incrementally (skip packages already processed)
pixi run test-pipeline --mode incremental
# Test ALL packages in bioconda noarch (slow!)
pixi run test-pipeline --channel bioconda --letter allMinIO Web UI:
- URL: http://localhost:9001
- Username:
minioadmin - Password:
minioadmin
Browse the conda bucket to see:
hash-v0/- Hash-based mappingsrelations-v1/- Relations tablepypi-to-conda-v1/- PyPI lookup files
# Clean everything (stops MinIO and removes all data)
pixi run clean-all
# Or clean separately:
# Clean all local data (cache + outputs, keep MinIO running)
pixi run clean-local-data
# Just remove output files
pixi run clean-outputs
# Just remove conda package cache (can grow to 1GB+)
pixi run clean-cacheWhat gets cleaned:
| Command | What it removes |
|---|---|
clean-outputs |
Output directories |
clean-cache |
Downloaded conda packages (can be GB) |
clean-local-data |
Cache + outputs (keeps MinIO) |
clean-all |
Everything + stop MinIO |
Details:
local_test_output/- Local pipeline test outputsoutput/- Updater outputsoutput_index/- Index filesoutput_relations/- Relations table filestest_production/- Production test filesconda_oci_mirror/cache/- Cached conda packages (grows with each run)- MinIO container and volumes (with
clean-all)
π‘ Tip: Run pixi run clean-cache periodically to reclaim disk space. The cache grows based on how many packages you process and will be rebuilt as needed.
The test script runs through the complete workflow:
- Producer - Identifies new packages to process
- Updater - Downloads and processes conda packages
- Merger - Combines partial indices into master index
- Relations - Generates v1 relations table and PyPI lookups
- Verification - Checks that all data is accessible
The default configuration in docker-compose.yml:
- S3 API Port: 9000
- Web UI Port: 9001
- Default bucket:
conda - Credentials:
minioadmin/minioadmin
Use command-line arguments to test different channels:
# PyTorch (default - small, fast)
pixi run test-pipeline
# Conda-forge (large - takes longer)
pixi run test-pipeline --channel conda-forge
# Bioconda (medium size)
pixi run test-pipeline --channel biocondaUse --subdir and --letter to narrow down what you process:
# Process only package names starting with 'n' (numpy, napari, etc.)
pixi run test-pipeline --letter n
# Process package names starting with 'p' in linux-64 (pandas, pytorch, etc.)
pixi run test-pipeline --subdir linux-64 --letter p
# Process ALL packages (warning: very slow!)
pixi run test-pipeline --letter allMultiple channels can coexist in the same bucket, separated by path prefixes. This mirrors production behavior:
# Process pytorch first (fresh mode - reprocess everything)
pixi run test-pipeline --channel pytorch --letter t
# Process conda-forge second (both will be in the same bucket)
pixi run test-pipeline --channel conda-forge --letter n
# Process bioconda third
pixi run test-pipeline --channel bioconda --letter a
# Run incrementally to only process new packages
pixi run test-pipeline --channel pytorch --mode incrementalIn MinIO you'll see:
conda/
βββ hash-v0/
β βββ {sha256} (shared across all channels - same hash = same package)
β βββ {sha256}
β βββ ...
β βββ pytorch/
β β βββ index.json (references hashes for pytorch packages)
β βββ conda-forge/
β β βββ index.json (references hashes for conda-forge packages)
β βββ bioconda/
β βββ index.json (references hashes for bioconda packages)
βββ relations-v1/
β βββ pytorch/
β β βββ relations.jsonl.gz
β β βββ metadata.json
β βββ conda-forge/
β β βββ relations.jsonl.gz
β β βββ metadata.json
β βββ bioconda/
β βββ relations.jsonl.gz
β βββ metadata.json
βββ pypi-to-conda-v1/
βββ pytorch/
β βββ torch.json
β βββ ...
βββ conda-forge/
β βββ numpy.json
β βββ ...
βββ bioconda/
βββ ...
Note: Individual package mappings (hash-v0/{sha256}) are shared across all channels because the same package with the same SHA256 can exist in multiple channels. Only the index files and relations data are separated by channel.
You can use MinIO with any parselmouth command by setting the endpoint:
export R2_PREFIX_ENDPOINT="http://localhost:9000"
export R2_PREFIX_ACCESS_KEY_ID="minioadmin"
export R2_PREFIX_SECRET_ACCESS_KEY="minioadmin"
export R2_PREFIX_BUCKET="conda"
# Now run any command
pixi run parselmouth update-v1-mappings --upload --channel pytorchNote: The system also supports the standard AWS_ENDPOINT_URL environment variable for compatibility with boto3 tools.
# List all buckets
aws --endpoint-url http://localhost:9000 \
s3 ls
# List files in a bucket
aws --endpoint-url http://localhost:9000 \
s3 ls s3://conda/hash-v0/
# Download a specific file
aws --endpoint-url http://localhost:9000 \
s3 cp s3://conda/relations-v1/pytorch/relations.jsonl.gz ./MinIO data is stored in a Docker volume by default. The pixi run clean-all command will stop MinIO and remove all volumes.
If you want to manually manage docker-compose:
# Stop without removing volumes
docker-compose down
# Restart with existing data
docker-compose up -d
# Remove everything including volumes
docker-compose down -v# Check container logs
docker-compose logs minio
# Ensure ports are available
lsof -i :9000
lsof -i :9001Make sure MinIO is running and healthy:
# Check container status
docker-compose ps
# Wait for health check
docker-compose logs minio-initIf the pipeline completes but data is missing:
- Check that packages exist upstream in the conda channel
- Try a different subdir (e.g.,
linux-64instead ofnoarch) - Check MinIO logs for errors
MinIO runs on HTTP by default. Make sure you're using http:// not https://:
export R2_PREFIX_ENDPOINT="http://localhost:9000" # Correct| Feature | MinIO (Local) | Cloudflare R2 (Production) |
|---|---|---|
| Endpoint | http://localhost:9000 |
https://*.r2.cloudflarestorage.com |
| Credentials | minioadmin |
Cloudflare API tokens |
| SSL | HTTP only | HTTPS only |
| Performance | Local disk | Global CDN |
| Cost | Free | Pay per operation |
The local setup is perfect for:
- Testing pipeline changes without affecting production
- Developing new features
- Debugging issues
- Validating data transformations
- Performance testing
# 1. Start fresh (clean everything first)
pixi run clean-all
# 2. Start MinIO
pixi run start-minio
# 3. Run pipeline (fresh mode)
pixi run test-pipeline
# 4. Verify in UI
open http://localhost:9001
# 5. Test incremental update (skips existing packages)
pixi run test-pipeline --mode incremental
# 6. Check that existing packages were skipped (look for logs)
# 7. Clean up
pixi run clean-allThe integration tests (tests/test_integration_s3_pipeline.py) use moto to mock S3. For full end-to-end testing with real S3 behavior, use this MinIO setup instead.
To run tests against MinIO:
# Start MinIO
pixi run start-minio
# Set environment
export R2_PREFIX_ENDPOINT="http://localhost:9000"
export R2_PREFIX_ACCESS_KEY_ID="minioadmin"
export R2_PREFIX_SECRET_ACCESS_KEY="minioadmin"
export R2_PREFIX_BUCKET="conda"
# Run specific test
pixi run test