taxomind

Overview

taxomind is a project for modular, multilingual hierarchical text classification against deep taxonomies (e.g. ISCO / ISIC). It builds multi-view taxonomy embeddings, runs top-down routing with explicit stopping (internal nodes are valid predictions), supports per-node incremental learning via evidence centroids, and includes error-analysis utilities.

Key Features

Multi-view embeddings for labels, definitions, and examples.
Retrieval + induced subgraph + top-down routing with explicit stopping.
Incremental learning via evidence centroids.
FastAPI async job API for taxonomy build, inference, learning, and analysis.
Task queue for production workloads.
Cross-lingual routing and validation (embedding-based).

Requirements

Python 3.9+
Optional: OPENAI_API_KEY for taxonomy enrichment.
Optional: Docker + Docker Compose for containerised deployment.

Quickstart

Create and activate a Python virtual environment.
Install dependencies:
```
pip install -r requirements.txt
```
Update conf/base/parameters.yml for embeddings and inference settings.

Run a pipeline or start the API server:

kedro run --pipeline build_taxonomy
PYTHONPATH=src python scripts/start_api.py

Configuration and Data

Shared defaults live in conf/base/.
Environment overrides can live in conf/test/ and conf/prod/ (run with kedro run --env=...).
Local secrets and machine-specific settings belong in conf/local/ (never commit secrets).
Data inputs are expected under data/ (see data/01_raw/ for inputs).
For enrichment, set OPENAI_API_KEY in the environment or .env.

Environment Variables

Copy .env.example to .env and adjust:

Variable	Default	Description
`API_TOKENS`	(none)	Comma-separated Bearer tokens for authentication
`API_AUTH_ENABLED`	`true`	Set to `false` to disable auth (dev only)
`TASK_BACKEND`	`background`	`background` (in-process) or `dramatiq` (Redis queue)
`REDIS_URL`	(none)	Redis connection string, required when `TASK_BACKEND=dramatiq`
`JOB_TTL_SECONDS`	`86400`	How long completed jobs stay in Redis (24h default)
`JOB_STALE_RUNNING_SECONDS`	`1800`	Auto-fail a running job if it has no updates for this duration
`OPENAI_API_KEY`	(none)	Required only for taxonomy enrichment (Optional)
`PORT`	`3000`	API server port in production entrypoint / containers

Pipelines

Prerequisite for taxonomy-scoped workflows:

Run build_taxonomy first for the target taxonomy_key (or trigger POST /taxonomies in the API) and wait for successful completion.
Only then run the other pipelines/endpoints that depend on taxonomy_index for that key (for example inference/labeling/learning/enrichment flows).

Pipeline	Description
`build_taxonomy`	Build a per-taxonomy index with multi-view embeddings (label/definition/examples) for fast retrieval + routing. Also used by API `/taxonomies` after request JSON is normalized to CSV.
`enrich_taxonomy`	Optional: enrich taxonomy definitions/examples (LLM-assisted) and save an enriched taxonomy definition.
`inference` / `inference_batch`	Hierarchical inference: retrieval -> induced subgraph -> top-down routing with explicit stopping + scoped validation.
`learning_pipe`	Incremental learning: update per-node evidence centroids from `/learn` corrections (no ancestor drift).
`error_analysis`	Produce standardized targets from datasets for downstream error analysis/debugging.

Each pipeline is modular so that intermediate datasets (taxonomy enrichment, embeddings, inference results, etc.) can be cached or swapped for external services.

API Service

src/taxomind/services/api/fastapi_app.py exposes an async job API (Bearer token auth) that maps to the Kedro pipelines:

GET / and GET /health (liveness and health endpoints)
POST /taxonomies and GET /taxonomies/{job_id}/status (create/build index from JSON request)
POST /taxonomies/{job_id}/cancel (cancel taxonomy job)
POST /taxonomies/{taxonomy_key}/enrich (run enrich_taxonomy)
POST /taxonomies/{taxonomy_key}/build (run build_taxonomy)
POST /classify and GET /classify/{job_id}/status (run inference_batch for classification)
POST /classify/{job_id}/cancel (cancel inference job)
POST /label and GET /label/{job_id}/status (run inference_batch for labeling)
POST /label/{job_id}/cancel (cancel labeling job)
POST /learn and GET /learn/{job_id}/status (run learning_pipe)
POST /learn/{job_id}/cancel (cancel learning job)
POST /error-analysis and GET /error-analysis/{job_id}/status (run error_analysis)
POST /error-analysis/{job_id}/cancel (cancel error-analysis job)

Job status values are: pending, running, completed, failed, canceled. Status payloads expose numeric progress (0.0 to 1.0) plus lifecycle timestamps (created_*, started_*, completed_*, failed_*) where relevant for each endpoint model.

For POST request bodies on /taxonomies, /classify, /label, and /learn, an optional top-level sourceSlug is supported. If omitted, it is inferred from the incoming host (e.g. domani1.com -> domani1).

Testing guide: docs/API_TESTING.md.

Architecture

taxomind supports two task execution modes, controlled by TASK_BACKEND:

TASK_BACKEND=background (default)        TASK_BACKEND=dramatiq (production)
┌──────────────────────┐                 ┌──────┐  ┌─────────┐  ┌────────┐
│   FastAPI process     │                 │ API  │  │  Redis  │  │ Worker │
│   API + BackgroundTask│                 │      ├──►         ├──►        │
│   + file JobStore     │                 │      │  │ JobStore│  │ Kedro  │
└──────────────────────┘                 └──────┘  └─────────┘  └────────┘

background: Everything runs in a single process. No Redis needed. Jobs are persisted to a local JSON file. Ideal for development and testing.
dramatiq: The API enqueues tasks via Dramatiq into Redis. A separate worker process picks them up and runs the Kedro pipelines. Job state is stored in Redis with a configurable TTL. Ideal for production.

Running Locally (Development)

No Docker, no Redis required:

# 1. Install dependencies
pip install -r requirements.txt

# 2. Create .env from template
cp .env.example .env
# Edit .env: set API_TOKENS, leave TASK_BACKEND=background

# 3. Start the dev server (auto-reload, port 8000)
PYTHONPATH=src python scripts/start_api.py

Swagger UI: http://localhost:8000/docs
Health check: http://localhost:8000/health
Jobs are stored in data/09_job_store/jobs.json (survives restart, lost on data wipe).
Pipeline tasks run in-process via FastAPI BackgroundTasks.

Running Locally with Docker

Single container (no Redis)

docker build -t taxomind .
docker run -p 3000:3000 \
  -e API_TOKENS=your-token \
  -e TASK_BACKEND=background \
  -v $(pwd)/data:/app/data \
  taxomind

This runs the API with in-process task execution — same behaviour as the dev server but inside a container.

Full stack with Docker Compose (Redis + Worker)

# 1. Create .env with at least API_TOKENS
cp .env.example .env

# 2. Build and start all services
docker compose up --build -d

# 3. Check status
docker compose ps       # 3 services: redis, api, worker
docker compose logs -f   # follow logs

This starts three containers:

Service	Role	Port
`redis`	Message broker + job store	internal only
`api`	FastAPI server (accepts requests, enqueues tasks)	host `${API_PORT:-3001}` -> container `3000`
`worker`	Dramatiq worker (executes Kedro pipelines)	none

Both api and worker share the /app/data volume so taxonomy files, models, and training data are accessible from both processes.

Note: Uvicorn access logs are disabled in the startup scripts (access_log=False) to reduce /health request noise; application and pipeline logs are still emitted.

Stop everything:

docker compose down          # stop containers (data volumes preserved)
docker compose down -v       # stop and remove volumes (full reset)

Deployment

Production — Generic

Build the Docker image from the included Dockerfile:
```
docker build -t taxomind .
```
The image is optimized with a multi-stage build and CPU-only PyTorch wheels (configured in requirements-prod.txt) to reduce size.

Run three services (or use an orchestrator like Docker Compose, Kubernetes, etc.):

Redis:

docker run -d --name taxomind-redis redis:7-alpine

API:

docker run -d --name taxomind-api \
  -p 3000:3000 \
  -e TASK_BACKEND=dramatiq \
  -e REDIS_URL=redis://taxomind-redis:6379/0 \
  -e API_TOKENS=your-production-token \
  -e PORT=3000 \
  -v taxomind-data:/app/data \
  taxomind

Worker:

docker run -d --name taxomind-worker \
  -e TASK_BACKEND=dramatiq \
  -e REDIS_URL=redis://taxomind-redis:6379/0 \
  -v taxomind-data:/app/data \
  taxomind \
  python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1

Shared volume: The API and worker must share the same /app/data volume for taxonomy files, trained models, and training data.

Scaling: Increase --processes on the worker for more concurrency (each process loads ML models into memory — size accordingly). You can also run separate worker containers for the default and inference queues:

# Inference-only worker
python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1 --queues inference

# Everything-else worker
python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1 --queues default

Production — Coolify

Create a new Docker Compose service in Coolify from this repository.
Set the following environment variables in Coolify (they are injected into the .env file referenced by the compose file):
- API_TOKENS — your production Bearer tokens (comma-separated)
- OPENAI_API_KEY — if using taxonomy enrichment
- optional API_PORT — host port for API publish (3001 default in compose)
Keep only api publicly exposed; worker and redis should remain internal.
Mount a persistent volume at /app/data for the api and worker services (the compose file uses a named volume app_data by default).
The compose file includes SERVICE_URL_API_3000 on api for Coolify routing.
If your project view shows only raw compose and you need per-service controls, deploy via Services -> Docker Compose Empty instead.
The health check at /health reports:
- status: healthy — API and Redis are connected
- status: degraded — API is up but Redis is unreachable
- task_backend: dramatiq — confirms the queue backend is active

Development Workflow

Use kedro jupyter lab or kedro ipython for exploratory work; Kedro automatically loads the catalog, parameters, and pipeline registry.
Run quality checks with ruff check and pytest.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.claude		.claude
.vscode		.vscode
codex_context		codex_context
conf		conf
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src/taxomind		src/taxomind
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.telemetry		.telemetry
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

taxomind

Overview

Key Features

Requirements

Quickstart

Configuration and Data

Environment Variables

Pipelines

API Service

Architecture

Running Locally (Development)

Running Locally with Docker

Single container (no Redis)

Full stack with Docker Compose (Redis + Worker)

Deployment

Production — Generic

Production — Coolify

Development Workflow

License

About

Uh oh!

Releases

Packages

Languages

License

rowsquared/taxomind

Folders and files

Latest commit

History

Repository files navigation

taxomind

Overview

Key Features

Requirements

Quickstart

Configuration and Data

Environment Variables

Pipelines

API Service

Architecture

Running Locally (Development)

Running Locally with Docker

Single container (no Redis)

Full stack with Docker Compose (Redis + Worker)

Deployment

Production — Generic

Production — Coolify

Development Workflow

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages