taxomind is a project for modular, multilingual hierarchical text classification against deep taxonomies (e.g. ISCO / ISIC). It builds multi-view taxonomy embeddings, runs top-down routing with explicit stopping (internal nodes are valid predictions), supports per-node incremental learning via evidence centroids, and includes error-analysis utilities.
- Multi-view embeddings for labels, definitions, and examples.
- Retrieval + induced subgraph + top-down routing with explicit stopping.
- Incremental learning via evidence centroids.
- FastAPI async job API for taxonomy build, inference, learning, and analysis.
- Task queue for production workloads.
- Cross-lingual routing and validation (embedding-based).
- Python 3.9+
- Optional:
OPENAI_API_KEYfor taxonomy enrichment. - Optional: Docker + Docker Compose for containerised deployment.
-
Create and activate a Python virtual environment.
-
Install dependencies:
pip install -r requirements.txt
-
Update
conf/base/parameters.ymlfor embeddings and inference settings. -
Run a pipeline or start the API server:
kedro run --pipeline build_taxonomy PYTHONPATH=src python scripts/start_api.py
- Shared defaults live in
conf/base/. - Environment overrides can live in
conf/test/andconf/prod/(run withkedro run --env=...). - Local secrets and machine-specific settings belong in
conf/local/(never commit secrets). - Data inputs are expected under
data/(seedata/01_raw/for inputs). - For enrichment, set
OPENAI_API_KEYin the environment or.env.
Copy .env.example to .env and adjust:
| Variable | Default | Description |
|---|---|---|
API_TOKENS |
(none) | Comma-separated Bearer tokens for authentication |
API_AUTH_ENABLED |
true |
Set to false to disable auth (dev only) |
TASK_BACKEND |
background |
background (in-process) or dramatiq (Redis queue) |
REDIS_URL |
(none) | Redis connection string, required when TASK_BACKEND=dramatiq |
JOB_TTL_SECONDS |
86400 |
How long completed jobs stay in Redis (24h default) |
JOB_STALE_RUNNING_SECONDS |
1800 |
Auto-fail a running job if it has no updates for this duration |
OPENAI_API_KEY |
(none) | Required only for taxonomy enrichment (Optional) |
PORT |
3000 |
API server port in production entrypoint / containers |
Prerequisite for taxonomy-scoped workflows:
- Run
build_taxonomyfirst for the targettaxonomy_key(or triggerPOST /taxonomiesin the API) and wait for successful completion. - Only then run the other pipelines/endpoints that depend on
taxonomy_indexfor that key (for example inference/labeling/learning/enrichment flows).
| Pipeline | Description |
|---|---|
build_taxonomy |
Build a per-taxonomy index with multi-view embeddings (label/definition/examples) for fast retrieval + routing. Also used by API /taxonomies after request JSON is normalized to CSV. |
enrich_taxonomy |
Optional: enrich taxonomy definitions/examples (LLM-assisted) and save an enriched taxonomy definition. |
inference / inference_batch |
Hierarchical inference: retrieval -> induced subgraph -> top-down routing with explicit stopping + scoped validation. |
learning_pipe |
Incremental learning: update per-node evidence centroids from /learn corrections (no ancestor drift). |
error_analysis |
Produce standardized targets from datasets for downstream error analysis/debugging. |
Each pipeline is modular so that intermediate datasets (taxonomy enrichment, embeddings, inference results, etc.) can be cached or swapped for external services.
src/taxomind/services/api/fastapi_app.py exposes an async job API (Bearer token
auth) that maps to the Kedro pipelines:
GET /andGET /health(liveness and health endpoints)POST /taxonomiesandGET /taxonomies/{job_id}/status(create/build index from JSON request)POST /taxonomies/{job_id}/cancel(cancel taxonomy job)POST /taxonomies/{taxonomy_key}/enrich(runenrich_taxonomy)POST /taxonomies/{taxonomy_key}/build(runbuild_taxonomy)POST /classifyandGET /classify/{job_id}/status(runinference_batchfor classification)POST /classify/{job_id}/cancel(cancel inference job)POST /labelandGET /label/{job_id}/status(runinference_batchfor labeling)POST /label/{job_id}/cancel(cancel labeling job)POST /learnandGET /learn/{job_id}/status(runlearning_pipe)POST /learn/{job_id}/cancel(cancel learning job)POST /error-analysisandGET /error-analysis/{job_id}/status(runerror_analysis)POST /error-analysis/{job_id}/cancel(cancel error-analysis job)
Job status values are: pending, running, completed, failed, canceled.
Status payloads expose numeric progress (0.0 to 1.0) plus lifecycle
timestamps (created_*, started_*, completed_*, failed_*) where relevant
for each endpoint model.
For POST request bodies on /taxonomies, /classify, /label, and /learn,
an optional top-level sourceSlug is supported. If omitted, it is inferred from
the incoming host (e.g. domani1.com -> domani1).
Testing guide: docs/API_TESTING.md.
taxomind supports two task execution modes, controlled by TASK_BACKEND:
TASK_BACKEND=background (default) TASK_BACKEND=dramatiq (production)
┌──────────────────────┐ ┌──────┐ ┌─────────┐ ┌────────┐
│ FastAPI process │ │ API │ │ Redis │ │ Worker │
│ API + BackgroundTask│ │ ├──► ├──► │
│ + file JobStore │ │ │ │ JobStore│ │ Kedro │
└──────────────────────┘ └──────┘ └─────────┘ └────────┘
- background: Everything runs in a single process. No Redis needed. Jobs are persisted to a local JSON file. Ideal for development and testing.
- dramatiq: The API enqueues tasks via Dramatiq into Redis. A separate worker process picks them up and runs the Kedro pipelines. Job state is stored in Redis with a configurable TTL. Ideal for production.
No Docker, no Redis required:
# 1. Install dependencies
pip install -r requirements.txt
# 2. Create .env from template
cp .env.example .env
# Edit .env: set API_TOKENS, leave TASK_BACKEND=background
# 3. Start the dev server (auto-reload, port 8000)
PYTHONPATH=src python scripts/start_api.py- Swagger UI: http://localhost:8000/docs
- Health check: http://localhost:8000/health
- Jobs are stored in
data/09_job_store/jobs.json(survives restart, lost on data wipe). - Pipeline tasks run in-process via FastAPI
BackgroundTasks.
docker build -t taxomind .
docker run -p 3000:3000 \
-e API_TOKENS=your-token \
-e TASK_BACKEND=background \
-v $(pwd)/data:/app/data \
taxomindThis runs the API with in-process task execution — same behaviour as the dev server but inside a container.
# 1. Create .env with at least API_TOKENS
cp .env.example .env
# 2. Build and start all services
docker compose up --build -d
# 3. Check status
docker compose ps # 3 services: redis, api, worker
docker compose logs -f # follow logsThis starts three containers:
| Service | Role | Port |
|---|---|---|
redis |
Message broker + job store | internal only |
api |
FastAPI server (accepts requests, enqueues tasks) | host ${API_PORT:-3001} -> container 3000 |
worker |
Dramatiq worker (executes Kedro pipelines) | none |
Both api and worker share the /app/data volume so taxonomy files, models,
and training data are accessible from both processes.
Note: Uvicorn access logs are disabled in the startup scripts (access_log=False)
to reduce /health request noise; application and pipeline logs are still emitted.
Stop everything:
docker compose down # stop containers (data volumes preserved)
docker compose down -v # stop and remove volumes (full reset)-
Build the Docker image from the included
Dockerfile:docker build -t taxomind .The image is optimized with a multi-stage build and CPU-only PyTorch wheels (configured in
requirements-prod.txt) to reduce size. -
Run three services (or use an orchestrator like Docker Compose, Kubernetes, etc.):
Redis:
docker run -d --name taxomind-redis redis:7-alpine
API:
docker run -d --name taxomind-api \ -p 3000:3000 \ -e TASK_BACKEND=dramatiq \ -e REDIS_URL=redis://taxomind-redis:6379/0 \ -e API_TOKENS=your-production-token \ -e PORT=3000 \ -v taxomind-data:/app/data \ taxomind
Worker:
docker run -d --name taxomind-worker \ -e TASK_BACKEND=dramatiq \ -e REDIS_URL=redis://taxomind-redis:6379/0 \ -v taxomind-data:/app/data \ taxomind \ python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1
-
Shared volume: The API and worker must share the same
/app/datavolume for taxonomy files, trained models, and training data. -
Scaling: Increase
--processeson the worker for more concurrency (each process loads ML models into memory — size accordingly). You can also run separate worker containers for thedefaultandinferencequeues:# Inference-only worker python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1 --queues inference # Everything-else worker python -m dramatiq taxomind.workers.tasks --processes 1 --threads 1 --queues default
- Create a new Docker Compose service in Coolify from this repository.
- Set the following environment variables in Coolify (they are injected into
the
.envfile referenced by the compose file):API_TOKENS— your production Bearer tokens (comma-separated)OPENAI_API_KEY— if using taxonomy enrichment- optional
API_PORT— host port for API publish (3001default in compose)
- Keep only
apipublicly exposed;workerandredisshould remain internal. - Mount a persistent volume at
/app/datafor theapiandworkerservices (the compose file uses a named volumeapp_databy default). - The compose file includes
SERVICE_URL_API_3000onapifor Coolify routing. - If your project view shows only raw compose and you need per-service controls, deploy via Services -> Docker Compose Empty instead.
- The health check at
/healthreports:status: healthy— API and Redis are connectedstatus: degraded— API is up but Redis is unreachabletask_backend: dramatiq— confirms the queue backend is active
- Use
kedro jupyter laborkedro ipythonfor exploratory work; Kedro automatically loads the catalog, parameters, and pipeline registry. - Run quality checks with
ruff checkandpytest.
MIT. See LICENSE.