Skip to content
This repository was archived by the owner on Feb 6, 2026. It is now read-only.

semiotic-ai/token-api

Repository files navigation

token-api

A research framework for detecting malicious tokens (ERC-20, ERC-721, ERC-1155) on Ethereum using machine learning.

Overview

token-api provides a factory-driven, config-driven platform for building and evaluating token classification models. It supports multiple model architectures, data sources, and training strategies that can be composed via YAML configuration files without code changes.

Key capabilities:

  • Multiple model architectures -- graph neural networks, XGBoost, logistic regression, with easy extension to new models
  • Factory-driven design -- extend with new models, data sources, and training strategies via ModelFactory, DataModelFactory, and FeatureFactory
  • Config-driven pipelines -- swap model, data, and training parameters through YAML configs
  • Rust-accelerated feature extraction -- high-performance data extraction and feature computation with Python bindings (PyO3/Maturin)
  • SQLMesh data pipeline -- reproducible data transformations backed by DuckDB and HuggingFace
  • MLflow experiment tracking -- track training runs, metrics, and model artifacts

Project Structure

.
├── token-api/                      # Python package (Poetry)
│   ├── src/token_api/              # Core library
│   │   ├── models/                 # Model implementations & ModelFactory
│   │   ├── data/                   # Data models, configs & DataModelFactory
│   │   ├── trainers/               # Training orchestration (Trainer)
│   │   ├── evaluator/              # Evaluation, metrics & MetricFactory
│   │   ├── scripts/                # Data processing utilities
│   │   └── main.py                 # CLI entry point
│   ├── src/assets/configs/         # Model, data, training & pipeline YAML configs
│   ├── configs/                    # Top-level training configurations
│   ├── token_api_data/             # SQLMesh data pipeline (DuckDB + HuggingFace)
│   ├── notebooks/                  # Research & analysis notebooks
│   ├── book/                       # Jupyter Book documentation
│   ├── tests/                      # Test suite
│   └── Makefile                    # SQLMesh & training Make targets
├── crates/tokenscout-dataset/      # Rust crate: high-perf data and feature extraction (PyO3)
├── docs/                           # Extended project documentation
├── .env.example                    # Environment variable template
└── run.sh                          # Root setup & build commands

Models

Model Description
TokenScout (Staged Pipeline) Multi-phase GNN: graph embedding, refinement, and classification on transfer graphs
TokenFlow GNN Graph neural network operating on token transfer flow patterns
XGBoost Binary classification for NFT/token spam using on-chain features (grid search, k-fold CV)
Logistic Regression Text-based classification using token metadata (name, symbol, description)

All models extend BaseModel and are registered through ModelFactory. New models can be added by implementing the base class and registering a config.

Getting Started

Prerequisites

  • Python 3.13+
  • Poetry
  • Rust toolchain (for the feature extraction crate)
  • Maturin (for building Python bindings)

Setup

chmod +x run.sh
./run.sh permissions    # Make sub-scripts executable
./run.sh dev            # Install Python dependencies via Poetry
./run.sh rust-build     # Build the Rust crate (release mode)
./run.sh rust-bindings  # Build and install Python bindings into the Poetry venv

Copy .env.example to .env and fill in the required values.

Running Tests

./run.sh pre-commit     # Runs all code checks and tests
./run.sh rust-test      # Run Rust crate tests

Data Pipeline

The SQLMesh pipeline lives in token-api/token_api_data/ and uses DuckDB as its local engine. Key Make targets (run from token-api/):

make sqlmesh-init       # First-time setup: download from HuggingFace, run full plan
make sqlmesh-resume     # Incremental fetch + save back to HuggingFace
make sqlmesh-plan       # Run SQLMesh plan with auto-apply
make sqlmesh-run        # Incremental update (no backfill)
make sqlmesh-ui         # Start the SQLMesh web UI
make sqlmesh-clean      # Remove local DuckDB database and data files

Training

Training is config-driven. Configs in token-api/src/assets/configs/trainings/ define model, data, and training parameters. Run training via the entry points below:

For ERC-20 related models, use the Makefile targets for local end-to-end workflows:

make train-local        # Train on SQLMesh data
make evaluate-curated   # Evaluate on curated token set
make e2e-help           # Show all E2E training commands

For NFT related models, use the following:

cd token-api
poetry run python -m token_api.main <path-to-config.yaml>

Development

Run all checks before committing:

./run.sh pre-commit

Tooling:

Contributing

Contributions are welcome. Please ensure all checks pass via ./run.sh pre-commit before opening a pull request.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •