DataSmith Agent — AI-assisted data pipeline builder.
DSAgt connects an MCP-compatible AI coding agent to tool registration, a semantic knowledge base, execution provenance, and observability infrastructure. DSAgt provides data-pipeline scaffolding around a user's existing agent CLI or VS Code extension (Claude Code, Goose, Codex, …);
Prerequisites: Python 3.10–3.13, uv, and one of the supported agent platforms below — already installed and authenticated against whatever LLM provider you intend to use.
| Agent | Install | Verify |
|---|---|---|
| Claude Code | npm i -g @anthropic-ai/claude-code |
claude --version |
| Goose | See Goose docs | goose --version |
| Codex | npm i -g @openai/codex (or brew install --cask codex) |
codex --version |
| opencode | See opencode docs | opencode --version |
| Roo Code | npm i -g @roo-code/cli |
roo --version |
| Cline | npm i -g cline |
cline --version |
Explore DSAgt knowledge ingest, tool registration, provenance, and explicit memory using the mock project in tests/smoke_test/. Uses claude; substitute another agent (goose / codex / opencode) if you prefer — the prompts are agent-agnostic.
# 0. Installation
git clone https://github.com/AI-ModCon/dsagt.git
cd dsagt
uv sync # add --all-groups for the test suite
source .venv/bin/activate # so `dsagt` is on PATH
# Set convenience folder env variable for quickstart demo (not a normal dsagt step)
export SMOKE_DIR="$(pwd)/tests/smoke_test"
# 1. Create a new project called quickstart
dsagt init quickstart --agent claude
# 2. Start MLflow in the background (writes <project>/mlflow.log) and print
# the OTel routing exports for this session, including the resolved
# experiment id:
dsagt mlflow quickstart
# 3. Paste the export block dsagt mlflow printed into THIS shell, then
# launch claude from the project directory:
cd ~/dsagt-projects/quickstart && claudeInside the agent, paste these prompts one at a time (substitute the absolute path you exported as $SMOKE_DIR — the chat doesn't expand env vars):
-
Ingest the docs in
$SMOKE_DIR/knowledge/into a collection namedknowledge. -
Register the csvkit CLI tools
csvcut,csvgrep,csvstat, andcsvlook. -
Use the
scan_directorytool from the registry to scan$SMOKE_DIR/data/. -
Summarize
samples.csv— columns, row count, quality issues using csvkit tools from the registry. -
Put this in explicit memory: samples.csv has null values in the status and timestamp columns.
-
Tell me what you remember about the samples dataset.
Exit the agent (Ctrl+C or /exit), then distill the session into episodic memory and stop the MLflow daemon:
# 4. After your session, distill traces into episodic memory:
dsagt memory --project quickstart
# 5. Stop the MLflow daemon dsagt mlflow started (writes a PID into
# <project>/.runtime; this releases the port and the gunicorn workers):
dsagt stop quickstartWhat this exercised:
| Prompt | Layer |
|---|---|
| 1 | Knowledge MCP server (kb_ingest) — chunks and indexes docs into ChromaDB |
| 2 | Registry MCP server (save_tool_spec) — writes tools/csvcut.md, tools/csvgrep.md, etc. (one per registered tool) |
| 3 | dsagt-run provenance wrapper — records exec layer to trace_archive/ |
| 4 | Explicit memory (kb_remember → explicit_memories.yaml) + KB recall |
Verify the artifacts and view traces in the MLflow UI (URL printed by dsagt mlflow):
dsagt info quickstart
ls ~/dsagt-projects/quickstart/{tools,trace_archive}
cat ~/dsagt-projects/quickstart/explicit_memories.yamlThe same flow runs non-interactively via dsagt smoke-test --agent claude (or goose / codex / opencode), which asserts each artifact is present.
dsagt setup-kb builds the shared ChromaDB collections under ~/.dsagt/kb_index/ that every project on this machine reuses. Three of the six collections shown in the architecture diagram are populated here — the other three are per-project and fill in automatically during use (see Knowledge Base below):
- Tool Specs — DSAgt's bundled tool specs from
src/dsagt/tools/, tagged withsource: bundledso the agent finds them viasearch_registryfrom the very first session. - Skills — DSAgt's bundled skill workflows from
src/dsagt/skills/(e.g.datacard-generator), discovered viasearch_skills. - Domain Knowledge — Reference corpora (NVIDIA NeMo Curator, AI Data Readiness Inspector) downloaded and embedded so the agent has data-curation domain knowledge out of the box.
The Tool Specs and Skills collections are wipe-and-rebuild on every run, so re-run setup-kb after upgrading DSAgt to pick up new bundled assets.
dsagt setup-kb # all collections (local embedder, no creds)
dsagt setup-kb --collection nemo_curator
dsagt setup-kb --embedding-backend api --embedding-base-url ... --embedding-api-key ...The default embedder is a local sentence-transformers model (~130 MB of weights downloaded on first run, CPU-side, no API key). Pass --embedding-backend api to route through a hosted embedder via LiteLLM (15–30 minutes typical for the reference corpora, depending on rate limits).
End-to-end walkthroughs for representative scientific domains live in use_cases/. Each one covers data acquisition, tool registration, pipeline construction, and agent-driven execution against a real dataset.
| Use case | Domain | Guide |
|---|---|---|
| Microbial isolate processing | Genomics — short-read QC and assembly with fastp + megahit |
isolate_demo.md |
| Cryo-EM data curation | Structural biology — EMPIAR-10017 β-galactosidase micrographs via CryoPPP | cryoem_demo.md |
| ISAAC / VASP workflows | Materials science — DFT input/output handling with VASP | use_cases/isaac_vasp/ |
Default location: ~/dsagt-projects/<name>/. Override with --location:
dsagt init my-project --agent claude --location /data/runs # /data/runs/my-project/
dsagt init my-project --agent claude --location . # ./my-project/Projects are registered in ~/.dsagt/projects.yaml so dsagt mlflow <name> and dsagt info <name> work from any directory. The data layer (knowledge base, MLflow store, registered tools, skills, audit records) is agent-agnostic, so re-running dsagt init <same-name> --agent <other> switches platforms while preserving everything you've accumulated.
~/dsagt-projects/cheese-metagenome/
dsagt_config.yaml # project configuration
tools/ # registered CLI tool specs (markdown + YAML frontmatter)
tools/code/ # agent-written tool scripts
skills/ # agent skills (SKILL.md + reference docs)
trace_archive/ # tool execution records (JSON, from dsagt-run)
mlflow/ # MLflow traces, metrics, artifacts
kb_index/ # knowledge base vector collections
explicit_memories.yaml # user-confirmed facts loaded at session start
# Per-agent runtime config (one of, generated by dsagt init):
# claude: CLAUDE.md, .mcp.json
# goose: goose.yaml, .goosehints
# codex: AGENTS.md, .codex-data/config.toml
# opencode: AGENTS.md, opencode.json
# roo: .roomodes, .roo/mcp.json
# cline: .clinerules/, cline_mcp_settings.json (managed via cline mcp add)
- Registry (
dsagt-registry-server) — Tool registration and dependency installation. Tools are markdown files with YAML frontmatter under<project>/tools/. Executables are wrapped withdsagt-runfor provenance anduv run --withfor Python dependencies. The agent discovers tools viasearch_registry. - Knowledge (
dsagt-knowledge-server) — Semantic search over indexed document collections (FAISS / ChromaDB). Background jobs handle long ingest operations. The agent searches viakb_search, ingests viakb_ingest, and saves user-confirmed facts viakb_remember.
Tools are CLI executables defined as markdown files with YAML frontmatter in <project>/tools/. The agent registers new tools via the registry MCP server's save_tool_spec.
Skills are instruction-based agent workflows in <project>/skills/. Each skill is a directory containing a SKILL.md and optional reference docs. DSAgt ships with a bundled datacard-generator skill. The agent discovers skills via search_skills.
Six independently-partitioned ChromaDB collections hold everything the agent searches semantically. The first three are global (under ~/.dsagt/kb_index/, populated by dsagt setup-kb); the last three are per-project (under <project>/kb_index/, populated automatically during use):
| Collection | Source | Populated by |
|---|---|---|
| Tool Specs | Bundled CLI tool specs in src/dsagt/tools/ |
dsagt setup-kb |
| Skills | Bundled skill workflows in src/dsagt/skills/ |
dsagt setup-kb |
| Domain Knowledge | NeMo Curator + AIDRIN reference corpora; user-ingested docs | dsagt setup-kb + agent's kb_ingest |
| Explicit Memory | User-confirmed facts | Agent's kb_remember (also written to <project>/explicit_memories.yaml); the agent fetches via kb_get_memories on demand — typically when you ask it to recall — not auto-loaded at session start |
| Episodic Memory | Distilled facts from MLflow traces | dsagt memory --project <name> (per-category outlier detection via embedding centroids) |
| Tool Use Records | dsagt-run execution traces |
dsagt-run wrapper writes JSON to <project>/trace_archive/; indexed into ChromaDB by dsagt memory |
The default embedding backend is local (sentence-transformers, CPU-side, no API needed). Switch to embedding.backend: api in dsagt_config.yaml to route through a hosted embedder via LiteLLM. Cross-encoder reranking is optional (knowledge.rerank: true).
The agent searches via kb_search (knowledge MCP server) and writes via kb_ingest / kb_remember. Tool Specs and Skills are queried through specialized routes (search_registry, search_skills) over the same backend.
MLflow runs at http://localhost:<mlflow_port> (pinned at init time, listed by dsagt info). The trace view shows:
- Knowledge base operations —
kb.search/kb.embed/kb.index_search/kb.rerankspan trees with per-phase timing. - Tool executions —
tool.executespans with exit code, duration, file counts, truncated stderr. Full payload intrace_archive/<record_id>.json. - Registry events —
save_tool_spec,install_dependencies,reconstruct_pipelinespans. - Native agent OTel (optional) — when you export
MLFLOW_TRACKING_URIandOTEL_EXPORTER_OTLP_ENDPOINT(printed bydsagt init), the agent's own LLM-call traces land in the same MLflow store. Trace coverage varies by agent: claude / goose emit full payloads, codex emits token counts + tool names, opencode emits nothing natively.
Every span carries the project's session.id for filtering. Tool execution records on disk provide the canonical provenance chain — the agent calls reconstruct_pipeline to render the trace archive as a reproducible bash script or Snakemake workflow.
| Command | Description |
|---|---|
dsagt init <name> --agent <platform> [--location <path>] [--mlflow-port N] |
Create a project; write per-agent MCP config; print launch one-liner |
dsagt mlflow <name> |
Run MLflow in the foreground against a project's store (port pinned at init time) |
dsagt memory --project <name> |
Distill new traces from this project's MLflow into episodic memory |
dsagt info <name> [--json] |
Resolved config (with source per value) and a session/error summary |
dsagt setup-kb [--collection <name>] |
Build the shared core knowledge base collections |
dsagt list |
List all projects with agent, status, and path |
dsagt mv <name> <new-location> |
Move a project to a new location |
dsagt rm <name> [-y] [--keep-files] |
Unregister a project (and optionally delete its directory) |
dsagt smoke-test [--agent claude|goose|codex|opencode] |
End-to-end install verification |
For tests, proxy mode, troubleshooting, and other developer-facing material, see developer.md.
