Conversational AI agent that uses scikit-rec as its tool belt. The agent reasons about the user's data and goals, then calls scikit-rec APIs via structured tool use to build, evaluate, and compare recommendation systems.
pip install scikit-rec-agent[anthropic] # with Claude
pip install scikit-rec-agent[openai] # with GPT-4
pip install scikit-rec-agent # bring your own LLM
pip install scikit-rec-agent[anthropic,torch] # + deep-learning modelsexport ANTHROPIC_API_KEY=...
scikit-rec-agent chatAuto-detects the provider from env vars. Pass --provider {anthropic,openai} if both are set.
import anthropic
from scikit_rec_agent import Agent
from scikit_rec_agent.llm.anthropic import AnthropicAdapter
agent = Agent(llm=AnthropicAdapter(anthropic.Anthropic()))
for event in agent.chat_turn("I have click data at /data/interactions.csv — help me build a ranker"):
...See examples/customizations/ for:
custom_tool.py— register a user-defined toolcustom_prompt.py— extend or replace the system promptcustom_llm.py— plug in your company's internal LLM via theBaseLLMprotocolcustom_frontend.py— drive the agent from Jupyter / Slack / web
See examples/transcripts/ for full captured chat sessions:
movielens_session.md— the sweep flow: compare 7 methods on MovieLens-1Mmovielens_hierarchical_session.md— the one-model design flow: walk through the picker step by step on the same data
Fifteen tools cover the full scikit-rec workflow — from raw data to a saved, tuned model:
| Tool | What it does |
|---|---|
profile_data |
Loads a CSV/parquet and reports shape, dtypes, sparsity, target type, and temporal range. Heuristic role detection for USER_ID / ITEM_ID / OUTCOME / TIMESTAMP. |
validate_data |
Checks a file against scikit-rec's required schema. Suggests column-rename mappings when names are close. |
transform_data |
Reshapes a raw file into one of nine scikit-rec contracts (long, long-with-timestamp, long-multi-reward, wide multi-output, multiclass, prebuilt sequences, sessions, users features, items features). Auto-detects source shape; applies pivot, melt, aggregate, dedupe, and cast as needed. Preserves user features across wide↔long reshapes. Surfaces a dropped_targets manifest when single-class ITEM_* columns are auto-dropped. |
create_datasets |
Builds scikit-rec Dataset handles from file paths. Auto-generates schemas from dtypes; auto-dispatches to InteractionsDataset / InteractionMultiOutputDataset / InteractionMultiClassDataset. For wide multi-output / multi-class bundles, auto-merges user features into the interactions frame (the wide scorers reject a separate users frame). Refuses bad joins upfront via a USER_ID overlap check. |
split_data |
Splits a bundle into train/valid/test using temporal, leave-last-n-per-user, random-split-per-user, leave-n-users-out, or random-split. Errors loudly on degenerate splits (e.g. per-user split on one-row-per-user data). |
list_compatible_options |
Drives the hierarchical model-design flow: walks the user through recommender_type → scorer_type → estimator_type → model_type → hyperparameters one step at a time. Each option carries a what_it_is / when_to_pick / tradeoff_vs_alternatives triple. The terminal step returns an assembled_config that plugs straight into train_model. |
train_model |
Trains a recommender from a RecommenderConfig dict via scikit-rec's factory. Accepts a scorer_config block (e.g. {'on_degenerate_target': 'constant'} to enable MultioutputScorer's constant-predictor fallback for single-class targets). Auto-picks a curated default config when none is supplied. Failure envelopes carry a category from the diagnose registry plus a one-line hint. |
sweep_methods |
Trains and evaluates multiple methods on the same bundle and returns a ranked leaderboard. Modes: list (menu only — adds a reshape_recommendation field when wide_multioutput could be widened by melting to long), auto (data-aware filter + hyperparameter resize), all (every entry — requires confirmed_all=True), broad (every capability-compatible triple), or explicit method dicts / short_names. Required to set drop_non_winners explicitly on >100K-row bundles. Idempotent across re-runs. |
diagnose_training_failure |
Pattern-matches a failed train_model envelope against a 26-pattern registry and returns ranked candidate fixes with structured actions. Auto-retries the top safe fix; bounded by max_retries to prevent loops. Multioutput-specific patterns (binary-only targets, retriever incompatibility, item_subset rejection, users-frame rejection, single-class targets) walk before generic sklearn fallbacks so the more-specific diagnosis fires first. |
evaluate_model |
Runs offline evaluation on a trained model with any of 7 evaluator types × 9 metrics at multiple k values. Auto-builds eval_kwargs from the bundle's validation interactions for the simple evaluator (including the wide multi-output shape — (n_users, n_targets) logged_rewards from ITEM_* columns). Returns per-target metrics with per_label=True on MultioutputScorer (classification + regression) or long-format UniversalScorer (roc_auc / pr_auc). Non-@k metrics (rmse, mae, roc_auc, pr_auc) compute once regardless of k_values. |
compare_models |
Renders a markdown leaderboard across all (or a chosen subset of) trained models in the session, sorted by a primary metric. |
run_hpo |
Optuna-driven hyperparameter search over a user-specified search_space. Persists the best config and writes the tuned model into the session. |
save_model |
Persists a trained model to the local file-based registry with optional tags. |
list_models |
Lists saved models in the registry with their metadata and tags. |
load_model |
Restores a saved model into the current session for further use. |
The system prompt is built at import time from scikit-rec's live enum maps, so new recommender / scorer / estimator types get picked up automatically.
The wide_multioutput contract — one row per user, several ITEM_* columns as joint prediction targets — is fully supported end-to-end:
- Classifier and regressor modes: binary
ITEM_*columns route to MultioutputScorer (classifier); continuousITEM_*route to regressor mode. The auto-sweep ships bothxgb_multioutputandxgb_multioutput_regression; the data profile picks the right one based on the column dtype. - Per-target metrics: pass
per_label=Truetoevaluate_modelorsweep_methodsto getDict[str, float]keyed by ITEM_* name. The macro-averaged scalar is the default; per-target is the deliberate "show me each label" path. - Degenerate single-class targets:
transform_dataauto-drops them by default and lists them indropped_targets. To keep them with a constant-predictor fallback, passscorer_config={'on_degenerate_target': 'constant'}totrain_modelinstead —degenerate_targetsthen surfaces in the train envelope. - Long-format equivalent: melt the wide contract into long_interactions with
transform_datato broaden the comparison to the universal-scorer methods (XGBoost, MF, NCF, Two-Tower, DCN, NFM). Side features are preserved across the reshape.
Five user-decision points are enforced as MissingDecision error envelopes at the tool layer (not just prompt guidance) so an LLM that ignores the system prompt can't silently default through them:
- Primary metric — required schema field on
sweep_methodsandcompare_models. per_label— required on multioutput bundles with ≥2 ITEM_* targets (default None → MissingDecision).drop_non_winners— required on bundles >100K rows (default None → MissingDecision). MF + NCF + Two-Tower together can hold 1–3 GB of user embeddings.methods='all'— requiresconfirmed_all=Trueso the menu-pick flow isn't bypassed.- Reshape vs stay —
sweep_methods(methods='list')on wide_multioutput surfaces areshape_recommendationfield for the agent to relay.
When you receive a MissingDecision envelope, read its message for the question and re-call the tool with the user's answer in the named parameter.
The agent expects natural language. There's no DSL, no required prompt structure — just describe your data and goal. Two main paths cover most workflows; the agent picks based on what you ask for.
For "I want results — show me which method works best on my data."
"I have click data at /data/interactions.csv with users in users.csv
and items in items.csv. Compare a few methods and tell me which works
best."
What happens:
profile_data+validate_dataon each filetransform_dataif the shape doesn't match the target contractcreate_datasets+split_datasweep_methods(methods="list")— agent surfaces the menu (XGBoost, MF, NCF, Two-Tower, DCN, NFM, SASRec — whichever fit your data) with brief descriptions, asks you to pick or say "all"sweep_methods(methods=[...])— trains + evaluates the picked methods, returns a ranked leaderboard- Agent reports the winner; offers to save / run HPO
The auto-sweep table is data-aware by default (methods="auto"): MF only runs in the high-sparsity regime, embedding methods only when n_rows ≥ 5K, sequential only with timestamps. Hyperparameters are tier-sized to your data scale. Pass methods="all" to override the filter and run every entry as-is.
See examples/transcripts/movielens_session.md — 7 methods on MovieLens-1M, SASRec wins with NDCG@10 ≈ 0.021.
For "I want to understand the choice, not just see a leaderboard."
"Walk me through how to choose a recommender for this data. I want
to understand the design space."
What happens — the agent walks the hierarchical flow via list_compatible_options:
recommender_type— Ranking? Sequential? Uplift? Bandits? Each option carries awhat_it_is / when_to_pick / tradeoff_vs_alternativestriple. Options that don't fit your data (e.g. sequential when there's no TIMESTAMP) are filtered out automatically.scorer_type— given your previous pick, what scoring strategy applies. Universal / independent / multioutput / multiclass / sequential / hierarchical, again with explanations.estimator_type— tabular (XGBoost) / embedding (MF, NCF, Two-Tower, DCN, NFM) / sequential (SASRec, HRNN). Filtered by data size (embedding needs ≥5K rows, etc).model_type— pick the specific family.- Terminal step — agent shows the data-tier-sized default hyperparameters with
what_it_isandwhy_this_default. Three actions:train_with_defaults— accept the sized defaults, train one modeltrain_with_overrides— change specific hyperparameters before trainingrun_hpo— search the pre-suggested ranges via Optuna
Uplift gets one extra step. Picking recommender_type=uplift adds required_recommender_params to the terminal payload — control_item_id (which ITEM_ID is the control / no-recommendation case?) and mode (T-Learner / S-Learner / X-Learner, each with its own triple). Both are user-supplied; the agent won't silently default them. train_with_defaults is blocked for uplift; you go through train_with_overrides.
See examples/transcripts/movielens_hierarchical_session.md for a real walk-through on MovieLens-1M.
| Goal | Path |
|---|---|
| "What works best on my data?" | A — sweep |
| "Should I use sequential or ranking? Help me choose." | B — design |
| "Compare 3 specific methods I picked." | A — sweep with explicit methods=[...] |
| "I want uplift. Help me set it up." | B — design |
| "Bulk compare everything and run HPO on the winner." | A — sweep, then run_hpo on the winner |
| "Train one specific model I already have a config for." | Skip both — use train_model directly |
"train_model errored — here's the envelope: {error_type: ValueError,
message: 'Input contains NaN', ...}. Help."
The agent calls diagnose_training_failure, pattern-matches the error against a 26-pattern registry, returns ranked candidate fixes with structured actions. Bounded retries (max 2 per model_name) prevent loops; if the category is unknown, it surfaces the raw error to you instead of guessing.
The agent asks targeted clarifying questions when the data or goal is genuinely ambiguous:
- "Which column is your timestamp?" — when
profile_data's heuristic role detection is uncertain - "Ranking or sequential? Your data has timestamps so both are valid" — when a design choice has real tradeoffs
- "Your
gendercolumn isM/Fstrings. Drop it, label-encode (0/1), or one-hot?" — whentrain_modelwould otherwise fail on object-dtype features - "What's your control item ID for uplift?" — when the design path lands on uplift and there's no sensible default
It does not ask you to write any code. Tool calls happen behind the scenes; you only see them if you watch chat_turn's event stream.
Every tool returns a JSON envelope. The shape is:
{"status": "ok", "data": {...}} // success
{"status": "error", "error_type": "...", "message": "...", // failure
"hint": "actionable next step", "category": "diagnostics-bucket"}category and hint are populated by the diagnose registry — when present they tell you exactly what failed and how to react. Sweep leaderboards sort by your primary metric; rows with status: "error" are kept in the leaderboard with their per-method failure category for later inspection.
The agent runs two deterministic detectors on every turn's output:
- URL echo check — flags
https://...links the model introduces that the user did not supply this session. Shipped adapters have no web retrieval, so model-introduced URLs are common fabrications. - Foreign-reference check — scans fenced Python blocks for imports and bare-alias usage outside
{skrec, scikit_rec, scikit_rec_agent, stdlib}. Library APIs we own have a runtime backstop via the scikit-rec factory; external libraries don't.
Warnings are emitted as AgentEvent(type="warning") and never enter conversation history. Opt out with Agent(..., enable_safeguards=False).
The detectors are deliberately narrow. They catch the common confident-plausible-looking fabrication case with near-zero false positives, not every possible hallucination. What they do not catch:
- Semantic errors inside trusted APIs (wrong
RecommenderConfigshape, poor metric choice). The scikit-rec factory catches bad configs attrain_model; the rest is on the user. - Invented keyword arguments for external libraries. We flag
pandasas unverified, not the specificmake_up_kwarg=True. - Fabricated dataset names, paper citations, or prose claims. We only inspect URLs and Python code blocks.
- Adversarial evasion (aliased
importlib, f-string import args, triple-backticks inside docstrings,ast.parse-rejecting blocks).
See scikit_rec_agent/safeguards.py for the full contract.
See agentic_design.md for the authoritative spec.
Contributions welcome — see CONTRIBUTING.md for dev setup, test commands, and where new work fits best.
Apache-2.0