Move / DeepBook Fine-Tuning Dataset

Produces a high-quality instruction-to-code dataset for fine-tuning a small open model into a Sui Move 2024 + DeepBook v3 coding specialist.

Target model: Gemma 4 E2B (google/gemma-4-E2B-it) trained on vram.ai with QLoRA.
Teacher model: DeepSeek V4 Flash via OpenRouter.

Quick start

# 1. Install Python deps (stdlib only – no third-party packages required for pipeline)
python --version   # 3.9+

# 2. Set your OpenRouter key (only needed for M4 synthesis)
cp .env.example .env
# edit .env and fill in OPENROUTER_API_KEY

# 3. Run M1 → M2 → M3 (no API key needed)
python scripts/build_move_dataset.py --stage m1
python scripts/build_move_dataset.py --stage m2
python scripts/build_move_dataset.py --stage m3

# 4. Synthesize pairs (requires OPENROUTER_API_KEY)
source .env   # or export OPENROUTER_API_KEY=...
python scripts/build_move_dataset.py --stage m4

# 5. QC + train/eval split
python scripts/build_move_dataset.py --stage m5

# 6. Run the full compile harness (requires `sui` CLI on PATH)
python scripts/move_eval.py --input data/pairs.jsonl --output eval_report.json
python scripts/move_eval.py --input data/eval.jsonl  --output eval_report_eval.json

Deliverables

File	Description
`data/pairs.jsonl`	Training pairs (Section 7.1 schema)
`data/eval.jsonl`	Hold-out eval pairs (decontaminated)
`data/SOURCES.md`	All data sources with URLs + licenses
`data/filter_report.json`	M2 filter count report
`scripts/build_move_dataset.py`	Full pipeline
`scripts/move_eval.py`	Compile-based eval harness
`training/MODELS.md`	HF repo IDs + teacher model config

Pair schema (data/pairs.jsonl)

{
  "id": "string",
  "instruction": "natural-language task",
  "input": "optional context or partial code",
  "output": "the Move code answer",
  "tags": ["deepbook" | "receiver-syntax" | "dependency-pattern" | "general"],
  "source": "provenance string",
  "compiles": true
}

Eval report schema (eval_report.json)

{
  "model": "string",
  "n": 0,
  "compile_rate": 0.0,
  "uses_2024_syntax_rate": 0.0,
  "deepbook_task_pass_rate": 0.0,
  "timestamp_ms": 0
}

Guardrails

No secrets committed. Scan before every push.
Only permissive licenses (MIT / Apache 2.0). See data/SOURCES.md.
Non-compiling synthesized rows are dropped in M5, unless tagged error-to-fix.
Dataset is private (internal use only).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
training		training
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
eval_report.json		eval_report.json
eval_report_rows.jsonl		eval_report_rows.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Move / DeepBook Fine-Tuning Dataset

Quick start

Deliverables

Pair schema (data/pairs.jsonl)

Eval report schema (eval_report.json)

Guardrails

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Move / DeepBook Fine-Tuning Dataset

Quick start

Deliverables

Pair schema (data/pairs.jsonl)

Eval report schema (eval_report.json)

Guardrails

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages