From 0d6972f4f4221bf206ee94a7f8a2495f59540871 Mon Sep 17 00:00:00 2001
From: Marcin Spoczynski <marcin@spoczynski.com>
Date: Sun, 7 Jun 2026 18:44:56 -0400
Subject: [PATCH 1/4] docs: add DSL.md guide for adding new languages

---
 README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/README.md b/README.md
index 63b9e3e..ca51f3f 100644
--- a/README.md
+++ b/README.md
@@ -34,6 +34,7 @@ The optimizer analyzes Triton kernels, identifies performance issues, and applie
 - [CLI Reference](#cli-reference)
 - [Environment Variables Reference](#environment-variables-reference)
 - [Knowledge Base](#knowledge-base)
+  - [Adding a New DSL](#adding-a-new-dsl)
 - [Examples](#examples)
 - [Roofline Plots](#roofline-plots)
 - [Troubleshooting](#troubleshooting)
@@ -681,6 +682,10 @@ Both variables have these defaults, so if you place the `knowledge_base/` direct
 - **Analyzer** receives the critical constraints so it can detect violations (wrong device placement, missing grf_mode declaration, etc.) before flagging issues.
 - **Optimizer** receives the patterns and examples relevant to the current stage — before/after code pairs and real optimized kernels it can learn from.
 
+### Adding a New DSL
+
+Xe Forge is DSL-aware end to end (Triton, Gluon, SYCL, CUDA). To add support for another kernel language — including the knowledge base, executor, and prompt wiring required — see [`DSL.md`](DSL.md).
+
 ---
 
 ## Examples

From e9cbae8aefda1cb67ec0d7dd4ff13661b316c098 Mon Sep 17 00:00:00 2001
From: Marcin Spoczynski <marcin@spoczynski.com>
Date: Sun, 7 Jun 2026 18:45:08 -0400
Subject: [PATCH 2/4] docs: add DSL.md guide for adding new languages

---
 DSL.md | 507 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 507 insertions(+)
 create mode 100644 DSL.md

diff --git a/DSL.md b/DSL.md
new file mode 100644
index 0000000..4e50a52
--- /dev/null
+++ b/DSL.md
@@ -0,0 +1,507 @@
+# Adding a New DSL to Xe Forge
+
+This guide explains how to add support for a new kernel **DSL** (domain-specific
+language / programming model — e.g. Triton, Gluon, SYCL, CUDA C++) to the Xe Forge
+optimization pipeline.
+
+A "DSL" in Xe Forge is the source language the kernels are written in. The pipeline
+is DSL-aware end to end: the **analyzer**, **planner**, **optimizer**, **executor**,
+**knowledge base**, and **prompt library** all branch on the active DSL so that the
+LLM sees the right instructions and the generated code is compiled, run, and verified
+with the right toolchain.
+
+Xe Forge currently ships four DSLs, defined in [src/xe_forge/models.py](src/xe_forge/models.py#L11-L21):
+
+| DSL     | Value    | Language | Executor             |
+|---------|----------|----------|----------------------|
+| Triton  | `triton` | Python   | `KernelBenchExecutor` |
+| Gluon   | `gluon`  | Python   | `KernelBenchExecutor` |
+| SYCL    | `sycl`   | C++      | `SyclExecutor`        |
+| CUDA    | `cuda`   | C++      | `KernelBenchExecutor` |
+
+The DSL is selected at runtime via the `DSL` environment variable or the
+`--dsl` CLI flag, both resolved in [src/xe_forge/config.py](src/xe_forge/config.py#L234-L236)
+into `config.device_config.dsl`.
+
+---
+
+## Table of Contents
+
+1. [Architecture overview](#architecture-overview)
+2. [The pieces you must touch](#the-pieces-you-must-touch)
+3. [Step 1 — Register the DSL enum](#step-1--register-the-dsl-enum)
+4. [Step 2 — Declare supported stages (DSL registry)](#step-2--declare-supported-stages-dsl-registry)
+5. [Step 3 — Provide an executor](#step-3--provide-an-executor)
+6. [Step 4 — Wire executor selection](#step-4--wire-executor-selection)
+7. [Step 5 — Add prompt-library entries](#step-5--add-prompt-library-entries)
+8. [Step 6 — Add analyzer & optimizer signatures](#step-6--add-analyzer--optimizer-signatures)
+9. [Step 7 — Add the knowledge base](#step-7--add-the-knowledge-base)
+10. [Step 8 — (Optional) issue types & stage mapping](#step-8--optional-issue-types--stage-mapping)
+11. [Step 9 — CLI / config plumbing](#step-9--cli--config-plumbing)
+12. [Step 10 — Claude engine templates](#step-10--claude-engine-templates)
+13. [Testing your DSL](#testing-your-dsl)
+14. [Checklist](#checklist)
+
+---
+
+## Architecture overview
+
+A single optimization run flows through [`XeForgePipeline.optimize()`](src/xe_forge/pipeline.py#L166)
+roughly as follows:
+
+```
+spec (YAML) ──► input shapes, dtypes, FLOPs
+kernel code ──► AnalyzerAgent.analyze()        # detects issues (DSL-aware prompt)
+                       │
+                       ▼
+              PlannerAgent.plan()               # orders the stages
+                       │
+       filtered by  get_stages_for_dsl(dsl)     # dsl_registry.py
+                       ▼
+   for each stage:  OptimizerAgent.optimize_stage()
+                       │
+                       ▼  (CoVeR loop: generate → compile → run → compare)
+              Executor.compare_kernels()        # KernelBenchExecutor or SyclExecutor
+                       │
+                       ▼
+              re-analyze, next stage …
+```
+
+Three knobs make this DSL-aware:
+
+- **`config.device_config.dsl`** — the active DSL string, read everywhere.
+- **`DSL.code_language`** — `"python"` or `"cpp"`, used to pick file extensions,
+  comment markers, and DSPy `Code[...]` types ([models.py](src/xe_forge/models.py#L17-L21)).
+- **The DSL registry** — which optimization stages even apply to this DSL
+  ([dsl_registry.py](src/xe_forge/dsl_registry.py)).
+
+The cleanest mental model: **Triton is the reference DSL**. Anything that does not
+special-case your DSL falls back to the Triton path, so the more your DSL resembles
+Triton's Python+KernelBench flow, the less you have to write.
+
+---
+
+## The pieces you must touch
+
+| # | Concern | File(s) | Required? |
+|---|---------|---------|-----------|
+| 1 | DSL identity | [models.py](src/xe_forge/models.py#L11-L21) | **Yes** |
+| 2 | Supported stages | [dsl_registry.py](src/xe_forge/dsl_registry.py) | **Yes** |
+| 3 | Compile / run / verify | [core/executor.py](src/xe_forge/core/executor.py) or [core/sycl_executor.py](src/xe_forge/core/sycl_executor.py) | **Yes** (reuse or new) |
+| 4 | Executor selection | [pipeline.py](src/xe_forge/pipeline.py#L65-L80), [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139) | **Yes** |
+| 5 | Prompt components | [prompts/device_prompts.py](src/xe_forge/prompts/device_prompts.py) | **Yes** |
+| 6 | DSPy signatures | [agents/analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py), [agents/optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py), [agents/react_agent.py](src/xe_forge/agents/react_agent.py) | If code rules differ from Triton |
+| 7 | Knowledge base | [knowledge_base/](knowledge_base/) directory | Optional (recommended) |
+| 8 | Issue types / mapping | [models.py](src/xe_forge/models.py#L45-L104), [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py) | Only for novel issues |
+| 9 | CLI / config | [cli.py](src/xe_forge/cli.py), [config.py](src/xe_forge/config.py) | Usually trivial |
+| 10 | Claude engine templates | [claude/templates/](src/xe_forge/claude/templates/) | Only for the Claude engine |
+
+The rest of this document walks each step in order.
+
+---
+
+## Step 1 — Register the DSL enum
+
+Add a member to the `DSL` `StrEnum` in [src/xe_forge/models.py](src/xe_forge/models.py#L11-L21)
+and make sure `code_language` returns the correct value for it:
+
+```python
+class DSL(StrEnum):
+    TRITON = "triton"
+    GLUON = "gluon"
+    SYCL = "sycl"
+    CUDA = "cuda"
+    MOJO = "mojo"          # <-- new DSL
+
+    @property
+    def code_language(self) -> str:
+        if self in (DSL.SYCL, DSL.CUDA):
+            return "cpp"
+        return "python"     # mojo falls here; add to the cpp tuple if it is C++-like
+```
+
+`code_language` drives:
+
+- the saved-file extension and comment marker in [`_save_results`](src/xe_forge/pipeline.py#L582-L599)
+  (`.py`/`#` vs `.cpp`/`//`), and
+- the `dspy.Code["python"]` / `dspy.Code["cpp"]` type hint used by the CoVeR verify
+  callback ([optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L472)).
+
+> The enum **value** (`"mojo"`) is the string used everywhere else — env var `DSL=mojo`,
+> the `--dsl mojo` flag, knowledge-base directory name, and prompt lookups. Keep it
+> lowercase and stable.
+
+---
+
+## Step 2 — Declare supported stages (DSL registry)
+
+[src/xe_forge/dsl_registry.py](src/xe_forge/dsl_registry.py) maps each DSL to the
+set of `OptimizationStage`s it supports. The planner's output is filtered against
+this set in the pipeline ([pipeline.py](src/xe_forge/pipeline.py#L329-L332)), so any
+stage you omit here will never run for your DSL.
+
+```python
+DSL_SUPPORTED_STAGES: dict[DSL, set[OptimizationStage]] = {
+    ...
+    DSL.MOJO: {
+        OptimizationStage.ANALYSIS,
+        OptimizationStage.ALGORITHMIC,
+        OptimizationStage.DTYPE_FIX,
+        OptimizationStage.FUSION,
+        OptimizationStage.MEMORY_ACCESS,
+        OptimizationStage.DEVICE_SPECIFIC,
+        OptimizationStage.AUTOTUNING,
+        OptimizationStage.DISCOVERY,
+    },
+}
+```
+
+Pick only the stages that make sense for the language. For example,
+`BLOCK_POINTERS` and `PERSISTENT_KERNEL` are Triton/CUDA concepts and are
+deliberately absent from the SYCL set. The full list of stages lives in the
+`OptimizationStage` enum ([models.py](src/xe_forge/models.py#L30-L42)); the canonical
+run order is `list(OptimizationStage)`, which `get_stages_for_dsl()` preserves.
+
+If you skip this step, `get_stages_for_dsl()` falls back to the Triton stage set
+([dsl_registry.py](src/xe_forge/dsl_registry.py#L57)).
+
+---
+
+## Step 3 — Provide an executor
+
+The **executor** is what makes a DSL real: it takes generated source, compiles it,
+runs it on the device, measures time/TFLOPS, and compares the optimized kernel
+against the original for correctness. Its `compare_kernels()` feedback string is fed
+straight back to the LLM inside the CoVeR loop.
+
+Two executors exist today, and they define the contract you must satisfy:
+
+### Option A — Reuse `KernelBenchExecutor` (Python-based DSLs)
+
+[core/executor.py](src/xe_forge/core/executor.py) handles any DSL whose kernels are
+**importable Python modules** exposing either a KernelBench-style `class Model` with
+`forward()`, or a named callable. Triton, Gluon, and CUDA (via Python wrappers /
+`torch.utils.cpp_extension`) all use it. If your DSL is invoked from Python, you
+likely need **no new executor** — just make sure the generated code imports cleanly
+and exposes `Model`/`forward`.
+
+Key methods you can rely on:
+
+- `execute(kernel_code, kernel_name, input_shapes, flop=…, dtype=…, init_args=…, input_dtypes=…)`
+  → `ExecutionResult`
+- `compare_kernels(original_code, optimized_code, …)` → `ComparisonResult` with a
+  `feedback_message` for the agent.
+
+### Option B — Write a new executor (compiled / out-of-process DSLs)
+
+If your DSL needs a separate compiler and runs as a subprocess (like SYCL), model it
+on [core/sycl_executor.py](src/xe_forge/core/sycl_executor.py). The `SyclExecutor`:
+
+- writes source to a temp `.cpp`, compiles via `ai_bench.sycl.compiler.SYCLCompiler`,
+- runs the binary with CLI args, parses `TFlop/s` and `ms` from stdout,
+- compares outputs by dumping `D2.bin` files and `numpy.allclose`,
+- returns a `SyclComparisonResult` whose `feedback_message` mirrors the
+  `KernelBenchExecutor` wording (SUCCESS / REGRESSION / CORRECTNESS FAILURE).
+
+**Contract for any executor** — to drop into the pipeline it must expose:
+
+- `execute(...) -> ExecutionResult` (the model in [models.py](src/xe_forge/models.py#L181-L188))
+- `compare_kernels(...) -> <result with .speedup, .feedback_message, .optimized_correct, .is_slower>`
+
+Export your new class from [core/__init__.py](src/xe_forge/core/__init__.py).
+
+---
+
+## Step 4 — Wire executor selection
+
+Two places choose the executor by DSL. Add a branch for yours (or let it fall through
+to `KernelBenchExecutor` if you reuse Option A):
+
+1. **Pipeline constructor** — [pipeline.py](src/xe_forge/pipeline.py#L65-L80):
+
+   ```python
+   if self.config.device_config.dsl == DSL.SYCL:
+       executor = SyclExecutor(verify=…)
+   else:
+       executor = KernelBenchExecutor(device=…, …)   # triton/gluon/cuda/mojo land here
+   ```
+
+2. **`create_executor_from_config()`** — [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139),
+   used by skills/scripts.
+
+The pipeline also has SYCL-specific branches around baseline measurement
+([pipeline.py](src/xe_forge/pipeline.py#L223-L260) and
+[L500-L516](src/xe_forge/pipeline.py#L500-L516)) that test `isinstance(self.executor, SyclExecutor)`.
+If your DSL needs dims-based (M/N/K) execution rather than `input_shapes`, follow the
+`_is_sycl` pattern; otherwise the `input_shapes` path is used automatically.
+
+---
+
+## Step 5 — Add prompt-library entries
+
+[src/xe_forge/prompts/device_prompts.py](src/xe_forge/prompts/device_prompts.py)
+centralizes all DSL/device-aware prompt text via `PromptLibrary(dsl, device_type)`.
+At minimum, register a human-readable name so the LLM knows what it is writing:
+
+```python
+_DSL_NAMES: dict[str, str] = {
+    "triton": "Triton",
+    "gluon": "Gluon",
+    "sycl": "SYCL/XeTLA",
+    "cuda": "CUDA C++",
+    "mojo": "Mojo",          # <-- new
+}
+```
+
+Then review the methods that branch on `self.dsl` and add your cases as needed:
+
+- `code_requirements()` — DSL-specific validation rules surfaced to the agent
+  ([device_prompts.py](src/xe_forge/prompts/device_prompts.py#L123-L145)).
+- `stage_guidance(stage)` — per-stage hints (e.g. how `block_pointers` or
+  `autotuning` should be done in your DSL)
+  ([device_prompts.py](src/xe_forge/prompts/device_prompts.py#L81-L121)).
+- `optimizer_signature_doc()` / `analyzer_signature_doc()` — the system docstrings.
+
+Anything you don't override degrades to a generic message, which is fine for a first
+cut.
+
+---
+
+## Step 6 — Add analyzer & optimizer signatures
+
+The agents pick a **DSPy signature** based on the DSL. Today the split is binary —
+SYCL vs "everything else (Triton-shaped)":
+
+- **Analyzer** — [analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L332-L335):
+  ```python
+  sig = SyclAnalysisSignature if self.dsl == DSL.SYCL else AnalysisSignature
+  ```
+  The issue-category block is also built per-DSL via `_build_issue_categories(dsl)`
+  ([analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L70-L210)), with SYCL
+  skipping Triton-only issue types (`_SYCL_SKIP_ISSUES`).
+
+- **Optimizer** — [optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L722-L776)
+  selects `SyclOptimizationSignature` / `SyclAlgorithmicOptimizationSignature` for
+  SYCL, else the Triton signatures. The CoVeR verify callback also branches: SYCL
+  goes through `_verify_sycl`, the Triton path runs `ast.parse` + `@triton.jit`/`Model`
+  checks ([optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L479-L519)).
+
+- **ReAct optimizer** (alternative strategy) — same pattern in
+  [react_agent.py](src/xe_forge/agents/react_agent.py#L395).
+
+**If your DSL is Python + KernelBench `Model` shaped** (like Triton), you can reuse the
+default signatures and only adjust `code_requirements()` from Step 5 — the generic
+path will validate and run it.
+
+**If your DSL is C++/compiled or has different code rules**, add new signatures
+(`MojoOptimizationSignature`, `MojoAnalysisSignature`, …) modeled on the SYCL ones and
+extend the `if self.dsl == DSL.MOJO:` branches in the analyzer, optimizer, and react
+agents, plus a `_verify_<dsl>` helper for the verify callback if the structural checks
+differ.
+
+---
+
+## Step 7 — Add the knowledge base
+
+The knowledge base is **optional** (the pipeline runs on the LLM's built-in knowledge
+when it is empty or disabled) but strongly recommended for quality. It is loaded by
+[knowledge/loader.py](src/xe_forge/knowledge/loader.py) and enabled with
+`KNOWLEDGE_BASE_ENABLED=true` + `KNOWLEDGE_DIR=./knowledge_base`.
+
+### Directory layout
+
+`load_knowledge_base(dir, dsl, device_type)` collects YAML files in priority order
+([loader.py](src/xe_forge/knowledge/loader.py#L318-L351)):
+
+```
+knowledge_base/
+├── common/                     # DSL-agnostic, always loaded
+│   ├── algorithmic_patterns.yaml
+│   └── correctness.yaml
+├── <dsl>/                      # e.g. triton, sycl, gluon  →  your new <dsl>/
+│   ├── common/                 # optional: DSL-wide, device-agnostic
+│   └── <device_type>/          # e.g. xpu, cuda
+│       ├── *.yaml              # patterns + constraints for (dsl, device)
+│       └── examples/
+│           ├── index.yaml      # reference-kernel manifest
+│           └── *.py / *.cpp    # the actual before/after kernels
+```
+
+So for a new DSL targeting XPU you would create
+`knowledge_base/mojo/xpu/*.yaml` and `knowledge_base/mojo/xpu/examples/`.
+Look at [knowledge_base/triton/xpu/](knowledge_base/triton/xpu/) and
+[knowledge_base/sycl/xpu/](knowledge_base/sycl/xpu/) as templates.
+
+### Pattern YAML schema
+
+Each YAML file may contain `constraints:` and `patterns:` lists
+([loader.py](src/xe_forge/knowledge/loader.py#L359-L391)):
+
+```yaml
+patterns:
+  - id: large_tiles
+    name: Use large tiles on XPU
+    stage: device_specific        # must map to an OptimizationStage (aliases allowed)
+    description: ...
+    rationale: ...
+    pattern_before: |             # or "before:"
+      ...code...
+    pattern_after: |              # or "after:"
+      ...code...
+    expected_speedup: "2-4x"
+    notes: ...
+
+constraints:
+  - id: grf_mode_constexpr        # the id keyword routes it to a stage
+    name: grf_mode must be constexpr
+    severity: critical
+    description: ...
+```
+
+Notes on the loader's behavior:
+
+- `stage` strings are normalized through `_STAGE_ALIASES`
+  ([loader.py](src/xe_forge/knowledge/loader.py#L29-L48)) — e.g. `memory`, `dtype`,
+  `xpu_specific`, `stream_k` all resolve to canonical stages. Unmappable stages are
+  skipped and logged.
+- **Constraints have no `stage` field**; their target stage is inferred from keywords
+  in their `id` via `_CONSTRAINT_STAGE_HINTS`
+  ([loader.py](src/xe_forge/knowledge/loader.py#L50-L74)). A constraint with no keyword
+  match applies to *all* stages.
+- Only patterns/constraints for stages your DSL supports (Step 2) will ever be shown.
+
+### Examples manifest
+
+`examples/index.yaml` lists reference kernels with `stages:` tags and points at the
+code files ([loader.py](src/xe_forge/knowledge/loader.py#L467-L549)). See
+[knowledge_base/triton/xpu/examples/index.yaml](knowledge_base/triton/xpu/examples/index.yaml):
+
+```yaml
+examples:
+  - id: gemm_activation
+    name: GEMM + Activation Fusion
+    stages: [algorithmic, fusion, device_specific, autotuning]
+    description: ...
+    unoptimized: gemm_activation_unoptimized.py   # or "file:" for optimized-only
+    optimized: gemm_activation_optimized.py
+    optimizations_applied: [ ... ]
+    expected_speedup: 2-4x
+```
+
+If `stages:` is omitted, the loader infers them from keywords in the description /
+`optimizations_applied` ([loader.py](src/xe_forge/knowledge/loader.py#L552-L607)).
+
+How it is consumed: `KnowledgeBase.format_for_stage(stage)` returns only the
+constraints, patterns, and examples relevant to the stage currently running
+([loader.py](src/xe_forge/knowledge/loader.py#L165-L267)), keeping the context window
+lean. The analyzer gets the critical constraints; the optimizer gets the stage's
+patterns and examples.
+
+---
+
+## Step 8 — (Optional) issue types & stage mapping
+
+If your DSL has optimization opportunities **not covered** by the existing
+`IssueType` enum, you can add them — but in most cases you don't need to, because:
+
+- the `OPEN_ENDED` / `DISCOVERY` mechanism lets the LLM propose novel optimizations
+  without a predefined type ([models.py](src/xe_forge/models.py#L100-L104)), and
+- new issue strings are auto-routed to a stage by keyword/prefix inference in
+  [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py#L98-L190).
+
+If you do add a type:
+
+1. Add the member to `IssueType` ([models.py](src/xe_forge/models.py#L45-L104)).
+2. Map it to a stage in `_MAPPING` in
+   [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py#L32-L89) — or rely on the
+   keyword/prefix layers, or call `register_stage(value, stage)` at runtime
+   ([patterns.py](src/xe_forge/knowledge/patterns.py#L199-L213)).
+3. Add a one-line description in `_descriptions` inside `_build_issue_categories`
+   ([analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L100-L172)) so the LLM
+   knows when to emit it.
+4. If the issue is Triton-only and should be hidden from your DSL, add it to a skip
+   set analogous to `_SYCL_SKIP_ISSUES`.
+
+Anything unmapped falls back to `ANALYSIS` and is skipped with a warning, so nothing
+breaks silently.
+
+---
+
+## Step 9 — CLI / config plumbing
+
+The `--dsl` flag is already generic: it accepts any string and sets the `DSL` env var
+([cli.py](src/xe_forge/cli.py#L110-L114), [L242-L243](src/xe_forge/cli.py#L242-L243)),
+which `ConfigManager` reads into `device_config.dsl`
+([config.py](src/xe_forge/config.py#L234-L236)). So a new DSL is usable as
+`--dsl mojo` with no parser change.
+
+Review these DSL-string checks in the CLI and add your DSL where the behavior should
+match a C++/compiled flow rather than the Python/reference flow:
+
+- reference-implementation reading is gated to non-C++ DSLs
+  ([cli.py](src/xe_forge/cli.py#L504), [L519-L521](src/xe_forge/cli.py#L519-L521)) —
+  `if dsl not in ("sycl", "cuda")`.
+- default variant resolution for compiled DSLs
+  ([cli.py](src/xe_forge/cli.py#L428-L431)).
+
+If your DSL needs its own device defaults (tile sizes, warps), either reuse
+`XPUConfig`/`CUDAConfig` or add a new `DeviceConfig` subclass and branch in
+`_build_device_config()` ([config.py](src/xe_forge/config.py#L288-L310)). Note device
+config is keyed on **device type** (`xpu`/`cuda`), not DSL, so this is usually
+unnecessary.
+
+---
+
+## Step 10 — Claude engine templates
+
+Xe Forge has two engines ([engines/](src/xe_forge/engines/)): the automated **DSPy**
+pipeline (everything above) and the **Claude** engine, which generates a ready-to-run
+workspace. The Claude generator renders Jinja templates with the DSL
+([claude/generator.py](src/xe_forge/claude/generator.py#L47-L69)) from
+[claude/templates/](src/xe_forge/claude/templates/) (`CLAUDE.md.j2`,
+`optimize-kernel.md.j2`, `tool-runner.md.j2`, `config.yaml.j2`).
+
+If you want the Claude engine to support your DSL, make sure those templates handle
+the `dsl` variable (instructions, file extensions, build/run commands). The DSPy
+engine and the Claude engine are independent — you can ship a DSL on one without the
+other.
+
+---
+
+## Testing your DSL
+
+1. **Unit-level**: the spec loader and validator have tests in
+   [tests/](tests/) ([test_spec_loader.py](tests/test_spec_loader.py),
+   [test_validator.py](tests/test_validator.py)). Add KB-loading coverage modeled on
+   [runners/test_kb_examples.py](runners/test_kb_examples.py) if you add a knowledge base.
+2. **Knowledge base sanity**: run with `KNOWLEDGE_BASE_ENABLED=true` and check the
+   startup log line `Knowledge base loaded (dsl=…): N patterns, M constraints, K examples`
+   plus any "Skipped … unmappable stage" warnings
+   ([loader.py](src/xe_forge/knowledge/loader.py#L306-L313)).
+3. **End-to-end**: write a small kernel + YAML spec (mirror the pairs in
+   [test_kernels/](test_kernels/)) and run:
+   ```bash
+   python -m xe_forge.cli --dsl mojo --device xpu --kernel my_kernel.<ext> --spec my_kernel.yaml
+   ```
+   Confirm the baseline measures, the planned stages are filtered to your supported
+   set, and the executor's compile/run/compare feedback flows back into the agent.
+
+---
+
+## Checklist
+
+- [ ] **Step 1** — `DSL` enum member + correct `code_language` ([models.py](src/xe_forge/models.py#L11-L21))
+- [ ] **Step 2** — `DSL_SUPPORTED_STAGES` entry ([dsl_registry.py](src/xe_forge/dsl_registry.py))
+- [ ] **Step 3** — executor: reuse `KernelBenchExecutor` or add a new one ([core/](src/xe_forge/core/))
+- [ ] **Step 4** — executor selection in [pipeline.py](src/xe_forge/pipeline.py#L65-L80) and [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139)
+- [ ] **Step 5** — `_DSL_NAMES` + relevant `PromptLibrary` branches ([device_prompts.py](src/xe_forge/prompts/device_prompts.py))
+- [ ] **Step 6** — analyzer/optimizer/react signatures (only if code rules differ from Triton)
+- [ ] **Step 7** — `knowledge_base/<dsl>/<device>/` patterns, constraints, examples (optional)
+- [ ] **Step 8** — new `IssueType`s + stage mapping (only if needed)
+- [ ] **Step 9** — CLI DSL-string checks / device config (usually trivial)
+- [ ] **Step 10** — Claude engine templates (only for the Claude engine)
+- [ ] **Test** — KB load log is clean, stages filter correctly, an end-to-end run compiles/runs/compares
+
+For the minimum viable DSL (Python-based, KernelBench `Model`-shaped, XPU), only
+Steps 1, 2, and 5 are strictly required — everything else falls back to the Triton
+path.

From a6ccdd33763eecbda41b24c1ed4bc84e955270c1 Mon Sep 17 00:00:00 2001
From: Marcin Spoczynski <marcin@spoczynski.com>
Date: Sun, 7 Jun 2026 19:01:53 -0400
Subject: [PATCH 3/4] Shorten dsl readme

---
 DSL.md | 522 ++++++++++++++-------------------------------------------
 1 file changed, 124 insertions(+), 398 deletions(-)

diff --git a/DSL.md b/DSL.md
index 4e50a52..5222535 100644
--- a/DSL.md
+++ b/DSL.md
@@ -1,109 +1,26 @@
 # Adding a New DSL to Xe Forge
 
-This guide explains how to add support for a new kernel **DSL** (domain-specific
-language / programming model — e.g. Triton, Gluon, SYCL, CUDA C++) to the Xe Forge
-optimization pipeline.
-
-A "DSL" in Xe Forge is the source language the kernels are written in. The pipeline
-is DSL-aware end to end: the **analyzer**, **planner**, **optimizer**, **executor**,
-**knowledge base**, and **prompt library** all branch on the active DSL so that the
-LLM sees the right instructions and the generated code is compiled, run, and verified
-with the right toolchain.
-
-Xe Forge currently ships four DSLs, defined in [src/xe_forge/models.py](src/xe_forge/models.py#L11-L21):
-
-| DSL     | Value    | Language | Executor             |
-|---------|----------|----------|----------------------|
-| Triton  | `triton` | Python   | `KernelBenchExecutor` |
-| Gluon   | `gluon`  | Python   | `KernelBenchExecutor` |
-| SYCL    | `sycl`   | C++      | `SyclExecutor`        |
-| CUDA    | `cuda`   | C++      | `KernelBenchExecutor` |
-
-The DSL is selected at runtime via the `DSL` environment variable or the
-`--dsl` CLI flag, both resolved in [src/xe_forge/config.py](src/xe_forge/config.py#L234-L236)
-into `config.device_config.dsl`.
-
----
-
-## Table of Contents
-
-1. [Architecture overview](#architecture-overview)
-2. [The pieces you must touch](#the-pieces-you-must-touch)
-3. [Step 1 — Register the DSL enum](#step-1--register-the-dsl-enum)
-4. [Step 2 — Declare supported stages (DSL registry)](#step-2--declare-supported-stages-dsl-registry)
-5. [Step 3 — Provide an executor](#step-3--provide-an-executor)
-6. [Step 4 — Wire executor selection](#step-4--wire-executor-selection)
-7. [Step 5 — Add prompt-library entries](#step-5--add-prompt-library-entries)
-8. [Step 6 — Add analyzer & optimizer signatures](#step-6--add-analyzer--optimizer-signatures)
-9. [Step 7 — Add the knowledge base](#step-7--add-the-knowledge-base)
-10. [Step 8 — (Optional) issue types & stage mapping](#step-8--optional-issue-types--stage-mapping)
-11. [Step 9 — CLI / config plumbing](#step-9--cli--config-plumbing)
-12. [Step 10 — Claude engine templates](#step-10--claude-engine-templates)
-13. [Testing your DSL](#testing-your-dsl)
-14. [Checklist](#checklist)
-
----
-
-## Architecture overview
-
-A single optimization run flows through [`XeForgePipeline.optimize()`](src/xe_forge/pipeline.py#L166)
-roughly as follows:
-
-```
-spec (YAML) ──► input shapes, dtypes, FLOPs
-kernel code ──► AnalyzerAgent.analyze()        # detects issues (DSL-aware prompt)
-                       │
-                       ▼
-              PlannerAgent.plan()               # orders the stages
-                       │
-       filtered by  get_stages_for_dsl(dsl)     # dsl_registry.py
-                       ▼
-   for each stage:  OptimizerAgent.optimize_stage()
-                       │
-                       ▼  (CoVeR loop: generate → compile → run → compare)
-              Executor.compare_kernels()        # KernelBenchExecutor or SyclExecutor
-                       │
-                       ▼
-              re-analyze, next stage …
-```
-
-Three knobs make this DSL-aware:
-
-- **`config.device_config.dsl`** — the active DSL string, read everywhere.
-- **`DSL.code_language`** — `"python"` or `"cpp"`, used to pick file extensions,
-  comment markers, and DSPy `Code[...]` types ([models.py](src/xe_forge/models.py#L17-L21)).
-- **The DSL registry** — which optimization stages even apply to this DSL
-  ([dsl_registry.py](src/xe_forge/dsl_registry.py)).
-
-The cleanest mental model: **Triton is the reference DSL**. Anything that does not
-special-case your DSL falls back to the Triton path, so the more your DSL resembles
-Triton's Python+KernelBench flow, the less you have to write.
-
----
-
-## The pieces you must touch
-
-| # | Concern | File(s) | Required? |
-|---|---------|---------|-----------|
-| 1 | DSL identity | [models.py](src/xe_forge/models.py#L11-L21) | **Yes** |
-| 2 | Supported stages | [dsl_registry.py](src/xe_forge/dsl_registry.py) | **Yes** |
-| 3 | Compile / run / verify | [core/executor.py](src/xe_forge/core/executor.py) or [core/sycl_executor.py](src/xe_forge/core/sycl_executor.py) | **Yes** (reuse or new) |
-| 4 | Executor selection | [pipeline.py](src/xe_forge/pipeline.py#L65-L80), [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139) | **Yes** |
-| 5 | Prompt components | [prompts/device_prompts.py](src/xe_forge/prompts/device_prompts.py) | **Yes** |
-| 6 | DSPy signatures | [agents/analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py), [agents/optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py), [agents/react_agent.py](src/xe_forge/agents/react_agent.py) | If code rules differ from Triton |
-| 7 | Knowledge base | [knowledge_base/](knowledge_base/) directory | Optional (recommended) |
-| 8 | Issue types / mapping | [models.py](src/xe_forge/models.py#L45-L104), [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py) | Only for novel issues |
-| 9 | CLI / config | [cli.py](src/xe_forge/cli.py), [config.py](src/xe_forge/config.py) | Usually trivial |
-| 10 | Claude engine templates | [claude/templates/](src/xe_forge/claude/templates/) | Only for the Claude engine |
-
-The rest of this document walks each step in order.
+A **DSL** is the source language kernels are written in. Xe Forge is DSL-aware end to
+end — analyzer, planner, optimizer, executor, knowledge base, and prompts all branch
+on the active DSL. It ships with four, defined in `src/xe_forge/models.py`:
+
+| DSL | Value | Language | Executor |
+|-----|-------|----------|----------|
+| Triton | `triton` | Python | `KernelBenchExecutor` |
+| Gluon | `gluon` | Python | `KernelBenchExecutor` |
+| SYCL | `sycl` | C++ | `SyclExecutor` |
+| CUDA | `cuda` | Python | `KernelBenchExecutor` |
+
+The DSL is chosen with `--dsl <name>` or the `DSL` env var, ending up in
+`config.device_config.dsl`. **Triton is the reference path**: anything that doesn't
+special-case a DSL falls back to it. A Python + KernelBench-`Model`-shaped DSL on XPU
+only needs Steps 1, 2, and 5.
 
 ---
 
 ## Step 1 — Register the DSL enum
 
-Add a member to the `DSL` `StrEnum` in [src/xe_forge/models.py](src/xe_forge/models.py#L11-L21)
-and make sure `code_language` returns the correct value for it:
+`src/xe_forge/models.py`:
 
 ```python
 class DSL(StrEnum):
@@ -111,37 +28,26 @@ class DSL(StrEnum):
     GLUON = "gluon"
     SYCL = "sycl"
     CUDA = "cuda"
-    MOJO = "mojo"          # <-- new DSL
+    MOJO = "mojo"          # new
 
     @property
     def code_language(self) -> str:
-        if self in (DSL.SYCL, DSL.CUDA):
+        if self in (DSL.SYCL, DSL.CUDA):   # add MOJO here if it is C++-like
             return "cpp"
-        return "python"     # mojo falls here; add to the cpp tuple if it is C++-like
+        return "python"
 ```
 
-`code_language` drives:
-
-- the saved-file extension and comment marker in [`_save_results`](src/xe_forge/pipeline.py#L582-L599)
-  (`.py`/`#` vs `.cpp`/`//`), and
-- the `dspy.Code["python"]` / `dspy.Code["cpp"]` type hint used by the CoVeR verify
-  callback ([optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L472)).
-
-> The enum **value** (`"mojo"`) is the string used everywhere else — env var `DSL=mojo`,
-> the `--dsl mojo` flag, knowledge-base directory name, and prompt lookups. Keep it
-> lowercase and stable.
-
----
+`code_language` decides saved-file extension (`.py`/`.cpp`), comment marker, and the
+`dspy.Code["python"|"cpp"]` type. The enum value (`"mojo"`) is the string used for the
+flag, env var, and knowledge-base directory — keep it lowercase.
 
-## Step 2 — Declare supported stages (DSL registry)
+## Step 2 — Declare supported stages
 
-[src/xe_forge/dsl_registry.py](src/xe_forge/dsl_registry.py) maps each DSL to the
-set of `OptimizationStage`s it supports. The planner's output is filtered against
-this set in the pipeline ([pipeline.py](src/xe_forge/pipeline.py#L329-L332)), so any
-stage you omit here will never run for your DSL.
+`src/xe_forge/dsl_registry.py`. The planner output is filtered to this set, so omitted
+stages never run.
 
 ```python
-DSL_SUPPORTED_STAGES: dict[DSL, set[OptimizationStage]] = {
+DSL_SUPPORTED_STAGES = {
     ...
     DSL.MOJO: {
         OptimizationStage.ANALYSIS,
@@ -156,225 +62,112 @@ DSL_SUPPORTED_STAGES: dict[DSL, set[OptimizationStage]] = {
 }
 ```
 
-Pick only the stages that make sense for the language. For example,
-`BLOCK_POINTERS` and `PERSISTENT_KERNEL` are Triton/CUDA concepts and are
-deliberately absent from the SYCL set. The full list of stages lives in the
-`OptimizationStage` enum ([models.py](src/xe_forge/models.py#L30-L42)); the canonical
-run order is `list(OptimizationStage)`, which `get_stages_for_dsl()` preserves.
-
-If you skip this step, `get_stages_for_dsl()` falls back to the Triton stage set
-([dsl_registry.py](src/xe_forge/dsl_registry.py#L57)).
-
----
-
-## Step 3 — Provide an executor
-
-The **executor** is what makes a DSL real: it takes generated source, compiles it,
-runs it on the device, measures time/TFLOPS, and compares the optimized kernel
-against the original for correctness. Its `compare_kernels()` feedback string is fed
-straight back to the LLM inside the CoVeR loop.
+Include only stages that make sense (e.g. SYCL omits `BLOCK_POINTERS` and
+`PERSISTENT_KERNEL`). Missing DSL → falls back to the Triton set.
 
-Two executors exist today, and they define the contract you must satisfy:
+## Step 3 — Executor
 
-### Option A — Reuse `KernelBenchExecutor` (Python-based DSLs)
+The executor compiles, runs, times, and compares kernels; its `compare_kernels()`
+feedback string is fed back to the LLM. It must expose:
 
-[core/executor.py](src/xe_forge/core/executor.py) handles any DSL whose kernels are
-**importable Python modules** exposing either a KernelBench-style `class Model` with
-`forward()`, or a named callable. Triton, Gluon, and CUDA (via Python wrappers /
-`torch.utils.cpp_extension`) all use it. If your DSL is invoked from Python, you
-likely need **no new executor** — just make sure the generated code imports cleanly
-and exposes `Model`/`forward`.
-
-Key methods you can rely on:
-
-- `execute(kernel_code, kernel_name, input_shapes, flop=…, dtype=…, init_args=…, input_dtypes=…)`
-  → `ExecutionResult`
-- `compare_kernels(original_code, optimized_code, …)` → `ComparisonResult` with a
-  `feedback_message` for the agent.
-
-### Option B — Write a new executor (compiled / out-of-process DSLs)
-
-If your DSL needs a separate compiler and runs as a subprocess (like SYCL), model it
-on [core/sycl_executor.py](src/xe_forge/core/sycl_executor.py). The `SyclExecutor`:
-
-- writes source to a temp `.cpp`, compiles via `ai_bench.sycl.compiler.SYCLCompiler`,
-- runs the binary with CLI args, parses `TFlop/s` and `ms` from stdout,
-- compares outputs by dumping `D2.bin` files and `numpy.allclose`,
-- returns a `SyclComparisonResult` whose `feedback_message` mirrors the
-  `KernelBenchExecutor` wording (SUCCESS / REGRESSION / CORRECTNESS FAILURE).
-
-**Contract for any executor** — to drop into the pipeline it must expose:
-
-- `execute(...) -> ExecutionResult` (the model in [models.py](src/xe_forge/models.py#L181-L188))
-- `compare_kernels(...) -> <result with .speedup, .feedback_message, .optimized_correct, .is_slower>`
-
-Export your new class from [core/__init__.py](src/xe_forge/core/__init__.py).
+```python
+def execute(...) -> ExecutionResult: ...
+def compare_kernels(...):   # result has .speedup, .feedback_message, .optimized_correct, .is_slower
+```
 
----
+- **Python / importable kernels** (Triton, Gluon, CUDA): reuse `KernelBenchExecutor`
+  (`src/xe_forge/core/executor.py`). Just emit code that imports cleanly and exposes a
+  `class Model` with `forward()` (or a named callable). Usually **no new executor needed**.
+- **Compiled / out-of-process** (like SYCL): model a new class on `SyclExecutor`
+  (`src/xe_forge/core/sycl_executor.py`) — write source to temp file, compile, run as
+  subprocess, parse timing, compare output dumps. Export it from `src/xe_forge/core/__init__.py`.
 
 ## Step 4 — Wire executor selection
 
-Two places choose the executor by DSL. Add a branch for yours (or let it fall through
-to `KernelBenchExecutor` if you reuse Option A):
-
-1. **Pipeline constructor** — [pipeline.py](src/xe_forge/pipeline.py#L65-L80):
-
-   ```python
-   if self.config.device_config.dsl == DSL.SYCL:
-       executor = SyclExecutor(verify=…)
-   else:
-       executor = KernelBenchExecutor(device=…, …)   # triton/gluon/cuda/mojo land here
-   ```
+Two spots pick the executor. Add a branch or let it fall through to
+`KernelBenchExecutor`:
 
-2. **`create_executor_from_config()`** — [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139),
-   used by skills/scripts.
+```python
+# src/xe_forge/pipeline.py  (constructor)  and  src/xe_forge/core/__init__.py (create_executor_from_config)
+if config.device_config.dsl == DSL.SYCL:
+    executor = SyclExecutor(...)
+else:
+    executor = KernelBenchExecutor(...)   # triton/gluon/cuda/mojo
+```
 
-The pipeline also has SYCL-specific branches around baseline measurement
-([pipeline.py](src/xe_forge/pipeline.py#L223-L260) and
-[L500-L516](src/xe_forge/pipeline.py#L500-L516)) that test `isinstance(self.executor, SyclExecutor)`.
-If your DSL needs dims-based (M/N/K) execution rather than `input_shapes`, follow the
-`_is_sycl` pattern; otherwise the `input_shapes` path is used automatically.
+If your DSL runs from M/N/K dims instead of `input_shapes`, follow the `_is_sycl`
+branches in `pipeline.py`.
 
----
+## Step 5 — Prompt library
 
-## Step 5 — Add prompt-library entries
-
-[src/xe_forge/prompts/device_prompts.py](src/xe_forge/prompts/device_prompts.py)
-centralizes all DSL/device-aware prompt text via `PromptLibrary(dsl, device_type)`.
-At minimum, register a human-readable name so the LLM knows what it is writing:
+`src/xe_forge/prompts/device_prompts.py`. At minimum register the display name:
 
 ```python
-_DSL_NAMES: dict[str, str] = {
+_DSL_NAMES = {
     "triton": "Triton",
-    "gluon": "Gluon",
     "sycl": "SYCL/XeTLA",
-    "cuda": "CUDA C++",
-    "mojo": "Mojo",          # <-- new
+    "mojo": "Mojo",          # new
 }
 ```
 
-Then review the methods that branch on `self.dsl` and add your cases as needed:
-
-- `code_requirements()` — DSL-specific validation rules surfaced to the agent
-  ([device_prompts.py](src/xe_forge/prompts/device_prompts.py#L123-L145)).
-- `stage_guidance(stage)` — per-stage hints (e.g. how `block_pointers` or
-  `autotuning` should be done in your DSL)
-  ([device_prompts.py](src/xe_forge/prompts/device_prompts.py#L81-L121)).
-- `optimizer_signature_doc()` / `analyzer_signature_doc()` — the system docstrings.
-
-Anything you don't override degrades to a generic message, which is fine for a first
-cut.
-
----
+Then add cases as needed in `code_requirements()` (validation rules) and
+`stage_guidance(stage)` (per-stage hints). Unhandled cases degrade to generic text.
 
-## Step 6 — Add analyzer & optimizer signatures
+## Step 6 — Agent signatures (only if code rules differ from Triton)
 
-The agents pick a **DSPy signature** based on the DSL. Today the split is binary —
-SYCL vs "everything else (Triton-shaped)":
+Agents pick a DSPy signature per DSL — today it's SYCL vs Triton-shaped:
 
-- **Analyzer** — [analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L332-L335):
-  ```python
-  sig = SyclAnalysisSignature if self.dsl == DSL.SYCL else AnalysisSignature
-  ```
-  The issue-category block is also built per-DSL via `_build_issue_categories(dsl)`
-  ([analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L70-L210)), with SYCL
-  skipping Triton-only issue types (`_SYCL_SKIP_ISSUES`).
-
-- **Optimizer** — [optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L722-L776)
-  selects `SyclOptimizationSignature` / `SyclAlgorithmicOptimizationSignature` for
-  SYCL, else the Triton signatures. The CoVeR verify callback also branches: SYCL
-  goes through `_verify_sycl`, the Triton path runs `ast.parse` + `@triton.jit`/`Model`
-  checks ([optimizer_agent.py](src/xe_forge/agents/optimizer_agent.py#L479-L519)).
-
-- **ReAct optimizer** (alternative strategy) — same pattern in
-  [react_agent.py](src/xe_forge/agents/react_agent.py#L395).
-
-**If your DSL is Python + KernelBench `Model` shaped** (like Triton), you can reuse the
-default signatures and only adjust `code_requirements()` from Step 5 — the generic
-path will validate and run it.
-
-**If your DSL is C++/compiled or has different code rules**, add new signatures
-(`MojoOptimizationSignature`, `MojoAnalysisSignature`, …) modeled on the SYCL ones and
-extend the `if self.dsl == DSL.MOJO:` branches in the analyzer, optimizer, and react
-agents, plus a `_verify_<dsl>` helper for the verify callback if the structural checks
-differ.
-
----
-
-## Step 7 — Add the knowledge base
+```python
+# analyzer_agent.py
+sig = SyclAnalysisSignature if self.dsl == DSL.SYCL else AnalysisSignature
+# optimizer_agent.py: SyclOptimizationSignature / SyclAlgorithmicOptimizationSignature else Triton
+```
 
-The knowledge base is **optional** (the pipeline runs on the LLM's built-in knowledge
-when it is empty or disabled) but strongly recommended for quality. It is loaded by
-[knowledge/loader.py](src/xe_forge/knowledge/loader.py) and enabled with
-`KNOWLEDGE_BASE_ENABLED=true` + `KNOWLEDGE_DIR=./knowledge_base`.
+Python+`Model`-shaped DSLs reuse the default signatures (only adjust Step 5). For a
+C++/compiled DSL, add `MojoAnalysisSignature` / `MojoOptimizationSignature` modeled on
+the SYCL ones, extend the `if self.dsl == DSL.MOJO:` branches in
+`analyzer_agent.py`, `optimizer_agent.py`, `react_agent.py`, and add a `_verify_<dsl>`
+helper for the CoVeR verify callback if the structural checks differ from the Triton
+`ast.parse` + `@triton.jit`/`Model` checks.
 
-### Directory layout
+## Step 7 — Knowledge base (optional, recommended)
 
-`load_knowledge_base(dir, dsl, device_type)` collects YAML files in priority order
-([loader.py](src/xe_forge/knowledge/loader.py#L318-L351)):
+Loaded by `src/xe_forge/knowledge/loader.py`, enabled with
+`KNOWLEDGE_BASE_ENABLED=true`. Layout (priority: `common` → `<dsl>/common` → `<dsl>/<device>`):
 
 ```
 knowledge_base/
-├── common/                     # DSL-agnostic, always loaded
-│   ├── algorithmic_patterns.yaml
-│   └── correctness.yaml
-├── <dsl>/                      # e.g. triton, sycl, gluon  →  your new <dsl>/
-│   ├── common/                 # optional: DSL-wide, device-agnostic
-│   └── <device_type>/          # e.g. xpu, cuda
-│       ├── *.yaml              # patterns + constraints for (dsl, device)
-│       └── examples/
-│           ├── index.yaml      # reference-kernel manifest
-│           └── *.py / *.cpp    # the actual before/after kernels
+├── common/                  # DSL-agnostic, always loaded
+└── mojo/xpu/                # your <dsl>/<device>
+    ├── *.yaml               # patterns + constraints
+    └── examples/
+        ├── index.yaml
+        └── *.py / *.cpp
 ```
 
-So for a new DSL targeting XPU you would create
-`knowledge_base/mojo/xpu/*.yaml` and `knowledge_base/mojo/xpu/examples/`.
-Look at [knowledge_base/triton/xpu/](knowledge_base/triton/xpu/) and
-[knowledge_base/sycl/xpu/](knowledge_base/sycl/xpu/) as templates.
-
-### Pattern YAML schema
-
-Each YAML file may contain `constraints:` and `patterns:` lists
-([loader.py](src/xe_forge/knowledge/loader.py#L359-L391)):
+Pattern / constraint YAML:
 
 ```yaml
 patterns:
   - id: large_tiles
     name: Use large tiles on XPU
-    stage: device_specific        # must map to an OptimizationStage (aliases allowed)
+    stage: device_specific        # aliases ok: memory, dtype, xpu_specific, stream_k...
     description: ...
     rationale: ...
-    pattern_before: |             # or "before:"
+    pattern_before: |
       ...code...
-    pattern_after: |              # or "after:"
+    pattern_after: |
       ...code...
     expected_speedup: "2-4x"
-    notes: ...
 
 constraints:
-  - id: grf_mode_constexpr        # the id keyword routes it to a stage
+  - id: grf_mode_constexpr        # stage inferred from keywords in the id
     name: grf_mode must be constexpr
     severity: critical
     description: ...
 ```
 
-Notes on the loader's behavior:
-
-- `stage` strings are normalized through `_STAGE_ALIASES`
-  ([loader.py](src/xe_forge/knowledge/loader.py#L29-L48)) — e.g. `memory`, `dtype`,
-  `xpu_specific`, `stream_k` all resolve to canonical stages. Unmappable stages are
-  skipped and logged.
-- **Constraints have no `stage` field**; their target stage is inferred from keywords
-  in their `id` via `_CONSTRAINT_STAGE_HINTS`
-  ([loader.py](src/xe_forge/knowledge/loader.py#L50-L74)). A constraint with no keyword
-  match applies to *all* stages.
-- Only patterns/constraints for stages your DSL supports (Step 2) will ever be shown.
-
-### Examples manifest
-
-`examples/index.yaml` lists reference kernels with `stages:` tags and points at the
-code files ([loader.py](src/xe_forge/knowledge/loader.py#L467-L549)). See
-[knowledge_base/triton/xpu/examples/index.yaml](knowledge_base/triton/xpu/examples/index.yaml):
+Examples manifest (`examples/index.yaml`):
 
 ```yaml
 examples:
@@ -382,126 +175,59 @@ examples:
     name: GEMM + Activation Fusion
     stages: [algorithmic, fusion, device_specific, autotuning]
     description: ...
-    unoptimized: gemm_activation_unoptimized.py   # or "file:" for optimized-only
+    unoptimized: gemm_activation_unoptimized.py    # or "file:" for optimized-only
     optimized: gemm_activation_optimized.py
-    optimizations_applied: [ ... ]
     expected_speedup: 2-4x
 ```
 
-If `stages:` is omitted, the loader infers them from keywords in the description /
-`optimizations_applied` ([loader.py](src/xe_forge/knowledge/loader.py#L552-L607)).
-
-How it is consumed: `KnowledgeBase.format_for_stage(stage)` returns only the
-constraints, patterns, and examples relevant to the stage currently running
-([loader.py](src/xe_forge/knowledge/loader.py#L165-L267)), keeping the context window
-lean. The analyzer gets the critical constraints; the optimizer gets the stage's
-patterns and examples.
-
----
-
-## Step 8 — (Optional) issue types & stage mapping
-
-If your DSL has optimization opportunities **not covered** by the existing
-`IssueType` enum, you can add them — but in most cases you don't need to, because:
-
-- the `OPEN_ENDED` / `DISCOVERY` mechanism lets the LLM propose novel optimizations
-  without a predefined type ([models.py](src/xe_forge/models.py#L100-L104)), and
-- new issue strings are auto-routed to a stage by keyword/prefix inference in
-  [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py#L98-L190).
+`format_for_stage()` shows only the constraints/patterns/examples for the stage
+currently running, so context stays lean. Copy `knowledge_base/triton/xpu/` or
+`knowledge_base/sycl/xpu/` as a starting template.
 
-If you do add a type:
+## Step 8 — Issue types (only if needed)
 
-1. Add the member to `IssueType` ([models.py](src/xe_forge/models.py#L45-L104)).
-2. Map it to a stage in `_MAPPING` in
-   [knowledge/patterns.py](src/xe_forge/knowledge/patterns.py#L32-L89) — or rely on the
-   keyword/prefix layers, or call `register_stage(value, stage)` at runtime
-   ([patterns.py](src/xe_forge/knowledge/patterns.py#L199-L213)).
-3. Add a one-line description in `_descriptions` inside `_build_issue_categories`
-   ([analyzer_agent.py](src/xe_forge/agents/analyzer_agent.py#L100-L172)) so the LLM
-   knows when to emit it.
-4. If the issue is Triton-only and should be hidden from your DSL, add it to a skip
-   set analogous to `_SYCL_SKIP_ISSUES`.
+Usually skip this: the LLM can propose novel optimizations via the `OPEN_ENDED` /
+`DISCOVERY` path, and unknown issue strings are auto-routed by keyword/prefix in
+`src/xe_forge/knowledge/patterns.py`. To add a real type: add it to `IssueType`
+(`models.py`), map it in `_MAPPING` (`patterns.py`), and give it a description in
+`_build_issue_categories` (`analyzer_agent.py`).
 
-Anything unmapped falls back to `ANALYSIS` and is skipped with a warning, so nothing
-breaks silently.
+## Step 9 — CLI / config
 
----
-
-## Step 9 — CLI / config plumbing
-
-The `--dsl` flag is already generic: it accepts any string and sets the `DSL` env var
-([cli.py](src/xe_forge/cli.py#L110-L114), [L242-L243](src/xe_forge/cli.py#L242-L243)),
-which `ConfigManager` reads into `device_config.dsl`
-([config.py](src/xe_forge/config.py#L234-L236)). So a new DSL is usable as
-`--dsl mojo` with no parser change.
+`--dsl` already accepts any string. Just check the DSL-string gates in
+`src/xe_forge/cli.py` (e.g. `if dsl not in ("sycl", "cuda")` for reading the reference
+implementation and default variant) and add your DSL where it should follow the
+compiled-flow path instead of the Python/reference path. Device defaults are keyed on
+device type, not DSL, so `config.py` rarely needs changes.
 
-Review these DSL-string checks in the CLI and add your DSL where the behavior should
-match a C++/compiled flow rather than the Python/reference flow:
+## Step 10 — Claude engine templates (only for the Claude engine)
 
-- reference-implementation reading is gated to non-C++ DSLs
-  ([cli.py](src/xe_forge/cli.py#L504), [L519-L521](src/xe_forge/cli.py#L519-L521)) —
-  `if dsl not in ("sycl", "cuda")`.
-- default variant resolution for compiled DSLs
-  ([cli.py](src/xe_forge/cli.py#L428-L431)).
-
-If your DSL needs its own device defaults (tile sizes, warps), either reuse
-`XPUConfig`/`CUDAConfig` or add a new `DeviceConfig` subclass and branch in
-`_build_device_config()` ([config.py](src/xe_forge/config.py#L288-L310)). Note device
-config is keyed on **device type** (`xpu`/`cuda`), not DSL, so this is usually
-unnecessary.
+The Claude engine renders `src/xe_forge/claude/templates/*.j2` with the `dsl`
+variable. If you want it to support your DSL, make those templates handle the new
+value (build/run commands, extensions). The DSPy and Claude engines are independent.
 
 ---
 
-## Step 10 — Claude engine templates
-
-Xe Forge has two engines ([engines/](src/xe_forge/engines/)): the automated **DSPy**
-pipeline (everything above) and the **Claude** engine, which generates a ready-to-run
-workspace. The Claude generator renders Jinja templates with the DSL
-([claude/generator.py](src/xe_forge/claude/generator.py#L47-L69)) from
-[claude/templates/](src/xe_forge/claude/templates/) (`CLAUDE.md.j2`,
-`optimize-kernel.md.j2`, `tool-runner.md.j2`, `config.yaml.j2`).
+## Test
 
-If you want the Claude engine to support your DSL, make sure those templates handle
-the `dsl` variable (instructions, file extensions, build/run commands). The DSPy
-engine and the Claude engine are independent — you can ship a DSL on one without the
-other.
-
----
-
-## Testing your DSL
-
-1. **Unit-level**: the spec loader and validator have tests in
-   [tests/](tests/) ([test_spec_loader.py](tests/test_spec_loader.py),
-   [test_validator.py](tests/test_validator.py)). Add KB-loading coverage modeled on
-   [runners/test_kb_examples.py](runners/test_kb_examples.py) if you add a knowledge base.
-2. **Knowledge base sanity**: run with `KNOWLEDGE_BASE_ENABLED=true` and check the
-   startup log line `Knowledge base loaded (dsl=…): N patterns, M constraints, K examples`
-   plus any "Skipped … unmappable stage" warnings
-   ([loader.py](src/xe_forge/knowledge/loader.py#L306-L313)).
-3. **End-to-end**: write a small kernel + YAML spec (mirror the pairs in
-   [test_kernels/](test_kernels/)) and run:
-   ```bash
-   python -m xe_forge.cli --dsl mojo --device xpu --kernel my_kernel.<ext> --spec my_kernel.yaml
-   ```
-   Confirm the baseline measures, the planned stages are filtered to your supported
-   set, and the executor's compile/run/compare feedback flows back into the agent.
+```bash
+python -m xe_forge.cli --dsl mojo --device xpu --kernel my_kernel.<ext> --spec my_kernel.yaml
+```
 
----
+Confirm: baseline measures, planned stages are filtered to your supported set, the KB
+load log (`Knowledge base loaded (dsl=mojo): N patterns ...`) is clean with no
+"unmappable stage" warnings, and the executor's compile/run/compare feedback flows
+back into the agent. Mirror the kernel+spec pairs in `test_kernels/`.
 
 ## Checklist
 
-- [ ] **Step 1** — `DSL` enum member + correct `code_language` ([models.py](src/xe_forge/models.py#L11-L21))
-- [ ] **Step 2** — `DSL_SUPPORTED_STAGES` entry ([dsl_registry.py](src/xe_forge/dsl_registry.py))
-- [ ] **Step 3** — executor: reuse `KernelBenchExecutor` or add a new one ([core/](src/xe_forge/core/))
-- [ ] **Step 4** — executor selection in [pipeline.py](src/xe_forge/pipeline.py#L65-L80) and [core/__init__.py](src/xe_forge/core/__init__.py#L118-L139)
-- [ ] **Step 5** — `_DSL_NAMES` + relevant `PromptLibrary` branches ([device_prompts.py](src/xe_forge/prompts/device_prompts.py))
-- [ ] **Step 6** — analyzer/optimizer/react signatures (only if code rules differ from Triton)
-- [ ] **Step 7** — `knowledge_base/<dsl>/<device>/` patterns, constraints, examples (optional)
-- [ ] **Step 8** — new `IssueType`s + stage mapping (only if needed)
-- [ ] **Step 9** — CLI DSL-string checks / device config (usually trivial)
-- [ ] **Step 10** — Claude engine templates (only for the Claude engine)
-- [ ] **Test** — KB load log is clean, stages filter correctly, an end-to-end run compiles/runs/compares
-
-For the minimum viable DSL (Python-based, KernelBench `Model`-shaped, XPU), only
-Steps 1, 2, and 5 are strictly required — everything else falls back to the Triton
-path.
+- [ ] Step 1 — `DSL` enum + `code_language` (`models.py`)
+- [ ] Step 2 — `DSL_SUPPORTED_STAGES` entry (`dsl_registry.py`)
+- [ ] Step 3 — executor: reuse `KernelBenchExecutor` or add one (`core/`)
+- [ ] Step 4 — executor selection (`pipeline.py`, `core/__init__.py`)
+- [ ] Step 5 — `_DSL_NAMES` + `PromptLibrary` branches (`prompts/device_prompts.py`)
+- [ ] Step 6 — agent signatures (only if code rules differ from Triton)
+- [ ] Step 7 — `knowledge_base/<dsl>/<device>/` (optional)
+- [ ] Step 8 — issue types (only if needed)
+- [ ] Step 9 — CLI DSL-string checks (`cli.py`)
+- [ ] Step 10 — Claude engine templates (only for the Claude engine)

From de09b8813cc3824e117b2264cb269c8db53d76a3 Mon Sep 17 00:00:00 2001
From: Marcin Spoczynski <marcin@spoczynski.com>
Date: Wed, 10 Jun 2026 08:11:23 -0700
Subject: [PATCH 4/4] Update Claude and move other readme files to the separate
 directory

---
 README.md                       |  8 ++++----
 DSL.md => docs/DSL.md           | 25 +++++++++++++++++++++++--
 EXAMPLES.md => docs/EXAMPLES.md |  0
 TILE.md => docs/TILE.md         |  0
 VTUNE.md => docs/VTUNE.md       |  0
 src/xe_forge/cli.py             |  2 +-
 6 files changed, 28 insertions(+), 7 deletions(-)
 rename DSL.md => docs/DSL.md (87%)
 rename EXAMPLES.md => docs/EXAMPLES.md (100%)
 rename TILE.md => docs/TILE.md (100%)
 rename VTUNE.md => docs/VTUNE.md (100%)

diff --git a/README.md b/README.md
index ca51f3f..315651f 100644
--- a/README.md
+++ b/README.md
@@ -229,7 +229,7 @@ python -m xe_forge.cli --dsl sycl --tile-tune \
 python -m xe_forge.cli --dsl sycl --tile-tune --tune-config tune.yaml
 ```
 
-LLM-driven tile configuration tuning for CUTLASS SYCL kernels (GEMM, Flash Attention V2, MoE GEMM, Grouped GEMM) on Intel Xe GPUs. Uses a propose-validate-benchmark loop: an LLM proposes tile shapes, a hardware validator checks them against Intel Xe DPAS constraints, and valid configs are compiled and benchmarked on the GPU. Supports batch tuning via YAML configs. See [TILE.md](TILE.md) for the full setup guide, supported kernel types, and YAML config format.
+LLM-driven tile configuration tuning for CUTLASS SYCL kernels (GEMM, Flash Attention V2, MoE GEMM, Grouped GEMM) on Intel Xe GPUs. Uses a propose-validate-benchmark loop: an LLM proposes tile shapes, a hardware validator checks them against Intel Xe DPAS constraints, and valid configs are compiled and benchmarked on the GPU. Supports batch tuning via YAML configs. See [docs/TILE.md](docs/TILE.md) for the full setup guide, supported kernel types, and YAML config format.
 
 ---
 
@@ -582,7 +582,7 @@ xe-forge --input KERNEL --spec SPEC [OPTIONS]
 | `--max-trials` | Max optimization trials (default: 10) |
 | `--trials-dir` | Trial state directory (default: `./trials`) |
 | `--no-trials` | Disable trial tracking |
-| `--vtune` | Enable VTune GPU profiling (see [VTUNE.md](VTUNE.md)) |
+| `--vtune` | Enable VTune GPU profiling (see [docs/VTUNE.md](docs/VTUNE.md)) |
 | `--vtune-bin` | Path to VTune binary |
 | `--workspace` | Workspace directory (Claude engine only) |
 
@@ -684,13 +684,13 @@ Both variables have these defaults, so if you place the `knowledge_base/` direct
 
 ### Adding a New DSL
 
-Xe Forge is DSL-aware end to end (Triton, Gluon, SYCL, CUDA). To add support for another kernel language — including the knowledge base, executor, and prompt wiring required — see [`DSL.md`](DSL.md).
+Xe Forge is DSL-aware end to end (Triton, Gluon, SYCL, CUDA). To add support for another kernel language — including the knowledge base, executor, and prompt wiring required — see [`docs/DSL.md`](docs/DSL.md).
 
 ---
 
 ## Examples
 
-Curated kernels from [KernelBench](https://github.com/ScalingIntelligence/KernelBench) Level 2 and the [Intel XPU Triton benchmarks](https://github.com/intel/intel-xpu-backend-for-triton/tree/main/benchmarks/triton_kernels_benchmark), organized by category (GEMM, fused ops, reductions, attention, mixed ops). See [`EXAMPLES.md`](EXAMPLES.md) for the full list and usage instructions.
+Curated kernels from [KernelBench](https://github.com/ScalingIntelligence/KernelBench) Level 2 and the [Intel XPU Triton benchmarks](https://github.com/intel/intel-xpu-backend-for-triton/tree/main/benchmarks/triton_kernels_benchmark), organized by category (GEMM, fused ops, reductions, attention, mixed ops). See [`docs/EXAMPLES.md`](docs/EXAMPLES.md) for the full list and usage instructions.
 
 ---
 
diff --git a/DSL.md b/docs/DSL.md
similarity index 87%
rename from DSL.md
rename to docs/DSL.md
index 5222535..2e663d3 100644
--- a/DSL.md
+++ b/docs/DSL.md
@@ -200,7 +200,27 @@ implementation and default variant) and add your DSL where it should follow the
 compiled-flow path instead of the Python/reference path. Device defaults are keyed on
 device type, not DSL, so `config.py` rarely needs changes.
 
-## Step 10 — Claude engine templates (only for the Claude engine)
+## Step 10 — Skills folder
+
+`src/xe_forge/skills/` is a thin CLI wrapper around the core modules (`validate`,
+`benchmark`, `analyze`, `profile`, `trial`). Even though they just call core, their
+DSL knowledge is hardcoded and must be updated:
+
+```python
+# src/xe_forge/skills/__init__.py — add the value to every --dsl choices list
+p_validate.add_argument("--dsl", default="triton",
+                        choices=["triton", "sycl", "gluon", "cuda", "mojo"])
+```
+
+- `skills/benchmark.py` constructs `KernelBenchExecutor` directly — switch it to
+  `create_executor_from_config(...)` (or branch on the DSL) so a compiled DSL gets the
+  right executor.
+- `skills/validate.py` forwards `--dsl` to `KernelValidator.validate(code, dsl=...)`
+  in `src/xe_forge/core/validator.py`, which dispatches `_validate_triton` /
+  `_validate_sycl`. Add a `_validate_<dsl>` branch there (else it falls back to the
+  Triton checks).
+
+## Step 11 — Claude engine templates (only for the Claude engine)
 
 The Claude engine renders `src/xe_forge/claude/templates/*.j2` with the `dsl`
 variable. If you want it to support your DSL, make those templates handle the new
@@ -230,4 +250,5 @@ back into the agent. Mirror the kernel+spec pairs in `test_kernels/`.
 - [ ] Step 7 — `knowledge_base/<dsl>/<device>/` (optional)
 - [ ] Step 8 — issue types (only if needed)
 - [ ] Step 9 — CLI DSL-string checks (`cli.py`)
-- [ ] Step 10 — Claude engine templates (only for the Claude engine)
+- [ ] Step 10 — skills folder: `--dsl` choices, executor, validator (`skills/`, `core/validator.py`)
+- [ ] Step 11 — Claude engine templates (only for the Claude engine)
diff --git a/EXAMPLES.md b/docs/EXAMPLES.md
similarity index 100%
rename from EXAMPLES.md
rename to docs/EXAMPLES.md
diff --git a/TILE.md b/docs/TILE.md
similarity index 100%
rename from TILE.md
rename to docs/TILE.md
diff --git a/VTUNE.md b/docs/VTUNE.md
similarity index 100%
rename from VTUNE.md
rename to docs/VTUNE.md
diff --git a/src/xe_forge/cli.py b/src/xe_forge/cli.py
index 8f0ad00..b7aa7f6 100644
--- a/src/xe_forge/cli.py
+++ b/src/xe_forge/cli.py
@@ -188,7 +188,7 @@ def _parse_args():
     tune_group.add_argument(
         "--tune-config",
         type=str,
-        help="YAML config file for multi-workload tile tuning (see TILE.md)",
+        help="YAML config file for multi-workload tile tuning (see docs/TILE.md)",
     )
     tune_group.add_argument("--m", type=int, default=4096, help="GEMM M dimension")
     tune_group.add_argument("--gemm-n", type=int, default=4096, help="GEMM N dimension")