Skip to content

Add CoreMatch dialect: surface pattern matching for Strata Core#1161

Open
julesmt wants to merge 8 commits into
main2from
features/julesmt/pattern_matching
Open

Add CoreMatch dialect: surface pattern matching for Strata Core#1161
julesmt wants to merge 8 commits into
main2from
features/julesmt/pattern_matching

Conversation

@julesmt
Copy link
Copy Markdown
Member

@julesmt julesmt commented May 13, 2026

Summary

Adds the CoreMatch dialect: a Core extension giving surface-level
pattern matching at both expression and statement positions, with
two recursion styles that lower to identical Core.

  • Expression-level match lowers to an application of the
    auto-generated structural eliminator D$Elim. Recursive functions
    written this way carry no termination obligation — the eliminator
    is total by construction.
  • Statement-level match lowers to a nested if chain over the
    auto-generated testers D..isCᵢ, with pattern binders prepended
    as var b : T := D..fᵢ(scrut); init statements.
  • The arm keyword anchors expression-level arms against ambiguity
    with Core operators.

Two recursion surfaces, one lowering

// Natural style: write the call.
rec function length(xs : List) : int {
  match xs {
    arm Nil()                       => 0;
    arm Cons(hd : int, tl : List)   => 1 + length(tl);
  }
};

// Cata style: bind the precomputed result.
function length(xs : List) : int {
  match xs {
    arm Nil()                                  => 0;
    arm Cons(hd : int, tl : List, lenTl : int) => 1 + lenTl;
  }
};

Both forms compile to the same List$Elim application. The natural
style runs through a small two-pass transformation: tryRecRewrite
swaps each f(field) self-call for a sentinel fvar during body
translation, and desugarNaturalToCata injects rec-result binders,
lifts existing bvars past them, and substitutes the sentinels for
the matching bvars.

What CoreMatch can do today

  • Statement-level match with constructor and wildcard arms,
    and any number of pattern binders per constructor.
  • Expression-level match with the head arm anchoring the
    result type and the arm keyword on tail arms.
  • Datatype declarations with multiple constructors, each with
    zero or more typed fields; constructors can mix recursive and
    non-recursive fields freely.
  • Wildcard arms (_ =>) covering one or more uncovered
    constructors; redundancy checker rejects a wildcard that's already
    redundant given the other arms.
  • Recursive functions in two surface styles:
    • Natural — f(field) self-calls on recursive constructor fields,
      rewritten internally to rec-result references.
    • Cata — explicit rec-result binders appended to the pattern
      binders, typed as the function's return type.
  • Multiple recursive fields per constructor — e.g. binary tree
    DFS with both left and right recursive fields, each producing
    its own auto-injected rec-result binder.
  • Mutually recursive function blocks — every function in the
    same rec function … and … block is a valid self-call target.
  • Nested matches — an arm body can itself be a match.
  • Constructor calls in arm bodies — return a freshly-built
    datatype value from a recursive case (map_inc, filter_pos,
    etc.).
  • Heterogeneously typed fields within a single datatype.
  • Procedure specsrequires / ensures clauses, with
    free attributes and user-supplied labels, both preserved through
    lowering.
  • Full statement repertoire in procedure bodies — assert,
    assume, if/else, non-deterministic if *, labelled blocks,
    var n : T; (no rhs), var n : T := e;, assignments.
  • Type breadthint, bool, string, real, every
    bv{1,8,16,32,64} width, Map K V, Sequence T, arrow types,
    and arbitrarily nested type constructors.
  • Diagnostics with source ranges for non-exhaustive match,
    duplicate or unknown constructor in an arm, redundant or multiple
    wildcards, non-structural recursive calls (f(otherList),
    f(reverse(tl)), f as a higher-order value), and wrong arm
    binder counts.

What CoreMatch cannot do today

  • Polymorphic datatypes — declarations are monomorphic. The
    translator's toCoreDatatype accepts an empty typeParams list
    only.
  • Recursion through type constructors — the natural-style
    rewrite only fires for f(field) where field's type matches the
    scrutinee datatype directly. Children stored in Sequence T or
    Map K T aren't lifted across the type constructor, so sum on
    an n-ary Tree { Node(value : int, children : Sequence Tree) }
    cannot be written structurally.
  • Non-structural recursion in general — merge sort, merge,
    Ackermann, gcd, etc. all require either an explicit decreases
    measure or a different fixed-point principle. CoreMatch has no
    surface form for decreases clauses yet; the escape hatch is to
    fall back to Core's procedure-level termination machinery.
  • Or-patterns and nested patterns — each arm matches a single
    constructor at depth 1. Nested decomposition needs a nested
    match inside the arm body.
  • Guards on arms — no when … or if … guard clause; push
    the predicate into the body.
  • Literal patterns — patterns bind constructor field names,
    never literal values. Match Succ(Succ(_)) is not parseable.
  • Cross-module / cross-dialect datatypes — every match's
    scrutinee datatype must be cached during the same Translate
    walk. Datatypes from imports aren't visible to the redundancy
    check or natural-style rewrite.
  • Match expressions as conditions or arbitrary subterms in
    statements
    — the only way to lift an expression-level match
    into statement position is via var x := match …;.

Jules added 8 commits May 8, 2026 14:10
translateSyntaxDef collapsed any single-ident syntax to .passthrough,
but the parser path (Parser.opSyntaxParser) hardcodes the passthrough's
target as argDecls[0].  For rules like

  fn matchExprArms_one (tp : Type, a : MatchExprArm tp) : MatchExprArms tp =>
    a;

the actual target is argDecls[1] (because tp is an implicit type
parameter at index 0).  Passthrough sent the parser to argDecls[0],
which is the implicit Type slot, producing nonsensical syntax trees
and triggering parsing-ambiguity panics in elabExpr.

Fix: only enable passthrough when the single ident references
argDecls[0].  Otherwise emit .std atoms prec — the parser will route
through liftToKind with the correct ident index.
translateSyntaxDef now also folds bound type variables in the result
type into the implicit-arg map, so e.g.

  fn match_expr (dt : Type, tp : Type, scrut : dt, ..., body : tp) : tp =>
    "match " scrut " { ... " body " ... }";

is accepted: `tp` is reachable through both `body : tp` (already worked)
and the result `: tp` (new).  Matches the existing convention that
implicit type params must appear *somewhere* in the rule's signature.
Lays the programmatic foundation for the dialect.  Two modules:

* `Strata/Languages/CoreMatch/CoreMatch.lean` — the `MExpr` / `MArm` /
  `MStmt` / `MStmtArm` / `MFunction` / `MProcedure` / `MDecl` /
  `MProgram` types.  Constructor arms carry a *pre-curried case
  function* `caseFn = λb₁:T₁. … λbₖ:Tₖ. body` in de Bruijn form, and
  statement-level arms carry pre-lowered bodies that already include
  the `var b : T := D..f(scrut);` binder inits.  Both choices let the
  lowering pass collapse to a single straightforward translation.

* `Strata/Languages/CoreMatch/ToCore.lean` — lowering an `MProgram`
  to a `Core.Program`.  Expression-level matches lower to
  `D$Elim scrut case₁ … caseₙ` with cases reordered into the
  datatype's declaration order; constructors with no explicit arm
  fall through to the wildcard arm (padded with `λ_:T.` lambdas to
  match arity), or to a `__coreMatch_missing_<ctor>` sentinel fvar
  when no wildcard is present (unreachable when the redundancy
  checker has run).  Statement matches lower to a nested-if chain
  over the auto-generated testers `D..isCᵢ`; uncovered constructors
  fall through to the wildcard arm or to `assert false` with a
  `coreMatch_nonexhaustive` label.

This commit is self-contained: no DDM grammar, no source parser, no
verifier integration.  The next commits add the redundancy checker
(ArmCheck) and the source pipeline.
`Strata/Languages/CoreMatch/ArmCheck.lean` — a small pure module that
inspects a list of arm "keys" against a datatype's constructor list
and reports any of:

* `unknownConstructor` — constructor name not in the datatype
* `duplicateConstructor` — same constructor named twice
* `multipleWildcards` — more than one `_ => …` arm
* `redundantWildcard` — wildcard appears though every constructor is
  already covered explicitly
* `nonExhaustive` — at least one constructor has no arm and there's no
  wildcard

`ArmKey = Option String` is a deliberately abstract view of an arm
(`some ctor` or `none` for a wildcard) so the checker is shape-
independent: both the expression-level `MArm` and statement-level
`MStmtArm` project to it via `MArm.key` / `MStmtArm.key` and run
through the same `checkAgainst`.

The checker doesn't depend on the source pipeline; it can be called
on hand-built `MProgram`s as well, which is why it lands separately
from the DDM-side translator.
The source-side of the dialect, on top of the AST, lowering, and
ArmCheck modules from the previous two commits.

* `DDMTransform/Grammar.lean` declares the `CoreMatch` dialect (which
  imports `Core`) and the surface syntax for `match`:
  - statement form `match scrut { Ctor(v:T,…) => { … } … }` for
    procedure bodies
  - expression form `match scrut { arm Ctor(v:T,…) => body; … }` for
    function bodies, with a head-arm body surfaced as a direct
    `tp`-typed argument so DDM's type inference fixes the result
    type from the user's first body, plus a tail-arm-list category
    whose bodies are typed `Expr` (re-checked by the translator)
  - the `arm` keyword anchors each tail arm against parser ambiguity
    with Core operators
  - both forms share the `match` keyword; the parser dispatches by
    category (Statement vs Expr) so the two never compete
* `DDMTransform/StrataGen.lean` runs `#strata_gen CoreMatch` to
  produce the typed `CoreMatchDDM.*` AST.
* `DDMTransform/Translate.lean` walks the typed `CoreMatchDDM.Program`
  and emits an `MProgram`.  Tracks `bvars`, `tyBVars`, datatype
  ctor info, label counters, and a `currentRecFns` / `recRewrite`
  state used by the rec-call rewrite for natural-style structural
  recursion.  Detects two arm shapes (`natural`, `cata`, `unknown`)
  via `classifyArm`, rewrites self-calls `f(field)` into sentinel
  fvars during arm body translation, then resolves the sentinels to
  the eliminator's rec-result bvars at the absMulti wrapping step.
  Non-structural self-references (bare references, calls on
  non-recursive arguments) are rejected with a clear diagnostic.
* `Verify.lean` exposes `toCore` (programmatic) and `parseToCore`
  (source) entry points for the dialect.

`Strata.lean` registers the three new modules.
* `Examples/pattern_matching/`: nine `.coreMatch.st` programs covering
  non-recursive enums with wildcards, structural recursion (natural
  and cata-style), nested matches, mixed-type constructor fields,
  expression-tree evaluation, and constructor calls in arm bodies.
  README documents the conventions.

* `StrataTest/Languages/CoreMatch/CompileTest.lean`: programmatic
  invariants on `MProgram → Core.Program` lowering plus ArmCheck
  diagnostics.

* `StrataTest/Languages/CoreMatch/SourceTest.lean`: end-to-end source
  pipeline tests asserting `D$Elim` head, no self-references after
  natural-style rewrite, ite-shaped procedure bodies, and the
  redundancy / non-structural diagnostic surface.
* `editors/GenSyntax.lean` extended to emit a per-dialect highlighter:
  add `LangCfg`/`EmacsCfg` config records and a `DialectGenSpec`
  describing each language to write.  The generator now produces
  highlighters for both `Core` (`.core.st`) and `CoreMatch`
  (`.coreMatch.st`).
* Auto-generated mode files: `editors/emacs/coreMatch-st-mode.el`,
  `editors/vscode/syntaxes/coreMatch-st.tmLanguage.json`.
* VSCode `package.json` and Emacs / VSCode README updates wire the new
  language entries.
Quick CLI that loads the CoreMatch DDM dialect, parses a single
`.coreMatch.st` file, runs the redundancy / exhaustiveness checker,
and lowers it to a `Core.Program`.  Prints `OK: <N> decls` (with
optional `--show-decls` summary) or `ERR: <message>`.

Useful for iterating on individual programs in
`Examples/pattern_matching/` without rebuilding the full test suite.
@julesmt julesmt requested a review from a team May 13, 2026 17:34
@github-actions github-actions Bot added Waiting-For-Review dependencies Pull requests that update a dependency file labels May 13, 2026
@julesmt julesmt requested review from joscoh and shigoel May 21, 2026 20:48
@julesmt julesmt changed the base branch from main to main2 May 22, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file Waiting-For-Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant