Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,33 @@ packages, and tooling contracts may change before a stable release.

### Changed

- Conflict diagnostics (`duplicate_route`, `route_method_conflict`, including
contract-route conflicts) now carry a related source location pointing at the
first declaration. `gowdk check --json` gains an additive `related` array per
diagnostic, and the language server reports it as `relatedInformation`.
- The formatter now tracks brace depth with the parser's string- and
comment-aware scanner, so braces inside string literals, comments, and
template literals (for example `title "a { b"`) no longer skew indentation.

### Implemented

- A machine-checked `.gwdk` conformance corpus
(`internal/lang/testdata/conformance/`) pins the language contract: `accept/`
cases must check clean and `reject/` cases must produce their declared stable
diagnostic codes. See `docs/language/conformance.md`.
- A per-construct stability and deprecation table
(`docs/language/stability.md`) documents which blocks, metadata keywords, and
`g:` directives are stable, partial, planned, or deprecated, guarded against
drift from the code registries by a test.
- `source.SourcePosition` carries a byte `Offset`, with `source.PositionAt` and
`source.OffsetOf` conversion helpers, as the exact substrate for future
AST-backed formatting and precise editor edits.
- ADR 0010 records the decision to replace the line-oriented parser with a
shared tokenizer and a recursive-descent parser with error recovery, migrated
behind the stable `gwdkast` AST seam.

### Changed

- A page that declares no `guard` is no longer a build error. `guard` is now
optional, but a page is not public by default: `missing_page_guard` is now a
**warning** and the page's route is denied (403) at request time until the
Expand Down
6 changes: 6 additions & 0 deletions docs/compiler/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,9 @@ project config

Future build work should expand from the current generated-output slice while
keeping downstream passes on `internal/gwdkir.Program`.

The `lex/parse full AST` front-end is the line-oriented parser today. The
decision to replace it with a shared tokenizer and a recursive-descent parser
with error recovery, migrated behind the stable `internal/gwdkast` AST seam, is
recorded in
`docs/engineering/decisions/0010-tokenizer-recursive-descent-parser.md`.
5 changes: 5 additions & 0 deletions docs/compiler/syntax-contributors.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ language contract.
- LSP/editor: `go test ./internal/lsp` plus editor checks when touched.
- CLI report changes: update `cmd/gowdk/testdata/*_golden` and run
`go test ./cmd/gowdk`.
8. Add a conformance corpus case:
- Accepted syntax: an `accept/` file under
`internal/lang/testdata/conformance/` that exercises it.
- A rejection or new diagnostic: a `reject/` file with a leading
`// expect: <code>` directive. See `docs/language/conformance.md`.

## Guardrails

Expand Down
139 changes: 139 additions & 0 deletions docs/engineering/decisions/0010-tokenizer-recursive-descent-parser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# ADR 0010: Tokenizer and Recursive-Descent Parser Direction

Date: 2026-06-11

## Status

Accepted

## Context

The compiler front-end is line-oriented. `internal/parser.ParseSyntax` reads
source with a `bufio.Scanner`, matches patterns against each trimmed line
(`internal/parser/patterns.go` `lexLine`), tracks nesting with a separate stateful
brace scanner (`internal/parser/braces.go`), and returns on the first syntax
error with no recovery. Source positions are 1-based line/column with no byte
offset, so many spans are line-wide approximations (`sourceLineSpan`). The
formatter (`internal/lang/format.go`) is independent whitespace-only string
manipulation that counts braces without skipping strings or comments.

This single foundation is the upstream constraint behind most of the deferred
parser/formatter/diagnostics work (#250): error recovery, an AST-backed
formatter, exact token spans, and granular per-construct diagnostic codes are all
downstream of having a real token stream and a node-producing parser. Right now
the line-oriented parser is deferred by omission rather than by an explicit
decision.

Two facts make the direction clear rather than open-ended:

1. The documented target pipeline (`docs/compiler/pipeline.md`) already names a
`lex/parse full AST -> semantic analysis -> stable internal IR` front-end.
This ADR makes explicit the parser-internals decision that target already
implies.
2. A real character-level tokenizer already exists. `internal/lang.Lex`
(`internal/lang/lexer.go`) scans runes into typed tokens with line/column
positions, but only editor and CLI tooling consume it. The compiler parser
ignores it and re-lexes per line. The codebase therefore maintains two
divergent front-ends for the same language.

Crucially, the typed AST is already a stable seam. `internal/parser.ParseSyntax`
produces the `internal/gwdkast` AST, and every downstream pass
(`internal/gwdkanalysis` lowering to `internal/gwdkir.Program`, validation, and
generation) consumes that AST. The parser can be replaced behind that seam
without disturbing IR, validation, reports, or codegen.

## Decision

Commit to a single shared tokenizer and a recursive-descent parser with error
recovery, producing the existing `internal/gwdkast` AST. Migrate incrementally
behind the AST seam.

Concretely:

- **One tokenizer.** Promote the `internal/lang` rune scanner into the shared
lexer that both the compiler parser and editor/CLI tooling consume. Retire the
per-line `lexLine` path in `internal/parser`. There is one lexical definition
of `.gwdk`, not two.
- **Recursive-descent parser over tokens.** Parse the token stream into
`gwdkast.File` with explicit declaration, block, and view productions instead
of line-pattern matching. The brace scanner's string/comment/template state
becomes ordinary lexer state rather than a separate counter.
- **Error recovery.** The parser synchronizes at top-level declaration
boundaries and block braces so one syntax error does not hide the rest of the
file. It accumulates diagnostics instead of returning on the first error.
- **Exact spans.** Tokens carry byte offsets (ADR depends on #294), so AST nodes
and diagnostics get exact token ranges instead of line-wide approximations.
- **AST is the frozen seam.** `internal/gwdkast.File` is the contract. The new
parser must produce the same AST as the line-oriented parser for the currently
supported subset; `gwdkanalysis`, `gwdkir`, validation, reports, and codegen do
not change as part of this work.
- **Formatter follows.** Once the parser yields full nodes, the AST-backed
formatter deferred in #250 becomes possible and replaces line-oriented
`format.go`. Until then, the line-oriented formatter keeps its documented
limits (see #296).

Migration is incremental and non-breaking. The line-oriented parser keeps working
while the new parser is built to produce identical `gwdkast.File` output for the
supported subset, gated by golden AST-equivalence tests and the language
conformance corpus (#295). Cutover happens per declaration kind once equivalence
holds, then the line-oriented path and `lexLine` are removed.

## Consequences

### Positive

- One lexical and grammatical definition of `.gwdk` shared by the compiler and
the language server, instead of a line parser plus a separate tooling lexer.
- Error recovery, exact spans, AST-backed formatting, and granular diagnostic
codes become reachable; #250 stops being blocked by the front-end.
- Diagnostics point at tokens rather than whole lines, improving CLI output and
LSP precision.
- Braces inside strings, comments, and template literals are handled by lexer
state, removing a class of parser and formatter miscounts by construction.

### Negative

- A recursive-descent parser plus recovery is materially more code than the
current line parser, and the migration must preserve AST output exactly to stay
non-breaking.
- Equivalence testing across every declaration kind is required before cutover;
this is real up-front cost before any user-visible benefit lands.
- Recovery and span precision depend on byte offsets (#294) landing first.

### Neutral

- The public language surface does not change. This is a front-end
implementation decision, not a grammar change; the conformance corpus (#295)
pins behavior across the migration.
- Downstream passes are untouched because the AST seam is stable.

## Alternatives Considered

- **Keep the line-oriented parser, document its limits.** Lowest cost, but
permanently caps span precision, error recovery, and AST-backed formatting, and
keeps two divergent front-ends. Rejected: it contradicts the already-documented
target pipeline and leaves #250 structurally blocked.
- **Adopt a parser generator or third-party combinator library** (ANTLR,
participle, goyacc). Rejected: adds a dependency and a generated/runtime layer
against the project's lean-dependency stance, and a hand-written
recursive-descent parser gives better control over recovery and diagnostics for
a small surface language.
- **Incremental/streaming parser from day one.** Useful for an editor, but
premature. The AST seam lets an incremental layer be added later without
another front-end decision.

## Follow-Up

- #294 (byte offsets in source positions) is the prerequisite; land it first.
- Build the shared tokenizer by promoting `internal/lang`'s scanner; retire
`internal/parser` `lexLine`.
- Build the recursive-descent parser to `gwdkast.File` with recovery, gated by
golden AST-equivalence tests and the conformance corpus (#295).
- Cut over per declaration kind; remove the line-oriented parser when equivalence
holds across the supported subset.
- AST-backed formatter and granular per-construct diagnostic codes (#250) consume
the new parser; #296 is the interim formatter guard.
- Link this ADR from the #250 deferral so the line-oriented limitation is a
conscious choice with a committed exit.
- Keep `docs/compiler/pipeline.md` and `docs/engineering/architecture.md` aligned
as the migration proceeds.
3 changes: 3 additions & 0 deletions docs/engineering/decisions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,6 @@ Recommended naming:
- `0007-static-first-spa-navigation.md`: accepted static-first SPA navigation and generated JavaScript guardrails.
- `0008-bounded-client-language.md`: accepted bounded `client {}` language and page-scoped store boundaries.
- `0009-optional-inline-go-authoring.md`: accepted optional inline Go authoring direction, with extraction to normal package Go.
- `0010-tokenizer-recursive-descent-parser.md`: accepted shared tokenizer and
recursive-descent parser with error recovery, migrated behind the stable
`gwdkast` AST seam.
2 changes: 2 additions & 0 deletions docs/language/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ component contract and inline package-go-block slices.
- `hybrid.md`: hybrid request-time behavior and deferred hybrid capabilities.
- `diagnostics.md`: current diagnostic shape and known codes.
- `formatting.md`: current formatter behavior.
- `stability.md`: per-construct stability and deprecation tiers.
- `conformance.md`: machine-checked accept/reject corpus that pins the contract.

## File Kinds

Expand Down
51 changes: 51 additions & 0 deletions docs/language/conformance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# .gwdk Conformance Corpus

The conformance corpus is the machine-checked source of truth for the `.gwdk`
language contract. The prose in `docs/language/spec.md` and
`docs/language/grammar.md` describes the language; the corpus *pins* it, so a
parser or validator change that silently accepts or rejects different syntax
fails a test instead of drifting from the docs.

## Location

```text
internal/lang/testdata/conformance/
accept/ # files that must check clean (no error-severity diagnostics)
reject/ # files that must produce specific stable diagnostic codes
```

The runner is `TestConformanceCorpusAccept` and `TestConformanceCorpusReject` in
`internal/lang/conformance_test.go`. Each file is checked with
`lang.CheckSource`, the same single-file path the editor and `gowdk check` use,
so cases are hermetic and need no project layout.

## Accept cases

Any `.gwdk` file under `accept/` must produce no error-severity diagnostics.
Warnings (for example `missing_img_alt`) are allowed, because they do not fail a
build. File-kind classification follows the filename suffix, so a component case
is named `*.cmp.gwdk` and a layout case `*.layout.gwdk`.

## Reject cases

Any `.gwdk` file under `reject/` must declare the stable diagnostic codes it is
expected to produce in a leading directive comment:

```gwdk
// expect: old_action_block_syntax
package pages
...
```

Multiple codes may be comma- or space-separated. The test asserts every named
code appears among the diagnostics for that file. Diagnostic codes are the ones
registered in `internal/diagnostics/registry.go` and documented in
`docs/reference/diagnostic-codes.md`.

## Adding a corpus case

New or changed `.gwdk` syntax must come with a corpus case. Adding accepted
syntax means an `accept/` file exercising it; adding a rejection or a new
diagnostic means a `reject/` file with the expected code. This requirement is
part of the syntax contributor checklist in
`docs/compiler/syntax-contributors.md`.
4 changes: 4 additions & 0 deletions docs/language/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

This is the grammar accepted by the current metadata parser. It is intentionally line-oriented and incomplete.

Accepted and rejected syntax is pinned by the machine-checked conformance corpus
in [Conformance Corpus](conformance.md), which is the contract source of truth
when this grammar drifts.

```text
file = line*
line = blank | comment | packageDecl | metadataDecl | importDecl | useDecl | blockDecl | goDecl | actionDecl | apiDecl | unsupportedBlock | other
Expand Down
10 changes: 10 additions & 0 deletions docs/language/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,16 @@ instead of becoming accidental behavior.
Detailed behavior stays in the feature pages linked from
[GOWDK Language](README.md).

This prose is pinned by the machine-checked conformance corpus described in
[Conformance Corpus](conformance.md): accepted syntax has an `accept/` case that
must check clean, and rejected syntax has a `reject/` case asserting its stable
diagnostic code. When this spec and the corpus disagree, the corpus is the
contract and one of them is a bug.

Per-construct stability and deprecation tiers (which blocks, metadata keywords,
and `g:` directives are stable, partial, planned, or deprecated) are published
in [Language Construct Stability](stability.md).

## Status Terms

- Implemented: accepted by the current compiler and covered by tests or a
Expand Down
Loading