fix(plpgsql): use context-specific SQL scanner tokens by AlexS778 · Pull Request #32 · gmr/tree-sitter-postgres

AlexS778 · 2026-05-21T19:07:19Z

Summary

Refactors the PL/pgSQL external scanner to use context-specific SQL fragment tokens instead of one global SQL-body heuristic.

This fixes valid PL/pgSQL where words like INTO, USING, NULL, or LOOP are SQL syntax in one context but PL/pgSQL delimiters in another.

Fixed examples

SELECT * INTO q
FROM main.tasks
WHERE id = 1;

INSERT INTO main.tasks (kind)
VALUES ('x');

INSERT INTO main.tasks (kind)
VALUES ('x')
RETURNING * INTO q;

What changed

Added context-specific scanner modes for SQL fragments ending at THEN, LOOP, ;, INTO/USING, comma, range, and assignment boundaries.
Updated PL/pgSQL grammar rules to request the correct scanner mode per context.
Kept the visible node as (sql_expression) via aliases, so existing SQL injections continue to work.
Added corpus coverage for delimiter-sensitive PL/pgSQL cases.
Regenerated only PL/pgSQL Tree-sitter artifacts.

Testing

cd plpgsql && ../node_modules/.bin/tree-sitter test
Parsed local schema corpus:
- 17 .pl.sql files
- 60 extracted PL/pgSQL bodies
- 0 parse errors

closes #31

Summary by CodeRabbit

Bug Fixes
- More accurate parsing of embedded SQL across control flow, assignment, loops, cursors, RETURN/RAISE/ASSERT, and EXECUTE/PERFORM/CALL forms, yielding more consistent parse trees.
Tests
- Expanded corpus coverage for cursors, dynamic SQL, control-flow conditions, exceptions, loops, and common statement patterns to validate the parsing improvements.

Replace the single global PL/pgSQL SQL-body scanner token with context-specific external tokens for the different embedded SQL contexts. This lets the scanner stop SQL fragments at the correct delimiter for each grammar rule, such as THEN, LOOP, semicolon, INTO/USING, comma, range, or assignment boundaries. Keep the visible node shape as sql_expression via aliases so existing SQL injection queries continue to work. Add corpus coverage for INSERT INTO, SELECT INTO FROM, RETURNING INTO, dynamic EXECUTE USING, RETURN QUERY EXECUTE USING, FOR EXECUTE USING, cursor arguments, RAISE arguments, and dollar-quoted dynamic SQL containing semicolons. Regenerate only the PL/pgSQL tree-sitter artifacts.

coderabbitai · 2026-05-21T19:07:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73ac6a83-5703-4e1b-946d-69d2d21630fd

📥 Commits

Reviewing files that changed from the base of the PR and between 34d72ac and a43da41.

📒 Files selected for processing (3)

plpgsql/src/scanner.c
plpgsql/test/corpus/control_flow.txt
plpgsql/test/corpus/statements.txt

📝 Walkthrough

Walkthrough

The PL/pgSQL grammar and scanner are refactored to replace a single generic SQL-body token with 16+ context-specific external tokens that terminate at PL/pgSQL-relevant delimiters. The scanner now uses a ScanMode mechanism to determine the correct terminator set for each context, and all statement rules are rewritten to consume the appropriate context-specific wrapper nodes instead of the generic sql_expression.

Changes

Context-Specific SQL Fragment Tokens

Layer / File(s)	Summary
External scanner refactoring with context-specific tokens `plpgsql/src/scanner.c`	The token model shifts from a single SQL body to multiple delimiter-aware types. New ScanMode struct and mode_for() function determine which delimiters terminate each fragment. ASCII-only character classification enables WebAssembly builds. Core scan_sql() tracks nesting depth, handles strings/comments, and uses keyword-termination logic without libc ctype. Main scanner dispatch routes through valid_symbols to determine context.
Grammar external symbols and alias wrappers `plpgsql/grammar.js`, `plpgsql/src/grammar.json`, `script/generate-plpgsql-grammar.js`	Externals list expands from one `_sql_body` to many context-specific tokens (`_sql_statement`, `_sql_until_semicolon`, `_sql_until_then`, `_sql_until_when`, `_sql_until_loop`, `_sql_until_assignment`, `_sql_until_range`, etc.). New hidden wrapper rules (`_sql_*_expr`) alias each external token back to the visible `sql_expression` node, preserving injection interfaces.
PL/pgSQL statement rules routed to context-specific fragments `plpgsql/grammar.js`, `plpgsql/src/grammar.json`, `script/generate-plpgsql-grammar.js`	Cursor declarations, variable defaults, assignments, IF/ELSIF/CASE conditions, WHILE/FOR/FOREACH loops, EXIT WHEN, RETURN variants, RAISE, ASSERT, dynamic EXECUTE/PERFORM, CALL/DO, OPEN, FETCH direction, and `stmt_execsql` are rewritten to consume appropriate terminator-aware `_sql_*_expr` wrappers (e.g., IF uses `_sql_then_expr`, EXECUTE uses `_sql_into_using_or_semicolon_expr`, stmt_execsql uses `_sql_statement_expr`).
node-types and stmt_execsql contract `plpgsql/src/node-types.json`	`sql_expression` node-type location adjusted; `stmt_execsql` children/types narrowed to a single `sql_expression` child per updated grammar shape.
Test corpus additions and parse-tree updates `plpgsql/test/corpus/*`	New and updated corpus fixtures cover IF complex IS conditions, OPEN cursor forms (FOR EXECUTE, USING, bound args), FETCH ABSOLUTE INTO, dynamic EXECUTE with multiple USING args and dollar-quoted strings, RETURN QUERY EXECUTE USING, RAISE with multiple format arguments, dynamic loops with USING, and static SQL statement fixtures asserting `stmt_execsql (sql_expression)`. Parse-tree snapshots updated to reflect removed intermediate keyword nodes and new `sql_expression` placements.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

PL/pgSQL scanner should use context-specific SQL fragment tokens instead of one global SQL-body heuristic #31: This PR directly implements the architectural change described in the issue to replace a single generic SQL-body token with context-specific external fragment tokens.

Possibly related PRs

gmr/tree-sitter-postgres#13: Overlaps on removing libc ctype usage and adding ASCII-only classification in plpgsql/src/scanner.c.
gmr/tree-sitter-postgres#18: Related changes to how embedded SQL fragments are detected/terminated by the scanner.
gmr/tree-sitter-postgres#1: Original generator/script that introduced the single $._sql_body handling replaced here.

Poem

🐰 I nibble tokens, split the sea—

One body grows to many ends,
THEN, LOOP, INTO find their key.
No more errors where SQL bends,
Grammar hops, and parsing mends.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(plpgsql): use context-specific SQL scanner tokens' directly describes the main change: replacing a single generic SQL-body token with context-specific external scanner tokens.
Linked Issues check	✅ Passed	The PR fulfills all key objectives from issue `#31`: replaces single generic _sql_body with context-specific external tokens (statement, semicolon, then, when, loop, assignment, range, etc.), updates grammar rules to select appropriate externals per construct, maintains sql_expression compatibility via aliases, and provides comprehensive regression test coverage for known failing cases.
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to the stated objective: scanner refactoring, grammar rule updates for context-specific token routing, test corpus updates for regression coverage, and build script regeneration. No extraneous code modifications or formatting changes are evident.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AlexS778 · 2026-05-21T19:08:54Z

@gmr part of this was fixed in here #10 (comment)

about plpgsql scanner stops too early at NULL/INTO/etc.

I found some other cases which gave me errors, fixed them.

AlexS778 · 2026-05-21T19:53:39Z

@coderabbitai review

coderabbitai · 2026-05-21T19:53:45Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

script/generate-plpgsql-grammar.js (2)

213-236: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the ALIAS FOR arm before variable declarations in the generator.

The checked-in plpgsql/grammar.js still relies on the alias arm coming first so ALIAS is not consumed as a type name. This template has the arms reversed, so the next regeneration will reintroduce that ambiguity.

Suggested fix

     decl_statement: $ => choice(
+      // Alias declaration: name ALIAS FOR target ;
+      seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'),
       // Variable declaration: name [CONSTANT] type [COLLATE collation] [NOT NULL] [DEFAULT|:=|= expr] ;
       seq(
         $.decl_varname,
         optional($.kw_constant),
         $.decl_datatype,
         optional($.decl_collate),
         optional(seq($.kw_not, $.kw_null)),
         optional($.decl_defval),
         ';'
       ),
-      // Alias declaration: name ALIAS FOR target ;
-      seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'),
       // Cursor declaration: name [NO SCROLL | SCROLL] CURSOR [(args)] FOR|IS query ;
       seq(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/generate-plpgsql-grammar.js` around lines 213 - 236, The choice arms
inside decl_statement are ordered so the variable-declaration arm can consume
the token ALIAS as a type; move the Alias declaration arm (the
seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';')) to appear
before the variable declaration arm (the seq that includes $.decl_datatype,
optional($.decl_collate), etc.) so that $.kw_alias is recognized as the ALIAS
form instead of being parsed as a datatype; update the ordering in
decl_statement accordingly.

574-581: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require the cursor name even when FROM is omitted.

This template makes the cursor identifier part of the optional FROM clause, so regenerating from it would reject valid forms like FETCH NEXT cur INTO target; and only accept the FROM cur variant.

Suggested fix

     stmt_fetch: $ => seq(
       $.kw_fetch,
       optional($.fetch_direction),
-      optional(seq($.kw_from, $.any_identifier)),
+      optional($.kw_from),
+      $.any_identifier,
       $.kw_into,
       $.into_target,
       ';'
     ),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/generate-plpgsql-grammar.js` around lines 574 - 581, The grammar rule
stmt_fetch currently nests the cursor identifier inside optional(seq($.kw_from,
$.any_identifier)), making the cursor name optional when FROM is omitted; change
the sequence so the cursor identifier ($.any_identifier) is required regardless
of the presence of $.kw_from — e.g. keep $.kw_fetch and
optional($.fetch_direction), make $.kw_from optional by itself, then require
$.any_identifier before $.kw_into and $.into_target, so symbols to edit are
stmt_fetch, $.fetch_direction, $.kw_from, and $.any_identifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plpgsql/src/scanner.c`:
- Around line 433-438: The code currently treats any top-level '<<' as a
block-label start even when SQL content has already begun; update the
conditional in the scanner so that the '<<' label-start handling only triggers
when at top-level (depth == 0) AND no SQL content has been emitted yet
(has_content is false). Concretely, modify the check around lexer->lookahead ==
'<' (the block that calls lexer->mark_end, lexer->advance and return
finish_token(lexer, mode.symbol, has_content)) to include a guard that
!has_content (or equivalent "fragment-start" state) before consuming '<<' as a
label; apply the same change to the analogous block that handles '<<' at the
second location (the 628-635 region) so 'a << 1' and similar SQL expressions are
not mis-parsed as label boundaries.

---

Outside diff comments:
In `@script/generate-plpgsql-grammar.js`:
- Around line 213-236: The choice arms inside decl_statement are ordered so the
variable-declaration arm can consume the token ALIAS as a type; move the Alias
declaration arm (the seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier,
';')) to appear before the variable declaration arm (the seq that includes
$.decl_datatype, optional($.decl_collate), etc.) so that $.kw_alias is
recognized as the ALIAS form instead of being parsed as a datatype; update the
ordering in decl_statement accordingly.
- Around line 574-581: The grammar rule stmt_fetch currently nests the cursor
identifier inside optional(seq($.kw_from, $.any_identifier)), making the cursor
name optional when FROM is omitted; change the sequence so the cursor identifier
($.any_identifier) is required regardless of the presence of $.kw_from — e.g.
keep $.kw_fetch and optional($.fetch_direction), make $.kw_from optional by
itself, then require $.any_identifier before $.kw_into and $.into_target, so
symbols to edit are stmt_fetch, $.fetch_direction, $.kw_from, and
$.any_identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b183ca6-5137-4358-ad5e-78eadd5ec90c

📥 Commits

Reviewing files that changed from the base of the PR and between 33e140f and 34d72ac.

📒 Files selected for processing (13)

plpgsql/grammar.js
plpgsql/src/grammar.json
plpgsql/src/node-types.json
plpgsql/src/parser.c
plpgsql/src/scanner.c
plpgsql/test/corpus/control_flow.txt
plpgsql/test/corpus/cursors.txt
plpgsql/test/corpus/diagnostics.txt
plpgsql/test/corpus/dynamic_sql.txt
plpgsql/test/corpus/exceptions.txt
plpgsql/test/corpus/loops.txt
plpgsql/test/corpus/statements.txt
script/generate-plpgsql-grammar.js

Only treat top-level `<<` as a PL/pgSQL label opener when the SQL fragment has not emitted any content yet. This prevents expressions like `a << 1` from being split as if `<<` started a block label. Apply the same guard in both scanner paths and add corpus coverage for `<<` in IF conditions and static SQL statements.

AlexS778 · 2026-05-23T11:32:59Z

@coderabbitai review

coderabbitai · 2026-05-23T11:33:05Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

AlexS778 · 2026-05-26T13:11:02Z

@gmr
Hey, since 'I've been working through recent PL/plpgsql generator/scanner issues and other stuff, would you be open to adding me as a maintainer with permissions to merge PRs?

I’m happy to keep things conservative: work through PRs, keep changes reviewable, not touch releases unless you want me to, and coordinate before any big grammar/generator changes.

Right now I’m doing everything from a fork, and it’s getting difficult to keep branches, generated files, and fixes in sync. Having merge access would let me cleanly land fixes like #30/#32, keep the repo moving while you’re busy, and reduce the amount of coordination needed from you.

Me and my coworker are having troubles using tree sitter in zed, because parts of the code are from my fork, parts from this repo and etc. So yeah it's not ideal.

Totally fine if you’d prefer a limited setup first - for example, merge/write access but releases stay with you, or a trial period.

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread plpgsql/src/scanner.c Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(plpgsql): use context-specific SQL scanner tokens#32

fix(plpgsql): use context-specific SQL scanner tokens#32
AlexS778 wants to merge 2 commits into
gmr:mainfrom
AlexS778:fix/into-scanner-context

AlexS778 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

AlexS778 commented May 21, 2026

Uh oh!

AlexS778 commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

AlexS778 commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

AlexS778 commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexS778 commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixed examples

What changed

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

AlexS778 commented May 21, 2026

Uh oh!

AlexS778 commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlexS778 commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

AlexS778 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlexS778 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading

AlexS778 commented May 26, 2026 •

edited

Loading