Skip to content

fix(plpgsql): use context-specific SQL scanner tokens#32

Open
AlexS778 wants to merge 2 commits into
gmr:mainfrom
AlexS778:fix/into-scanner-context
Open

fix(plpgsql): use context-specific SQL scanner tokens#32
AlexS778 wants to merge 2 commits into
gmr:mainfrom
AlexS778:fix/into-scanner-context

Conversation

@AlexS778

@AlexS778 AlexS778 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Refactors the PL/pgSQL external scanner to use context-specific SQL fragment tokens instead of one global SQL-body heuristic.

This fixes valid PL/pgSQL where words like INTO, USING, NULL, or LOOP are SQL syntax in one context but PL/pgSQL delimiters in another.

Fixed examples

SELECT * INTO q
FROM main.tasks
WHERE id = 1;
INSERT INTO main.tasks (kind)
VALUES ('x');
INSERT INTO main.tasks (kind)
VALUES ('x')
RETURNING * INTO q;

What changed

  • Added context-specific scanner modes for SQL fragments ending at THEN, LOOP, ;, INTO/USING, comma, range, and assignment boundaries.

  • Updated PL/pgSQL grammar rules to request the correct scanner mode per context.

  • Kept the visible node as (sql_expression) via aliases, so existing SQL injections continue to work.

  • Added corpus coverage for delimiter-sensitive PL/pgSQL cases.

  • Regenerated only PL/pgSQL Tree-sitter artifacts.

Testing

  • cd plpgsql && ../node_modules/.bin/tree-sitter test
  • Parsed local schema corpus:
    • 17 .pl.sql files
    • 60 extracted PL/pgSQL bodies
    • 0 parse errors

closes #31

Summary by CodeRabbit

  • Bug Fixes

    • More accurate parsing of embedded SQL across control flow, assignment, loops, cursors, RETURN/RAISE/ASSERT, and EXECUTE/PERFORM/CALL forms, yielding more consistent parse trees.
  • Tests

    • Expanded corpus coverage for cursors, dynamic SQL, control-flow conditions, exceptions, loops, and common statement patterns to validate the parsing improvements.

Review Change Stack

Replace the single global PL/pgSQL SQL-body scanner token with
context-specific external tokens for the different embedded SQL contexts.

This lets the scanner stop SQL fragments at the correct delimiter for each
grammar rule, such as THEN, LOOP, semicolon, INTO/USING, comma, range, or
assignment boundaries.

Keep the visible node shape as sql_expression via aliases so existing SQL
injection queries continue to work.

Add corpus coverage for INSERT INTO, SELECT INTO FROM, RETURNING INTO,
dynamic EXECUTE USING, RETURN QUERY EXECUTE USING, FOR EXECUTE USING, cursor
arguments, RAISE arguments, and dollar-quoted dynamic SQL containing
semicolons.

Regenerate only the PL/pgSQL tree-sitter artifacts.
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73ac6a83-5703-4e1b-946d-69d2d21630fd

📥 Commits

Reviewing files that changed from the base of the PR and between 34d72ac and a43da41.

📒 Files selected for processing (3)
  • plpgsql/src/scanner.c
  • plpgsql/test/corpus/control_flow.txt
  • plpgsql/test/corpus/statements.txt

📝 Walkthrough

Walkthrough

The PL/pgSQL grammar and scanner are refactored to replace a single generic SQL-body token with 16+ context-specific external tokens that terminate at PL/pgSQL-relevant delimiters. The scanner now uses a ScanMode mechanism to determine the correct terminator set for each context, and all statement rules are rewritten to consume the appropriate context-specific wrapper nodes instead of the generic sql_expression.

Changes

Context-Specific SQL Fragment Tokens

Layer / File(s) Summary
External scanner refactoring with context-specific tokens
plpgsql/src/scanner.c
The token model shifts from a single SQL body to multiple delimiter-aware types. New ScanMode struct and mode_for() function determine which delimiters terminate each fragment. ASCII-only character classification enables WebAssembly builds. Core scan_sql() tracks nesting depth, handles strings/comments, and uses keyword-termination logic without libc ctype. Main scanner dispatch routes through valid_symbols to determine context.
Grammar external symbols and alias wrappers
plpgsql/grammar.js, plpgsql/src/grammar.json, script/generate-plpgsql-grammar.js
Externals list expands from one _sql_body to many context-specific tokens (_sql_statement, _sql_until_semicolon, _sql_until_then, _sql_until_when, _sql_until_loop, _sql_until_assignment, _sql_until_range, etc.). New hidden wrapper rules (_sql_*_expr) alias each external token back to the visible sql_expression node, preserving injection interfaces.
PL/pgSQL statement rules routed to context-specific fragments
plpgsql/grammar.js, plpgsql/src/grammar.json, script/generate-plpgsql-grammar.js
Cursor declarations, variable defaults, assignments, IF/ELSIF/CASE conditions, WHILE/FOR/FOREACH loops, EXIT WHEN, RETURN variants, RAISE, ASSERT, dynamic EXECUTE/PERFORM, CALL/DO, OPEN, FETCH direction, and stmt_execsql are rewritten to consume appropriate terminator-aware _sql_*_expr wrappers (e.g., IF uses _sql_then_expr, EXECUTE uses _sql_into_using_or_semicolon_expr, stmt_execsql uses _sql_statement_expr).
node-types and stmt_execsql contract
plpgsql/src/node-types.json
sql_expression node-type location adjusted; stmt_execsql children/types narrowed to a single sql_expression child per updated grammar shape.
Test corpus additions and parse-tree updates
plpgsql/test/corpus/*
New and updated corpus fixtures cover IF complex IS conditions, OPEN cursor forms (FOR EXECUTE, USING, bound args), FETCH ABSOLUTE INTO, dynamic EXECUTE with multiple USING args and dollar-quoted strings, RETURN QUERY EXECUTE USING, RAISE with multiple format arguments, dynamic loops with USING, and static SQL statement fixtures asserting stmt_execsql (sql_expression). Parse-tree snapshots updated to reflect removed intermediate keyword nodes and new sql_expression placements.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I nibble tokens, split the sea—

One body grows to many ends,
THEN, LOOP, INTO find their key.
No more errors where SQL bends,
Grammar hops, and parsing mends.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(plpgsql): use context-specific SQL scanner tokens' directly describes the main change: replacing a single generic SQL-body token with context-specific external scanner tokens.
Linked Issues check ✅ Passed The PR fulfills all key objectives from issue #31: replaces single generic _sql_body with context-specific external tokens (statement, semicolon, then, when, loop, assignment, range, etc.), updates grammar rules to select appropriate externals per construct, maintains sql_expression compatibility via aliases, and provides comprehensive regression test coverage for known failing cases.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the stated objective: scanner refactoring, grammar rule updates for context-specific token routing, test corpus updates for regression coverage, and build script regeneration. No extraneous code modifications or formatting changes are evident.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AlexS778

Copy link
Copy Markdown
Contributor Author

@gmr part of this was fixed in here #10 (comment)

about plpgsql scanner stops too early at NULL/INTO/etc.

I found some other cases which gave me errors, fixed them.

@AlexS778

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
script/generate-plpgsql-grammar.js (2)

213-236: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the ALIAS FOR arm before variable declarations in the generator.

The checked-in plpgsql/grammar.js still relies on the alias arm coming first so ALIAS is not consumed as a type name. This template has the arms reversed, so the next regeneration will reintroduce that ambiguity.

Suggested fix
     decl_statement: $ => choice(
+      // Alias declaration: name ALIAS FOR target ;
+      seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'),
       // Variable declaration: name [CONSTANT] type [COLLATE collation] [NOT NULL] [DEFAULT|:=|= expr] ;
       seq(
         $.decl_varname,
         optional($.kw_constant),
         $.decl_datatype,
         optional($.decl_collate),
         optional(seq($.kw_not, $.kw_null)),
         optional($.decl_defval),
         ';'
       ),
-      // Alias declaration: name ALIAS FOR target ;
-      seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'),
       // Cursor declaration: name [NO SCROLL | SCROLL] CURSOR [(args)] FOR|IS query ;
       seq(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/generate-plpgsql-grammar.js` around lines 213 - 236, The choice arms
inside decl_statement are ordered so the variable-declaration arm can consume
the token ALIAS as a type; move the Alias declaration arm (the
seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';')) to appear
before the variable declaration arm (the seq that includes $.decl_datatype,
optional($.decl_collate), etc.) so that $.kw_alias is recognized as the ALIAS
form instead of being parsed as a datatype; update the ordering in
decl_statement accordingly.

574-581: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require the cursor name even when FROM is omitted.

This template makes the cursor identifier part of the optional FROM clause, so regenerating from it would reject valid forms like FETCH NEXT cur INTO target; and only accept the FROM cur variant.

Suggested fix
     stmt_fetch: $ => seq(
       $.kw_fetch,
       optional($.fetch_direction),
-      optional(seq($.kw_from, $.any_identifier)),
+      optional($.kw_from),
+      $.any_identifier,
       $.kw_into,
       $.into_target,
       ';'
     ),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/generate-plpgsql-grammar.js` around lines 574 - 581, The grammar rule
stmt_fetch currently nests the cursor identifier inside optional(seq($.kw_from,
$.any_identifier)), making the cursor name optional when FROM is omitted; change
the sequence so the cursor identifier ($.any_identifier) is required regardless
of the presence of $.kw_from — e.g. keep $.kw_fetch and
optional($.fetch_direction), make $.kw_from optional by itself, then require
$.any_identifier before $.kw_into and $.into_target, so symbols to edit are
stmt_fetch, $.fetch_direction, $.kw_from, and $.any_identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plpgsql/src/scanner.c`:
- Around line 433-438: The code currently treats any top-level '<<' as a
block-label start even when SQL content has already begun; update the
conditional in the scanner so that the '<<' label-start handling only triggers
when at top-level (depth == 0) AND no SQL content has been emitted yet
(has_content is false). Concretely, modify the check around lexer->lookahead ==
'<' (the block that calls lexer->mark_end, lexer->advance and return
finish_token(lexer, mode.symbol, has_content)) to include a guard that
!has_content (or equivalent "fragment-start" state) before consuming '<<' as a
label; apply the same change to the analogous block that handles '<<' at the
second location (the 628-635 region) so 'a << 1' and similar SQL expressions are
not mis-parsed as label boundaries.

---

Outside diff comments:
In `@script/generate-plpgsql-grammar.js`:
- Around line 213-236: The choice arms inside decl_statement are ordered so the
variable-declaration arm can consume the token ALIAS as a type; move the Alias
declaration arm (the seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier,
';')) to appear before the variable declaration arm (the seq that includes
$.decl_datatype, optional($.decl_collate), etc.) so that $.kw_alias is
recognized as the ALIAS form instead of being parsed as a datatype; update the
ordering in decl_statement accordingly.
- Around line 574-581: The grammar rule stmt_fetch currently nests the cursor
identifier inside optional(seq($.kw_from, $.any_identifier)), making the cursor
name optional when FROM is omitted; change the sequence so the cursor identifier
($.any_identifier) is required regardless of the presence of $.kw_from — e.g.
keep $.kw_fetch and optional($.fetch_direction), make $.kw_from optional by
itself, then require $.any_identifier before $.kw_into and $.into_target, so
symbols to edit are stmt_fetch, $.fetch_direction, $.kw_from, and
$.any_identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b183ca6-5137-4358-ad5e-78eadd5ec90c

📥 Commits

Reviewing files that changed from the base of the PR and between 33e140f and 34d72ac.

📒 Files selected for processing (13)
  • plpgsql/grammar.js
  • plpgsql/src/grammar.json
  • plpgsql/src/node-types.json
  • plpgsql/src/parser.c
  • plpgsql/src/scanner.c
  • plpgsql/test/corpus/control_flow.txt
  • plpgsql/test/corpus/cursors.txt
  • plpgsql/test/corpus/diagnostics.txt
  • plpgsql/test/corpus/dynamic_sql.txt
  • plpgsql/test/corpus/exceptions.txt
  • plpgsql/test/corpus/loops.txt
  • plpgsql/test/corpus/statements.txt
  • script/generate-plpgsql-grammar.js

Comment thread plpgsql/src/scanner.c Outdated
Only treat top-level `<<` as a PL/pgSQL label opener when the SQL fragment has
not emitted any content yet. This prevents expressions like `a << 1` from being
split as if `<<` started a block label.

Apply the same guard in both scanner paths and add corpus coverage for `<<` in
IF conditions and static SQL statements.
@AlexS778

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@AlexS778

AlexS778 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@gmr
Hey, since 'I've been working through recent PL/plpgsql generator/scanner issues and other stuff, would you be open to adding me as a maintainer with permissions to merge PRs?

I’m happy to keep things conservative: work through PRs, keep changes reviewable, not touch releases unless you want me to, and coordinate before any big grammar/generator changes.

Right now I’m doing everything from a fork, and it’s getting difficult to keep branches, generated files, and fixes in sync. Having merge access would let me cleanly land fixes like #30/#32, keep the repo moving while you’re busy, and reduce the amount of coordination needed from you.

Me and my coworker are having troubles using tree sitter in zed, because parts of the code are from my fork, parts from this repo and etc. So yeah it's not ideal.

Totally fine if you’d prefer a limited setup first - for example, merge/write access but releases stay with you, or a trial period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PL/pgSQL scanner should use context-specific SQL fragment tokens instead of one global SQL-body heuristic

1 participant