fix(plpgsql): use context-specific SQL scanner tokens#32
Conversation
Replace the single global PL/pgSQL SQL-body scanner token with context-specific external tokens for the different embedded SQL contexts. This lets the scanner stop SQL fragments at the correct delimiter for each grammar rule, such as THEN, LOOP, semicolon, INTO/USING, comma, range, or assignment boundaries. Keep the visible node shape as sql_expression via aliases so existing SQL injection queries continue to work. Add corpus coverage for INSERT INTO, SELECT INTO FROM, RETURNING INTO, dynamic EXECUTE USING, RETURN QUERY EXECUTE USING, FOR EXECUTE USING, cursor arguments, RAISE arguments, and dollar-quoted dynamic SQL containing semicolons. Regenerate only the PL/pgSQL tree-sitter artifacts.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe PL/pgSQL grammar and scanner are refactored to replace a single generic SQL-body token with 16+ context-specific external tokens that terminate at PL/pgSQL-relevant delimiters. The scanner now uses a ScanMode mechanism to determine the correct terminator set for each context, and all statement rules are rewritten to consume the appropriate context-specific wrapper nodes instead of the generic sql_expression. ChangesContext-Specific SQL Fragment Tokens
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@gmr part of this was fixed in here #10 (comment) about I found some other cases which gave me errors, fixed them. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
script/generate-plpgsql-grammar.js (2)
213-236:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the
ALIAS FORarm before variable declarations in the generator.The checked-in
plpgsql/grammar.jsstill relies on the alias arm coming first soALIASis not consumed as a type name. This template has the arms reversed, so the next regeneration will reintroduce that ambiguity.Suggested fix
decl_statement: $ => choice( + // Alias declaration: name ALIAS FOR target ; + seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'), // Variable declaration: name [CONSTANT] type [COLLATE collation] [NOT NULL] [DEFAULT|:=|= expr] ; seq( $.decl_varname, optional($.kw_constant), $.decl_datatype, optional($.decl_collate), optional(seq($.kw_not, $.kw_null)), optional($.decl_defval), ';' ), - // Alias declaration: name ALIAS FOR target ; - seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';'), // Cursor declaration: name [NO SCROLL | SCROLL] CURSOR [(args)] FOR|IS query ; seq(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@script/generate-plpgsql-grammar.js` around lines 213 - 236, The choice arms inside decl_statement are ordered so the variable-declaration arm can consume the token ALIAS as a type; move the Alias declaration arm (the seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier, ';')) to appear before the variable declaration arm (the seq that includes $.decl_datatype, optional($.decl_collate), etc.) so that $.kw_alias is recognized as the ALIAS form instead of being parsed as a datatype; update the ordering in decl_statement accordingly.
574-581:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire the cursor name even when
FROMis omitted.This template makes the cursor identifier part of the optional
FROMclause, so regenerating from it would reject valid forms likeFETCH NEXT cur INTO target;and only accept theFROM curvariant.Suggested fix
stmt_fetch: $ => seq( $.kw_fetch, optional($.fetch_direction), - optional(seq($.kw_from, $.any_identifier)), + optional($.kw_from), + $.any_identifier, $.kw_into, $.into_target, ';' ),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@script/generate-plpgsql-grammar.js` around lines 574 - 581, The grammar rule stmt_fetch currently nests the cursor identifier inside optional(seq($.kw_from, $.any_identifier)), making the cursor name optional when FROM is omitted; change the sequence so the cursor identifier ($.any_identifier) is required regardless of the presence of $.kw_from — e.g. keep $.kw_fetch and optional($.fetch_direction), make $.kw_from optional by itself, then require $.any_identifier before $.kw_into and $.into_target, so symbols to edit are stmt_fetch, $.fetch_direction, $.kw_from, and $.any_identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plpgsql/src/scanner.c`:
- Around line 433-438: The code currently treats any top-level '<<' as a
block-label start even when SQL content has already begun; update the
conditional in the scanner so that the '<<' label-start handling only triggers
when at top-level (depth == 0) AND no SQL content has been emitted yet
(has_content is false). Concretely, modify the check around lexer->lookahead ==
'<' (the block that calls lexer->mark_end, lexer->advance and return
finish_token(lexer, mode.symbol, has_content)) to include a guard that
!has_content (or equivalent "fragment-start" state) before consuming '<<' as a
label; apply the same change to the analogous block that handles '<<' at the
second location (the 628-635 region) so 'a << 1' and similar SQL expressions are
not mis-parsed as label boundaries.
---
Outside diff comments:
In `@script/generate-plpgsql-grammar.js`:
- Around line 213-236: The choice arms inside decl_statement are ordered so the
variable-declaration arm can consume the token ALIAS as a type; move the Alias
declaration arm (the seq($.decl_varname, $.kw_alias, $.kw_for, $.any_identifier,
';')) to appear before the variable declaration arm (the seq that includes
$.decl_datatype, optional($.decl_collate), etc.) so that $.kw_alias is
recognized as the ALIAS form instead of being parsed as a datatype; update the
ordering in decl_statement accordingly.
- Around line 574-581: The grammar rule stmt_fetch currently nests the cursor
identifier inside optional(seq($.kw_from, $.any_identifier)), making the cursor
name optional when FROM is omitted; change the sequence so the cursor identifier
($.any_identifier) is required regardless of the presence of $.kw_from — e.g.
keep $.kw_fetch and optional($.fetch_direction), make $.kw_from optional by
itself, then require $.any_identifier before $.kw_into and $.into_target, so
symbols to edit are stmt_fetch, $.fetch_direction, $.kw_from, and
$.any_identifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5b183ca6-5137-4358-ad5e-78eadd5ec90c
📒 Files selected for processing (13)
plpgsql/grammar.jsplpgsql/src/grammar.jsonplpgsql/src/node-types.jsonplpgsql/src/parser.cplpgsql/src/scanner.cplpgsql/test/corpus/control_flow.txtplpgsql/test/corpus/cursors.txtplpgsql/test/corpus/diagnostics.txtplpgsql/test/corpus/dynamic_sql.txtplpgsql/test/corpus/exceptions.txtplpgsql/test/corpus/loops.txtplpgsql/test/corpus/statements.txtscript/generate-plpgsql-grammar.js
Only treat top-level `<<` as a PL/pgSQL label opener when the SQL fragment has not emitted any content yet. This prevents expressions like `a << 1` from being split as if `<<` started a block label. Apply the same guard in both scanner paths and add corpus coverage for `<<` in IF conditions and static SQL statements.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@gmr I’m happy to keep things conservative: work through PRs, keep changes reviewable, not touch releases unless you want me to, and coordinate before any big grammar/generator changes. Right now I’m doing everything from a fork, and it’s getting difficult to keep branches, generated files, and fixes in sync. Having merge access would let me cleanly land fixes like #30/#32, keep the repo moving while you’re busy, and reduce the amount of coordination needed from you. Me and my coworker are having troubles using tree sitter in zed, because parts of the code are from my fork, parts from this repo and etc. So yeah it's not ideal. Totally fine if you’d prefer a limited setup first - for example, merge/write access but releases stay with you, or a trial period. |
Summary
Refactors the PL/pgSQL external scanner to use context-specific SQL fragment tokens instead of one global SQL-body heuristic.
This fixes valid PL/pgSQL where words like
INTO,USING,NULL, orLOOPare SQL syntax in one context but PL/pgSQL delimiters in another.Fixed examples
What changed
Added context-specific scanner modes for SQL fragments ending at THEN, LOOP, ;, INTO/USING, comma, range, and assignment boundaries.
Updated PL/pgSQL grammar rules to request the correct scanner mode per context.
Kept the visible node as (sql_expression) via aliases, so existing SQL injections continue to work.
Added corpus coverage for delimiter-sensitive PL/pgSQL cases.
Regenerated only PL/pgSQL Tree-sitter artifacts.
Testing
cd plpgsql && ../node_modules/.bin/tree-sitter test.pl.sqlfilescloses #31
Summary by CodeRabbit
Bug Fixes
Tests