Skip to content

Conversation

@HazelYuAhiru
Copy link
Collaborator

Overview:

This PR implements the query parser and adds tests for converting an input SQL query string into our custom AST structure (SQL → JSON via mosql → custom AST). The base test case passes, and all additional test cases run without errors.

Code Changes:

  • Added a base parsing test in tests/test_query_parser.py.
  • Implemented parse() and its helper functions.

colinthebomb1
colinthebomb1 previously approved these changes Dec 1, 2025
Copy link
Collaborator

@colinthebomb1 colinthebomb1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! let one small comment

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a query parser that converts SQL strings to a custom AST structure via the mosql library. It adds a basic parsing test and implements the core parse() method along with helper functions for parsing different SQL clauses.

Key Changes:

  • Implements QueryParser.parse() method that converts SQL to AST via mosql intermediate JSON
  • Adds helper methods for parsing SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY clauses
  • Implements expression parsing, alias resolution, and operator/join type normalization
  • Adds comprehensive test case test_basic_parse() that validates parsing of complex queries with JOINs, aggregations, and multiple clauses

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File Description
tests/test_query_parser.py Adds test_basic_parse() function with comprehensive test case covering JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and OFFSET; updates imports to include new node types
core/query_parser.py Implements complete query parser with parse() method and 10+ helper methods for clause parsing, expression parsing, alias resolution, and type normalization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 224 to 255
def resolve_aliases(self, expr: Node) -> Node:
if isinstance(expr, OperatorNode):
# Recursively resolve aliases in operator operands
left = self.resolve_aliases(expr.children[0])
right = self.resolve_aliases(expr.children[1])
return OperatorNode(left, expr.name, right)
elif isinstance(expr, FunctionNode):
# Check if this function matches an aliased function from SELECT
if expr.alias is None:
for alias, aliased_expr in self.aliases.items():
if isinstance(aliased_expr, FunctionNode):
if (expr.name == aliased_expr.name and
len(expr.children) == len(aliased_expr.children) and
all(expr.children[i] == aliased_expr.children[i]
for i in range(len(expr.children)))):
# This function matches an aliased one, use the alias
expr.alias = alias
break
return expr
elif isinstance(expr, ColumnNode):
# Check if this column matches an aliased column from SELECT
if expr.alias is None:
for alias, aliased_expr in self.aliases.items():
if isinstance(aliased_expr, ColumnNode):
if (expr.name == aliased_expr.name and
expr.parent_alias == aliased_expr.parent_alias):
# This column matches an aliased one, use the alias
expr.alias = alias
break
return expr
else:
return expr
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The alias resolution in the HAVING clause iterates through all aliases to find matching FunctionNode or ColumnNode instances (lines 233-241, 246-252). For large SELECT clauses with many aliases, this could be inefficient. Consider optimizing by creating a separate lookup structure or only checking relevant aliases.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is OK for now.


# [2] Our new code
# Any (JSON) -> AST (QueryNode)
self.aliases = {}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.aliases instance variable is initialized within the parse() method rather than in __init__. This means the aliases dictionary persists between parse calls and could lead to stale alias references affecting subsequent parses. Consider either: 1) initializing self.aliases = {} in a proper __init__ method, or 2) passing aliases as a parameter through the helper methods instead of using instance state.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point which I also noticed. If we use the instance member to store aliases, this instance is not safe for multi-threading. [1] One way to solve this issue is through a helper parameter being passed through call the following function calls through the function arguments, similar to what we use as the memo in the rewriter engine. [2] Or we can make sure the QueryParser is declared as non-thread-safe, and every place we need to use the parse function, we create a new instance of the parser, i.e., parser = QueryParser(), and if we use this approach, it makes more sense to init the self.aliases = {} in the __init__ function instead of this parse function. @HazelYuAhiru , please make a decision with coordination with @colinthebomb1 's PR, if he uses the same approach (some internal state shared by different functions as an instance-level state), we can use [2]. If he already used [1], you may change to use [1]. If you decide to use [2], please find a proper annotation tag in Python to say it is non-thread-safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'v updated the parser to use a thread-safe approach

from core.ast.node import (
Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'SubqueryNode' is not used.
Import of 'VarNode' is not used.
Import of 'VarSetNode' is not used.

Suggested change
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, JoinNode

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed this version of implementation does not consider subquery. We will consider it in the next iteration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, in this case I will leave a TODO comment above as reminders.

Copy link
Contributor

@baiqiushi baiqiushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HazelYuAhiru for the PR! It looks fantastic overall. Ony a few small changes are needed. Please take care of them. Thanks!

from core.ast.node import (
Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed this version of implementation does not consider subquery. We will consider it in the next iteration.


# [2] Our new code
# Any (JSON) -> AST (QueryNode)
self.aliases = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point which I also noticed. If we use the instance member to store aliases, this instance is not safe for multi-threading. [1] One way to solve this issue is through a helper parameter being passed through call the following function calls through the function arguments, similar to what we use as the memo in the rewriter engine. [2] Or we can make sure the QueryParser is declared as non-thread-safe, and every place we need to use the parse function, we create a new instance of the parser, i.e., parser = QueryParser(), and if we use this approach, it makes more sense to init the self.aliases = {} in the __init__ function instead of this parse function. @HazelYuAhiru , please make a decision with coordination with @colinthebomb1 's PR, if he uses the same approach (some internal state shared by different functions as an instance-level state), we can use [2]. If he already used [1], you may change to use [1]. If you decide to use [2], please find a proper annotation tag in Python to say it is non-thread-safe.

HazelYuAhiru and others added 5 commits December 16, 2025 16:21
first batch fix

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants