This file contains important information about the sql-metadata repository for AI agents and developers working on the codebase.
sql-metadata is a Python library that parses SQL queries and extracts metadata such as:
- Tables referenced in queries
- Columns used
- Query types (SELECT, INSERT, UPDATE, etc.)
- WITH clause (CTE) definitions
- Subqueries and aliases
Technology Stack:
- Python 3.10+
- sqlparse library for tokenization
- Poetry for dependency management
- pytest for testing
- flake8 and pylint for linting
sql-metadata/
├── sql_metadata/ # Main package
│ ├── parser.py # Core Parser class
│ ├── token.py # SQLToken and EmptyToken classes
│ ├── keywords_lists.py # SQL keyword definitions
│ └── __init__.py
├── test/ # Test suite
│ ├── test_with_statements.py
│ ├── test_getting_tables.py
│ ├── test_getting_columns.py
│ └── ... (30+ test files)
├── pyproject.toml # Poetry configuration
├── Makefile # Common commands
├── .flake8 # Flake8 configuration
└── README.md
poetry install # Install dependenciesmake test # Run all tests with pytest
poetry run pytest -vv # Verbose test output
poetry run pytest -x # Stop on first failure
poetry run pytest test/test_with_statements.py::test_name # Run specific testmake lint # Run flake8 and pylint
poetry run flake8 sql_metadata/
poetry run pylint sql_metadata/make format # Run black formattermake coverage # Run tests with coverage reportImportant: The project has a 100% test coverage requirement (fail_under = 100 in pyproject.toml).
- Max line length: Not explicitly set (defaults apply)
- Max complexity: 8 (C901 error for complexity > 8)
- Exceptions: Use
# noqa: C901for complex but necessary functions
When a function legitimately needs higher complexity, suppress the warning:
@property
def complex_method(self) -> Type: # noqa: C901
"""Method with necessary complexity"""Examples in codebase:
parser.py:134:tokenspropertyparser.py:450:with_namespropertyparser.py:822:_resolve_nested_querymethod
The Parser class has # pylint: disable=R0902 to suppress "too many instance attributes" warnings.
Located in sql_metadata/parser.py
The Parser class uses sqlparse to tokenize SQL and then processes tokens to extract metadata.
Key Properties (lazy evaluation):
tokens- Tokenized SQLtables- Tables referenced in querycolumns- Columns referencedwith_names- CTE (Common Table Expression) nameswith_queries- CTE definitionsquery_type- Type of SQL querysubqueries- Subquery definitions
Important Pattern: Most properties cache their results:
@property
def example(self):
if self._example is not None:
return self._example
# ... computation ...
self._example = result
return self._exampleThe parser processes SQLToken objects which have properties like:
value- The token textnormalized- Uppercased token valuenext_token- Next token in sequenceprevious_token- Previous tokennext_token_not_comment- Next non-comment tokenis_as_keyword- Boolean flagis_with_query_end- Boolean flag for WITH clause boundariestoken_type- Type classification
Located in parser.py:450 (with_names property)
Key Logic:
- Iterates through tokens looking for "WITH" keywords
- Enters a while loop that stays in WITH block until finding ending keywords
- Processes each CTE by finding "AS" keywords and extracting names
- Advances through tokens until finding
is_with_query_end - Checks if at end of WITH block using
WITH_ENDING_KEYWORDS
WITH_ENDING_KEYWORDS (from keywords_lists.py):
- UPDATE
- SELECT
- DELETE
- REPLACE
- INSERT
Common Pitfall: Malformed SQL with consecutive AS keywords (e.g., WITH a AS (...) AS b) can cause infinite loops if not properly detected and handled.
Solution Pattern: After processing a WITH clause, always check if the next token is another AS keyword (which indicates malformed SQL) and raise ValueError("This query is wrong").
The codebase has established patterns for handling malformed SQL:
- Detect the malformed pattern early
- Raise
ValueError("This query is wrong")- This is the standard error message - Use pytest.raises in tests:
parser = Parser(malformed_query)
with pytest.raises(ValueError, match="This query is wrong"):
parser.tablesExamples:
test_with_statements.py:500-528: Tests for malformed WITH queriesparser.py:679: Detection in_handle_with_name_save
When processing tokens in loops:
- Always ensure the token advances in each iteration
- Check for malformed patterns before looping back
- Have clear exit conditions
Pattern:
while condition and token.next_token:
if some_pattern:
# ... process ...
if exit_condition:
break
else:
# Always advance token to prevent infinite loop
token = token.next_token
else:
token = token.next_tokenTests are organized by feature/SQL clause:
test_with_statements.py- WITH clause (CTEs)test_getting_tables.py- Table extractiontest_getting_columns.py- Column extractiontest_query_type.py- Query type detection- Database-specific:
test_mssql_server.py,test_postgress.py,test_hive.py, etc.
def test_descriptive_name():
"""Optional docstring explaining the test"""
query = """SQL query here"""
parser = Parser(query)
assert parser.tables == ["expected", "tables"]def test_malformed_case():
# Comment explaining what's being tested and why
# Include issue reference if applicable: # https://github.com/macbre/sql-metadata/issues/XXX
query = """Malformed SQL"""
parser = Parser(query)
with pytest.raises(ValueError, match="This query is wrong"):
parser.tables- Every new feature needs tests
- Every bug fix needs a test that would have caught the bug
- Coverage must remain at 100%
Following the established pattern:
Brief description of change
Resolves #issue-number.
More detailed explanation of what was wrong and why.
The issue was: [explain the problem]
This fix:
- Bullet point 1
- Bullet point 2
- Bullet point 3
Co-Authored-By: Claude <noreply@anthropic.com>
- Feature:
feature/description - Bug fix:
fix/description - Example:
fix/parser-tables-hangs
1fbfee4 Drop Python 3.9 support (#604)
d0e6fc6 Parser.columns drops column named 'source' when it is the last column in a SELECT statement (#603)
Symptoms: Parser never returns when calling .tables or other properties
Common Causes:
- Token not advancing in a while loop
- Malformed SQL not detected early enough
- Missing exit condition in nested loops
Solution Checklist:
- Ensure token advances in all loop branches
- Check for malformed SQL patterns and raise ValueError
- Verify exit conditions are reachable
- Add timeout test to verify fix
When it happens: Function exceeds complexity threshold of 8
Solutions:
- Refactor to reduce complexity (preferred)
- Use
# noqa: C901if complexity is necessary (see examples in codebase)
Cause: Missing test coverage for new code paths
Solution:
poetry run pytest -vv --cov=sql_metadata --cov-report=term-missingThis shows which lines are not covered.
- Lines 134-200: Token processing and initialization
- Lines 450-482: WITH clause parsing (with_names property)
- Lines 484-580: WITH queries extraction
- Lines 669-700:
_handle_with_name_savehelper method - Lines 822+: Nested query resolution
Defines SQL keyword sets:
WITH_ENDING_KEYWORDS(line 40)SUBQUERY_PRECEDING_KEYWORDSTABLE_ADJUSTMENT_KEYWORDSKEYWORDS_BEFORE_COLUMNSSUPPORTED_QUERY_TYPES
Comprehensive tests for WITH clause parsing:
- Valid multi-CTE queries
- CTEs with column definitions
- Nested WITH statements
- Malformed SQL detection (lines 500-540)
timeout 5 poetry run pytest test/test_file.py::test_name -vvtimeout 3 poetry run python -c "from sql_metadata import Parser; Parser(query).tables"If it times out, there's still an infinite loop.
Add debug prints in parser.py:
print(f"Token: {token.value}, Next: {token.next_token.value if token.next_token else None}")- sqlparse (>=0.4.1, <0.6.0): SQL tokenization
- pytest (^8.4.2): Testing framework
- pytest-cov (^7.0.0): Coverage reporting
- black (^25.11): Code formatting
- flake8 (^7.3.0): Linting
- pylint (^3.3.9): Advanced linting
- coverage (^7.10): Coverage measurement
- Current Version: 2.19.0
- Python Support: ^3.10 (Python 3.9 support dropped in #604)
- License: MIT
- Homepage: https://github.com/macbre/sql-metadata
Always cache property results:
@property
def my_property(self):
if self._my_property is not None:
return self._my_property
self._my_property = self._compute_property()
return self._my_propertyIn loops, ensure every branch advances:
while condition:
if pattern_match:
# ... process ...
if should_exit:
flag = False
else:
token = token.next_token # MUST advance
else:
token = token.next_token # MUST advanceUse consistent error messages:
"This query is wrong"- for malformed SQL- Keep messages simple and consistent with existing patterns
Reference issues in test comments:
def test_issue_fix():
# Test for issue #556 - malformed WITH query causes infinite loop
# https://github.com/macbre/sql-metadata/issues/556# Setup
poetry install
# Test
make test # All tests
poetry run pytest test/test_with_statements.py -vv # Specific file
poetry run pytest -x # Stop on first failure
poetry run pytest -k "test_name" # Run by name pattern
# Quality
make lint # Lint check
make format # Format code
make coverage # Coverage report
# Debug
poetry run python -c "from sql_metadata import Parser; print(Parser('SELECT * FROM t').tables)"- Consider refactoring
with_namesproperty to reduce complexity below 8 - Add more detailed error messages for different types of malformed SQL
- Consider extracting token advancement logic into helper methods
- Poetry dev-dependencies section is deprecated (migrate to poetry.group.dev.dependencies)
- Consider adding type hints more comprehensively
- Some test files could be consolidated
2026-03-04 - Initial creation after fixing issue #556 (infinite loop in WITH statement parsing)