Separate aggregation from measure definition#22
Conversation
Measures become named row-level expressions; aggregation is specified at query time using colon syntax (e.g., revenue:sum, *:count). Custom aggregations are first-class model entities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces explicit query-time aggregation syntax (colon notation: Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FormulaParser as Formula Parser
participant Enrichment as Enrichment Engine
participant QueryEngine as Query Engine
participant SQLGen as SQL Generator
participant Database as Database
Client->>FormulaParser: parse_formula("revenue:sum + cost:sum")
FormulaParser->>FormulaParser: _preprocess_agg_refs()
FormulaParser->>FormulaParser: Replace "revenue:sum" & "cost:sum" with placeholders
FormulaParser->>FormulaParser: Build AST with AggregatedMeasureRef nodes
FormulaParser->>Enrichment: Return FieldSpec with agg_refs map
Client->>Enrichment: enrich_query(query with fields=["revenue:sum"])
Enrichment->>Enrichment: Detect AggregatedMeasureRef
Enrichment->>Enrichment: _ensure_aggregated_measure(measure="revenue", agg="sum")
Enrichment->>Enrichment: Validate "sum" in allowed_aggregations
Enrichment->>Enrichment: Create EnrichedMeasure(aggregation="sum", aggregation_def=...)
Enrichment->>QueryEngine: Return enriched query with resolved measures
QueryEngine->>SQLGen: Generate SQL for enriched query
SQLGen->>SQLGen: Map aggregation="sum" via _AGG_FUNCTION_MAP
SQLGen->>SQLGen: _build_agg(measure, aggregation="sum")
SQLGen->>SQLGen: Return "SUM(revenue_col)" SQL fragment
SQLGen->>Database: Execute full query
Database->>Client: Return result rows
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
.gitignore (1)
213-213: Consider adding a comment for thebenchmarks/entry.While the pattern is self-explanatory, adding a brief comment (similar to line 214) would improve consistency with the documentation style used elsewhere in the file.
📝 Optional improvement
+# Benchmark results and data benchmarks/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore at line 213, Add a brief inline comment for the gitignore entry "benchmarks/" to match the documentation style used elsewhere (e.g., similar to the comment on the following line), e.g., a one-line explanatory comment above or beside the "benchmarks/" entry describing why benchmark artifacts are ignored; update the ".gitignore" around the "benchmarks/" token to include that short comment.slayer/engine/enrichment.py (2)
332-344: Consider extracting canonical name computation to reduce duplication.The canonical name logic (
spec.aggregation_name if spec.measure_name == "*" else f"{spec.measure_name}_{spec.aggregation_name}") is duplicated between_flatten_spec(lines 332-336),_ensure_aggregated_measure(lines 147-150), and the query field processing loop (lines 446-449). Consider extracting this into a helper function.♻️ Suggested helper function
def _canonical_agg_name(measure_name: str, aggregation_name: str) -> str: """Compute canonical result column name for an aggregated measure.""" if measure_name == "*": return aggregation_name # *:count → "count" return f"{measure_name}_{aggregation_name}" # revenue:sum → "revenue_sum"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@slayer/engine/enrichment.py` around lines 332 - 344, The canonical aggregated measure name logic is duplicated across _flatten_spec, _ensure_aggregated_measure, and the query field processing loop; extract it into a small helper function (e.g., _canonical_agg_name(measure_name: str, aggregation_name: str) -> str) that returns aggregation_name when measure_name == "*" otherwise f"{measure_name}_{aggregation_name}", then replace the inline expressions in _flatten_spec, _ensure_aggregated_measure, and the query field loop to call _canonical_agg_name to compute canonical_name before passing alias_key or returning the model-qualified name.
479-485: Validation may be redundant when explicit time arg is provided.The validation at lines 479-485 checks if
last_agg_time_column is Noneafter calling_ensure_aggregated_measure. However,_ensure_aggregated_measure(lines 183-187) already updateslast_agg_time_columnwhenagg_argscontains an explicit time column. This means the error would never trigger forrevenue:last(ordered_at)sincelast_agg_time_columnwould already be set.This is actually correct behavior (no error when time arg is explicit), but the code flow is subtle. Consider adding a comment explaining that
_ensure_aggregated_measuremay have already setlast_agg_time_column.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@slayer/engine/enrichment.py` around lines 479 - 485, The validation raising ValueError when spec.aggregation_name is "first" or "last" and last_agg_time_column is None is correct but subtle because _ensure_aggregated_measure may already set last_agg_time_column when agg_args includes an explicit time column; add a concise inline comment above this validation referencing _ensure_aggregated_measure and explaining that it can populate last_agg_time_column from explicit agg_args (so the check only triggers when no time column was provided anywhere), and mention the relevant symbols spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the control flow explicit for future readers.slayer/core/formula.py (1)
122-133: Args parsing assumes no commas within argument values.The current implementation splits on commas without considering that argument values could potentially contain commas (e.g., in future string-based args). For the current use cases (
weight=quantity,ordered_at), this works fine, but consider adding a comment noting this limitation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@slayer/core/formula.py` around lines 122 - 133, The args parsing block that builds agg_args and agg_kwargs from args_str (the code around agg_args, agg_kwargs, and the for loop parsing inner) naively splits on commas and therefore will break if argument values themselves contain commas; update the parsing code or—if changing behavior is not desired now—add a clear inline comment next to the agg_args/agg_kwargs parsing logic stating this limitation (that values cannot contain commas) and document expected formats (e.g., key=val or bare identifiers) so future maintainers know the constraint and consider switching to a more robust parser if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@slayer/core/enums.py`:
- Line 127: The "weighted_avg" SQL expression in the enum/dict must guard
against a zero denominator; update the "weighted_avg" value (the mapping entry
named "weighted_avg" in slayer/core/enums.py) to divide by NULLIF(SUM({weight}),
0) instead of SUM({weight}) so the expression becomes SUM({value} * {weight}) /
NULLIF(SUM({weight}), 0), ensuring NULL is returned when total weight is zero.
In `@slayer/core/formula.py`:
- Around line 101-106: The regex _AGG_REF_RE currently includes an undocumented
optional path wildcard (?:\.\*)? that matches patterns like "customers.*:count";
remove this dead-code by deleting that optional group from the pattern so the
measure name group is only the base name or dotted path (i.e., change
_AGG_REF_RE to
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update
the inline comment accordingly), and run tests; if you instead intend to support
path.* keep the group but add an example to the AggregatedMeasureRef docstring
demonstrating "path.*:agg" usage so documentation matches the code.
In `@slayer/engine/enrichment.py`:
- Around line 181-187: The current use of nonlocal last_agg_time_column inside
the aggregation handling (when aggregation_name is "first" or "last" and
agg_args is present) creates a global side-effect so only the last explicit_time
wins; instead, stop mutating the shared last_agg_time_column and associate the
explicit time column with the specific measure (e.g., attach it to the
measure/aggregation metadata or store it in a dict keyed by the measure name or
aggregation identifier). Update the logic around aggregation_name, agg_args and
where last_agg_time_column is referenced to read per-measure time columns (e.g.,
measure.time_column or per_measure_time_map[measure_id]) and remove the nonlocal
mutation to avoid cross-measure interference.
- Around line 41-42: The file defines an unused logger variable named logger
(logging.getLogger(__name__))—either remove the unused import/definition or add
meaningful logging statements where aggregation or measure resolution happens
(e.g., inside functions that compute or resolve aggregations such as
resolve_aggregation or resolve_measure if present) so the logger is referenced;
update imports accordingly (remove logging import if deleting logger) and insert
process-level debug/info/error logs at key decision points to enable aggregation
resolution tracing.
- Around line 831-838: The reverse mapping _agg_reverse built from agg_refs
omits ref.agg_kwargs when formatting the aggregation signature, causing
reconstructed formulas to lose keyword arguments; update the comprehension that
builds _agg_reverse to include both ref.agg_args and ref.agg_kwargs inside the
parentheses (format kwargs as key=value pairs and join them with positional
args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for k,v in
ref.agg_kwargs.items()])), preserving the existing branch that omits parentheses
when there are no args/kwargs; target the _agg_reverse comprehension and the
attributes ref.measure_name, ref.aggregation_name, ref.agg_args, ref.agg_kwargs.
---
Nitpick comments:
In @.gitignore:
- Line 213: Add a brief inline comment for the gitignore entry "benchmarks/" to
match the documentation style used elsewhere (e.g., similar to the comment on
the following line), e.g., a one-line explanatory comment above or beside the
"benchmarks/" entry describing why benchmark artifacts are ignored; update the
".gitignore" around the "benchmarks/" token to include that short comment.
In `@slayer/core/formula.py`:
- Around line 122-133: The args parsing block that builds agg_args and
agg_kwargs from args_str (the code around agg_args, agg_kwargs, and the for loop
parsing inner) naively splits on commas and therefore will break if argument
values themselves contain commas; update the parsing code or—if changing
behavior is not desired now—add a clear inline comment next to the
agg_args/agg_kwargs parsing logic stating this limitation (that values cannot
contain commas) and document expected formats (e.g., key=val or bare
identifiers) so future maintainers know the constraint and consider switching to
a more robust parser if needed.
In `@slayer/engine/enrichment.py`:
- Around line 332-344: The canonical aggregated measure name logic is duplicated
across _flatten_spec, _ensure_aggregated_measure, and the query field processing
loop; extract it into a small helper function (e.g.,
_canonical_agg_name(measure_name: str, aggregation_name: str) -> str) that
returns aggregation_name when measure_name == "*" otherwise
f"{measure_name}_{aggregation_name}", then replace the inline expressions in
_flatten_spec, _ensure_aggregated_measure, and the query field loop to call
_canonical_agg_name to compute canonical_name before passing alias_key or
returning the model-qualified name.
- Around line 479-485: The validation raising ValueError when
spec.aggregation_name is "first" or "last" and last_agg_time_column is None is
correct but subtle because _ensure_aggregated_measure may already set
last_agg_time_column when agg_args includes an explicit time column; add a
concise inline comment above this validation referencing
_ensure_aggregated_measure and explaining that it can populate
last_agg_time_column from explicit agg_args (so the check only triggers when no
time column was provided anywhere), and mention the relevant symbols
spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the
control flow explicit for future readers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7f052283-3ecb-4320-b366-ce3b0e542832
📒 Files selected for processing (19)
.gitignoreCLAUDE.mddocs/examples/02_sql_vs_dsl/sql_vs_dsl_nb.ipynbdocs/examples/03_auto_ingest/auto_ingest_nb.ipynbdocs/examples/04_time/time_nb.ipynbdocs/examples/05_joined_measures/joined_measures_nb.ipynbdocs/examples/05_joins/joins_nb.ipynbdocs/examples/06_multistage_queries/multistage_queries_nb.ipynbslayer/core/enums.pyslayer/core/formula.pyslayer/core/models.pyslayer/engine/enriched.pyslayer/engine/enrichment.pyslayer/engine/ingestion.pyslayer/engine/query_engine.pyslayer/sql/generator.pytests/integration/test_integration.pytests/integration/test_integration_duckdb.pytests/integration/test_integration_postgres.py
| # Built-in aggregation SQL formulas (for aggregations that use a template). | ||
| # {value} = measure's SQL expression; {param_name} = parameter values. | ||
| BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = { | ||
| "weighted_avg": "SUM({value} * {weight}) / SUM({weight})", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify where weighted_avg is defined/used and whether denominator guards exist.
rg -n "weighted_avg|SUM\\(\\{value\\} \\* \\{weight\\}\\) / SUM\\(\\{weight\\}\\)|NULLIF\\(SUM\\(\\{weight\\}\\),\\s*0\\)"Repository: MotleyAI/slayer
Length of output: 1723
🏁 Script executed:
# Get context around where weighted_avg formula is built in generator.py
sed -n '920,940p' slayer/sql/generator.pyRepository: MotleyAI/slayer
Length of output: 928
🏁 Script executed:
# Check the models.py context around line 118 for weight handling
sed -n '110,130p' slayer/core/models.pyRepository: MotleyAI/slayer
Length of output: 772
🏁 Script executed:
# Search for any weight validation or guards before aggregation
rg -B5 -A5 "weighted_avg|weight\s*=" slayer/ --type py -m 10Repository: MotleyAI/slayer
Length of output: 8237
🏁 Script executed:
# Check if there are any tests or examples showing weighted_avg usage
fd -t f -e py | xargs rg -l "weighted_avg" | head -5Repository: MotleyAI/slayer
Length of output: 180
🏁 Script executed:
# Search for test files with weighted_avg
fd -t f -e py -path "*/test*" | xargs rg "weighted_avg" -l 2>/dev/null || echo "No test matches"Repository: MotleyAI/slayer
Length of output: 398
🏁 Script executed:
# Check enrichment layer for any weight validation
sed -n '125,200p' slayer/engine/enrichment.pyRepository: MotleyAI/slayer
Length of output: 3329
🏁 Script executed:
# Search for any division or NULLIF handling in formula processing
rg "division|NULLIF|divide" slayer/ --type pyRepository: MotleyAI/slayer
Length of output: 41
Guard weighted_avg against zero denominator.
SUM({weight}) can be zero for legitimate data groups (all weight values zero or NULL), which causes runtime errors or undefined behavior across SQL engines. Add NULLIF to safely return NULL instead of dividing by zero.
Proposed fix
BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = {
- "weighted_avg": "SUM({value} * {weight}) / SUM({weight})",
+ "weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "weighted_avg": "SUM({value} * {weight}) / SUM({weight})", | |
| "weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@slayer/core/enums.py` at line 127, The "weighted_avg" SQL expression in the
enum/dict must guard against a zero denominator; update the "weighted_avg" value
(the mapping entry named "weighted_avg" in slayer/core/enums.py) to divide by
NULLIF(SUM({weight}), 0) instead of SUM({weight}) so the expression becomes
SUM({value} * {weight}) / NULLIF(SUM({weight}), 0), ensuring NULL is returned
when total weight is zero.
| _AGG_REF_RE = re.compile( | ||
| r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\.\*)?)' # group 1: measure name, *, or path.* | ||
| r':' # colon separator | ||
| r'([a-zA-Z_]\w*)' # group 2: aggregation name | ||
| r'(\([^)]*\))?' # group 3: optional (args) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if path.* syntax is used anywhere in the codebase or tests
rg -n '\w+\.\*:' --type=pyRepository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Read the AggregatedMeasureRef class and surrounding context
sed -n '31,60p' slayer/core/formula.pyRepository: MotleyAI/slayer
Length of output: 1397
🏁 Script executed:
# Search for tests that verify the regex pattern
fd -e py -path '*/test*' | xargs rg -l 'AGG_REF_RE|AggregatedMeasureRef' 2>/dev/null | head -5Repository: MotleyAI/slayer
Length of output: 295
🏁 Script executed:
# Check how placeholders are generated and used
rg -n '__agg\d+__' --type=py | head -20Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for test examples of aggregation syntax
rg -n 'customers\.|revenue:' --type=py -A 2 -B 2 | head -40Repository: MotleyAI/slayer
Length of output: 2542
🏁 Script executed:
# Search for placeholder generation logic
rg -n '__agg|_preprocess_agg_refs' slayer/core/formula.py | head -30Repository: MotleyAI/slayer
Length of output: 258
🏁 Script executed:
# Check enrichment.py to see how it consumes AggregatedMeasureRef
sed -n '1,50p' slayer/engine/enrichment.pyRepository: MotleyAI/slayer
Length of output: 1441
🏁 Script executed:
# Look for actual placeholder generation in _preprocess_agg_refs function
sed -n '109,150p' slayer/core/formula.pyRepository: MotleyAI/slayer
Length of output: 1469
Remove undocumented path.* syntax from regex or add documentation.
The regex includes (?:\.\*)? to match patterns like customers.*:count, but this syntax is not documented in the AggregatedMeasureRef docstring examples and is not used anywhere in the codebase. Either remove this dead-code pattern or add it to the documented examples if it's intended for future use.
The concern about double underscores in identifiers is unfounded — the placeholder system correctly replaces identifiers like __internal__:sum with numeric placeholders like __agg0__, so there is no conflict.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@slayer/core/formula.py` around lines 101 - 106, The regex _AGG_REF_RE
currently includes an undocumented optional path wildcard (?:\.\*)? that matches
patterns like "customers.*:count"; remove this dead-code by deleting that
optional group from the pattern so the measure name group is only the base name
or dotted path (i.e., change _AGG_REF_RE to
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update
the inline comment accordingly), and run tests; if you instead intend to support
path.* keep the group but add an example to the AggregatedMeasureRef docstring
demonstrating "path.*:agg" usage so documentation matches the code.
| logger = logging.getLogger(__name__) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if logger is used anywhere in this file
rg -n 'logger\.' slayer/engine/enrichment.pyRepository: MotleyAI/slayer
Length of output: 41
Remove the unused logger or add logging calls for aggregation resolution.
The logger defined at line 41 is never used in this file. Either remove the unused import and definition, or add logging for debugging measure/aggregation resolution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@slayer/engine/enrichment.py` around lines 41 - 42, The file defines an unused
logger variable named logger (logging.getLogger(__name__))—either remove the
unused import/definition or add meaningful logging statements where aggregation
or measure resolution happens (e.g., inside functions that compute or resolve
aggregations such as resolve_aggregation or resolve_measure if present) so the
logger is referenced; update imports accordingly (remove logging import if
deleting logger) and insert process-level debug/info/error logs at key decision
points to enable aggregation resolution tracing.
| # For first/last with explicit time dimension arg, update last_agg_time_column | ||
| nonlocal last_agg_time_column | ||
| if aggregation_name in ("first", "last") and agg_args: | ||
| explicit_time = agg_args[0] | ||
| if "." not in explicit_time: | ||
| explicit_time = f"{model.name}.{explicit_time}" | ||
| last_agg_time_column = explicit_time |
There was a problem hiding this comment.
Side effect on last_agg_time_column may cause unexpected behavior with multiple first/last measures.
The nonlocal last_agg_time_column modification means that if multiple first/last aggregations are used with different explicit time columns, only the last one processed will take effect. This could lead to incorrect ordering for earlier first/last measures.
Consider whether each measure should track its own time column, or document that only one explicit time column is supported per query.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@slayer/engine/enrichment.py` around lines 181 - 187, The current use of
nonlocal last_agg_time_column inside the aggregation handling (when
aggregation_name is "first" or "last" and agg_args is present) creates a global
side-effect so only the last explicit_time wins; instead, stop mutating the
shared last_agg_time_column and associate the explicit time column with the
specific measure (e.g., attach it to the measure/aggregation metadata or store
it in a dict keyed by the measure name or aggregation identifier). Update the
logic around aggregation_name, agg_args and where last_agg_time_column is
referenced to read per-measure time columns (e.g., measure.time_column or
per_measure_time_map[measure_id]) and remove the nonlocal mutation to avoid
cross-measure interference.
| _agg_reverse = { | ||
| ph: ( | ||
| f"{ref.measure_name}:{ref.aggregation_name}" | ||
| if not ref.agg_args and not ref.agg_kwargs | ||
| else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})" | ||
| ) | ||
| for ph, ref in agg_refs.items() | ||
| } |
There was a problem hiding this comment.
Reverse mapping doesn't include kwargs, potentially producing incomplete formulas.
The reverse map construction at lines 831-838 only includes agg_args when building the original colon form, but ignores agg_kwargs. For a formula like price:weighted_avg(weight=quantity), the reconstructed formula would be price:weighted_avg() instead of price:weighted_avg(weight=quantity).
🐛 Proposed fix to include kwargs in reverse mapping
_agg_reverse = {
ph: (
f"{ref.measure_name}:{ref.aggregation_name}"
if not ref.agg_args and not ref.agg_kwargs
- else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"
+ else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})"
)
for ph, ref in agg_refs.items()
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _agg_reverse = { | |
| ph: ( | |
| f"{ref.measure_name}:{ref.aggregation_name}" | |
| if not ref.agg_args and not ref.agg_kwargs | |
| else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})" | |
| ) | |
| for ph, ref in agg_refs.items() | |
| } | |
| _agg_reverse = { | |
| ph: ( | |
| f"{ref.measure_name}:{ref.aggregation_name}" | |
| if not ref.agg_args and not ref.agg_kwargs | |
| else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})" | |
| ) | |
| for ph, ref in agg_refs.items() | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@slayer/engine/enrichment.py` around lines 831 - 838, The reverse mapping
_agg_reverse built from agg_refs omits ref.agg_kwargs when formatting the
aggregation signature, causing reconstructed formulas to lose keyword arguments;
update the comprehension that builds _agg_reverse to include both ref.agg_args
and ref.agg_kwargs inside the parentheses (format kwargs as key=value pairs and
join them with positional args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for
k,v in ref.agg_kwargs.items()])), preserving the existing branch that omits
parentheses when there are no args/kwargs; target the _agg_reverse comprehension
and the attributes ref.measure_name, ref.aggregation_name, ref.agg_args,
ref.agg_kwargs.
| def _build_formula_agg(self, measure: EnrichedMeasure, agg_name: str) -> exp.Expression: | ||
| """Build SQL for formula-based aggregations (weighted_avg, custom).""" | ||
| # Get formula: from aggregation_def or built-in | ||
| formula = None | ||
| if measure.aggregation_def and measure.aggregation_def.formula: | ||
| formula = measure.aggregation_def.formula | ||
| elif agg_name in BUILTIN_AGGREGATION_FORMULAS: | ||
| formula = BUILTIN_AGGREGATION_FORMULAS[agg_name] | ||
|
|
||
| if formula is None: | ||
| raise ValueError( | ||
| f"Aggregation '{agg_name}' has no formula. " | ||
| f"Custom aggregations must define a formula." | ||
| ) | ||
|
|
||
| # Collect param values: query-time overrides > aggregation_def defaults | ||
| param_defaults = {} | ||
| if measure.aggregation_def: | ||
| param_defaults = {p.name: p.sql for p in measure.aggregation_def.params} | ||
| params = {**param_defaults, **measure.agg_kwargs} | ||
|
|
||
| # Validate required params | ||
| required = BUILTIN_AGGREGATION_REQUIRED_PARAMS.get(agg_name, []) | ||
| for req in required: | ||
| if req not in params: | ||
| raise ValueError( | ||
| f"Aggregation '{agg_name}' requires parameter '{req}'. " | ||
| f"Set it in the model's aggregation definition or at query time " | ||
| f"(e.g., 'measure:{agg_name}({req}=column)')." | ||
| ) | ||
|
|
||
| # Resolve {value} and {param_name} in formula | ||
| col_expr = measure.sql or measure.name | ||
| substituted = formula.replace("{value}", col_expr) | ||
| for param_name, param_val in params.items(): | ||
| substituted = substituted.replace(f"{{{param_name}}}", param_val) | ||
|
|
||
| return sqlglot.parse_one(substituted, dialect=self.dialect) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check where agg_kwargs values originate from in the enrichment flow
rg -n "agg_kwargs" --type py -C 3 slayer/Repository: MotleyAI/slayer
Length of output: 10262
🏁 Script executed:
# Check how agg_kwargs values are validated and what BUILTIN_AGGREGATION_FORMULAS looks like
rg -n "BUILTIN_AGGREGATION_FORMULAS|BUILTIN_AGGREGATION_REQUIRED_PARAMS" --type py -C 2 slayer/Repository: MotleyAI/slayer
Length of output: 1789
🏁 Script executed:
# Look at the formula definitions to understand how parameters are used
fd -e py -x grep -l "weighted_avg\|custom.*aggregation" {} \; slayer/Repository: MotleyAI/slayer
Length of output: 392
🏁 Script executed:
# Check if there's any validation of agg_kwargs before it reaches _build_formula_agg
rg -n "agg_kwargs" --type py -B 5 -A 5 slayer/engine/enrichment.py | head -100Repository: MotleyAI/slayer
Length of output: 4177
🏁 Script executed:
# Look at the actual substitution logic and any context around it
sed -n '929,970p' slayer/sql/generator.pyRepository: MotleyAI/slayer
Length of output: 2038
🏁 Script executed:
# Search for any validation of parameter values or agg_kwargs
rg -n "validate|sanitize|escape" --type py slayer/engine/enrichment.py | grep -i paramRepository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for tests that show expected usage of weighted_avg and other parameterized aggregations
fd . slayer --full-path -e py -name "*test*" | head -20Repository: MotleyAI/slayer
Length of output: 289
🏁 Script executed:
# Check for parameter validation in the aggregation definition or enrichment
cat slayer/core/enums.py | grep -A 10 "class Aggregation\|@dataclass.*Aggregation"Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for the Aggregation and Parameter dataclasses to understand structure
rg -n "class.*Aggregation|class.*Parameter|@dataclass.*Aggregation" --type py slayer/ -A 5Repository: MotleyAI/slayer
Length of output: 843
🏁 Script executed:
# Check if there are any tests for weighted_avg or formula substitution
rg -n "weighted_avg\|_build_formula_agg" --type py slayer/ -B 2 -A 5 | head -80Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look at the full Aggregation model definition
sed -n '115,135p' slayer/core/models.pyRepository: MotleyAI/slayer
Length of output: 894
🏁 Script executed:
# Search for test files and examples using weighted_avg
find slayer -name "*.py" -type f | xargs grep -l "weighted_avg" | head -5Repository: MotleyAI/slayer
Length of output: 178
🏁 Script executed:
# Check if there are any constraints or validation on parameter values
rg -n "param|weight" --type py slayer/core/models.py -B 2 -A 2 | head -50Repository: MotleyAI/slayer
Length of output: 763
🏁 Script executed:
# Look for documentation or examples in README or docs
find . -name "*.md" -o -name "*.rst" | xargs grep -l "weighted_avg\|agg_kwargs" 2>/dev/null | head -3Repository: MotleyAI/slayer
Length of output: 70
🏁 Script executed:
# Check if sqlglot.parse_one is being called and how it handles the substituted string
rg -n "parse_one" --type py slayer/sql/generator.py -B 3 -A 2Repository: MotleyAI/slayer
Length of output: 4052
🏁 Script executed:
# Check CLAUDE.md for documentation on weighted_avg usage
cat ./CLAUDE.md | grep -A 20 "weighted_avg"Repository: MotleyAI/slayer
Length of output: 4633
🏁 Script executed:
# Look for tests that might show expected behavior
find slayer -name "*.py" -path "*test*" -type fRepository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Search for any test files in the repo
find . -name "test_*.py" -o -name "*_test.py" | head -10Repository: MotleyAI/slayer
Length of output: 417
🏁 Script executed:
# Check if there's validation in the formula parsing step
rg -n "agg_kwargs" --type py slayer/core/formula.py -B 5 -A 5Repository: MotleyAI/slayer
Length of output: 2995
🏁 Script executed:
# Look for how the parameter values are actually used - check if they're expected to be column names or full expressions
rg -n "def.*_build_formula_agg|substituted" --type py slayer/sql/generator.py -B 2 -A 10Repository: MotleyAI/slayer
Length of output: 1684
🏁 Script executed:
# Check how the expression returned from _build_formula_agg is used
rg -n "_build_formula_agg" --type py slayer/sql/generator.py -A 3Repository: MotleyAI/slayer
Length of output: 512
🏁 Script executed:
# Look at test files to see if there are tests for weighted_avg with different parameter values
grep -r "weighted_avg" tests/ --include="*.py" -B 2 -A 2Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Check if there's any sanitization in the enrichment pipeline for parameter values
rg -n "param.*val|val.*strip|escape|quote" --type py slayer/engine/enrichment.py | head -20Repository: MotleyAI/slayer
Length of output: 294
🏁 Script executed:
# Look at how column expressions are built - check _build_column_expr to understand if col_expr on line 961 is validated
sed -n '860,875p' slayer/sql/generator.pyRepository: MotleyAI/slayer
Length of output: 1037
🏁 Script executed:
# Check if sqlglot.parse_one can handle injection attempts safely - look for any error handling
rg -n "try|except|parse_one" --type py slayer/sql/generator.py | grep -A 2 "parse_one"Repository: MotleyAI/slayer
Length of output: 972
🏁 Script executed:
# Verify the expected format of parameter values - look at documentation or tests for constraints
rg -n "weight|quantity|agg_kwargs" --type py tests/ -B 2 -A 2 | head -40Repository: MotleyAI/slayer
Length of output: 2237
🏁 Script executed:
# Check how col_expr is constructed - does _resolve_sql provide any validation?
rg -n "_resolve_sql" --type py slayer/sql/generator.py -A 15 | head -25Repository: MotleyAI/slayer
Length of output: 1480
🏁 Script executed:
# Look for any tests that try malicious input
grep -r "injection\|DROP\|;\|--" tests/ --include="*.py" | head -10Repository: MotleyAI/slayer
Length of output: 964
🏁 Script executed:
# Check if parameters can be arbitrary expressions or only column names
rg -n "allowed_aggregations|param.*validate|weight.*column" --type py slayer/ -B 2 -A 2 | head -30Repository: MotleyAI/slayer
Length of output: 2308
🏁 Script executed:
# Check if there are any try-catch blocks around parse_one that might handle injection attempts
rg -n "try|except" --type py slayer/sql/generator.py | grep -B 5 -A 5 "966\|parse_one"Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for any validation that restricts agg_kwargs to column names
rg -n "Column|Identifier|parse_one" --type py slayer/sql/generator.py | grep -B 3 -A 3 "param\|weight\|agg_kwargs"Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Check what happens if sqlglot.parse_one fails - is there error handling upstream?
rg -n "query_engine\|SlayerQuery" --type py slayer/ -l | head -3Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look at how col_expr (measure.sql) is built - is it already validated as an expression?
rg -n "col_expr = measure.sql" --type py slayer/sql/generator.py -B 3 -A 10Repository: MotleyAI/slayer
Length of output: 786
🏁 Script executed:
# Check for any documentation on expected parameter format or constraints
rg -n "param.*format|parameter.*column|weight.*definition" --type py slayer/core/models.py -B 2 -A 2Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look at how queries are built and executed - check if the generated SQL from _build_formula_agg is safely incorporated
rg -n "execute\|query\|run_sql" --type py slayer/ -l | head -5Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Check the broader context of _build_agg usage to see how the returned expression is used
rg -n "_build_agg\|_build_formula_agg" --type py slayer/sql/generator.py -B 2 -A 5Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for any input validation or constraints in the aggregation parameter parsing
sed -n '115,145p' slayer/core/formula.pyRepository: MotleyAI/slayer
Length of output: 1048
🏁 Script executed:
# Check if parameter values should be restricted to identifier-like patterns
rg -n "identifier|column.*name|bare.*column" --type py slayer/core/models.py -B 1 -A 1 | head -30Repository: MotleyAI/slayer
Length of output: 497
🏁 Script executed:
# Check if sqlglot.parse_one can parse multi-statement SQL or if it safely rejects it
python3 << 'EOF'
import sqlglot
# Test if sqlglot.parse_one can handle potential injection attempts
test_cases = [
"SUM(amount * quantity) / SUM(quantity)", # normal
"SUM(amount * quantity); DROP TABLE orders", # injection attempt
"SUM(amount * quantity) OR 1=1", # another attempt
]
for test in test_cases:
try:
result = sqlglot.parse_one(test, dialect="sqlite")
print(f"✓ Parsed: {test[:50]}")
except Exception as e:
print(f"✗ Failed: {test[:50]} -> {type(e).__name__}")
EOFRepository: MotleyAI/slayer
Length of output: 178
🏁 Script executed:
# Look at the AggregationParam definition to understand expected format
sed -n '109,113p' slayer/core/models.pyRepository: MotleyAI/slayer
Length of output: 226
🏁 Script executed:
# Check if there's any documentation about parameter value constraints
rg -n "AggregationParam|param.*sql|weight.*sql" --type py slayer/core/models.py -B 2 -A 3Repository: MotleyAI/slayer
Length of output: 476
🏁 Script executed:
# Look at how sqlglot.parse_one is imported and check for any safety measures
rg -n "import.*sqlglot|from.*sqlglot" --type py slayer/sql/generator.py -A 2Repository: MotleyAI/slayer
Length of output: 309
🏁 Script executed:
# Check if there are integration tests for weighted_avg that might show expected behavior
find tests -name "*.py" -type f -exec grep -l "weighted_avg" {} \;Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look for any validation of formula or aggregation definitions
rg -n "Aggregation|formula.*validate|param.*validate" --type py slayer/core/models.py -A 3 | head -40Repository: MotleyAI/slayer
Length of output: 875
🏁 Script executed:
# Install sqlglot and test if it provides injection protection
pip install sqlglot -q && python3 << 'EOF'
import sqlglot
# Test if sqlglot.parse_one safely handles injection attempts in formula substitution
formula = "SUM({value} * {weight}) / SUM({weight})"
test_cases = [
("quantity", "Normal column name"),
("quantity + 1", "Expression"),
("quantity); DROP TABLE orders; --", "Injection attempt"),
("(SELECT COUNT(*) FROM orders)", "Subquery"),
]
for param_val, desc in test_cases:
substituted = formula.replace("{value}", "price").replace("{weight}", param_val)
print(f"\n{desc}: weight={param_val}")
print(f" Substituted: {substituted}")
try:
result = sqlglot.parse_one(substituted, dialect="sqlite")
print(f" ✓ Parsed successfully")
print(f" SQL output: {result.sql(dialect='sqlite')}")
except Exception as e:
print(f" ✗ Parse error: {e}")
EOFRepository: MotleyAI/slayer
Length of output: 953
🏁 Script executed:
# Check if there's any documentation about parameter validation in comments or docstrings
rg -n "param.*safe|sanitize|validate|column.*name|SQL.*expression" --type py slayer/core/ -B 1 -A 1 | head -50Repository: MotleyAI/slayer
Length of output: 2618
🏁 Script executed:
# Check how the final SQL is executed - is it parameterized or executed as a raw string?
rg -n "execute\|run\|query" --type py slayer/engine/ -lRepository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look at the query engine to see if SQL is executed directly or with parameters
sed -n '1,50p' slayer/engine/query_engine.py | head -30Repository: MotleyAI/slayer
Length of output: 949
🏁 Script executed:
# Check if there's any escaping or quoting of parameter values
rg -n "quote|escape|literal|param" --type py slayer/sql/generator.py | head -20Repository: MotleyAI/slayer
Length of output: 1327
🏁 Script executed:
# Check how the SQL from _build_formula_agg is used downstream
rg -n "def _build_agg|_build_formula_agg" --type py slayer/sql/generator.py -A 20 | grep -A 15 "def _build_agg"Repository: MotleyAI/slayer
Length of output: 958
🏁 Script executed:
# Check how the final SQL expression is rendered and executed
rg -n "\.sql\(|execute|client" --type py slayer/engine/query_engine.py | head -20Repository: MotleyAI/slayer
Length of output: 531
🏁 Script executed:
# Look for any tests that validate parameter safety or show expected usage
find tests -name "*.py" -type f -exec grep -l "agg_kwargs\|weighted_avg\|aggregation.*param" {} \;Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Check if there's any documentation about parameter constraints or safety
cat slayer/core/models.py | grep -A 10 "class AggregationParam"Repository: MotleyAI/slayer
Length of output: 471
🏁 Script executed:
# Check SlayerSQLClient to see if it uses parameterized queries or raw SQL execution
rg -n "class SlayerSQLClient|def execute" --type py slayer/sql/client.py -A 10 | head -30Repository: MotleyAI/slayer
Length of output: 623
🏁 Script executed:
# Look at the actual execution to confirm raw SQL string usage
sed -n '1,100p' slayer/sql/client.py | head -50Repository: MotleyAI/slayer
Length of output: 1355
🏁 Script executed:
# Search for any validation of formula or parameter values before they reach _build_formula_agg
rg -n "_ensure_aggregated_measure\|_build_formula_agg" --type py slayer/engine/enrichment.py -B 5 -A 5 | head -40Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Check the _execute_sql function to see how SQL is actually executed
rg -n "def _execute_sql" --type py slayer/sql/client.py -A 30Repository: MotleyAI/slayer
Length of output: 1104
🏁 Script executed:
# Look for any tests that demonstrate parameter value validation or injection testing
grep -r "injection\|DROP\|--\|;\|parametr" tests/ --include="*.py" -i | head -20Repository: MotleyAI/slayer
Length of output: 2101
🏁 Script executed:
# Check if there's a CHANGELOG or docs explaining design decisions around parameter handling
find . -name "CHANGELOG*" -o -name "*.md" | xargs grep -l "param\|aggregation\|injection" 2>/dev/null | head -3Repository: MotleyAI/slayer
Length of output: 137
🏁 Script executed:
# Check what sqlglot actually renders for the injection case - detailed test
pip install sqlglot -q && python3 << 'EOF'
import sqlglot
# Test the exact scenario: formula substitution → parse → render
formula = "SUM({value} * {weight}) / SUM({weight})"
col_expr = "price"
malicious_weight = "quantity); DROP TABLE orders; --"
# Step 1: String substitution (what happens in _build_formula_agg)
substituted = formula.replace("{value}", col_expr).replace("{weight}", malicious_weight)
print(f"1. After string substitution:\n {substituted}\n")
# Step 2: Parse with sqlglot
try:
parsed = sqlglot.parse_one(substituted, dialect="sqlite")
print(f"2. Parsed successfully as AST")
except Exception as e:
print(f"2. Parse failed: {e}")
exit(1)
# Step 3: Render back to SQL
rendered_sql = parsed.sql(dialect="sqlite")
print(f"3. Rendered SQL:\n {rendered_sql}\n")
# Step 4: Check if malicious statement is present
if "DROP TABLE" in rendered_sql:
print("⚠️ VULNERABLE: DROP TABLE statement is present in rendered SQL!")
else:
print("✓ Safe: DROP TABLE statement was removed/neutralized")
EOFRepository: MotleyAI/slayer
Length of output: 384
🏁 Script executed:
# Check if there are any comments in the codebase about parameter validation or security
rg -n "TODO|FIXME|HACK|SECURITY|injection|safe" --type py slayer/sql/generator.py | grep -i "param\|agg\|formula\|safe"Repository: MotleyAI/slayer
Length of output: 41
🏁 Script executed:
# Look at the broader context of how parameters should be used - check if there are constraints documented
cat slayer/core/models.py | sed -n '100,200p'Repository: MotleyAI/slayer
Length of output: 3858
Validate and sanitize user-provided parameter values in aggregations to prevent SQL injection.
The _build_formula_agg function substitutes user-provided parameter values directly into SQL formulas via string replacement (lines 962–964), then passes the result to sqlglot.parse_one(). While parameter definitions in aggregations support SQL expressions, user-provided agg_kwargs from query-time aggregation syntax (e.g., "revenue:weighted_avg(weight=user_input)") are not validated before substitution.
Testing confirms that malicious input like "quantity); DROP TABLE orders; --" is not rejected by sqlglot's parser and produces executable SQL statements when rendered. Since the generated SQL is executed as raw text via conn.execute(sa.text(sql)), this is a real SQL injection vector.
Mitigations to consider:
- Validate parameter values against a whitelist of column names or safe expression patterns before substitution
- Use sqlglot's parameterization capabilities instead of string replacement
- Document that parameters must come from trusted sources only and enforce this at the enrichment stage
Summary
This PR redesigns SLayer's measure/aggregation system to separate the row-level expression (measure) from the aggregation method. Previously, a measure tightly coupled both:
{name: "revenue", sql: "amount", type: "sum"}. Now:{name: "revenue", sql: "amount"}"revenue:sum"Why?
revenue_sum,revenue_avg,revenue_min, ...), you define one measure (revenue) and aggregate it any way you want at query timeNew query syntax
New model syntax
Result column naming
Colon becomes underscore in result keys (same canonical form as before):
revenue:sum→orders.revenue_sum*:count→orders.countcustomers.*:count→orders.customers__countBuilt-in aggregations (always available)
sumSUM(expr)avgAVG(expr)minMIN(expr)maxMAX(expr)countCOUNT(expr)orCOUNT(*)*:count= COUNT(*),col:count= COUNT(col)count_distinctCOUNT(DISTINCT expr)firstlastweighted_avgSUM(v*w)/SUM(w)weightparammedianPERCENTILE_CONT(0.5)Backward compatibility
typefield on measures is still accepted (deprecated, emitsDeprecationWarning)"revenue") work when the measure has a deprecatedtypefieldAuto-ingestion changes
amount_sum,amount_avg,amount_min,amount_max,amount_distinct)amount), aggregated at query time (amount:sum,amount:avg, etc.)*:countis always available without any measure definitionFiles changed
slayer/core/enums.py—BUILTIN_AGGREGATIONS, formula/param constantsslayer/core/models.py—Aggregation,AggregationParammodels;Measure.allowed_aggregationsslayer/core/formula.py—AggregatedMeasureRef, colon-syntax preprocessingslayer/engine/enriched.py—EnrichedMeasureusesaggregationstringslayer/engine/enrichment.py— aggregated measure resolution, backward compat, filter preprocessingslayer/engine/query_engine.py— cross-model*measures, virtual model simplificationslayer/sql/generator.py— string-based dispatch, formula aggregations,first/mediansupportslayer/engine/ingestion.py— one measure per columnTest plan
poetry run ruff check slayer/ tests/— cleantypefield still work (deprecation warnings)weighted_avgaggregation and query itallowed_aggregationswhitelist rejects invalid aggregations🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
revenue:sum,*:count,price:weighted_avg(weight=quantity)) to apply different aggregations at query time.Documentation