Separate aggregation from measure definition by ZmeiGorynych · Pull Request #22 · MotleyAI/slayer

ZmeiGorynych · 2026-04-09T21:58:23Z

Summary

This PR redesigns SLayer's measure/aggregation system to separate the row-level expression (measure) from the aggregation method. Previously, a measure tightly coupled both: {name: "revenue", sql: "amount", type: "sum"}. Now:

Measures are named row-level expressions: {name: "revenue", sql: "amount"}
Aggregations are specified at query time using colon syntax: "revenue:sum"
Custom aggregations (e.g., weighted average) are first-class named entities defined at the model level

Why?

Eliminates measure proliferation — instead of N measures per column (revenue_sum, revenue_avg, revenue_min, ...), you define one measure (revenue) and aggregate it any way you want at query time
Enables complex aggregations — weighted averages, custom formulas, percentiles can be defined and reused
Cleaner separation of concerns — the "what" (measure expression) is separate from the "how" (aggregation method)

New query syntax

# Old syntax (still works via backward compat, emits DeprecationWarning)
fields=["count", "revenue"]

# New colon syntax
fields=[
    "*:count",                              # COUNT(*)
    "revenue:sum",                          # SUM(amount)
    "revenue:avg",                          # AVG(amount)
    "customer_id:count_distinct",           # COUNT(DISTINCT customer_id)
    "price:weighted_avg(weight=quantity)",   # custom aggregation with params
    "revenue:median",                       # PERCENTILE_CONT(0.5)
    "revenue:first(ordered_at)",            # earliest value (explicit time col)
    "revenue:last",                         # latest value
]

# Transforms wrap aggregated refs naturally
fields=[
    "cumsum(revenue:sum)",                  # running sum of SUM(revenue)
    {"formula": "change(revenue:sum)", "name": "mom"},  # month-over-month change
]

# Arithmetic works too
fields=[
    {"formula": "revenue:sum / *:count", "name": "aov"},  # average order value
]

# Cross-model measures
fields=[
    "customers.*:count",                    # COUNT(*) on joined customers model
    "customers.score:avg",                  # AVG(score) on joined customers model
]

New model syntax

measures:
  - name: revenue
    sql: amount
  - name: price
    sql: price
    allowed_aggregations: [sum, avg, min, max, weighted_avg]  # optional whitelist
  - name: customer_id
    sql: customer_id
    allowed_aggregations: [count, count_distinct]

aggregations:
  # Override built-in to set default weight for this model
  - name: weighted_avg
    params:
      - name: weight
        sql: quantity

  # Fully custom aggregation
  - name: trimmed_mean
    formula: "AVG(CASE WHEN {value} BETWEEN {lo} AND {hi} THEN {value} END)"
    params:
      - name: lo
        sql: "0"
      - name: hi
        sql: "1000"

Result column naming

Colon becomes underscore in result keys (same canonical form as before):

revenue:sum → orders.revenue_sum
*:count → orders.count
customers.*:count → orders.customers__count

Built-in aggregations (always available)

Name	SQL	Notes
`sum`	`SUM(expr)`
`avg`	`AVG(expr)`
`min`	`MIN(expr)`
`max`	`MAX(expr)`
`count`	`COUNT(expr)` or `COUNT(*)`	`:count` = COUNT(), `col:count` = COUNT(col)
`count_distinct`	`COUNT(DISTINCT expr)`
`first`	Earliest record's value	Requires time dimension
`last`	Latest record's value	Requires time dimension
`weighted_avg`	`SUM(v*w)/SUM(w)`	Requires `weight` param
`median`	`PERCENTILE_CONT(0.5)`	Dialect-dependent

Backward compatibility

Old type field on measures is still accepted (deprecated, emits DeprecationWarning)
Bare measure names in formulas (e.g., "revenue") work when the measure has a deprecated type field
All existing queries continue to work with deprecation warnings guiding migration

Auto-ingestion changes

Before: N measures per column (amount_sum, amount_avg, amount_min, amount_max, amount_distinct)
After: 1 measure per column (amount), aggregated at query time (amount:sum, amount:avg, etc.)
*:count is always available without any measure definition

Files changed

slayer/core/enums.py — BUILTIN_AGGREGATIONS, formula/param constants
slayer/core/models.py — Aggregation, AggregationParam models; Measure.allowed_aggregations
slayer/core/formula.py — AggregatedMeasureRef, colon-syntax preprocessing
slayer/engine/enriched.py — EnrichedMeasure uses aggregation string
slayer/engine/enrichment.py — aggregated measure resolution, backward compat, filter preprocessing
slayer/engine/query_engine.py — cross-model * measures, virtual model simplification
slayer/sql/generator.py — string-based dispatch, formula aggregations, first/median support
slayer/engine/ingestion.py — one measure per column
All example notebooks and integration tests updated

Test plan

409 tests pass (unit + integration including SQLite, Postgres, DuckDB, notebooks)
poetry run ruff check slayer/ tests/ — clean
Backward compat: old-style measures with type field still work (deprecation warnings)
Manual test: define a model with custom weighted_avg aggregation and query it
Manual test: verify allowed_aggregations whitelist rejects invalid aggregations

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Introduced explicit aggregation syntax using colon notation (e.g., revenue:sum, *:count, price:weighted_avg(weight=quantity)) to apply different aggregations at query time.
- Simplified auto-generated measures: now creates one measure per non-ID column instead of multiple aggregation variants.
- Maintained backward compatibility for existing measure syntax with deprecation warnings.
Documentation
- Updated example notebooks to demonstrate the new aggregation syntax.

Measures become named row-level expressions; aggregation is specified at query time using colon syntax (e.g., revenue:sum, *:count). Custom aggregations are first-class model entities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-09T21:59:03Z

📝 Walkthrough

Walkthrough

This PR introduces explicit query-time aggregation syntax (colon notation: measure:aggregation) as an alternative to auto-generated aggregate measures. It restructures measure definitions to contain one per column, updates all example notebooks to use new field syntax, and reworks the enrichment and SQL generation pipelines to resolve aggregations at query time.

Changes

Cohort / File(s)	Summary
Documentation & Examples `.gitignore`, `CLAUDE.md`, `docs/examples/02_sql_vs_dsl/sql_vs_dsl_nb.ipynb`, `docs/examples/03_auto_ingest/auto_ingest_nb.ipynb`, `docs/examples/04_time/time_nb.ipynb`, `docs/examples/05_joined_measures/joined_measures_nb.ipynb`, `docs/examples/05_joins/joins_nb.ipynb`, `docs/examples/06_multistage_queries/multistage_queries_nb.ipynb`	Updated examples and documentation to reflect new colon-based aggregation syntax (`measure:agg`, `*:count`), replaced auto-generated measure names with explicit aggregation references, and cleared execution outputs from notebooks. CLAUDE.md updated with comprehensive semantics for the new aggregation model and deprecation guidance.
Core Data Model `slayer/core/enums.py`, `slayer/core/models.py`	Added aggregation configuration constants (`BUILTIN_AGGREGATIONS`, `BUILTIN_AGGREGATION_FORMULAS`, `BUILTIN_AGGREGATION_REQUIRED_PARAMS`). Extended `Measure` with optional `type` and `allowed_aggregations` field. Introduced new `Aggregation` and `AggregationParam` models. Added model-level validator for aggregation whitelisting.
Formula Parsing & AST `slayer/core/formula.py`	Introduced `AggregatedMeasureRef` AST node for colon-syntax aggregations. Added `_preprocess_agg_refs()` to scan and replace aggregation patterns before AST parsing. Updated `ArithmeticField` and `MixedArithmeticField` with `agg_refs` tracking. Modified `parse_formula()` to preprocess and inject aggregated-ref mappings into AST construction.
Enrichment & Query Resolution `slayer/engine/enriched.py`, `slayer/engine/enrichment.py`	Refactored `EnrichedMeasure` from `type: DataType` to `aggregation: str` with `aggregation_def` and `agg_kwargs` fields. Added `_ensure_aggregated_measure()` to validate and enrich aggregated measure references. Implemented backward compatibility for deprecated measure `type` and bare `count`. Extended arithmetic/filter enrichment to handle `AggregatedMeasureRef` and resolve canonical column naming.
Measure Ingestion & Virtual Models `slayer/engine/ingestion.py`, `slayer/engine/query_engine.py`	Changed auto-ingestion from creating multiple aggregate measures per column to creating exactly one measure per non-ID column. Updated virtual model construction to use per-column measures instead of predefined aggregation measures. Modified `_resolve_cross_model_measure()` to accept explicit `aggregation_name` and `agg_kwargs` parameters and handle `measure_name == ""` for wildcard `COUNT()` support.
SQL Generation `slayer/sql/generator.py`	Replaced `DataType`-based aggregation mapping with `aggregation`-name driven flow. Added `_build_formula_agg()` to support custom/parameterized aggregations via formula templates. Generalized first/last support to be aggregation-name based. Expanded aggregation function mapping to include `median` and custom formula-based aggregations with parameter substitution.
Integration Tests `tests/integration/test_integration.py`, `tests/integration/test_integration_duckdb.py`, `tests/integration/test_integration_postgres.py`	Updated test assertions and query formulas to use new colon-syntax aggregations (`*:count`, `measure:sum`). Changed expected auto-ingested measures from pre-aggregated names to base-column names. Updated rollup and cross-model query formulas to reflect new aggregation syntax.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FormulaParser as Formula Parser
    participant Enrichment as Enrichment Engine
    participant QueryEngine as Query Engine
    participant SQLGen as SQL Generator
    participant Database as Database

    Client->>FormulaParser: parse_formula("revenue:sum + cost:sum")
    FormulaParser->>FormulaParser: _preprocess_agg_refs()
    FormulaParser->>FormulaParser: Replace "revenue:sum" & "cost:sum" with placeholders
    FormulaParser->>FormulaParser: Build AST with AggregatedMeasureRef nodes
    FormulaParser->>Enrichment: Return FieldSpec with agg_refs map

    Client->>Enrichment: enrich_query(query with fields=["revenue:sum"])
    Enrichment->>Enrichment: Detect AggregatedMeasureRef
    Enrichment->>Enrichment: _ensure_aggregated_measure(measure="revenue", agg="sum")
    Enrichment->>Enrichment: Validate "sum" in allowed_aggregations
    Enrichment->>Enrichment: Create EnrichedMeasure(aggregation="sum", aggregation_def=...)
    Enrichment->>QueryEngine: Return enriched query with resolved measures

    QueryEngine->>SQLGen: Generate SQL for enriched query
    SQLGen->>SQLGen: Map aggregation="sum" via _AGG_FUNCTION_MAP
    SQLGen->>SQLGen: _build_agg(measure, aggregation="sum")
    SQLGen->>SQLGen: Return "SUM(revenue_col)" SQL fragment
    SQLGen->>Database: Execute full query

    Database->>Client: Return result rows

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Filtering on computed columns #6: Overlaps in formula/filter parsing and engine filter-transform extraction code to support inline transforms and post-filters.
Refactor query_engine.py #17: Directly related—PR #17 refactors the call path to enrich_query, which this PR extensively implements with aggregation handling.
Dynamic Joins & Multi-Stage Queries #9: Both PRs modify core formula parsing and enrichment logic to handle measure references and aggregation resolution across multiple engine components.

Suggested reviewers

AivanF

🐰 A colon syntax hops through the parser's fields,
Aggregations bloom where query time yields,
No more predefined measures, wild and free,
Just measure names with aggregations—see?
One measure per column, clean and neat,
Makes DSL queries dance to a rhythmic beat! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.07% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "Separate aggregation from measure definition" directly and concisely describes the main architectural change throughout this PR: measures now define row-level expressions while aggregations are specified separately at query time.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch egor/dev-1228-specify-aggregation-separately-from-measure

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

.gitignore (1)
213-213: Consider adding a comment for the benchmarks/ entry.

While the pattern is self-explanatory, adding a brief comment (similar to line 214) would improve consistency with the documentation style used elsewhere in the file.
📝 Optional improvement
+# Benchmark results and data
 benchmarks/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 213, Add a brief inline comment for the gitignore entry
"benchmarks/" to match the documentation style used elsewhere (e.g., similar to
the comment on the following line), e.g., a one-line explanatory comment above
or beside the "benchmarks/" entry describing why benchmark artifacts are
ignored; update the ".gitignore" around the "benchmarks/" token to include that
short comment.
slayer/engine/enrichment.py (2)
332-344: Consider extracting canonical name computation to reduce duplication.

The canonical name logic (spec.aggregation_name if spec.measure_name == "*" else f"{spec.measure_name}_{spec.aggregation_name}") is duplicated between _flatten_spec (lines 332-336), _ensure_aggregated_measure (lines 147-150), and the query field processing loop (lines 446-449). Consider extracting this into a helper function.
♻️ Suggested helper function
def _canonical_agg_name(measure_name: str, aggregation_name: str) -> str:
    """Compute canonical result column name for an aggregated measure."""
    if measure_name == "*":
        return aggregation_name  # *:count → "count"
    return f"{measure_name}_{aggregation_name}"  # revenue:sum → "revenue_sum"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 332 - 344, The canonical aggregated
measure name logic is duplicated across _flatten_spec,
_ensure_aggregated_measure, and the query field processing loop; extract it into
a small helper function (e.g., _canonical_agg_name(measure_name: str,
aggregation_name: str) -> str) that returns aggregation_name when measure_name
== "*" otherwise f"{measure_name}_{aggregation_name}", then replace the inline
expressions in _flatten_spec, _ensure_aggregated_measure, and the query field
loop to call _canonical_agg_name to compute canonical_name before passing
alias_key or returning the model-qualified name.
479-485: Validation may be redundant when explicit time arg is provided.

The validation at lines 479-485 checks if last_agg_time_column is None after calling _ensure_aggregated_measure. However, _ensure_aggregated_measure (lines 183-187) already updates last_agg_time_column when agg_args contains an explicit time column. This means the error would never trigger for revenue:last(ordered_at) since last_agg_time_column would already be set.

This is actually correct behavior (no error when time arg is explicit), but the code flow is subtle. Consider adding a comment explaining that _ensure_aggregated_measure may have already set last_agg_time_column.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 479 - 485, The validation raising
ValueError when spec.aggregation_name is "first" or "last" and
last_agg_time_column is None is correct but subtle because
_ensure_aggregated_measure may already set last_agg_time_column when agg_args
includes an explicit time column; add a concise inline comment above this
validation referencing _ensure_aggregated_measure and explaining that it can
populate last_agg_time_column from explicit agg_args (so the check only triggers
when no time column was provided anywhere), and mention the relevant symbols
spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the
control flow explicit for future readers.
slayer/core/formula.py (1)
122-133: Args parsing assumes no commas within argument values.

The current implementation splits on commas without considering that argument values could potentially contain commas (e.g., in future string-based args). For the current use cases (weight=quantity, ordered_at), this works fine, but consider adding a comment noting this limitation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/core/formula.py` around lines 122 - 133, The args parsing block that
builds agg_args and agg_kwargs from args_str (the code around agg_args,
agg_kwargs, and the for loop parsing inner) naively splits on commas and
therefore will break if argument values themselves contain commas; update the
parsing code or—if changing behavior is not desired now—add a clear inline
comment next to the agg_args/agg_kwargs parsing logic stating this limitation
(that values cannot contain commas) and document expected formats (e.g., key=val
or bare identifiers) so future maintainers know the constraint and consider
switching to a more robust parser if needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@slayer/core/enums.py`:
- Line 127: The "weighted_avg" SQL expression in the enum/dict must guard
against a zero denominator; update the "weighted_avg" value (the mapping entry
named "weighted_avg" in slayer/core/enums.py) to divide by NULLIF(SUM({weight}),
0) instead of SUM({weight}) so the expression becomes SUM({value} * {weight}) /
NULLIF(SUM({weight}), 0), ensuring NULL is returned when total weight is zero.

In `@slayer/core/formula.py`:
- Around line 101-106: The regex _AGG_REF_RE currently includes an undocumented
optional path wildcard (?:\.\*)? that matches patterns like "customers.*:count";
remove this dead-code by deleting that optional group from the pattern so the
measure name group is only the base name or dotted path (i.e., change
_AGG_REF_RE to
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update
the inline comment accordingly), and run tests; if you instead intend to support
path.* keep the group but add an example to the AggregatedMeasureRef docstring
demonstrating "path.*:agg" usage so documentation matches the code.

In `@slayer/engine/enrichment.py`:
- Around line 181-187: The current use of nonlocal last_agg_time_column inside
the aggregation handling (when aggregation_name is "first" or "last" and
agg_args is present) creates a global side-effect so only the last explicit_time
wins; instead, stop mutating the shared last_agg_time_column and associate the
explicit time column with the specific measure (e.g., attach it to the
measure/aggregation metadata or store it in a dict keyed by the measure name or
aggregation identifier). Update the logic around aggregation_name, agg_args and
where last_agg_time_column is referenced to read per-measure time columns (e.g.,
measure.time_column or per_measure_time_map[measure_id]) and remove the nonlocal
mutation to avoid cross-measure interference.
- Around line 41-42: The file defines an unused logger variable named logger
(logging.getLogger(__name__))—either remove the unused import/definition or add
meaningful logging statements where aggregation or measure resolution happens
(e.g., inside functions that compute or resolve aggregations such as
resolve_aggregation or resolve_measure if present) so the logger is referenced;
update imports accordingly (remove logging import if deleting logger) and insert
process-level debug/info/error logs at key decision points to enable aggregation
resolution tracing.
- Around line 831-838: The reverse mapping _agg_reverse built from agg_refs
omits ref.agg_kwargs when formatting the aggregation signature, causing
reconstructed formulas to lose keyword arguments; update the comprehension that
builds _agg_reverse to include both ref.agg_args and ref.agg_kwargs inside the
parentheses (format kwargs as key=value pairs and join them with positional
args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for k,v in
ref.agg_kwargs.items()])), preserving the existing branch that omits parentheses
when there are no args/kwargs; target the _agg_reverse comprehension and the
attributes ref.measure_name, ref.aggregation_name, ref.agg_args, ref.agg_kwargs.

---

Nitpick comments:
In @.gitignore:
- Line 213: Add a brief inline comment for the gitignore entry "benchmarks/" to
match the documentation style used elsewhere (e.g., similar to the comment on
the following line), e.g., a one-line explanatory comment above or beside the
"benchmarks/" entry describing why benchmark artifacts are ignored; update the
".gitignore" around the "benchmarks/" token to include that short comment.

In `@slayer/core/formula.py`:
- Around line 122-133: The args parsing block that builds agg_args and
agg_kwargs from args_str (the code around agg_args, agg_kwargs, and the for loop
parsing inner) naively splits on commas and therefore will break if argument
values themselves contain commas; update the parsing code or—if changing
behavior is not desired now—add a clear inline comment next to the
agg_args/agg_kwargs parsing logic stating this limitation (that values cannot
contain commas) and document expected formats (e.g., key=val or bare
identifiers) so future maintainers know the constraint and consider switching to
a more robust parser if needed.

In `@slayer/engine/enrichment.py`:
- Around line 332-344: The canonical aggregated measure name logic is duplicated
across _flatten_spec, _ensure_aggregated_measure, and the query field processing
loop; extract it into a small helper function (e.g.,
_canonical_agg_name(measure_name: str, aggregation_name: str) -> str) that
returns aggregation_name when measure_name == "*" otherwise
f"{measure_name}_{aggregation_name}", then replace the inline expressions in
_flatten_spec, _ensure_aggregated_measure, and the query field loop to call
_canonical_agg_name to compute canonical_name before passing alias_key or
returning the model-qualified name.
- Around line 479-485: The validation raising ValueError when
spec.aggregation_name is "first" or "last" and last_agg_time_column is None is
correct but subtle because _ensure_aggregated_measure may already set
last_agg_time_column when agg_args includes an explicit time column; add a
concise inline comment above this validation referencing
_ensure_aggregated_measure and explaining that it can populate
last_agg_time_column from explicit agg_args (so the check only triggers when no
time column was provided anywhere), and mention the relevant symbols
spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the
control flow explicit for future readers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f052283-3ecb-4320-b366-ce3b0e542832

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf5d66 and ae8826b.

📒 Files selected for processing (19)

.gitignore
CLAUDE.md
docs/examples/02_sql_vs_dsl/sql_vs_dsl_nb.ipynb
docs/examples/03_auto_ingest/auto_ingest_nb.ipynb
docs/examples/04_time/time_nb.ipynb
docs/examples/05_joined_measures/joined_measures_nb.ipynb
docs/examples/05_joins/joins_nb.ipynb
docs/examples/06_multistage_queries/multistage_queries_nb.ipynb
slayer/core/enums.py
slayer/core/formula.py
slayer/core/models.py
slayer/engine/enriched.py
slayer/engine/enrichment.py
slayer/engine/ingestion.py
slayer/engine/query_engine.py
slayer/sql/generator.py
tests/integration/test_integration.py
tests/integration/test_integration_duckdb.py
tests/integration/test_integration_postgres.py

coderabbitai · 2026-04-09T22:07:32Z

slayer/core/enums.py

+# Built-in aggregation SQL formulas (for aggregations that use a template).
+# {value} = measure's SQL expression; {param_name} = parameter values.
+BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = {
+    "weighted_avg": "SUM({value} * {weight}) / SUM({weight})",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify where weighted_avg is defined/used and whether denominator guards exist. rg -n "weighted_avg|SUM\\(\\{value\\} \\* \\{weight\\}\\) / SUM\\(\\{weight\\}\\)|NULLIF\\(SUM\\(\\{weight\\}\\),\\s*0\\)"

Repository: MotleyAI/slayer

Length of output: 1723

🏁 Script executed:

# Get context around where weighted_avg formula is built in generator.py sed -n '920,940p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 928

🏁 Script executed:

# Check the models.py context around line 118 for weight handling sed -n '110,130p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 772

🏁 Script executed:

# Search for any weight validation or guards before aggregation rg -B5 -A5 "weighted_avg|weight\s*=" slayer/ --type py -m 10

Repository: MotleyAI/slayer

Length of output: 8237

🏁 Script executed:

# Check if there are any tests or examples showing weighted_avg usage fd -t f -e py | xargs rg -l "weighted_avg" | head -5

Repository: MotleyAI/slayer

Length of output: 180

🏁 Script executed:

# Search for test files with weighted_avg fd -t f -e py -path "*/test*" | xargs rg "weighted_avg" -l 2>/dev/null || echo "No test matches"

Repository: MotleyAI/slayer

Length of output: 398

🏁 Script executed:

# Check enrichment layer for any weight validation sed -n '125,200p' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 3329

🏁 Script executed:

# Search for any division or NULLIF handling in formula processing rg "division|NULLIF|divide" slayer/ --type py

Repository: MotleyAI/slayer

Length of output: 41

Guard weighted_avg against zero denominator.

SUM({weight}) can be zero for legitimate data groups (all weight values zero or NULL), which causes runtime errors or undefined behavior across SQL engines. Add NULLIF to safely return NULL instead of dividing by zero.

Proposed fix

BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = { - "weighted_avg": "SUM({value} * {weight}) / SUM({weight})", + "weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)", }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"weighted_avg": "SUM({value} * {weight}) / SUM({weight})",

"weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@slayer/core/enums.py` at line 127, The "weighted_avg" SQL expression in the enum/dict must guard against a zero denominator; update the "weighted_avg" value (the mapping entry named "weighted_avg" in slayer/core/enums.py) to divide by NULLIF(SUM({weight}), 0) instead of SUM({weight}) so the expression becomes SUM({value} * {weight}) / NULLIF(SUM({weight}), 0), ensuring NULL is returned when total weight is zero.

coderabbitai · 2026-04-09T22:07:33Z

slayer/core/formula.py

+_AGG_REF_RE = re.compile(
+    r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\.\*)?)'  # group 1: measure name, *, or path.*
+    r':'                                                  # colon separator
+    r'([a-zA-Z_]\w*)'                                    # group 2: aggregation name
+    r'(\([^)]*\))?'                                      # group 3: optional (args)
+)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if path.* syntax is used anywhere in the codebase or tests rg -n '\w+\.\*:' --type=py

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Read the AggregatedMeasureRef class and surrounding context sed -n '31,60p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1397

🏁 Script executed:

# Search for tests that verify the regex pattern fd -e py -path '*/test*' | xargs rg -l 'AGG_REF_RE|AggregatedMeasureRef' 2>/dev/null | head -5

Repository: MotleyAI/slayer

Length of output: 295

🏁 Script executed:

# Check how placeholders are generated and used rg -n '__agg\d+__' --type=py | head -20

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for test examples of aggregation syntax rg -n 'customers\.|revenue:' --type=py -A 2 -B 2 | head -40

Repository: MotleyAI/slayer

Length of output: 2542

🏁 Script executed:

# Search for placeholder generation logic rg -n '__agg|_preprocess_agg_refs' slayer/core/formula.py | head -30

Repository: MotleyAI/slayer

Length of output: 258

🏁 Script executed:

# Check enrichment.py to see how it consumes AggregatedMeasureRef sed -n '1,50p' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 1441

🏁 Script executed:

# Look for actual placeholder generation in _preprocess_agg_refs function sed -n '109,150p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1469

Remove undocumented path.* syntax from regex or add documentation.

The regex includes (?:\.\*)? to match patterns like customers.*:count, but this syntax is not documented in the AggregatedMeasureRef docstring examples and is not used anywhere in the codebase. Either remove this dead-code pattern or add it to the documented examples if it's intended for future use.

The concern about double underscores in identifiers is unfounded — the placeholder system correctly replaces identifiers like __internal__:sum with numeric placeholders like __agg0__, so there is no conflict.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@slayer/core/formula.py` around lines 101 - 106, The regex _AGG_REF_RE currently includes an undocumented optional path wildcard (?:\.\*)? that matches patterns like "customers.*:count"; remove this dead-code by deleting that optional group from the pattern so the measure name group is only the base name or dotted path (i.e., change _AGG_REF_RE to r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update the inline comment accordingly), and run tests; if you instead intend to support path.* keep the group but add an example to the AggregatedMeasureRef docstring demonstrating "path.*:agg" usage so documentation matches the code.

coderabbitai · 2026-04-09T22:07:33Z

slayer/engine/enrichment.py

+logger = logging.getLogger(__name__)
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if logger is used anywhere in this file rg -n 'logger\.' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 41

Remove the unused logger or add logging calls for aggregation resolution.

The logger defined at line 41 is never used in this file. Either remove the unused import and definition, or add logging for debugging measure/aggregation resolution.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@slayer/engine/enrichment.py` around lines 41 - 42, The file defines an unused logger variable named logger (logging.getLogger(__name__))—either remove the unused import/definition or add meaningful logging statements where aggregation or measure resolution happens (e.g., inside functions that compute or resolve aggregations such as resolve_aggregation or resolve_measure if present) so the logger is referenced; update imports accordingly (remove logging import if deleting logger) and insert process-level debug/info/error logs at key decision points to enable aggregation resolution tracing.

coderabbitai · 2026-04-09T22:07:33Z

slayer/engine/enrichment.py

+        # For first/last with explicit time dimension arg, update last_agg_time_column
+        nonlocal last_agg_time_column
+        if aggregation_name in ("first", "last") and agg_args:
+            explicit_time = agg_args[0]
+            if "." not in explicit_time:
+                explicit_time = f"{model.name}.{explicit_time}"
+            last_agg_time_column = explicit_time


⚠️ Potential issue | 🟡 Minor

Side effect on last_agg_time_column may cause unexpected behavior with multiple first/last measures.

The nonlocal last_agg_time_column modification means that if multiple first/last aggregations are used with different explicit time columns, only the last one processed will take effect. This could lead to incorrect ordering for earlier first/last measures.

Consider whether each measure should track its own time column, or document that only one explicit time column is supported per query.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@slayer/engine/enrichment.py` around lines 181 - 187, The current use of nonlocal last_agg_time_column inside the aggregation handling (when aggregation_name is "first" or "last" and agg_args is present) creates a global side-effect so only the last explicit_time wins; instead, stop mutating the shared last_agg_time_column and associate the explicit time column with the specific measure (e.g., attach it to the measure/aggregation metadata or store it in a dict keyed by the measure name or aggregation identifier). Update the logic around aggregation_name, agg_args and where last_agg_time_column is referenced to read per-measure time columns (e.g., measure.time_column or per_measure_time_map[measure_id]) and remove the nonlocal mutation to avoid cross-measure interference.

coderabbitai · 2026-04-09T22:07:33Z

slayer/engine/enrichment.py

+    _agg_reverse = {
+        ph: (
+            f"{ref.measure_name}:{ref.aggregation_name}"
+            if not ref.agg_args and not ref.agg_kwargs
+            else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"
+        )
+        for ph, ref in agg_refs.items()
+    }


⚠️ Potential issue | 🟡 Minor

Reverse mapping doesn't include kwargs, potentially producing incomplete formulas.

The reverse map construction at lines 831-838 only includes agg_args when building the original colon form, but ignores agg_kwargs. For a formula like price:weighted_avg(weight=quantity), the reconstructed formula would be price:weighted_avg() instead of price:weighted_avg(weight=quantity).

🐛 Proposed fix to include kwargs in reverse mapping

_agg_reverse = { ph: ( f"{ref.measure_name}:{ref.aggregation_name}" if not ref.agg_args and not ref.agg_kwargs - else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})" + else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})" ) for ph, ref in agg_refs.items() }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_agg_reverse = {

ph: (

f"{ref.measure_name}:{ref.aggregation_name}"

if not ref.agg_args and not ref.agg_kwargs

else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"

)

for ph, ref in agg_refs.items()

}

_agg_reverse = {

ph: (

f"{ref.measure_name}:{ref.aggregation_name}"

if not ref.agg_args and not ref.agg_kwargs

else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})"

)

for ph, ref in agg_refs.items()

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@slayer/engine/enrichment.py` around lines 831 - 838, The reverse mapping _agg_reverse built from agg_refs omits ref.agg_kwargs when formatting the aggregation signature, causing reconstructed formulas to lose keyword arguments; update the comprehension that builds _agg_reverse to include both ref.agg_args and ref.agg_kwargs inside the parentheses (format kwargs as key=value pairs and join them with positional args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for k,v in ref.agg_kwargs.items()])), preserving the existing branch that omits parentheses when there are no args/kwargs; target the _agg_reverse comprehension and the attributes ref.measure_name, ref.aggregation_name, ref.agg_args, ref.agg_kwargs.

coderabbitai · 2026-04-09T22:07:33Z

slayer/sql/generator.py

+    def _build_formula_agg(self, measure: EnrichedMeasure, agg_name: str) -> exp.Expression:
+        """Build SQL for formula-based aggregations (weighted_avg, custom)."""
+        # Get formula: from aggregation_def or built-in
+        formula = None
+        if measure.aggregation_def and measure.aggregation_def.formula:
+            formula = measure.aggregation_def.formula
+        elif agg_name in BUILTIN_AGGREGATION_FORMULAS:
+            formula = BUILTIN_AGGREGATION_FORMULAS[agg_name]
+
+        if formula is None:
+            raise ValueError(
+                f"Aggregation '{agg_name}' has no formula. "
+                f"Custom aggregations must define a formula."
+            )
+
+        # Collect param values: query-time overrides > aggregation_def defaults
+        param_defaults = {}
+        if measure.aggregation_def:
+            param_defaults = {p.name: p.sql for p in measure.aggregation_def.params}
+        params = {**param_defaults, **measure.agg_kwargs}
+
+        # Validate required params
+        required = BUILTIN_AGGREGATION_REQUIRED_PARAMS.get(agg_name, [])
+        for req in required:
+            if req not in params:
+                raise ValueError(
+                    f"Aggregation '{agg_name}' requires parameter '{req}'. "
+                    f"Set it in the model's aggregation definition or at query time "
+                    f"(e.g., 'measure:{agg_name}({req}=column)')."
+                )
+
+        # Resolve {value} and {param_name} in formula
+        col_expr = measure.sql or measure.name
+        substituted = formula.replace("{value}", col_expr)
+        for param_name, param_val in params.items():
+            substituted = substituted.replace(f"{{{param_name}}}", param_val)
+
+        return sqlglot.parse_one(substituted, dialect=self.dialect)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check where agg_kwargs values originate from in the enrichment flow rg -n "agg_kwargs" --type py -C 3 slayer/

Repository: MotleyAI/slayer

Length of output: 10262

🏁 Script executed:

# Check how agg_kwargs values are validated and what BUILTIN_AGGREGATION_FORMULAS looks like rg -n "BUILTIN_AGGREGATION_FORMULAS|BUILTIN_AGGREGATION_REQUIRED_PARAMS" --type py -C 2 slayer/

Repository: MotleyAI/slayer

Length of output: 1789

🏁 Script executed:

# Look at the formula definitions to understand how parameters are used fd -e py -x grep -l "weighted_avg\|custom.*aggregation" {} \; slayer/

Repository: MotleyAI/slayer

Length of output: 392

🏁 Script executed:

# Check if there's any validation of agg_kwargs before it reaches _build_formula_agg rg -n "agg_kwargs" --type py -B 5 -A 5 slayer/engine/enrichment.py | head -100

Repository: MotleyAI/slayer

Length of output: 4177

🏁 Script executed:

# Look at the actual substitution logic and any context around it sed -n '929,970p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 2038

🏁 Script executed:

# Search for any validation of parameter values or agg_kwargs rg -n "validate|sanitize|escape" --type py slayer/engine/enrichment.py | grep -i param

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for tests that show expected usage of weighted_avg and other parameterized aggregations fd . slayer --full-path -e py -name "*test*" | head -20

Repository: MotleyAI/slayer

Length of output: 289

🏁 Script executed:

# Check for parameter validation in the aggregation definition or enrichment cat slayer/core/enums.py | grep -A 10 "class Aggregation\|@dataclass.*Aggregation"

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for the Aggregation and Parameter dataclasses to understand structure rg -n "class.*Aggregation|class.*Parameter|@dataclass.*Aggregation" --type py slayer/ -A 5

Repository: MotleyAI/slayer

Length of output: 843

🏁 Script executed:

# Check if there are any tests for weighted_avg or formula substitution rg -n "weighted_avg\|_build_formula_agg" --type py slayer/ -B 2 -A 5 | head -80

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look at the full Aggregation model definition sed -n '115,135p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 894

🏁 Script executed:

# Search for test files and examples using weighted_avg find slayer -name "*.py" -type f | xargs grep -l "weighted_avg" | head -5

Repository: MotleyAI/slayer

Length of output: 178

🏁 Script executed:

# Check if there are any constraints or validation on parameter values rg -n "param|weight" --type py slayer/core/models.py -B 2 -A 2 | head -50

Repository: MotleyAI/slayer

Length of output: 763

🏁 Script executed:

# Look for documentation or examples in README or docs find . -name "*.md" -o -name "*.rst" | xargs grep -l "weighted_avg\|agg_kwargs" 2>/dev/null | head -3

Repository: MotleyAI/slayer

Length of output: 70

🏁 Script executed:

# Check if sqlglot.parse_one is being called and how it handles the substituted string rg -n "parse_one" --type py slayer/sql/generator.py -B 3 -A 2

Repository: MotleyAI/slayer

Length of output: 4052

🏁 Script executed:

# Check CLAUDE.md for documentation on weighted_avg usage cat ./CLAUDE.md | grep -A 20 "weighted_avg"

Repository: MotleyAI/slayer

Length of output: 4633

🏁 Script executed:

# Look for tests that might show expected behavior find slayer -name "*.py" -path "*test*" -type f

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Search for any test files in the repo find . -name "test_*.py" -o -name "*_test.py" | head -10

Repository: MotleyAI/slayer

Length of output: 417

🏁 Script executed:

# Check if there's validation in the formula parsing step rg -n "agg_kwargs" --type py slayer/core/formula.py -B 5 -A 5

Repository: MotleyAI/slayer

Length of output: 2995

🏁 Script executed:

# Look for how the parameter values are actually used - check if they're expected to be column names or full expressions rg -n "def.*_build_formula_agg|substituted" --type py slayer/sql/generator.py -B 2 -A 10

Repository: MotleyAI/slayer

Length of output: 1684

🏁 Script executed:

# Check how the expression returned from _build_formula_agg is used rg -n "_build_formula_agg" --type py slayer/sql/generator.py -A 3

Repository: MotleyAI/slayer

Length of output: 512

🏁 Script executed:

# Look at test files to see if there are tests for weighted_avg with different parameter values grep -r "weighted_avg" tests/ --include="*.py" -B 2 -A 2

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Check if there's any sanitization in the enrichment pipeline for parameter values rg -n "param.*val|val.*strip|escape|quote" --type py slayer/engine/enrichment.py | head -20

Repository: MotleyAI/slayer

Length of output: 294

🏁 Script executed:

# Look at how column expressions are built - check _build_column_expr to understand if col_expr on line 961 is validated sed -n '860,875p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 1037

🏁 Script executed:

# Check if sqlglot.parse_one can handle injection attempts safely - look for any error handling rg -n "try|except|parse_one" --type py slayer/sql/generator.py | grep -A 2 "parse_one"

Repository: MotleyAI/slayer

Length of output: 972

🏁 Script executed:

# Verify the expected format of parameter values - look at documentation or tests for constraints rg -n "weight|quantity|agg_kwargs" --type py tests/ -B 2 -A 2 | head -40

Repository: MotleyAI/slayer

Length of output: 2237

🏁 Script executed:

# Check how col_expr is constructed - does _resolve_sql provide any validation? rg -n "_resolve_sql" --type py slayer/sql/generator.py -A 15 | head -25

Repository: MotleyAI/slayer

Length of output: 1480

🏁 Script executed:

# Look for any tests that try malicious input grep -r "injection\|DROP\|;\|--" tests/ --include="*.py" | head -10

Repository: MotleyAI/slayer

Length of output: 964

🏁 Script executed:

# Check if parameters can be arbitrary expressions or only column names rg -n "allowed_aggregations|param.*validate|weight.*column" --type py slayer/ -B 2 -A 2 | head -30

Repository: MotleyAI/slayer

Length of output: 2308

🏁 Script executed:

# Check if there are any try-catch blocks around parse_one that might handle injection attempts rg -n "try|except" --type py slayer/sql/generator.py | grep -B 5 -A 5 "966\|parse_one"

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for any validation that restricts agg_kwargs to column names rg -n "Column|Identifier|parse_one" --type py slayer/sql/generator.py | grep -B 3 -A 3 "param\|weight\|agg_kwargs"

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Check what happens if sqlglot.parse_one fails - is there error handling upstream? rg -n "query_engine\|SlayerQuery" --type py slayer/ -l | head -3

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look at how col_expr (measure.sql) is built - is it already validated as an expression? rg -n "col_expr = measure.sql" --type py slayer/sql/generator.py -B 3 -A 10

Repository: MotleyAI/slayer

Length of output: 786

🏁 Script executed:

# Check for any documentation on expected parameter format or constraints rg -n "param.*format|parameter.*column|weight.*definition" --type py slayer/core/models.py -B 2 -A 2

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look at how queries are built and executed - check if the generated SQL from _build_formula_agg is safely incorporated rg -n "execute\|query\|run_sql" --type py slayer/ -l | head -5

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Check the broader context of _build_agg usage to see how the returned expression is used rg -n "_build_agg\|_build_formula_agg" --type py slayer/sql/generator.py -B 2 -A 5

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for any input validation or constraints in the aggregation parameter parsing sed -n '115,145p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1048

🏁 Script executed:

# Check if parameter values should be restricted to identifier-like patterns rg -n "identifier|column.*name|bare.*column" --type py slayer/core/models.py -B 1 -A 1 | head -30

Repository: MotleyAI/slayer

Length of output: 497

🏁 Script executed:

# Check if sqlglot.parse_one can parse multi-statement SQL or if it safely rejects it python3 << 'EOF' import sqlglot # Test if sqlglot.parse_one can handle potential injection attempts test_cases = [ "SUM(amount * quantity) / SUM(quantity)", # normal "SUM(amount * quantity); DROP TABLE orders", # injection attempt "SUM(amount * quantity) OR 1=1", # another attempt ] for test in test_cases: try: result = sqlglot.parse_one(test, dialect="sqlite") print(f"✓ Parsed: {test[:50]}") except Exception as e: print(f"✗ Failed: {test[:50]} -> {type(e).__name__}") EOF

Repository: MotleyAI/slayer

Length of output: 178

🏁 Script executed:

# Look at the AggregationParam definition to understand expected format sed -n '109,113p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 226

🏁 Script executed:

# Check if there's any documentation about parameter value constraints rg -n "AggregationParam|param.*sql|weight.*sql" --type py slayer/core/models.py -B 2 -A 3

Repository: MotleyAI/slayer

Length of output: 476

🏁 Script executed:

# Look at how sqlglot.parse_one is imported and check for any safety measures rg -n "import.*sqlglot|from.*sqlglot" --type py slayer/sql/generator.py -A 2

Repository: MotleyAI/slayer

Length of output: 309

🏁 Script executed:

# Check if there are integration tests for weighted_avg that might show expected behavior find tests -name "*.py" -type f -exec grep -l "weighted_avg" {} \;

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look for any validation of formula or aggregation definitions rg -n "Aggregation|formula.*validate|param.*validate" --type py slayer/core/models.py -A 3 | head -40

Repository: MotleyAI/slayer

Length of output: 875

🏁 Script executed:

# Install sqlglot and test if it provides injection protection pip install sqlglot -q && python3 << 'EOF' import sqlglot # Test if sqlglot.parse_one safely handles injection attempts in formula substitution formula = "SUM({value} * {weight}) / SUM({weight})" test_cases = [ ("quantity", "Normal column name"), ("quantity + 1", "Expression"), ("quantity); DROP TABLE orders; --", "Injection attempt"), ("(SELECT COUNT(*) FROM orders)", "Subquery"), ] for param_val, desc in test_cases: substituted = formula.replace("{value}", "price").replace("{weight}", param_val) print(f"\n{desc}: weight={param_val}") print(f" Substituted: {substituted}") try: result = sqlglot.parse_one(substituted, dialect="sqlite") print(f" ✓ Parsed successfully") print(f" SQL output: {result.sql(dialect='sqlite')}") except Exception as e: print(f" ✗ Parse error: {e}") EOF

Repository: MotleyAI/slayer

Length of output: 953

🏁 Script executed:

# Check if there's any documentation about parameter validation in comments or docstrings rg -n "param.*safe|sanitize|validate|column.*name|SQL.*expression" --type py slayer/core/ -B 1 -A 1 | head -50

Repository: MotleyAI/slayer

Length of output: 2618

🏁 Script executed:

# Check how the final SQL is executed - is it parameterized or executed as a raw string? rg -n "execute\|run\|query" --type py slayer/engine/ -l

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look at the query engine to see if SQL is executed directly or with parameters sed -n '1,50p' slayer/engine/query_engine.py | head -30

Repository: MotleyAI/slayer

Length of output: 949

🏁 Script executed:

# Check if there's any escaping or quoting of parameter values rg -n "quote|escape|literal|param" --type py slayer/sql/generator.py | head -20

Repository: MotleyAI/slayer

Length of output: 1327

🏁 Script executed:

# Check how the SQL from _build_formula_agg is used downstream rg -n "def _build_agg|_build_formula_agg" --type py slayer/sql/generator.py -A 20 | grep -A 15 "def _build_agg"

Repository: MotleyAI/slayer

Length of output: 958

🏁 Script executed:

# Check how the final SQL expression is rendered and executed rg -n "\.sql\(|execute|client" --type py slayer/engine/query_engine.py | head -20

Repository: MotleyAI/slayer

Length of output: 531

🏁 Script executed:

# Look for any tests that validate parameter safety or show expected usage find tests -name "*.py" -type f -exec grep -l "agg_kwargs\|weighted_avg\|aggregation.*param" {} \;

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Check if there's any documentation about parameter constraints or safety cat slayer/core/models.py | grep -A 10 "class AggregationParam"

Repository: MotleyAI/slayer

Length of output: 471

🏁 Script executed:

# Check SlayerSQLClient to see if it uses parameterized queries or raw SQL execution rg -n "class SlayerSQLClient|def execute" --type py slayer/sql/client.py -A 10 | head -30

Repository: MotleyAI/slayer

Length of output: 623

🏁 Script executed:

# Look at the actual execution to confirm raw SQL string usage sed -n '1,100p' slayer/sql/client.py | head -50

Repository: MotleyAI/slayer

Length of output: 1355

🏁 Script executed:

# Search for any validation of formula or parameter values before they reach _build_formula_agg rg -n "_ensure_aggregated_measure\|_build_formula_agg" --type py slayer/engine/enrichment.py -B 5 -A 5 | head -40

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Check the _execute_sql function to see how SQL is actually executed rg -n "def _execute_sql" --type py slayer/sql/client.py -A 30

Repository: MotleyAI/slayer

Length of output: 1104

🏁 Script executed:

# Look for any tests that demonstrate parameter value validation or injection testing grep -r "injection\|DROP\|--\|;\|parametr" tests/ --include="*.py" -i | head -20

Repository: MotleyAI/slayer

Length of output: 2101

🏁 Script executed:

# Check if there's a CHANGELOG or docs explaining design decisions around parameter handling find . -name "CHANGELOG*" -o -name "*.md" | xargs grep -l "param\|aggregation\|injection" 2>/dev/null | head -3

Repository: MotleyAI/slayer

Length of output: 137

🏁 Script executed:

# Check what sqlglot actually renders for the injection case - detailed test pip install sqlglot -q && python3 << 'EOF' import sqlglot # Test the exact scenario: formula substitution → parse → render formula = "SUM({value} * {weight}) / SUM({weight})" col_expr = "price" malicious_weight = "quantity); DROP TABLE orders; --" # Step 1: String substitution (what happens in _build_formula_agg) substituted = formula.replace("{value}", col_expr).replace("{weight}", malicious_weight) print(f"1. After string substitution:\n {substituted}\n") # Step 2: Parse with sqlglot try: parsed = sqlglot.parse_one(substituted, dialect="sqlite") print(f"2. Parsed successfully as AST") except Exception as e: print(f"2. Parse failed: {e}") exit(1) # Step 3: Render back to SQL rendered_sql = parsed.sql(dialect="sqlite") print(f"3. Rendered SQL:\n {rendered_sql}\n") # Step 4: Check if malicious statement is present if "DROP TABLE" in rendered_sql: print("⚠️ VULNERABLE: DROP TABLE statement is present in rendered SQL!") else: print("✓ Safe: DROP TABLE statement was removed/neutralized") EOF

Repository: MotleyAI/slayer

Length of output: 384

🏁 Script executed:

# Check if there are any comments in the codebase about parameter validation or security rg -n "TODO|FIXME|HACK|SECURITY|injection|safe" --type py slayer/sql/generator.py | grep -i "param\|agg\|formula\|safe"

Repository: MotleyAI/slayer

Length of output: 41

🏁 Script executed:

# Look at the broader context of how parameters should be used - check if there are constraints documented cat slayer/core/models.py | sed -n '100,200p'

Repository: MotleyAI/slayer

Length of output: 3858

Validate and sanitize user-provided parameter values in aggregations to prevent SQL injection.

The _build_formula_agg function substitutes user-provided parameter values directly into SQL formulas via string replacement (lines 962–964), then passes the result to sqlglot.parse_one(). While parameter definitions in aggregations support SQL expressions, user-provided agg_kwargs from query-time aggregation syntax (e.g., "revenue:weighted_avg(weight=user_input)") are not validated before substitution.

Testing confirms that malicious input like "quantity); DROP TABLE orders; --" is not rejected by sqlglot's parser and produces executable SQL statements when rendered. Since the generated SQL is executed as raw text via conn.execute(sa.text(sql)), this is a real SQL injection vector.

Mitigations to consider:

Validate parameter values against a whitelist of column names or safe expression patterns before substitution

Use sqlglot's parameterization capabilities instead of string replacement

Document that parameters must come from trusted sources only and enforce this at the enrichment stage

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate aggregation from measure definition#22

Separate aggregation from measure definition#22
ZmeiGorynych wants to merge 1 commit intomainfrom
egor/dev-1228-specify-aggregation-separately-from-measure

ZmeiGorynych commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"weighted_avg": "SUM({value} * {weight}) / SUM({weight})",
	"weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)",

Conversation

ZmeiGorynych commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why?

New query syntax

New model syntax

Result column naming

Built-in aggregations (always available)

Backward compatibility

Auto-ingestion changes

Files changed

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZmeiGorynych commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading