Skip to content

Separate aggregation from measure definition#22

Open
ZmeiGorynych wants to merge 1 commit intomainfrom
egor/dev-1228-specify-aggregation-separately-from-measure
Open

Separate aggregation from measure definition#22
ZmeiGorynych wants to merge 1 commit intomainfrom
egor/dev-1228-specify-aggregation-separately-from-measure

Conversation

@ZmeiGorynych
Copy link
Copy Markdown
Member

@ZmeiGorynych ZmeiGorynych commented Apr 9, 2026

Summary

This PR redesigns SLayer's measure/aggregation system to separate the row-level expression (measure) from the aggregation method. Previously, a measure tightly coupled both: {name: "revenue", sql: "amount", type: "sum"}. Now:

  • Measures are named row-level expressions: {name: "revenue", sql: "amount"}
  • Aggregations are specified at query time using colon syntax: "revenue:sum"
  • Custom aggregations (e.g., weighted average) are first-class named entities defined at the model level

Why?

  1. Eliminates measure proliferation — instead of N measures per column (revenue_sum, revenue_avg, revenue_min, ...), you define one measure (revenue) and aggregate it any way you want at query time
  2. Enables complex aggregations — weighted averages, custom formulas, percentiles can be defined and reused
  3. Cleaner separation of concerns — the "what" (measure expression) is separate from the "how" (aggregation method)

New query syntax

# Old syntax (still works via backward compat, emits DeprecationWarning)
fields=["count", "revenue"]

# New colon syntax
fields=[
    "*:count",                              # COUNT(*)
    "revenue:sum",                          # SUM(amount)
    "revenue:avg",                          # AVG(amount)
    "customer_id:count_distinct",           # COUNT(DISTINCT customer_id)
    "price:weighted_avg(weight=quantity)",   # custom aggregation with params
    "revenue:median",                       # PERCENTILE_CONT(0.5)
    "revenue:first(ordered_at)",            # earliest value (explicit time col)
    "revenue:last",                         # latest value
]

# Transforms wrap aggregated refs naturally
fields=[
    "cumsum(revenue:sum)",                  # running sum of SUM(revenue)
    {"formula": "change(revenue:sum)", "name": "mom"},  # month-over-month change
]

# Arithmetic works too
fields=[
    {"formula": "revenue:sum / *:count", "name": "aov"},  # average order value
]

# Cross-model measures
fields=[
    "customers.*:count",                    # COUNT(*) on joined customers model
    "customers.score:avg",                  # AVG(score) on joined customers model
]

New model syntax

measures:
  - name: revenue
    sql: amount
  - name: price
    sql: price
    allowed_aggregations: [sum, avg, min, max, weighted_avg]  # optional whitelist
  - name: customer_id
    sql: customer_id
    allowed_aggregations: [count, count_distinct]

aggregations:
  # Override built-in to set default weight for this model
  - name: weighted_avg
    params:
      - name: weight
        sql: quantity

  # Fully custom aggregation
  - name: trimmed_mean
    formula: "AVG(CASE WHEN {value} BETWEEN {lo} AND {hi} THEN {value} END)"
    params:
      - name: lo
        sql: "0"
      - name: hi
        sql: "1000"

Result column naming

Colon becomes underscore in result keys (same canonical form as before):

  • revenue:sumorders.revenue_sum
  • *:countorders.count
  • customers.*:countorders.customers__count

Built-in aggregations (always available)

Name SQL Notes
sum SUM(expr)
avg AVG(expr)
min MIN(expr)
max MAX(expr)
count COUNT(expr) or COUNT(*) *:count = COUNT(*), col:count = COUNT(col)
count_distinct COUNT(DISTINCT expr)
first Earliest record's value Requires time dimension
last Latest record's value Requires time dimension
weighted_avg SUM(v*w)/SUM(w) Requires weight param
median PERCENTILE_CONT(0.5) Dialect-dependent

Backward compatibility

  • Old type field on measures is still accepted (deprecated, emits DeprecationWarning)
  • Bare measure names in formulas (e.g., "revenue") work when the measure has a deprecated type field
  • All existing queries continue to work with deprecation warnings guiding migration

Auto-ingestion changes

  • Before: N measures per column (amount_sum, amount_avg, amount_min, amount_max, amount_distinct)
  • After: 1 measure per column (amount), aggregated at query time (amount:sum, amount:avg, etc.)
  • *:count is always available without any measure definition

Files changed

  • slayer/core/enums.pyBUILTIN_AGGREGATIONS, formula/param constants
  • slayer/core/models.pyAggregation, AggregationParam models; Measure.allowed_aggregations
  • slayer/core/formula.pyAggregatedMeasureRef, colon-syntax preprocessing
  • slayer/engine/enriched.pyEnrichedMeasure uses aggregation string
  • slayer/engine/enrichment.py — aggregated measure resolution, backward compat, filter preprocessing
  • slayer/engine/query_engine.py — cross-model * measures, virtual model simplification
  • slayer/sql/generator.py — string-based dispatch, formula aggregations, first/median support
  • slayer/engine/ingestion.py — one measure per column
  • All example notebooks and integration tests updated

Test plan

  • 409 tests pass (unit + integration including SQLite, Postgres, DuckDB, notebooks)
  • poetry run ruff check slayer/ tests/ — clean
  • Backward compat: old-style measures with type field still work (deprecation warnings)
  • Manual test: define a model with custom weighted_avg aggregation and query it
  • Manual test: verify allowed_aggregations whitelist rejects invalid aggregations

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced explicit aggregation syntax using colon notation (e.g., revenue:sum, *:count, price:weighted_avg(weight=quantity)) to apply different aggregations at query time.
    • Simplified auto-generated measures: now creates one measure per non-ID column instead of multiple aggregation variants.
    • Maintained backward compatibility for existing measure syntax with deprecation warnings.
  • Documentation

    • Updated example notebooks to demonstrate the new aggregation syntax.

Measures become named row-level expressions; aggregation is specified
at query time using colon syntax (e.g., revenue:sum, *:count).
Custom aggregations are first-class model entities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces explicit query-time aggregation syntax (colon notation: measure:aggregation) as an alternative to auto-generated aggregate measures. It restructures measure definitions to contain one per column, updates all example notebooks to use new field syntax, and reworks the enrichment and SQL generation pipelines to resolve aggregations at query time.

Changes

Cohort / File(s) Summary
Documentation & Examples
.gitignore, CLAUDE.md, docs/examples/02_sql_vs_dsl/sql_vs_dsl_nb.ipynb, docs/examples/03_auto_ingest/auto_ingest_nb.ipynb, docs/examples/04_time/time_nb.ipynb, docs/examples/05_joined_measures/joined_measures_nb.ipynb, docs/examples/05_joins/joins_nb.ipynb, docs/examples/06_multistage_queries/multistage_queries_nb.ipynb
Updated examples and documentation to reflect new colon-based aggregation syntax (measure:agg, *:count), replaced auto-generated measure names with explicit aggregation references, and cleared execution outputs from notebooks. CLAUDE.md updated with comprehensive semantics for the new aggregation model and deprecation guidance.
Core Data Model
slayer/core/enums.py, slayer/core/models.py
Added aggregation configuration constants (BUILTIN_AGGREGATIONS, BUILTIN_AGGREGATION_FORMULAS, BUILTIN_AGGREGATION_REQUIRED_PARAMS). Extended Measure with optional type and allowed_aggregations field. Introduced new Aggregation and AggregationParam models. Added model-level validator for aggregation whitelisting.
Formula Parsing & AST
slayer/core/formula.py
Introduced AggregatedMeasureRef AST node for colon-syntax aggregations. Added _preprocess_agg_refs() to scan and replace aggregation patterns before AST parsing. Updated ArithmeticField and MixedArithmeticField with agg_refs tracking. Modified parse_formula() to preprocess and inject aggregated-ref mappings into AST construction.
Enrichment & Query Resolution
slayer/engine/enriched.py, slayer/engine/enrichment.py
Refactored EnrichedMeasure from type: DataType to aggregation: str with aggregation_def and agg_kwargs fields. Added _ensure_aggregated_measure() to validate and enrich aggregated measure references. Implemented backward compatibility for deprecated measure type and bare count. Extended arithmetic/filter enrichment to handle AggregatedMeasureRef and resolve canonical column naming.
Measure Ingestion & Virtual Models
slayer/engine/ingestion.py, slayer/engine/query_engine.py
Changed auto-ingestion from creating multiple aggregate measures per column to creating exactly one measure per non-ID column. Updated virtual model construction to use per-column measures instead of predefined aggregation measures. Modified _resolve_cross_model_measure() to accept explicit aggregation_name and agg_kwargs parameters and handle measure_name == "*" for wildcard COUNT(*) support.
SQL Generation
slayer/sql/generator.py
Replaced DataType-based aggregation mapping with aggregation-name driven flow. Added _build_formula_agg() to support custom/parameterized aggregations via formula templates. Generalized first/last support to be aggregation-name based. Expanded aggregation function mapping to include median and custom formula-based aggregations with parameter substitution.
Integration Tests
tests/integration/test_integration.py, tests/integration/test_integration_duckdb.py, tests/integration/test_integration_postgres.py
Updated test assertions and query formulas to use new colon-syntax aggregations (*:count, measure:sum). Changed expected auto-ingested measures from pre-aggregated names to base-column names. Updated rollup and cross-model query formulas to reflect new aggregation syntax.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FormulaParser as Formula Parser
    participant Enrichment as Enrichment Engine
    participant QueryEngine as Query Engine
    participant SQLGen as SQL Generator
    participant Database as Database

    Client->>FormulaParser: parse_formula("revenue:sum + cost:sum")
    FormulaParser->>FormulaParser: _preprocess_agg_refs()
    FormulaParser->>FormulaParser: Replace "revenue:sum" & "cost:sum" with placeholders
    FormulaParser->>FormulaParser: Build AST with AggregatedMeasureRef nodes
    FormulaParser->>Enrichment: Return FieldSpec with agg_refs map

    Client->>Enrichment: enrich_query(query with fields=["revenue:sum"])
    Enrichment->>Enrichment: Detect AggregatedMeasureRef
    Enrichment->>Enrichment: _ensure_aggregated_measure(measure="revenue", agg="sum")
    Enrichment->>Enrichment: Validate "sum" in allowed_aggregations
    Enrichment->>Enrichment: Create EnrichedMeasure(aggregation="sum", aggregation_def=...)
    Enrichment->>QueryEngine: Return enriched query with resolved measures

    QueryEngine->>SQLGen: Generate SQL for enriched query
    SQLGen->>SQLGen: Map aggregation="sum" via _AGG_FUNCTION_MAP
    SQLGen->>SQLGen: _build_agg(measure, aggregation="sum")
    SQLGen->>SQLGen: Return "SUM(revenue_col)" SQL fragment
    SQLGen->>Database: Execute full query

    Database->>Client: Return result rows
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Filtering on computed columns #6: Overlaps in formula/filter parsing and engine filter-transform extraction code to support inline transforms and post-filters.
  • Refactor query_engine.py #17: Directly related—PR #17 refactors the call path to enrich_query, which this PR extensively implements with aggregation handling.
  • Dynamic Joins & Multi-Stage Queries #9: Both PRs modify core formula parsing and enrichment logic to handle measure references and aggregation resolution across multiple engine components.

Suggested reviewers

  • AivanF

🐰 A colon syntax hops through the parser's fields,
Aggregations bloom where query time yields,
No more predefined measures, wild and free,
Just measure names with aggregations—see?
One measure per column, clean and neat,
Makes DSL queries dance to a rhythmic beat!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Separate aggregation from measure definition" directly and concisely describes the main architectural change throughout this PR: measures now define row-level expressions while aggregations are specified separately at query time.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch egor/dev-1228-specify-aggregation-separately-from-measure

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
.gitignore (1)

213-213: Consider adding a comment for the benchmarks/ entry.

While the pattern is self-explanatory, adding a brief comment (similar to line 214) would improve consistency with the documentation style used elsewhere in the file.

📝 Optional improvement
+# Benchmark results and data
 benchmarks/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 213, Add a brief inline comment for the gitignore entry
"benchmarks/" to match the documentation style used elsewhere (e.g., similar to
the comment on the following line), e.g., a one-line explanatory comment above
or beside the "benchmarks/" entry describing why benchmark artifacts are
ignored; update the ".gitignore" around the "benchmarks/" token to include that
short comment.
slayer/engine/enrichment.py (2)

332-344: Consider extracting canonical name computation to reduce duplication.

The canonical name logic (spec.aggregation_name if spec.measure_name == "*" else f"{spec.measure_name}_{spec.aggregation_name}") is duplicated between _flatten_spec (lines 332-336), _ensure_aggregated_measure (lines 147-150), and the query field processing loop (lines 446-449). Consider extracting this into a helper function.

♻️ Suggested helper function
def _canonical_agg_name(measure_name: str, aggregation_name: str) -> str:
    """Compute canonical result column name for an aggregated measure."""
    if measure_name == "*":
        return aggregation_name  # *:count → "count"
    return f"{measure_name}_{aggregation_name}"  # revenue:sum → "revenue_sum"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 332 - 344, The canonical aggregated
measure name logic is duplicated across _flatten_spec,
_ensure_aggregated_measure, and the query field processing loop; extract it into
a small helper function (e.g., _canonical_agg_name(measure_name: str,
aggregation_name: str) -> str) that returns aggregation_name when measure_name
== "*" otherwise f"{measure_name}_{aggregation_name}", then replace the inline
expressions in _flatten_spec, _ensure_aggregated_measure, and the query field
loop to call _canonical_agg_name to compute canonical_name before passing
alias_key or returning the model-qualified name.

479-485: Validation may be redundant when explicit time arg is provided.

The validation at lines 479-485 checks if last_agg_time_column is None after calling _ensure_aggregated_measure. However, _ensure_aggregated_measure (lines 183-187) already updates last_agg_time_column when agg_args contains an explicit time column. This means the error would never trigger for revenue:last(ordered_at) since last_agg_time_column would already be set.

This is actually correct behavior (no error when time arg is explicit), but the code flow is subtle. Consider adding a comment explaining that _ensure_aggregated_measure may have already set last_agg_time_column.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 479 - 485, The validation raising
ValueError when spec.aggregation_name is "first" or "last" and
last_agg_time_column is None is correct but subtle because
_ensure_aggregated_measure may already set last_agg_time_column when agg_args
includes an explicit time column; add a concise inline comment above this
validation referencing _ensure_aggregated_measure and explaining that it can
populate last_agg_time_column from explicit agg_args (so the check only triggers
when no time column was provided anywhere), and mention the relevant symbols
spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the
control flow explicit for future readers.
slayer/core/formula.py (1)

122-133: Args parsing assumes no commas within argument values.

The current implementation splits on commas without considering that argument values could potentially contain commas (e.g., in future string-based args). For the current use cases (weight=quantity, ordered_at), this works fine, but consider adding a comment noting this limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/core/formula.py` around lines 122 - 133, The args parsing block that
builds agg_args and agg_kwargs from args_str (the code around agg_args,
agg_kwargs, and the for loop parsing inner) naively splits on commas and
therefore will break if argument values themselves contain commas; update the
parsing code or—if changing behavior is not desired now—add a clear inline
comment next to the agg_args/agg_kwargs parsing logic stating this limitation
(that values cannot contain commas) and document expected formats (e.g., key=val
or bare identifiers) so future maintainers know the constraint and consider
switching to a more robust parser if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@slayer/core/enums.py`:
- Line 127: The "weighted_avg" SQL expression in the enum/dict must guard
against a zero denominator; update the "weighted_avg" value (the mapping entry
named "weighted_avg" in slayer/core/enums.py) to divide by NULLIF(SUM({weight}),
0) instead of SUM({weight}) so the expression becomes SUM({value} * {weight}) /
NULLIF(SUM({weight}), 0), ensuring NULL is returned when total weight is zero.

In `@slayer/core/formula.py`:
- Around line 101-106: The regex _AGG_REF_RE currently includes an undocumented
optional path wildcard (?:\.\*)? that matches patterns like "customers.*:count";
remove this dead-code by deleting that optional group from the pattern so the
measure name group is only the base name or dotted path (i.e., change
_AGG_REF_RE to
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update
the inline comment accordingly), and run tests; if you instead intend to support
path.* keep the group but add an example to the AggregatedMeasureRef docstring
demonstrating "path.*:agg" usage so documentation matches the code.

In `@slayer/engine/enrichment.py`:
- Around line 181-187: The current use of nonlocal last_agg_time_column inside
the aggregation handling (when aggregation_name is "first" or "last" and
agg_args is present) creates a global side-effect so only the last explicit_time
wins; instead, stop mutating the shared last_agg_time_column and associate the
explicit time column with the specific measure (e.g., attach it to the
measure/aggregation metadata or store it in a dict keyed by the measure name or
aggregation identifier). Update the logic around aggregation_name, agg_args and
where last_agg_time_column is referenced to read per-measure time columns (e.g.,
measure.time_column or per_measure_time_map[measure_id]) and remove the nonlocal
mutation to avoid cross-measure interference.
- Around line 41-42: The file defines an unused logger variable named logger
(logging.getLogger(__name__))—either remove the unused import/definition or add
meaningful logging statements where aggregation or measure resolution happens
(e.g., inside functions that compute or resolve aggregations such as
resolve_aggregation or resolve_measure if present) so the logger is referenced;
update imports accordingly (remove logging import if deleting logger) and insert
process-level debug/info/error logs at key decision points to enable aggregation
resolution tracing.
- Around line 831-838: The reverse mapping _agg_reverse built from agg_refs
omits ref.agg_kwargs when formatting the aggregation signature, causing
reconstructed formulas to lose keyword arguments; update the comprehension that
builds _agg_reverse to include both ref.agg_args and ref.agg_kwargs inside the
parentheses (format kwargs as key=value pairs and join them with positional
args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for k,v in
ref.agg_kwargs.items()])), preserving the existing branch that omits parentheses
when there are no args/kwargs; target the _agg_reverse comprehension and the
attributes ref.measure_name, ref.aggregation_name, ref.agg_args, ref.agg_kwargs.

---

Nitpick comments:
In @.gitignore:
- Line 213: Add a brief inline comment for the gitignore entry "benchmarks/" to
match the documentation style used elsewhere (e.g., similar to the comment on
the following line), e.g., a one-line explanatory comment above or beside the
"benchmarks/" entry describing why benchmark artifacts are ignored; update the
".gitignore" around the "benchmarks/" token to include that short comment.

In `@slayer/core/formula.py`:
- Around line 122-133: The args parsing block that builds agg_args and
agg_kwargs from args_str (the code around agg_args, agg_kwargs, and the for loop
parsing inner) naively splits on commas and therefore will break if argument
values themselves contain commas; update the parsing code or—if changing
behavior is not desired now—add a clear inline comment next to the
agg_args/agg_kwargs parsing logic stating this limitation (that values cannot
contain commas) and document expected formats (e.g., key=val or bare
identifiers) so future maintainers know the constraint and consider switching to
a more robust parser if needed.

In `@slayer/engine/enrichment.py`:
- Around line 332-344: The canonical aggregated measure name logic is duplicated
across _flatten_spec, _ensure_aggregated_measure, and the query field processing
loop; extract it into a small helper function (e.g.,
_canonical_agg_name(measure_name: str, aggregation_name: str) -> str) that
returns aggregation_name when measure_name == "*" otherwise
f"{measure_name}_{aggregation_name}", then replace the inline expressions in
_flatten_spec, _ensure_aggregated_measure, and the query field loop to call
_canonical_agg_name to compute canonical_name before passing alias_key or
returning the model-qualified name.
- Around line 479-485: The validation raising ValueError when
spec.aggregation_name is "first" or "last" and last_agg_time_column is None is
correct but subtle because _ensure_aggregated_measure may already set
last_agg_time_column when agg_args includes an explicit time column; add a
concise inline comment above this validation referencing
_ensure_aggregated_measure and explaining that it can populate
last_agg_time_column from explicit agg_args (so the check only triggers when no
time column was provided anywhere), and mention the relevant symbols
spec.aggregation_name, spec.measure_name, and last_agg_time_column to make the
control flow explicit for future readers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f052283-3ecb-4320-b366-ce3b0e542832

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf5d66 and ae8826b.

📒 Files selected for processing (19)
  • .gitignore
  • CLAUDE.md
  • docs/examples/02_sql_vs_dsl/sql_vs_dsl_nb.ipynb
  • docs/examples/03_auto_ingest/auto_ingest_nb.ipynb
  • docs/examples/04_time/time_nb.ipynb
  • docs/examples/05_joined_measures/joined_measures_nb.ipynb
  • docs/examples/05_joins/joins_nb.ipynb
  • docs/examples/06_multistage_queries/multistage_queries_nb.ipynb
  • slayer/core/enums.py
  • slayer/core/formula.py
  • slayer/core/models.py
  • slayer/engine/enriched.py
  • slayer/engine/enrichment.py
  • slayer/engine/ingestion.py
  • slayer/engine/query_engine.py
  • slayer/sql/generator.py
  • tests/integration/test_integration.py
  • tests/integration/test_integration_duckdb.py
  • tests/integration/test_integration_postgres.py

# Built-in aggregation SQL formulas (for aggregations that use a template).
# {value} = measure's SQL expression; {param_name} = parameter values.
BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = {
"weighted_avg": "SUM({value} * {weight}) / SUM({weight})",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify where weighted_avg is defined/used and whether denominator guards exist.
rg -n "weighted_avg|SUM\\(\\{value\\} \\* \\{weight\\}\\) / SUM\\(\\{weight\\}\\)|NULLIF\\(SUM\\(\\{weight\\}\\),\\s*0\\)"

Repository: MotleyAI/slayer

Length of output: 1723


🏁 Script executed:

# Get context around where weighted_avg formula is built in generator.py
sed -n '920,940p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 928


🏁 Script executed:

# Check the models.py context around line 118 for weight handling
sed -n '110,130p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 772


🏁 Script executed:

# Search for any weight validation or guards before aggregation
rg -B5 -A5 "weighted_avg|weight\s*=" slayer/ --type py -m 10

Repository: MotleyAI/slayer

Length of output: 8237


🏁 Script executed:

# Check if there are any tests or examples showing weighted_avg usage
fd -t f -e py | xargs rg -l "weighted_avg" | head -5

Repository: MotleyAI/slayer

Length of output: 180


🏁 Script executed:

# Search for test files with weighted_avg
fd -t f -e py -path "*/test*" | xargs rg "weighted_avg" -l 2>/dev/null || echo "No test matches"

Repository: MotleyAI/slayer

Length of output: 398


🏁 Script executed:

# Check enrichment layer for any weight validation
sed -n '125,200p' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 3329


🏁 Script executed:

# Search for any division or NULLIF handling in formula processing
rg "division|NULLIF|divide" slayer/ --type py

Repository: MotleyAI/slayer

Length of output: 41


Guard weighted_avg against zero denominator.

SUM({weight}) can be zero for legitimate data groups (all weight values zero or NULL), which causes runtime errors or undefined behavior across SQL engines. Add NULLIF to safely return NULL instead of dividing by zero.

Proposed fix
 BUILTIN_AGGREGATION_FORMULAS: dict[str, str] = {
-    "weighted_avg": "SUM({value} * {weight}) / SUM({weight})",
+    "weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)",
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"weighted_avg": "SUM({value} * {weight}) / SUM({weight})",
"weighted_avg": "SUM({value} * {weight}) / NULLIF(SUM({weight}), 0)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/core/enums.py` at line 127, The "weighted_avg" SQL expression in the
enum/dict must guard against a zero denominator; update the "weighted_avg" value
(the mapping entry named "weighted_avg" in slayer/core/enums.py) to divide by
NULLIF(SUM({weight}), 0) instead of SUM({weight}) so the expression becomes
SUM({value} * {weight}) / NULLIF(SUM({weight}), 0), ensuring NULL is returned
when total weight is zero.

Comment on lines +101 to +106
_AGG_REF_RE = re.compile(
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\.\*)?)' # group 1: measure name, *, or path.*
r':' # colon separator
r'([a-zA-Z_]\w*)' # group 2: aggregation name
r'(\([^)]*\))?' # group 3: optional (args)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if path.* syntax is used anywhere in the codebase or tests
rg -n '\w+\.\*:' --type=py

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Read the AggregatedMeasureRef class and surrounding context
sed -n '31,60p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1397


🏁 Script executed:

# Search for tests that verify the regex pattern
fd -e py -path '*/test*' | xargs rg -l 'AGG_REF_RE|AggregatedMeasureRef' 2>/dev/null | head -5

Repository: MotleyAI/slayer

Length of output: 295


🏁 Script executed:

# Check how placeholders are generated and used
rg -n '__agg\d+__' --type=py | head -20

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for test examples of aggregation syntax
rg -n 'customers\.|revenue:' --type=py -A 2 -B 2 | head -40

Repository: MotleyAI/slayer

Length of output: 2542


🏁 Script executed:

# Search for placeholder generation logic
rg -n '__agg|_preprocess_agg_refs' slayer/core/formula.py | head -30

Repository: MotleyAI/slayer

Length of output: 258


🏁 Script executed:

# Check enrichment.py to see how it consumes AggregatedMeasureRef
sed -n '1,50p' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 1441


🏁 Script executed:

# Look for actual placeholder generation in _preprocess_agg_refs function
sed -n '109,150p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1469


Remove undocumented path.* syntax from regex or add documentation.

The regex includes (?:\.\*)? to match patterns like customers.*:count, but this syntax is not documented in the AggregatedMeasureRef docstring examples and is not used anywhere in the codebase. Either remove this dead-code pattern or add it to the documented examples if it's intended for future use.

The concern about double underscores in identifiers is unfounded — the placeholder system correctly replaces identifiers like __internal__:sum with numeric placeholders like __agg0__, so there is no conflict.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/core/formula.py` around lines 101 - 106, The regex _AGG_REF_RE
currently includes an undocumented optional path wildcard (?:\.\*)? that matches
patterns like "customers.*:count"; remove this dead-code by deleting that
optional group from the pattern so the measure name group is only the base name
or dotted path (i.e., change _AGG_REF_RE to
r'(\*|[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*):([a-zA-Z_]\w*)(\([^)]*\))?' and update
the inline comment accordingly), and run tests; if you instead intend to support
path.* keep the group but add an example to the AggregatedMeasureRef docstring
demonstrating "path.*:agg" usage so documentation matches the code.

Comment on lines +41 to +42
logger = logging.getLogger(__name__)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if logger is used anywhere in this file
rg -n 'logger\.' slayer/engine/enrichment.py

Repository: MotleyAI/slayer

Length of output: 41


Remove the unused logger or add logging calls for aggregation resolution.

The logger defined at line 41 is never used in this file. Either remove the unused import and definition, or add logging for debugging measure/aggregation resolution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 41 - 42, The file defines an unused
logger variable named logger (logging.getLogger(__name__))—either remove the
unused import/definition or add meaningful logging statements where aggregation
or measure resolution happens (e.g., inside functions that compute or resolve
aggregations such as resolve_aggregation or resolve_measure if present) so the
logger is referenced; update imports accordingly (remove logging import if
deleting logger) and insert process-level debug/info/error logs at key decision
points to enable aggregation resolution tracing.

Comment on lines +181 to +187
# For first/last with explicit time dimension arg, update last_agg_time_column
nonlocal last_agg_time_column
if aggregation_name in ("first", "last") and agg_args:
explicit_time = agg_args[0]
if "." not in explicit_time:
explicit_time = f"{model.name}.{explicit_time}"
last_agg_time_column = explicit_time
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Side effect on last_agg_time_column may cause unexpected behavior with multiple first/last measures.

The nonlocal last_agg_time_column modification means that if multiple first/last aggregations are used with different explicit time columns, only the last one processed will take effect. This could lead to incorrect ordering for earlier first/last measures.

Consider whether each measure should track its own time column, or document that only one explicit time column is supported per query.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 181 - 187, The current use of
nonlocal last_agg_time_column inside the aggregation handling (when
aggregation_name is "first" or "last" and agg_args is present) creates a global
side-effect so only the last explicit_time wins; instead, stop mutating the
shared last_agg_time_column and associate the explicit time column with the
specific measure (e.g., attach it to the measure/aggregation metadata or store
it in a dict keyed by the measure name or aggregation identifier). Update the
logic around aggregation_name, agg_args and where last_agg_time_column is
referenced to read per-measure time columns (e.g., measure.time_column or
per_measure_time_map[measure_id]) and remove the nonlocal mutation to avoid
cross-measure interference.

Comment on lines +831 to +838
_agg_reverse = {
ph: (
f"{ref.measure_name}:{ref.aggregation_name}"
if not ref.agg_args and not ref.agg_kwargs
else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"
)
for ph, ref in agg_refs.items()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reverse mapping doesn't include kwargs, potentially producing incomplete formulas.

The reverse map construction at lines 831-838 only includes agg_args when building the original colon form, but ignores agg_kwargs. For a formula like price:weighted_avg(weight=quantity), the reconstructed formula would be price:weighted_avg() instead of price:weighted_avg(weight=quantity).

🐛 Proposed fix to include kwargs in reverse mapping
     _agg_reverse = {
         ph: (
             f"{ref.measure_name}:{ref.aggregation_name}"
             if not ref.agg_args and not ref.agg_kwargs
-            else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"
+            else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})"
         )
         for ph, ref in agg_refs.items()
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_agg_reverse = {
ph: (
f"{ref.measure_name}:{ref.aggregation_name}"
if not ref.agg_args and not ref.agg_kwargs
else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args)})"
)
for ph, ref in agg_refs.items()
}
_agg_reverse = {
ph: (
f"{ref.measure_name}:{ref.aggregation_name}"
if not ref.agg_args and not ref.agg_kwargs
else f"{ref.measure_name}:{ref.aggregation_name}({', '.join(ref.agg_args + [f'{k}={v}' for k, v in ref.agg_kwargs.items()])})"
)
for ph, ref in agg_refs.items()
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slayer/engine/enrichment.py` around lines 831 - 838, The reverse mapping
_agg_reverse built from agg_refs omits ref.agg_kwargs when formatting the
aggregation signature, causing reconstructed formulas to lose keyword arguments;
update the comprehension that builds _agg_reverse to include both ref.agg_args
and ref.agg_kwargs inside the parentheses (format kwargs as key=value pairs and
join them with positional args, e.g., ", ".join(ref.agg_args + [f"{k}={v}" for
k,v in ref.agg_kwargs.items()])), preserving the existing branch that omits
parentheses when there are no args/kwargs; target the _agg_reverse comprehension
and the attributes ref.measure_name, ref.aggregation_name, ref.agg_args,
ref.agg_kwargs.

Comment on lines +929 to +966
def _build_formula_agg(self, measure: EnrichedMeasure, agg_name: str) -> exp.Expression:
"""Build SQL for formula-based aggregations (weighted_avg, custom)."""
# Get formula: from aggregation_def or built-in
formula = None
if measure.aggregation_def and measure.aggregation_def.formula:
formula = measure.aggregation_def.formula
elif agg_name in BUILTIN_AGGREGATION_FORMULAS:
formula = BUILTIN_AGGREGATION_FORMULAS[agg_name]

if formula is None:
raise ValueError(
f"Aggregation '{agg_name}' has no formula. "
f"Custom aggregations must define a formula."
)

# Collect param values: query-time overrides > aggregation_def defaults
param_defaults = {}
if measure.aggregation_def:
param_defaults = {p.name: p.sql for p in measure.aggregation_def.params}
params = {**param_defaults, **measure.agg_kwargs}

# Validate required params
required = BUILTIN_AGGREGATION_REQUIRED_PARAMS.get(agg_name, [])
for req in required:
if req not in params:
raise ValueError(
f"Aggregation '{agg_name}' requires parameter '{req}'. "
f"Set it in the model's aggregation definition or at query time "
f"(e.g., 'measure:{agg_name}({req}=column)')."
)

# Resolve {value} and {param_name} in formula
col_expr = measure.sql or measure.name
substituted = formula.replace("{value}", col_expr)
for param_name, param_val in params.items():
substituted = substituted.replace(f"{{{param_name}}}", param_val)

return sqlglot.parse_one(substituted, dialect=self.dialect)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check where agg_kwargs values originate from in the enrichment flow
rg -n "agg_kwargs" --type py -C 3 slayer/

Repository: MotleyAI/slayer

Length of output: 10262


🏁 Script executed:

# Check how agg_kwargs values are validated and what BUILTIN_AGGREGATION_FORMULAS looks like
rg -n "BUILTIN_AGGREGATION_FORMULAS|BUILTIN_AGGREGATION_REQUIRED_PARAMS" --type py -C 2 slayer/

Repository: MotleyAI/slayer

Length of output: 1789


🏁 Script executed:

# Look at the formula definitions to understand how parameters are used
fd -e py -x grep -l "weighted_avg\|custom.*aggregation" {} \; slayer/

Repository: MotleyAI/slayer

Length of output: 392


🏁 Script executed:

# Check if there's any validation of agg_kwargs before it reaches _build_formula_agg
rg -n "agg_kwargs" --type py -B 5 -A 5 slayer/engine/enrichment.py | head -100

Repository: MotleyAI/slayer

Length of output: 4177


🏁 Script executed:

# Look at the actual substitution logic and any context around it
sed -n '929,970p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 2038


🏁 Script executed:

# Search for any validation of parameter values or agg_kwargs
rg -n "validate|sanitize|escape" --type py slayer/engine/enrichment.py | grep -i param

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for tests that show expected usage of weighted_avg and other parameterized aggregations
fd . slayer --full-path -e py -name "*test*" | head -20

Repository: MotleyAI/slayer

Length of output: 289


🏁 Script executed:

# Check for parameter validation in the aggregation definition or enrichment
cat slayer/core/enums.py | grep -A 10 "class Aggregation\|@dataclass.*Aggregation"

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for the Aggregation and Parameter dataclasses to understand structure
rg -n "class.*Aggregation|class.*Parameter|@dataclass.*Aggregation" --type py slayer/ -A 5

Repository: MotleyAI/slayer

Length of output: 843


🏁 Script executed:

# Check if there are any tests for weighted_avg or formula substitution
rg -n "weighted_avg\|_build_formula_agg" --type py slayer/ -B 2 -A 5 | head -80

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look at the full Aggregation model definition
sed -n '115,135p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 894


🏁 Script executed:

# Search for test files and examples using weighted_avg
find slayer -name "*.py" -type f | xargs grep -l "weighted_avg" | head -5

Repository: MotleyAI/slayer

Length of output: 178


🏁 Script executed:

# Check if there are any constraints or validation on parameter values
rg -n "param|weight" --type py slayer/core/models.py -B 2 -A 2 | head -50

Repository: MotleyAI/slayer

Length of output: 763


🏁 Script executed:

# Look for documentation or examples in README or docs
find . -name "*.md" -o -name "*.rst" | xargs grep -l "weighted_avg\|agg_kwargs" 2>/dev/null | head -3

Repository: MotleyAI/slayer

Length of output: 70


🏁 Script executed:

# Check if sqlglot.parse_one is being called and how it handles the substituted string
rg -n "parse_one" --type py slayer/sql/generator.py -B 3 -A 2

Repository: MotleyAI/slayer

Length of output: 4052


🏁 Script executed:

# Check CLAUDE.md for documentation on weighted_avg usage
cat ./CLAUDE.md | grep -A 20 "weighted_avg"

Repository: MotleyAI/slayer

Length of output: 4633


🏁 Script executed:

# Look for tests that might show expected behavior
find slayer -name "*.py" -path "*test*" -type f

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Search for any test files in the repo
find . -name "test_*.py" -o -name "*_test.py" | head -10

Repository: MotleyAI/slayer

Length of output: 417


🏁 Script executed:

# Check if there's validation in the formula parsing step
rg -n "agg_kwargs" --type py slayer/core/formula.py -B 5 -A 5

Repository: MotleyAI/slayer

Length of output: 2995


🏁 Script executed:

# Look for how the parameter values are actually used - check if they're expected to be column names or full expressions
rg -n "def.*_build_formula_agg|substituted" --type py slayer/sql/generator.py -B 2 -A 10

Repository: MotleyAI/slayer

Length of output: 1684


🏁 Script executed:

# Check how the expression returned from _build_formula_agg is used
rg -n "_build_formula_agg" --type py slayer/sql/generator.py -A 3

Repository: MotleyAI/slayer

Length of output: 512


🏁 Script executed:

# Look at test files to see if there are tests for weighted_avg with different parameter values
grep -r "weighted_avg" tests/ --include="*.py" -B 2 -A 2

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Check if there's any sanitization in the enrichment pipeline for parameter values
rg -n "param.*val|val.*strip|escape|quote" --type py slayer/engine/enrichment.py | head -20

Repository: MotleyAI/slayer

Length of output: 294


🏁 Script executed:

# Look at how column expressions are built - check _build_column_expr to understand if col_expr on line 961 is validated
sed -n '860,875p' slayer/sql/generator.py

Repository: MotleyAI/slayer

Length of output: 1037


🏁 Script executed:

# Check if sqlglot.parse_one can handle injection attempts safely - look for any error handling
rg -n "try|except|parse_one" --type py slayer/sql/generator.py | grep -A 2 "parse_one"

Repository: MotleyAI/slayer

Length of output: 972


🏁 Script executed:

# Verify the expected format of parameter values - look at documentation or tests for constraints
rg -n "weight|quantity|agg_kwargs" --type py tests/ -B 2 -A 2 | head -40

Repository: MotleyAI/slayer

Length of output: 2237


🏁 Script executed:

# Check how col_expr is constructed - does _resolve_sql provide any validation?
rg -n "_resolve_sql" --type py slayer/sql/generator.py -A 15 | head -25

Repository: MotleyAI/slayer

Length of output: 1480


🏁 Script executed:

# Look for any tests that try malicious input
grep -r "injection\|DROP\|;\|--" tests/ --include="*.py" | head -10

Repository: MotleyAI/slayer

Length of output: 964


🏁 Script executed:

# Check if parameters can be arbitrary expressions or only column names
rg -n "allowed_aggregations|param.*validate|weight.*column" --type py slayer/ -B 2 -A 2 | head -30

Repository: MotleyAI/slayer

Length of output: 2308


🏁 Script executed:

# Check if there are any try-catch blocks around parse_one that might handle injection attempts
rg -n "try|except" --type py slayer/sql/generator.py | grep -B 5 -A 5 "966\|parse_one"

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for any validation that restricts agg_kwargs to column names
rg -n "Column|Identifier|parse_one" --type py slayer/sql/generator.py | grep -B 3 -A 3 "param\|weight\|agg_kwargs"

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Check what happens if sqlglot.parse_one fails - is there error handling upstream?
rg -n "query_engine\|SlayerQuery" --type py slayer/ -l | head -3

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look at how col_expr (measure.sql) is built - is it already validated as an expression?
rg -n "col_expr = measure.sql" --type py slayer/sql/generator.py -B 3 -A 10

Repository: MotleyAI/slayer

Length of output: 786


🏁 Script executed:

# Check for any documentation on expected parameter format or constraints
rg -n "param.*format|parameter.*column|weight.*definition" --type py slayer/core/models.py -B 2 -A 2

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look at how queries are built and executed - check if the generated SQL from _build_formula_agg is safely incorporated
rg -n "execute\|query\|run_sql" --type py slayer/ -l | head -5

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Check the broader context of _build_agg usage to see how the returned expression is used
rg -n "_build_agg\|_build_formula_agg" --type py slayer/sql/generator.py -B 2 -A 5

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for any input validation or constraints in the aggregation parameter parsing
sed -n '115,145p' slayer/core/formula.py

Repository: MotleyAI/slayer

Length of output: 1048


🏁 Script executed:

# Check if parameter values should be restricted to identifier-like patterns
rg -n "identifier|column.*name|bare.*column" --type py slayer/core/models.py -B 1 -A 1 | head -30

Repository: MotleyAI/slayer

Length of output: 497


🏁 Script executed:

# Check if sqlglot.parse_one can parse multi-statement SQL or if it safely rejects it
python3 << 'EOF'
import sqlglot

# Test if sqlglot.parse_one can handle potential injection attempts
test_cases = [
    "SUM(amount * quantity) / SUM(quantity)",  # normal
    "SUM(amount * quantity); DROP TABLE orders",  # injection attempt
    "SUM(amount * quantity) OR 1=1",  # another attempt
]

for test in test_cases:
    try:
        result = sqlglot.parse_one(test, dialect="sqlite")
        print(f"✓ Parsed: {test[:50]}")
    except Exception as e:
        print(f"✗ Failed: {test[:50]} -> {type(e).__name__}")
EOF

Repository: MotleyAI/slayer

Length of output: 178


🏁 Script executed:

# Look at the AggregationParam definition to understand expected format
sed -n '109,113p' slayer/core/models.py

Repository: MotleyAI/slayer

Length of output: 226


🏁 Script executed:

# Check if there's any documentation about parameter value constraints
rg -n "AggregationParam|param.*sql|weight.*sql" --type py slayer/core/models.py -B 2 -A 3

Repository: MotleyAI/slayer

Length of output: 476


🏁 Script executed:

# Look at how sqlglot.parse_one is imported and check for any safety measures
rg -n "import.*sqlglot|from.*sqlglot" --type py slayer/sql/generator.py -A 2

Repository: MotleyAI/slayer

Length of output: 309


🏁 Script executed:

# Check if there are integration tests for weighted_avg that might show expected behavior
find tests -name "*.py" -type f -exec grep -l "weighted_avg" {} \;

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look for any validation of formula or aggregation definitions
rg -n "Aggregation|formula.*validate|param.*validate" --type py slayer/core/models.py -A 3 | head -40

Repository: MotleyAI/slayer

Length of output: 875


🏁 Script executed:

# Install sqlglot and test if it provides injection protection
pip install sqlglot -q && python3 << 'EOF'
import sqlglot

# Test if sqlglot.parse_one safely handles injection attempts in formula substitution
formula = "SUM({value} * {weight}) / SUM({weight})"

test_cases = [
    ("quantity", "Normal column name"),
    ("quantity + 1", "Expression"),
    ("quantity); DROP TABLE orders; --", "Injection attempt"),
    ("(SELECT COUNT(*) FROM orders)", "Subquery"),
]

for param_val, desc in test_cases:
    substituted = formula.replace("{value}", "price").replace("{weight}", param_val)
    print(f"\n{desc}: weight={param_val}")
    print(f"  Substituted: {substituted}")
    try:
        result = sqlglot.parse_one(substituted, dialect="sqlite")
        print(f"  ✓ Parsed successfully")
        print(f"  SQL output: {result.sql(dialect='sqlite')}")
    except Exception as e:
        print(f"  ✗ Parse error: {e}")
EOF

Repository: MotleyAI/slayer

Length of output: 953


🏁 Script executed:

# Check if there's any documentation about parameter validation in comments or docstrings
rg -n "param.*safe|sanitize|validate|column.*name|SQL.*expression" --type py slayer/core/ -B 1 -A 1 | head -50

Repository: MotleyAI/slayer

Length of output: 2618


🏁 Script executed:

# Check how the final SQL is executed - is it parameterized or executed as a raw string?
rg -n "execute\|run\|query" --type py slayer/engine/ -l

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look at the query engine to see if SQL is executed directly or with parameters
sed -n '1,50p' slayer/engine/query_engine.py | head -30

Repository: MotleyAI/slayer

Length of output: 949


🏁 Script executed:

# Check if there's any escaping or quoting of parameter values
rg -n "quote|escape|literal|param" --type py slayer/sql/generator.py | head -20

Repository: MotleyAI/slayer

Length of output: 1327


🏁 Script executed:

# Check how the SQL from _build_formula_agg is used downstream
rg -n "def _build_agg|_build_formula_agg" --type py slayer/sql/generator.py -A 20 | grep -A 15 "def _build_agg"

Repository: MotleyAI/slayer

Length of output: 958


🏁 Script executed:

# Check how the final SQL expression is rendered and executed
rg -n "\.sql\(|execute|client" --type py slayer/engine/query_engine.py | head -20

Repository: MotleyAI/slayer

Length of output: 531


🏁 Script executed:

# Look for any tests that validate parameter safety or show expected usage
find tests -name "*.py" -type f -exec grep -l "agg_kwargs\|weighted_avg\|aggregation.*param" {} \;

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Check if there's any documentation about parameter constraints or safety
cat slayer/core/models.py | grep -A 10 "class AggregationParam"

Repository: MotleyAI/slayer

Length of output: 471


🏁 Script executed:

# Check SlayerSQLClient to see if it uses parameterized queries or raw SQL execution
rg -n "class SlayerSQLClient|def execute" --type py slayer/sql/client.py -A 10 | head -30

Repository: MotleyAI/slayer

Length of output: 623


🏁 Script executed:

# Look at the actual execution to confirm raw SQL string usage
sed -n '1,100p' slayer/sql/client.py | head -50

Repository: MotleyAI/slayer

Length of output: 1355


🏁 Script executed:

# Search for any validation of formula or parameter values before they reach _build_formula_agg
rg -n "_ensure_aggregated_measure\|_build_formula_agg" --type py slayer/engine/enrichment.py -B 5 -A 5 | head -40

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Check the _execute_sql function to see how SQL is actually executed
rg -n "def _execute_sql" --type py slayer/sql/client.py -A 30

Repository: MotleyAI/slayer

Length of output: 1104


🏁 Script executed:

# Look for any tests that demonstrate parameter value validation or injection testing
grep -r "injection\|DROP\|--\|;\|parametr" tests/ --include="*.py" -i | head -20

Repository: MotleyAI/slayer

Length of output: 2101


🏁 Script executed:

# Check if there's a CHANGELOG or docs explaining design decisions around parameter handling
find . -name "CHANGELOG*" -o -name "*.md" | xargs grep -l "param\|aggregation\|injection" 2>/dev/null | head -3

Repository: MotleyAI/slayer

Length of output: 137


🏁 Script executed:

# Check what sqlglot actually renders for the injection case - detailed test
pip install sqlglot -q && python3 << 'EOF'
import sqlglot

# Test the exact scenario: formula substitution → parse → render
formula = "SUM({value} * {weight}) / SUM({weight})"
col_expr = "price"
malicious_weight = "quantity); DROP TABLE orders; --"

# Step 1: String substitution (what happens in _build_formula_agg)
substituted = formula.replace("{value}", col_expr).replace("{weight}", malicious_weight)
print(f"1. After string substitution:\n   {substituted}\n")

# Step 2: Parse with sqlglot
try:
    parsed = sqlglot.parse_one(substituted, dialect="sqlite")
    print(f"2. Parsed successfully as AST")
except Exception as e:
    print(f"2. Parse failed: {e}")
    exit(1)

# Step 3: Render back to SQL
rendered_sql = parsed.sql(dialect="sqlite")
print(f"3. Rendered SQL:\n   {rendered_sql}\n")

# Step 4: Check if malicious statement is present
if "DROP TABLE" in rendered_sql:
    print("⚠️ VULNERABLE: DROP TABLE statement is present in rendered SQL!")
else:
    print("✓ Safe: DROP TABLE statement was removed/neutralized")
EOF

Repository: MotleyAI/slayer

Length of output: 384


🏁 Script executed:

# Check if there are any comments in the codebase about parameter validation or security
rg -n "TODO|FIXME|HACK|SECURITY|injection|safe" --type py slayer/sql/generator.py | grep -i "param\|agg\|formula\|safe"

Repository: MotleyAI/slayer

Length of output: 41


🏁 Script executed:

# Look at the broader context of how parameters should be used - check if there are constraints documented
cat slayer/core/models.py | sed -n '100,200p'

Repository: MotleyAI/slayer

Length of output: 3858


Validate and sanitize user-provided parameter values in aggregations to prevent SQL injection.

The _build_formula_agg function substitutes user-provided parameter values directly into SQL formulas via string replacement (lines 962–964), then passes the result to sqlglot.parse_one(). While parameter definitions in aggregations support SQL expressions, user-provided agg_kwargs from query-time aggregation syntax (e.g., "revenue:weighted_avg(weight=user_input)") are not validated before substitution.

Testing confirms that malicious input like "quantity); DROP TABLE orders; --" is not rejected by sqlglot's parser and produces executable SQL statements when rendered. Since the generated SQL is executed as raw text via conn.execute(sa.text(sql)), this is a real SQL injection vector.

Mitigations to consider:

  • Validate parameter values against a whitelist of column names or safe expression patterns before substitution
  • Use sqlglot's parameterization capabilities instead of string replacement
  • Document that parameters must come from trusted sources only and enforce this at the enrichment stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant