Skip to content

RFC 012: Unified scalar expression surface #25

@dannymeijer

Description

@dannymeijer

Area

  • Specification (RFCs)
  • Package & tests
  • Documentation

Summary

RFC 012 proposes a single canonical scalar expression model for row-level relational meaning in InQL. Filters, computed projection values, grouping keys, and aggregate arguments should all lower through the same scalar-expression contract, while aggregate outputs remain a distinct aggregate-measure layer. This matters because the current direction invites split mini-DSLs for predicates, literals, and projection expressions, which makes the package surface harder to learn, duplicates semantics across planning/lowering layers, and creates room for silent degradation when one public surface accepts expression shapes it cannot actually represent faithfully.

Motivation

The current design pressure is clear: filter(...), with_column(...), grouping keys, and aggregate inputs are all expressing row-level meaning, but they are easy to model as separate builder families because features have been landing incrementally. That split is the wrong end state. It forces authors to remember which helper family belongs to which surface, encourages duplicated semantics across the package/Prism/Substrait boundary, and makes future concise DSL sugar harder because there is no single lowering target.

The most important technical motivation is correctness. InQL should not accept a broad expression shape in a public API and then quietly reinterpret or drop that shape downstream. Unsupported expressions must fail explicitly. RFC 012 is the design step that makes that rule coherent across the whole relational surface instead of one method at a time.

This is also now timely because the package has real filter, aggregate, and computed-column slices in flight. If the expression model is not unified soon, more surface area will accrete around the current split.

RFC document path: docs/rfcs/012_unified_scalar_expression_surface.md

Proposal sketch

Define one canonical scalar expression model for row-level relational authoring in InQL.

The contract is:

  • row-level filters consume scalar expressions
  • computed projection values consume scalar expressions
  • grouping keys consume scalar expressions
  • aggregate inputs consume scalar expressions
  • aggregate outputs are not row-level scalar expressions; they remain a distinct aggregate-measure layer

Illustrative author-facing shape:

from pub::inql import LazyFrame
from pub::inql.functions import col, lit, gt, add, sum, count
from models import Order, OrderSummary

def enrich_orders(orders: LazyFrame[Order]) -> LazyFrame[Order]:
    return (
        orders
            .filter(gt(col("amount"), lit(100)))
            .with_column("amount_plus_fee", add(col("amount"), lit(5)))
    )

def summarize_orders(orders: LazyFrame[Order]) -> LazyFrame[OrderSummary]:
    return (
        orders
            .group_by([col("customer_id")])
            .agg([sum(col("amount")), count()])
    )

Key design constraints:

  • unsupported expression shapes must fail explicitly; silent degradation is forbidden
  • future concise surfaces such as .amount > 100 or sum(.amount) should lower into this same model rather than creating a separate semantic path
  • the RFC is about the InQL contract, not about introducing new Incan parser syntax directly
  • aggregate outputs remain distinct from row-level scalar expressions unless a later RFC says otherwise

The current draft RFC is here:

  • docs/rfcs/012_unified_scalar_expression_surface.md

Alternatives considered

Keep predicate, literal, and projection surfaces separate.
This preserves the current incremental direction, but it duplicates concepts that are semantically the same and makes drift between authoring surfaces more likely.

Unify only literals.
This is too small. The real problem is not helper naming alone; it is that row-level meaning is being expressed through multiple semantic systems.

Treat aggregate calls as ordinary scalar expressions everywhere.
This collapses a real semantic boundary. Aggregate outputs are group-level values, not row-level values, and blurring that distinction makes typing and position rules less coherent.

Wait for concise DSL syntax first.
That would postpone the contract question until after more syntax work lands, which is backwards. Concise syntax needs a stable lowering target first.

Impact / compatibility

This is additive as an RFC, but it will likely require cleanup of existing public builder families in the InQL package.

Likely compatibility consequences:

  • legacy typed literal helpers may need to become compatibility shims
  • legacy predicate-specific wrappers may need deprecation if they survive at all
  • docs and examples will need to present one canonical row-level expression model
  • Prism and Substrait lowering will need one shared contract for scalar expressions and one shared contract for aggregate measures

This should improve migration quality, not hurt it, because it gives a clear north star instead of letting surface drift continue.

Implementation notes (optional)

The relevant design record already exists:

  • docs/rfcs/012_unified_scalar_expression_surface.md

Likely touch points if accepted:

  • this repo:
    • RFCs 001, 003, 004, and 007 for cross-RFC coherence
    • package builder surfaces and tests
    • Prism logical representation
    • Substrait lowering and validation
    • docs/reference/examples
  • Incan:
    • only insofar as future scoped DSL sugar from RFC 040 / RFC 045 needs to lower into the InQL-defined scalar-expression contract

Testing should focus on semantic consistency:

  • filter, computed projection, grouping keys, and aggregate inputs should all share one expression contract
  • unsupported shapes should fail explicitly instead of degrading silently
  • package, planning, and lowering layers should agree on the scalar-versus-aggregate boundary

I checked current open InQL issues before drafting this. There are open RFC issues for 000, 003, 004, 005, 006, and 007, plus feature work, but nothing open that already tracks RFC 012 directly.

Checklist

  • I checked for an existing RFC or issue covering this.
  • I can describe how this impacts existing code and how to migrate (if needed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRFC design and planningdocumentationImprovements or additions to documentationpackageLibrary source, tests, incan.tomlspecificationdocs/rfcs/ normative RFCs

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions