|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +--- |
| 21 | +name: make-pythonic |
| 22 | +description: Audit and improve datafusion-python functions to accept native Python types (int, float, str, bool) instead of requiring explicit lit() or col() wrapping. Analyzes function signatures, checks upstream Rust implementations for type constraints, and applies the appropriate coercion pattern. |
| 23 | +argument-hint: [scope] (e.g., "string functions", "datetime functions", "array functions", "math functions", "all", or a specific function name like "split_part") |
| 24 | +--- |
| 25 | + |
| 26 | +# Make Python API Functions More Pythonic |
| 27 | + |
| 28 | +You are improving the datafusion-python API to feel more natural to Python users. The goal is to allow functions to accept native Python types (int, float, str, bool, etc.) for arguments that are contextually always or typically literal values, instead of requiring users to manually wrap them in `lit()`. |
| 29 | + |
| 30 | +**Core principle:** A Python user should be able to write `split_part(col("a"), ",", 2)` instead of `split_part(col("a"), lit(","), lit(2))` when the arguments are contextually obvious literals. |
| 31 | + |
| 32 | +## How to Identify Candidates |
| 33 | + |
| 34 | +The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions.py`. |
| 35 | + |
| 36 | +For each function, determine if any parameter can accept native Python types by evaluating **two complementary signals**: |
| 37 | + |
| 38 | +### Signal 1: Contextual Understanding |
| 39 | + |
| 40 | +Some arguments are contextually always or almost always literal values based on what the function does: |
| 41 | + |
| 42 | +| Context | Typical Arguments | Examples | |
| 43 | +|---------|------------------|----------| |
| 44 | +| **String position/count** | Character counts, indices, repetition counts | `left(str, n)`, `right(str, n)`, `repeat(str, n)`, `lpad(str, count, ...)` | |
| 45 | +| **Delimiters/separators** | Fixed separator characters | `split_part(str, delim, idx)`, `concat_ws(sep, ...)` | |
| 46 | +| **Search/replace patterns** | Literal search strings, replacements | `replace(str, from, to)`, `regexp_replace(str, pattern, replacement, flags)` | |
| 47 | +| **Date/time parts** | Part names from a fixed set | `date_part(part, date)`, `date_trunc(part, date)` | |
| 48 | +| **Rounding precision** | Decimal place counts | `round(val, places)`, `trunc(val, places)` | |
| 49 | +| **Fill characters** | Padding characters | `lpad(str, count, fill)`, `rpad(str, count, fill)` | |
| 50 | + |
| 51 | +### Signal 2: Upstream Rust Implementation |
| 52 | + |
| 53 | +Check the Rust binding in `crates/core/src/functions.rs` and the upstream DataFusion function implementation to determine type constraints. The upstream source is cached locally at: |
| 54 | + |
| 55 | +``` |
| 56 | +~/.cargo/registry/src/index.crates.io-*/datafusion-functions-<VERSION>/src/ |
| 57 | +``` |
| 58 | + |
| 59 | +Check the DataFusion version in `crates/core/Cargo.toml` to find the right directory. Key subdirectories: `string/`, `datetime/`, `math/`, `regex/`. |
| 60 | + |
| 61 | +There are three concrete techniques to check, in order of signal strength: |
| 62 | + |
| 63 | +#### Technique 1: Check `invoke_with_args()` for literal-only enforcement (strongest signal) |
| 64 | + |
| 65 | +Some functions pattern-match on `ColumnarValue::Scalar` in their `invoke_with_args()` method and **return an error** if the argument is a column/array. This means the argument **must** be a literal — passing a column expression will fail at runtime. |
| 66 | + |
| 67 | +Example from `date_trunc.rs`: |
| 68 | +```rust |
| 69 | +let granularity_str = if let ColumnarValue::Scalar(ScalarValue::Utf8(Some(v))) = granularity { |
| 70 | + v.to_lowercase() |
| 71 | +} else { |
| 72 | + return exec_err!("Granularity of `date_trunc` must be non-null scalar Utf8"); |
| 73 | +}; |
| 74 | +``` |
| 75 | + |
| 76 | +**If you find this pattern:** The argument is **Category B** — accept only the corresponding native Python type (e.g., `str`), not `Expr`. The function will error at runtime with a column expression anyway. |
| 77 | + |
| 78 | +#### Technique 2: Check the `Signature` for data type constraints |
| 79 | + |
| 80 | +Each function defines a `Signature::coercible(...)` that specifies what data types each argument accepts, using `Coercion` entries. This tells you the expected **data type** even if it doesn't enforce literal-only. |
| 81 | + |
| 82 | +Example from `repeat.rs`: |
| 83 | +```rust |
| 84 | +signature: Signature::coercible( |
| 85 | + vec![ |
| 86 | + Coercion::new_exact(TypeSignatureClass::Native(logical_string())), |
| 87 | + Coercion::new_implicit( |
| 88 | + TypeSignatureClass::Native(logical_int64()), |
| 89 | + vec![TypeSignatureClass::Integer], |
| 90 | + NativeType::Int64, |
| 91 | + ), |
| 92 | + ], |
| 93 | + Volatility::Immutable, |
| 94 | +), |
| 95 | +``` |
| 96 | + |
| 97 | +This tells you arg 2 (`n`) must be an integer type coerced to Int64. Use this to choose the correct Python type (e.g., `int` not `str` or `float`). |
| 98 | + |
| 99 | +Common mappings: |
| 100 | +| Rust Type Constraint | Python Type | |
| 101 | +|---------------------|-------------| |
| 102 | +| `logical_int64()` / `TypeSignatureClass::Integer` | `int` | |
| 103 | +| `logical_float64()` / `TypeSignatureClass::Numeric` | `int \| float` | |
| 104 | +| `logical_string()` / `TypeSignatureClass::String` | `str` | |
| 105 | +| `LogicalType::Boolean` | `bool` | |
| 106 | + |
| 107 | +#### Technique 3: Check `return_field_from_args()` for `scalar_arguments` usage |
| 108 | + |
| 109 | +Functions that inspect literal values at query planning time use `args.scalar_arguments.get(n)` in their `return_field_from_args()` method. This indicates the argument is **expected to be a literal** for optimal behavior (e.g., to determine output type precision), but may still work as a column. |
| 110 | + |
| 111 | +Example from `round.rs`: |
| 112 | +```rust |
| 113 | +let decimal_places: Option<i32> = match args.scalar_arguments.get(1) { |
| 114 | + None => Some(0), |
| 115 | + Some(None) => None, // argument is not a literal (column) |
| 116 | + Some(Some(scalar)) if scalar.is_null() => Some(0), |
| 117 | + Some(Some(scalar)) => Some(decimal_places_from_scalar(scalar)?), |
| 118 | +}; |
| 119 | +``` |
| 120 | + |
| 121 | +**If you find this pattern:** The argument is **Category A** — accept native types AND `Expr`. It works as a column but is primarily used as a literal. |
| 122 | + |
| 123 | +#### Decision flow |
| 124 | + |
| 125 | +``` |
| 126 | +Is argument rejected at runtime if not a literal? |
| 127 | + (check invoke_with_args for ColumnarValue::Scalar-only match + exec_err!) |
| 128 | + → YES: Category B — accept only native type, no Expr |
| 129 | + → NO: Does the Signature constrain it to a specific data type? |
| 130 | + → YES: Category A — accept Expr | <native type matching the constraint> |
| 131 | + → NO: Leave as Expr only |
| 132 | +``` |
| 133 | + |
| 134 | +## Coercion Categories |
| 135 | + |
| 136 | +When making a function more pythonic, apply the correct coercion pattern based on **what the argument represents**: |
| 137 | + |
| 138 | +### Category A: Arguments That Should Accept Native Types AND Expr |
| 139 | + |
| 140 | +These are arguments that are *typically* literals but *could* be column references in advanced use cases. For these, accept a union type and coerce native types to `Expr.literal()`. |
| 141 | + |
| 142 | +**Type hint pattern:** `Expr | int`, `Expr | str`, `Expr | int | str`, etc. |
| 143 | + |
| 144 | +**When to use:** When the argument could plausibly come from a column in some use case (e.g., the repeat count might come from a column in a data-driven scenario). |
| 145 | + |
| 146 | +```python |
| 147 | +def repeat(string: Expr, n: Expr | int) -> Expr: |
| 148 | + """Repeats the ``string`` to ``n`` times. |
| 149 | +
|
| 150 | + Examples: |
| 151 | + >>> ctx = dfn.SessionContext() |
| 152 | + >>> df = ctx.from_pydict({"a": ["ha"]}) |
| 153 | + >>> result = df.select( |
| 154 | + ... dfn.functions.repeat(dfn.col("a"), 3).alias("r")) |
| 155 | + >>> result.collect_column("r")[0].as_py() |
| 156 | + 'hahaha' |
| 157 | + """ |
| 158 | + if not isinstance(n, Expr): |
| 159 | + n = Expr.literal(n) |
| 160 | + return Expr(f.repeat(string.expr, n.expr)) |
| 161 | +``` |
| 162 | + |
| 163 | +### Category B: Arguments That Should ONLY Accept Specific Native Types |
| 164 | + |
| 165 | +These are arguments where an `Expr` never makes sense because the value must be a fixed literal known at query-planning time (not a per-row value). For these, accept only the native type(s) and wrap internally. |
| 166 | + |
| 167 | +**Type hint pattern:** `str`, `int`, `list[str]`, etc. (no `Expr` in the union) |
| 168 | + |
| 169 | +**When to use:** When the argument is from a fixed enumeration or is always a compile-time constant: |
| 170 | +- Date/time part names (`"year"`, `"month"`, `"day"`, etc.) |
| 171 | +- Regex flags (`"g"`, `"i"`, etc.) |
| 172 | +- Values that the Rust implementation already accepts as native types |
| 173 | + |
| 174 | +```python |
| 175 | +def date_part(part: str, date: Expr) -> Expr: |
| 176 | + """Extracts a subfield from the date. |
| 177 | +
|
| 178 | + Args: |
| 179 | + part: The part of the date to extract. Must be one of "year", "month", |
| 180 | + "day", "hour", "minute", "second", etc. |
| 181 | + date: The date expression to extract from. |
| 182 | +
|
| 183 | + Examples: |
| 184 | + >>> ctx = dfn.SessionContext() |
| 185 | + >>> df = ctx.from_pydict({"a": ["2021-07-15T00:00:00"]}) |
| 186 | + >>> df = df.select(dfn.functions.to_timestamp(dfn.col("a")).alias("a")) |
| 187 | + >>> result = df.select( |
| 188 | + ... dfn.functions.date_part("year", dfn.col("a")).alias("y")) |
| 189 | + >>> result.collect_column("y")[0].as_py() |
| 190 | + 2021 |
| 191 | + """ |
| 192 | + part = Expr.literal(part) |
| 193 | + return Expr(f.date_part(part.expr, date.expr)) |
| 194 | +``` |
| 195 | + |
| 196 | +### Category C: Arguments That Should Accept str as Column Name |
| 197 | + |
| 198 | +In some contexts a string argument naturally refers to a column name rather than a literal. This is the pattern used by DataFrame methods. |
| 199 | + |
| 200 | +**Type hint pattern:** `Expr | str` |
| 201 | + |
| 202 | +**When to use:** Only when the string contextually means a column name (rare in `functions.py`, more common in DataFrame methods). |
| 203 | + |
| 204 | +```python |
| 205 | +# Use _to_raw_expr() from expr.py for this pattern |
| 206 | +from datafusion.expr import _to_raw_expr |
| 207 | + |
| 208 | +def some_function(column: Expr | str) -> Expr: |
| 209 | + raw = _to_raw_expr(column) # str -> col(str) |
| 210 | + return Expr(f.some_function(raw)) |
| 211 | +``` |
| 212 | + |
| 213 | +**IMPORTANT:** In `functions.py`, string arguments almost never mean column names. Functions operate on expressions, and column references should use `col()`. Category C applies mainly to DataFrame methods and context APIs, not to scalar/aggregate/window functions. Do NOT convert string arguments to column expressions in `functions.py` unless there is a very clear reason to do so. |
| 214 | + |
| 215 | +## Implementation Steps |
| 216 | + |
| 217 | +For each function being updated: |
| 218 | + |
| 219 | +### Step 1: Analyze the Function |
| 220 | + |
| 221 | +1. Read the current Python function signature in `python/datafusion/functions.py` |
| 222 | +2. Read the Rust binding in `crates/core/src/functions.rs` |
| 223 | +3. Optionally check the upstream DataFusion docs for the function |
| 224 | +4. Determine which category (A, B, or C) applies to each parameter |
| 225 | + |
| 226 | +### Step 2: Update the Python Function |
| 227 | + |
| 228 | +1. **Change the type hints** to accept native types (e.g., `Expr` -> `Expr | int`) |
| 229 | +2. **Add coercion logic** at the top of the function body |
| 230 | +3. **Update the docstring** examples to use the simpler calling convention |
| 231 | +4. **Preserve backward compatibility** — existing code using `Expr` must still work |
| 232 | + |
| 233 | +### Step 3: Update Docstring Examples |
| 234 | + |
| 235 | +Per the project's CLAUDE.md rules: |
| 236 | +- Every function must have doctest-style examples |
| 237 | +- Optional parameters need examples both without and with the optional args, using keyword argument syntax |
| 238 | +- Reuse the same input data across examples where possible |
| 239 | + |
| 240 | +**Update examples to demonstrate the pythonic calling convention:** |
| 241 | + |
| 242 | +```python |
| 243 | +# BEFORE (old style - still works but verbose) |
| 244 | +dfn.functions.left(dfn.col("a"), dfn.lit(3)) |
| 245 | + |
| 246 | +# AFTER (new style - shown in examples) |
| 247 | +dfn.functions.left(dfn.col("a"), 3) |
| 248 | +``` |
| 249 | + |
| 250 | +### Step 4: Run Tests |
| 251 | + |
| 252 | +After making changes, run the doctests to verify: |
| 253 | +```bash |
| 254 | +python -m pytest --doctest-modules python/datafusion/functions.py -v |
| 255 | +``` |
| 256 | + |
| 257 | +## Coercion Helper Pattern |
| 258 | + |
| 259 | +When adding coercion to a function, use this inline pattern: |
| 260 | + |
| 261 | +```python |
| 262 | +if not isinstance(arg, Expr): |
| 263 | + arg = Expr.literal(arg) |
| 264 | +``` |
| 265 | + |
| 266 | +Do NOT create a new helper function for this — the inline check is clear and explicit. The project intentionally uses `ensure_expr()` to reject non-Expr values in contexts where coercion is not wanted; the pythonic coercion is the opposite pattern and should be visually distinct. |
| 267 | + |
| 268 | +## What NOT to Change |
| 269 | + |
| 270 | +- **Do not change arguments that represent data columns.** If an argument is the primary data being operated on (e.g., the `string` in `left(string, n)` or the `array` in `array_sort(array)`), it should remain `Expr` only. Users should use `col()` for column references. |
| 271 | +- **Do not change variadic `*args: Expr` parameters.** These represent multiple expressions and should stay as `Expr`. |
| 272 | +- **Do not change arguments where the coercion is ambiguous.** If it is unclear whether a string should be a column name or a literal, leave it as `Expr` and let the user be explicit. |
| 273 | +- **Do not change functions that are simple aliases.** If a function is just `return other_function(...)`, update the primary function only. |
| 274 | +- **Do not change the Rust bindings.** All coercion happens in the Python layer. The Rust functions continue to accept `PyExpr`. |
| 275 | + |
| 276 | +## Priority Order |
| 277 | + |
| 278 | +When auditing functions, process them in this order: |
| 279 | + |
| 280 | +1. **Date/time functions** — `date_part`, `date_trunc`, `date_bin` — these have the clearest literal arguments |
| 281 | +2. **String functions** — `left`, `right`, `repeat`, `lpad`, `rpad`, `split_part`, `substring`, `replace`, `regexp_replace`, `regexp_match`, `regexp_count` — common and verbose without coercion |
| 282 | +3. **Math functions** — `round`, `trunc`, `power` — numeric literal arguments |
| 283 | +4. **Array functions** — `array_slice`, `array_position`, `array_remove_n`, `array_replace_n`, `array_resize`, `array_element` — index and count arguments |
| 284 | +5. **Other functions** — any remaining functions with literal arguments |
| 285 | + |
| 286 | +## Output Format |
| 287 | + |
| 288 | +For each function analyzed, report: |
| 289 | + |
| 290 | +``` |
| 291 | +## [Function Name] |
| 292 | +
|
| 293 | +**Current signature:** `function(arg1: Expr, arg2: Expr) -> Expr` |
| 294 | +**Proposed signature:** `function(arg1: Expr, arg2: Expr | int) -> Expr` |
| 295 | +**Category:** A (accepts native + Expr) |
| 296 | +**Arguments changed:** |
| 297 | +- `arg2`: Expr -> Expr | int (always a literal count) |
| 298 | +**Rust binding:** Takes PyExpr, wraps to literal internally |
| 299 | +**Status:** [Changed / Skipped / Needs Discussion] |
| 300 | +``` |
| 301 | + |
| 302 | +If asked to implement (not just audit), make the changes directly and show a summary of what was updated. |
0 commit comments