Overview
For varchar/text columns, min and max are currently suppressed (they only apply to numeric and date columns). String columns get no shape information beyond distinct_count and not_null_proportion. Adding min_length, max_length, and avg_length fills this gap.
Motivation
String length statistics catch a class of data quality bugs that are otherwise invisible:
- Truncation bugs: a column that should hold 50-char values suddenly has max_length of 20
- Empty string vs null:
not_null_proportion = 1.0 but min_length = 0 reveals unexpected empty strings
- Padding issues: unexpectedly uniform lengths suggest padding or fixed-width encoding
- Encoding problems: avg_length spikes when multi-byte characters are counted as bytes
Implementation notes
Simple cross-database implementation using length() / len():
-- default
min(length(col))
max(length(col))
avg(length(col))
-- SQL Server uses len()
min(len(col))
max(len(col))
avg(cast(len(col) as float))
Applied only when is_string_dtype(data_type) is true (mirrors how avg/median apply only to numeric types).
API design
Three new optional measures added to the default measure list:
exclude_measures: [min_length, max_length, avg_length]
Overview
For
varchar/textcolumns,minandmaxare currently suppressed (they only apply to numeric and date columns). String columns get no shape information beyonddistinct_countandnot_null_proportion. Addingmin_length,max_length, andavg_lengthfills this gap.Motivation
String length statistics catch a class of data quality bugs that are otherwise invisible:
not_null_proportion = 1.0butmin_length = 0reveals unexpected empty stringsImplementation notes
Simple cross-database implementation using
length()/len():Applied only when
is_string_dtype(data_type)is true (mirrors howavg/medianapply only to numeric types).API design
Three new optional measures added to the default measure list: