Skip to content

fix: PandasPreprocessor._spark_apply_steps breaks due to bad kwarg#47

Merged
mbaak merged 1 commit intoing-bank:mainfrom
Lodewic:fix/pandas_preprocessor_with_spark_session
Mar 11, 2026
Merged

fix: PandasPreprocessor._spark_apply_steps breaks due to bad kwarg#47
mbaak merged 1 commit intoing-bank:mainfrom
Lodewic:fix/pandas_preprocessor_with_spark_session

Conversation

@Lodewic
Copy link
Contributor

@Lodewic Lodewic commented Mar 11, 2026

Fixes #46

The PandasPreprocessor._spark_apply_steps is called when:

  1. Using PandasEntityMatching
  2. passing spark_session to the model
  3. processing 200k+ records

But, it breaks because a kwarg is passed with the wrong name.

Reproduce the problem

In this PR, this gets fixed.

import pandas as pd  # pandas 2.3.3
from pyspark.sql import SparkSession  # pyspark 3.5.2

from emm import PandasEntityMatching  # emm 2.1.10

spark = SparkSession.builder.getOrCreate()

# If spark_session is passed, and at least 200k records processed, 
# then Spark will be used for transformations. Otherwise, pandas will be used.
model = PandasEntityMatching(parameters={"spark_session": spark})

# Build 200k+ records
names = [
    ('John Smith LLC'),
    ('ING LLC'),
    ('John Doe LLC'),
    ('Zhe Sun G.M.B.H'),
    ('Random GMBH'),
] * 100_000
df = pd.DataFrame(names, columns=['name']).assign(id=range(len(names)))

# This will raise an error.
model.fit(df) # >>> TypeError: PandasPreprocessor._spark_apply_steps.<locals>.calc() got an unexpected keyword argument 'functions'

The fix

Fix the named keyword argument in:

def calc(chunk, funcs):
    for func in funcs:
        chunk = func(chunk)
     return chunk.index.values, chunk.values

to

def calc(chunk, functions):
    for func in functions:
        chunk = func(chunk)
     return chunk.index.values, chunk.values

this matches how the function is actually called.

Tests

I suppose there are no tests for this already, so I also have not updated or added any.

Copy link
Contributor

@mbaak mbaak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mbaak mbaak merged commit b5bfa48 into ing-bank:main Mar 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: PandasPreprocessor._spark_apply_steps is broken

2 participants