Skip to content

Conversation

@AKHIL-149
Copy link
Contributor

Summary

fixes #63314 - pivot_table creating duplicate indices on python 3.14 with numpy 1.26

tracked down the actual bug. wasn't in compress_group_index like i thought - it's numpy's searchsorted that's broken with this version combo.

What was happening

  • unstack uses searchsorted to build the compressor array
  • with py3.14 + numpy 1.26, searchsorted returns duplicate values instead of unique positions
  • this causes multiple different index values to map to the same output row

The fix

fallback to the np.unique approach when on python 3.14 + numpy < 2.0. this is the same method the non-sorted path already uses, so it's tested.

Testing

tested with the reproduction case from the issue (100k rows, 3 metrics). works correctly now.

found the real issue - searchsorted is broken with python 3.14 + numpy 1.26. it's not compress_group_index, it's the compressor calculation in unstack that uses searchsorted.

just fallback to the unique/return_index approach for this combo, same as what the non-sorted path does.

works with 100k rows now.
@AKHIL-149
Copy link
Contributor Author

pre-commit.ci autofix

Comment on lines 239 to 241
# GH 63314: avoid searchsorted bug with py3.14 + numpy < 2.0
numpy_major = int(np.__version__.split(".")[0])
has_searchsorted_bug = sys.version_info >= (3, 14) and numpy_major < 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some existing constants you can use for this:

from pandas.compat.numpy import np_version_gt2
from pandas.compat import PY314

@jorisvandenbossche
Copy link
Member

@AKHIL-149 Thanks for the analysis!

Can you reproduce the issue with Python 3.14 + NumPy 1.26 with a small example as well? (since we don't have that combo in CI, it's also not really a supported combination by numpy):

>>> arr = np.repeat(np.arange(100_000), 3)
>>> res1 = arr.searchsorted(np.arange(100_000))
>>> res2 = np.sort(np.unique(arr, return_index=True)[1])
>>> np.allclose(res1, res2)
True

addressing review feedback - use PY314 and np_version_gt2 instead of manual version parsing
@AKHIL-149
Copy link
Contributor Author

AKHIL-149 commented Dec 12, 2025

@jorisvandenbossche Thanks for the review, Updated to use the existing constants.

Regarding the searchsorted test, the simple example you provided works fine:

>>> arr = np.repeat(np.arange(100_000), 3)
>>> res1 = arr.searchsorted(np.arange(100_000))
>>> res2 = np.sort(np.unique(arr, return_index=True)[1])
>>> np.allclose(res1, res2)
True

However, the bug manifests specifically in the pivot_table/unstack scenario with the comp_index array that gets created. The issue seems to be triggered by the specific pattern of values and array size that occurs during the unstack operation, not with a simple repeated array pattern.

I tested with Python 3.14.2 + NumPy 1.26.4 and can reproduce the pivot_table bug consistently at around 15k+ indices (45k+ rows), where it returns 5001 unique indices instead of 15000 (exactly 1/3 ratio). The workaround using np.unique fixes the issue.

@AKHIL-149
Copy link
Contributor Author

@jorisvandenbossche done - switched to using the compat constants.

the searchsorted bug is weird, only shows up with the specific array pattern from unstack, not with simple test cases. but the workaround handles it.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With NumPy 1.26 EOL prior to the release of Python 3.14, I'm not sure we should be making changes in pandas to support this.

Copy link
Contributor Author

@AKHIL-149 AKHIL-149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a fair point about the EOL timing. i guess the main reason i looked at this is people are still hitting it in the wild (like the OP) - probably locked environments or slow upgrades.

the fix itself is pretty minimal, just routing to the existing fallback path that's already there. but i get it if supporting an EOL combo doesn't make sense for pandas.

could also just close this and document it as a known issue with that specific version combo if you prefer.

@rhshadrach
Copy link
Member

It seems to me a good resolution here would be to enforce numpy>=2.0 when the user has Python 3.14.

@AKHIL-149
Copy link
Contributor Author

ah yeah that's probably cleaner than the workaround. so basically, add numpy>=2.0 to the install_requires when python_version>='3.14' in the dependency specs?

makes sense - forces people to upgrade instead of patching around the bug. i can update the PR to do that instead if you want or just close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: large pivot_table has incorrect output with Python 3.14

3 participants