-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Fix pivot_table duplicate indices with Python 3.14 + NumPy 1.26 #63324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
found the real issue - searchsorted is broken with python 3.14 + numpy 1.26. it's not compress_group_index, it's the compressor calculation in unstack that uses searchsorted. just fallback to the unique/return_index approach for this combo, same as what the non-sorted path does. works with 100k rows now.
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
pandas/core/reshape/reshape.py
Outdated
| # GH 63314: avoid searchsorted bug with py3.14 + numpy < 2.0 | ||
| numpy_major = int(np.__version__.split(".")[0]) | ||
| has_searchsorted_bug = sys.version_info >= (3, 14) and numpy_major < 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some existing constants you can use for this:
from pandas.compat.numpy import np_version_gt2
from pandas.compat import PY314
|
@AKHIL-149 Thanks for the analysis! Can you reproduce the issue with Python 3.14 + NumPy 1.26 with a small example as well? (since we don't have that combo in CI, it's also not really a supported combination by numpy): |
addressing review feedback - use PY314 and np_version_gt2 instead of manual version parsing
|
@jorisvandenbossche Thanks for the review, Updated to use the existing constants. Regarding the searchsorted test, the simple example you provided works fine: >>> arr = np.repeat(np.arange(100_000), 3)
>>> res1 = arr.searchsorted(np.arange(100_000))
>>> res2 = np.sort(np.unique(arr, return_index=True)[1])
>>> np.allclose(res1, res2)
TrueHowever, the bug manifests specifically in the pivot_table/unstack scenario with the comp_index array that gets created. The issue seems to be triggered by the specific pattern of values and array size that occurs during the unstack operation, not with a simple repeated array pattern. I tested with Python 3.14.2 + NumPy 1.26.4 and can reproduce the pivot_table bug consistently at around 15k+ indices (45k+ rows), where it returns 5001 unique indices instead of 15000 (exactly 1/3 ratio). The workaround using np.unique fixes the issue. |
|
@jorisvandenbossche done - switched to using the compat constants. the searchsorted bug is weird, only shows up with the specific array pattern from unstack, not with simple test cases. but the workaround handles it. |
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With NumPy 1.26 EOL prior to the release of Python 3.14, I'm not sure we should be making changes in pandas to support this.
AKHIL-149
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's a fair point about the EOL timing. i guess the main reason i looked at this is people are still hitting it in the wild (like the OP) - probably locked environments or slow upgrades.
the fix itself is pretty minimal, just routing to the existing fallback path that's already there. but i get it if supporting an EOL combo doesn't make sense for pandas.
could also just close this and document it as a known issue with that specific version combo if you prefer.
|
It seems to me a good resolution here would be to enforce |
|
ah yeah that's probably cleaner than the workaround. so basically, add numpy>=2.0 to the install_requires when python_version>='3.14' in the dependency specs? makes sense - forces people to upgrade instead of patching around the bug. i can update the PR to do that instead if you want or just close this one. |
Summary
fixes #63314 - pivot_table creating duplicate indices on python 3.14 with numpy 1.26
tracked down the actual bug. wasn't in compress_group_index like i thought - it's numpy's searchsorted that's broken with this version combo.
What was happening
The fix
fallback to the np.unique approach when on python 3.14 + numpy < 2.0. this is the same method the non-sorted path already uses, so it's tested.
Testing
tested with the reproduction case from the issue (100k rows, 3 metrics). works correctly now.