Skip to content

Conversation

@b-pass
Copy link
Collaborator

@b-pass b-pass commented Dec 26, 2025

Description

Change both the Python->C++ and the C++->Python function calls to use the faster "vector call" calling protocol.

This makes function calls faster, at the minor expense of having to handle the arguments slightly differently.

Suggested changelog entry:

  • Improved performance of function calls between Python and C++ by switching to the "vectorcall" calling protocol.

@b-pass b-pass marked this pull request as draft December 26, 2025 13:25
@b-pass
Copy link
Collaborator Author

b-pass commented Dec 26, 2025

Small benchmark results for short function calls, this PR provides modest improvement over current master... perf

@b-pass b-pass requested a review from oremanj December 26, 2025 16:52
@b-pass b-pass marked this pull request as ready for review December 26, 2025 16:52
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

@b-pass I merged master, which had no conflicts.

My idea was that maybe Cursor can figure out the root cause for these two CI failures:

2025-12-26T23:29:10.4917170Z  E       ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
...
2025-12-26T23:29:12.4779770Z  E       ModuleNotFoundError: No module named 'mod_shared_interpreter_gil'

I'll wait for this CI run to finish, to see if these errors appear again.

When testing locally with Python 3.14.2, default and freethreaded, I saw this error (with both builds):

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

EDIT: Please see #5948 (comment) for the resolution.

I'm attaching the full build & test logs, JIC.

I switched back to master, everything else being equal, this error didn't appear. So it's definitely somehow connected to this PR.

pybind11_gcc_v3.14_df793163d58_default_tests_log_2025-12-26+193154.txt

pybind11_gcc_v3.14_df793163d58_freethreaded_log_2025-12-26+193608.txt

The `mod_per_interpreter_gil`, `mod_shared_interpreter_gil`, and
`mod_per_interpreter_gil_with_singleton` modules were being built
but not installed into the wheel when using scikit-build-core
(SKBUILD=true). This caused iOS (and potentially Android) CIBW
tests to fail with ModuleNotFoundError.

Root cause analysis:
- The main test targets have install() commands (line 531)
- The PYBIND11_MULTIPLE_INTERPRETERS_TEST_MODULES were missing
  equivalent install() commands
- For regular CMake builds, this wasn't a problem because
  LIBRARY_OUTPUT_DIRECTORY places the modules next to pybind11_tests
- For wheel builds, only targets with explicit install() commands
  are included in the wheel

This issue was latent until commit fee2527 changed the test imports
from `pytest.importorskip()` (graceful skip) to direct `import`
statements (hard failure), which exposed the missing modules.

Failing tests:
- test_multiple_interpreters.py::test_independent_subinterpreters
- test_multiple_interpreters.py::test_dependent_subinterpreters

Error: ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

Quick update: I think Cursor found the root cause for the CIBW / iOS failures, and a fix.

I'm still looking into why we're not seeing the m.array_reshape2(b) errors here.

Add numpy==2.4.0 requirement for Python 3.14 (both default and
free-threaded builds). NumPy 2.4.0 is the first version to provide
official PyPI wheels for Python 3.14:

- numpy-2.4.0-cp314-cp314-manylinux_2_27_x86_64...whl (default)
- numpy-2.4.0-cp314-cp314t-manylinux_2_27_x86_64...whl (free-threaded)

Previously, CI was skipping all numpy-dependent tests for Python 3.14
because PIP_ONLY_BINARY was set and no wheels were available:

  SKIPPED [...] test_numpy_array.py:8: could not import 'numpy':
  No module named 'numpy'

With this change, the full numpy test suite will run on Python 3.14,
providing better test coverage for the newest Python version.

Note: Using exact pin (==2.4.0) rather than compatible release (~=2.4.0)
to ensure reproducible CI results with the first known-working version.
@rwgk rwgk requested a review from henryiii as a code owner December 27, 2025 05:00
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

I pushed commit ca72d6c here because we only have very limited GHA resources. Depending on how it goes, it might be best to merge that commit in a separate PR. Let's see.

Ideally commit 6ed6d5a should also be in a separate PR, but same issue with the GHA resources.

@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

NumPy 2.3.5 vs 2.4.0 Observations

The Issue

When testing this PR locally with Python 3.14 + NumPy 2.3.5, I observed a test failure:

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

test_numpy_array.py::test_array_resize FAILED

This occurred when calling array_reshape2 on a transposed view of a resized 64-element array (which should have worked since 64 = 8²).

What We Know For Sure

  1. The error appeared with NumPy 2.3.5 (released 2025-11-16) on Python 3.14
  2. The error is gone with NumPy 2.4.0 (released 2025-12-20) on Python 3.14
  3. This is NOT a bug in the vectorcall PR — the same test passes on master with NumPy 2.3.5, but that's because CI was skipping numpy tests for Python 3.14 entirely (no numpy entry in tests/requirements.txt for 3.14)
  4. NumPy 2.4.0 wheels for Python 3.14 are now available on PyPI (both default cp314 and free-threaded cp314t)

What's Undetermined

The exact bug fix in NumPy between 2.3.5 and 2.4.0 that resolved this issue. I searched the NumPy 2.4.0 release notes but couldn't find a specific fix that clearly matches our observation. The release notes mention "many expired deprecations and bug fixes" without listing all of them individually.

Resolution

Added numpy==2.4.0 to tests/requirements.txt for Python 3.14+ (commit ca72d6c). This:

  • Enables numpy tests on Python 3.14 in CI (previously skipped)
  • Avoids the NumPy 2.3.5 bug

Add `-v` to the pytest command in tests/pyproject.toml to help
diagnose hanging tests in CIBW jobs (particularly iOS).

This will show each test name as it runs, making it easier to
identify which specific test is hanging.
- Add `IOS` platform constant to `tests/env.py` for consistency with
  existing `ANDROID`, `LINUX`, `MACOS`, `WIN`, `FREEBSD` constants.

- Skip `test_multiple_interpreters.py` module on iOS. Subinterpreters
  are not supported in the iOS simulator environment. These tests were
  previously skipped implicitly because the modules weren't installed
  in the wheel; now that they are (commit 6ed6d5a), we need an
  explicit skip.

- Change pytest timeout from 0 (disabled) to 120 seconds. This provides
  a safety net to catch hanging tests before the CI job times out after
  hours. Normal test runs complete in 33-55 seconds total (~1100 tests),
  so 120 seconds per test is very generous.

- Add `-v` flag for verbose output to help diagnose any future issues.
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

Cursor Code Review (with some manual edits)

Question for the Author

Is there a specific reason unpacking_vectorcall_collector requires Python 3.12+ instead of 3.8+?


Overview

This PR changes pybind11's function calling convention from METH_VARARGS | METH_KEYWORDS to METH_FASTCALL | METH_KEYWORDS (vectorcall protocol). This is a performance optimization that avoids tuple allocation for positional arguments when calling C++ functions from Python.

The PR also adds a new unpacking_vectorcall_collector for the reverse direction (C++ calling Python) on Python 3.12+.

Summary of Changes

1. include/pybind11/pybind11.h (183 lines changed)

Key Changes:

  • Changed ml_flags from METH_VARARGS | METH_KEYWORDS to METH_FASTCALL | METH_KEYWORDS
  • Changed dispatcher signature from (PyObject *self, PyObject *args_in, PyObject *kwargs_in) to (PyObject *self, PyObject *const *args_in_arr, size_t nargsf, PyObject *kwnames_in)
  • Replaced PyTuple_GET_ITEM(args_in, i) with args_in_arr[i]
  • Replaced PyTuple_GET_SIZE(args_in) with PyVectorcall_NARGS(nargsf)
  • Changed kwargs handling from dict-based to tuple-based (kwnames)
  • Added keyword_index() helper function for efficient keyword lookup
  • Added used_kwargs vector to track which kwargs have been consumed

Observations:

  • ✅ The keyword_index() function uses a two-pass approach: first checking pointer equality (for interned strings), then falling back to lexicographic comparison. This is a smart optimization.
  • ✅ The code properly handles PY_VECTORCALL_ARGUMENTS_OFFSET in the outgoing direction.
  • ⚠️ The used_kwargs vector is sorted before iterating to build the kwargs dict. This is O(n log n) but n is typically very small.

2. include/pybind11/cast.h (194 lines added)

Key Changes:

  • Added unpacking_vectorcall_collector<policy> class for Python 3.12+ (PY_VERSION_HEX >= 0x030C0000)
  • This class collects arguments for C++ → Python calls using the vectorcall protocol
  • Uses PyObject_Vectorcall() instead of PyObject_Call()

Observations:

  • ✅ Properly uses PY_VECTORCALL_ARGUMENTS_OFFSET to allow the callee to use the space before the first argument
  • ✅ Keeps temporary objects alive in m_temp vector to prevent premature destruction
  • ✅ Builds keyword names as a list first, then converts to tuple at the end (avoids resizing tuples)
  • ⚠️ The code uses PyList_Append with name.release().ptr() - this transfers ownership but could leak if the append fails. However, throw error_already_set() is called immediately after, so this should be fine.

3. include/pybind11/detail/argument_vector.h (123 lines changed)

Key Changes:

  • Renamed/generalized argument_vector to small_vector<T, InlineSize>
  • Removed static_assert for trivially move constructible/destructible (now supports non-trivial types)
  • Added proper destructor calls for inline storage
  • Changed from memcpy to proper move construction
  • Added data() method
  • Added sort() method
  • Made argument_vector and args_convert_vector type aliases to small_vector

Observations:

  • ✅ The generalization allows small_vector to be used for other types (e.g., small_vector<size_t, ...> for used_kwargs)
  • ✅ Proper handling of non-trivial types with explicit destructor calls
  • ✅ Added static_assert for exception safety in move_to_heap_vector_with_reserved_size

4. tests/test_kwargs_and_defaults.py (3 lines changed)

Key Changes:

  • Updated test_args_refcount to expect 3 extra references when creating a new tuple internally

Observations:

  • ✅ This is expected behavior change: with vectorcall, the code creates a new tuple for *args which holds references to the arguments

Question regarding Python Version Compatibility

  • The dispatcher changes apply to all Python versions
  • The unpacking_vectorcall_collector is only for Python 3.12+ (vectorcall was added in 3.8, but the PR uses 3.12+ for the C++ → Python direction)

Question: Why is unpacking_vectorcall_collector limited to Python 3.12+? Vectorcall has been available since Python 3.8. Perhaps there's a specific API used that requires 3.12+?

Code Quality

Strengths

  • ✅ Well-structured changes with clear separation of concerns
  • ✅ Good use of small vector optimization to avoid heap allocations
  • ✅ Proper handling of edge cases (empty kwargs, etc.)
  • ✅ The keyword_index() function is well-optimized for the common case

Minor Suggestion

  • Comment clarity: The comment "// -1 to account for PY_VECTORCALL_ARGUMENTS_OFFSET" could be expanded to explain that the first slot is reserved for the callee to potentially use.

Compatibility

  • ABI: This change affects the internal calling convention but should be transparent to users
  • API: No public API changes
  • Behavior: Minor reference counting changes (documented in test change)

Testing

The existing test suite appears to cover the changes well:

  • ✅ All Python 3.14 tests pass (with numpy 2.4.0)
  • ✅ CI is passing across multiple platforms and Python versions
  • ✅ The kwargs/args tests explicitly verify the new behavior

Verdict

Recommendation: Approve with minor suggestions

This is a well-implemented performance optimization that:

  1. Uses the modern vectorcall protocol for better performance
  2. Properly handles all edge cases
  3. Maintains backward compatibility
  4. Has good test coverage

The changes are consistent with Python's own direction (vectorcall is the preferred calling convention) and should provide performance benefits, especially for frequently-called functions with few arguments.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but mostly based on the Cursor review. I haven't looked very closely manually. I recommend waiting for a review from @oremanj before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants