Change function calls to use vectorcall #5948

b-pass · 2025-12-26T04:59:39Z

Description

Change both the Python->C++ and the C++->Python function calls to use the faster "vector call" calling protocol.

This makes function calls faster, at the minor expense of having to handle the arguments slightly differently.

Suggested changelog entry:

Improved performance of function calls between Python and C++ by switching to the "vectorcall" calling protocol.

This reverts commit 418a034.

It cannot re-use the incoming tuple as before, because it is no longer a tuple at all. So a new tuple must be created, which then holds references for each member.

b-pass · 2025-12-26T16:52:11Z

Small benchmark results for short function calls, this PR provides modest improvement over current master...

This would be easier with `if constexpr`

rwgk · 2025-12-27T03:42:24Z

@b-pass I merged master, which had no conflicts.

My idea was that maybe Cursor can figure out the root cause for these two CI failures:

2025-12-26T23:29:10.4917170Z  E       ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
...
2025-12-26T23:29:12.4779770Z  E       ModuleNotFoundError: No module named 'mod_shared_interpreter_gil'

I'll wait for this CI run to finish, to see if these errors appear again.

When testing locally with Python 3.14.2, default and freethreaded, I saw this error (with both builds):

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

EDIT: Please see #5948 (comment) for the resolution.

I'm attaching the full build & test logs, JIC.

I switched back to master, everything else being equal, this error didn't appear. So it's definitely somehow connected to this PR.

pybind11_gcc_v3.14_df793163d58_default_tests_log_2025-12-26+193154.txt

pybind11_gcc_v3.14_df793163d58_freethreaded_log_2025-12-26+193608.txt

The `mod_per_interpreter_gil`, `mod_shared_interpreter_gil`, and `mod_per_interpreter_gil_with_singleton` modules were being built but not installed into the wheel when using scikit-build-core (SKBUILD=true). This caused iOS (and potentially Android) CIBW tests to fail with ModuleNotFoundError. Root cause analysis: - The main test targets have install() commands (line 531) - The PYBIND11_MULTIPLE_INTERPRETERS_TEST_MODULES were missing equivalent install() commands - For regular CMake builds, this wasn't a problem because LIBRARY_OUTPUT_DIRECTORY places the modules next to pybind11_tests - For wheel builds, only targets with explicit install() commands are included in the wheel This issue was latent until commit fee2527 changed the test imports from `pytest.importorskip()` (graceful skip) to direct `import` statements (hard failure), which exposed the missing modules. Failing tests: - test_multiple_interpreters.py::test_independent_subinterpreters - test_multiple_interpreters.py::test_dependent_subinterpreters Error: ModuleNotFoundError: No module named 'mod_per_interpreter_gil'

rwgk · 2025-12-27T04:34:53Z

Quick update: I think Cursor found the root cause for the CIBW / iOS failures, and a fix.

I'm still looking into why we're not seeing the m.array_reshape2(b) errors here.

Add numpy==2.4.0 requirement for Python 3.14 (both default and free-threaded builds). NumPy 2.4.0 is the first version to provide official PyPI wheels for Python 3.14: - numpy-2.4.0-cp314-cp314-manylinux_2_27_x86_64...whl (default) - numpy-2.4.0-cp314-cp314t-manylinux_2_27_x86_64...whl (free-threaded) Previously, CI was skipping all numpy-dependent tests for Python 3.14 because PIP_ONLY_BINARY was set and no wheels were available: SKIPPED [...] test_numpy_array.py:8: could not import 'numpy': No module named 'numpy' With this change, the full numpy test suite will run on Python 3.14, providing better test coverage for the newest Python version. Note: Using exact pin (==2.4.0) rather than compatible release (~=2.4.0) to ensure reproducible CI results with the first known-working version.

rwgk · 2025-12-27T05:03:04Z

I pushed commit ca72d6c here because we only have very limited GHA resources. Depending on how it goes, it might be best to merge that commit in a separate PR. Let's see.

Ideally commit 6ed6d5a should also be in a separate PR, but same issue with the GHA resources.

rwgk · 2025-12-27T05:26:57Z

NumPy 2.3.5 vs 2.4.0 Observations

The Issue

When testing this PR locally with Python 3.14 + NumPy 2.3.5, I observed a test failure:

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

test_numpy_array.py::test_array_resize FAILED

This occurred when calling array_reshape2 on a transposed view of a resized 64-element array (which should have worked since 64 = 8²).

What We Know For Sure

The error appeared with NumPy 2.3.5 (released 2025-11-16) on Python 3.14
The error is gone with NumPy 2.4.0 (released 2025-12-20) on Python 3.14
This is NOT a bug in the vectorcall PR — the same test passes on master with NumPy 2.3.5, but that's because CI was skipping numpy tests for Python 3.14 entirely (no numpy entry in tests/requirements.txt for 3.14)
NumPy 2.4.0 wheels for Python 3.14 are now available on PyPI (both default cp314 and free-threaded cp314t)

What's Undetermined

The exact bug fix in NumPy between 2.3.5 and 2.4.0 that resolved this issue. I searched the NumPy 2.4.0 release notes but couldn't find a specific fix that clearly matches our observation. The release notes mention "many expired deprecations and bug fixes" without listing all of them individually.

Resolution

Added numpy==2.4.0 to tests/requirements.txt for Python 3.14+ (commit ca72d6c). This:

Enables numpy tests on Python 3.14 in CI (previously skipped)
Avoids the NumPy 2.3.5 bug

Add `-v` to the pytest command in tests/pyproject.toml to help diagnose hanging tests in CIBW jobs (particularly iOS). This will show each test name as it runs, making it easier to identify which specific test is hanging.

- Add `IOS` platform constant to `tests/env.py` for consistency with existing `ANDROID`, `LINUX`, `MACOS`, `WIN`, `FREEBSD` constants. - Skip `test_multiple_interpreters.py` module on iOS. Subinterpreters are not supported in the iOS simulator environment. These tests were previously skipped implicitly because the modules weren't installed in the wheel; now that they are (commit 6ed6d5a), we need an explicit skip. - Change pytest timeout from 0 (disabled) to 120 seconds. This provides a safety net to catch hanging tests before the CI job times out after hours. Normal test runs complete in 33-55 seconds total (~1100 tests), so 120 seconds per test is very generous. - Add `-v` flag for verbose output to help diagnose any future issues.

rwgk · 2025-12-27T06:25:26Z

Cursor Code Review (with some manual edits)

Question for the Author

Is there a specific reason unpacking_vectorcall_collector requires Python 3.12+ instead of 3.8+?

Overview

This PR changes pybind11's function calling convention from METH_VARARGS | METH_KEYWORDS to METH_FASTCALL | METH_KEYWORDS (vectorcall protocol). This is a performance optimization that avoids tuple allocation for positional arguments when calling C++ functions from Python.

The PR also adds a new unpacking_vectorcall_collector for the reverse direction (C++ calling Python) on Python 3.12+.

Summary of Changes

1. `include/pybind11/pybind11.h` (183 lines changed)

Key Changes:

Changed ml_flags from METH_VARARGS | METH_KEYWORDS to METH_FASTCALL | METH_KEYWORDS
Changed dispatcher signature from (PyObject *self, PyObject *args_in, PyObject *kwargs_in) to (PyObject *self, PyObject *const *args_in_arr, size_t nargsf, PyObject *kwnames_in)
Replaced PyTuple_GET_ITEM(args_in, i) with args_in_arr[i]
Replaced PyTuple_GET_SIZE(args_in) with PyVectorcall_NARGS(nargsf)
Changed kwargs handling from dict-based to tuple-based (kwnames)
Added keyword_index() helper function for efficient keyword lookup
Added used_kwargs vector to track which kwargs have been consumed

Observations:

✅ The keyword_index() function uses a two-pass approach: first checking pointer equality (for interned strings), then falling back to lexicographic comparison. This is a smart optimization.
✅ The code properly handles PY_VECTORCALL_ARGUMENTS_OFFSET in the outgoing direction.
⚠️ The used_kwargs vector is sorted before iterating to build the kwargs dict. This is O(n log n) but n is typically very small.

2. `include/pybind11/cast.h` (194 lines added)

Key Changes:

Added unpacking_vectorcall_collector<policy> class for Python 3.12+ (PY_VERSION_HEX >= 0x030C0000)
This class collects arguments for C++ → Python calls using the vectorcall protocol
Uses PyObject_Vectorcall() instead of PyObject_Call()

Observations:

✅ Properly uses PY_VECTORCALL_ARGUMENTS_OFFSET to allow the callee to use the space before the first argument
✅ Keeps temporary objects alive in m_temp vector to prevent premature destruction
✅ Builds keyword names as a list first, then converts to tuple at the end (avoids resizing tuples)
⚠️ The code uses PyList_Append with name.release().ptr() - this transfers ownership but could leak if the append fails. However, throw error_already_set() is called immediately after, so this should be fine.

3. `include/pybind11/detail/argument_vector.h` (123 lines changed)

Key Changes:

Renamed/generalized argument_vector to small_vector<T, InlineSize>
Removed static_assert for trivially move constructible/destructible (now supports non-trivial types)
Added proper destructor calls for inline storage
Changed from memcpy to proper move construction
Added data() method
Added sort() method
Made argument_vector and args_convert_vector type aliases to small_vector

Observations:

✅ The generalization allows small_vector to be used for other types (e.g., small_vector<size_t, ...> for used_kwargs)
✅ Proper handling of non-trivial types with explicit destructor calls
✅ Added static_assert for exception safety in move_to_heap_vector_with_reserved_size

4. `tests/test_kwargs_and_defaults.py` (3 lines changed)

Key Changes:

Updated test_args_refcount to expect 3 extra references when creating a new tuple internally

Observations:

✅ This is expected behavior change: with vectorcall, the code creates a new tuple for *args which holds references to the arguments

Question regarding Python Version Compatibility

The dispatcher changes apply to all Python versions
The unpacking_vectorcall_collector is only for Python 3.12+ (vectorcall was added in 3.8, but the PR uses 3.12+ for the C++ → Python direction)

Question: Why is unpacking_vectorcall_collector limited to Python 3.12+? Vectorcall has been available since Python 3.8. Perhaps there's a specific API used that requires 3.12+?

Code Quality

Strengths

✅ Well-structured changes with clear separation of concerns
✅ Good use of small vector optimization to avoid heap allocations
✅ Proper handling of edge cases (empty kwargs, etc.)
✅ The keyword_index() function is well-optimized for the common case

Minor Suggestion

Comment clarity: The comment "// -1 to account for PY_VECTORCALL_ARGUMENTS_OFFSET" could be expanded to explain that the first slot is reserved for the callee to potentially use.

Compatibility

ABI: This change affects the internal calling convention but should be transparent to users
API: No public API changes
Behavior: Minor reference counting changes (documented in test change)

Testing

The existing test suite appears to cover the changes well:

✅ All Python 3.14 tests pass (with numpy 2.4.0)
✅ CI is passing across multiple platforms and Python versions
✅ The kwargs/args tests explicitly verify the new behavior

Verdict

Recommendation: Approve with minor suggestions

This is a well-implemented performance optimization that:

Uses the modern vectorcall protocol for better performance
Properly handles all edge cases
Maintains backward compatibility
Has good test coverage

The changes are consistent with Python's own direction (vectorcall is the preferred calling convention) and should provide performance benefits, especially for frequently-called functions with few arguments.

rwgk

LGTM but mostly based on the Cursor review. I haven't looked very closely manually. I recommend waiting for a review from @oremanj before merging.

b-pass added 10 commits December 23, 2025 23:17

Make argument_vector re-usable for other types.

ef662f6

Attempt to collect args into array for vectorcall

418a034

Revert "Attempt to collect args into array for vectorcall"

fc22c97

This reverts commit 418a034.

Implement vectorcall args collector

9894a5a

pre-commit fixes

73ee046

Checkpoint in moving to METH_FASTCALL

4615f8a

pre-commit fixes

1dc7b3f

Use the names tuple directly, cleaner code and less reference counting

893df5d

Fix unit test, the code now holds more references

655db8a

It cannot re-use the incoming tuple as before, because it is no longer a tuple at all. So a new tuple must be created, which then holds references for each member.

Make clangtidy happy

c43b156

b-pass marked this pull request as draft December 26, 2025 13:25

b-pass and others added 4 commits December 26, 2025 08:51

Oops, _v is C++14

d881a7b

style: pre-commit fixes

76831ec

Minor code cleanup

0285818

Fix signed conversions

cc2b5a9

b-pass requested a review from oremanj December 26, 2025 16:52

b-pass marked this pull request as ready for review December 26, 2025 16:52

b-pass and others added 4 commits December 26, 2025 17:59

Fix args expansion

25485b3

This would be easier with `if constexpr`

style: pre-commit fixes

6c6bd71

Code cleanup

afd6299

Merge branch 'master' into b-pass→vectorcall

b8ba02d

rwgk requested a review from henryiii as a code owner December 27, 2025 05:00

tests: Add verbose flag to CIBW pytest command

7148a4a

Add `-v` to the pytest command in tests/pyproject.toml to help diagnose hanging tests in CIBW jobs (particularly iOS). This will show each test name as it runs, making it easier to identify which specific test is hanging.

rwgk approved these changes Dec 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change function calls to use vectorcall #5948

Change function calls to use vectorcall #5948

b-pass commented Dec 26, 2025

Uh oh!

b-pass commented Dec 26, 2025

Uh oh!

rwgk commented Dec 27, 2025 •

edited

Loading

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Change function calls to use vectorcall #5948

Are you sure you want to change the base?

Change function calls to use vectorcall #5948

Conversation

b-pass commented Dec 26, 2025

Description

Suggested changelog entry:

Uh oh!

b-pass commented Dec 26, 2025

Uh oh!

rwgk commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk commented Dec 27, 2025

Uh oh!

rwgk commented Dec 27, 2025

NumPy 2.3.5 vs 2.4.0 Observations

The Issue

What We Know For Sure

What's Undetermined

Resolution

Uh oh!

rwgk commented Dec 27, 2025

Cursor Code Review (with some manual edits)

Question for the Author

Overview

Summary of Changes

1. include/pybind11/pybind11.h (183 lines changed)

2. include/pybind11/cast.h (194 lines added)

3. include/pybind11/detail/argument_vector.h (123 lines changed)

4. tests/test_kwargs_and_defaults.py (3 lines changed)

Question regarding Python Version Compatibility

Code Quality

Strengths

Minor Suggestion

Compatibility

Testing

Verdict

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rwgk commented Dec 27, 2025 •

edited

Loading

1. `include/pybind11/pybind11.h` (183 lines changed)

2. `include/pybind11/cast.h` (194 lines added)

3. `include/pybind11/detail/argument_vector.h` (123 lines changed)

4. `tests/test_kwargs_and_defaults.py` (3 lines changed)