Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-40216_mitigation/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Vulnerability Overview

This vulnerability exists within the **io_uring** subsystem, specifically in how fixed buffers are registered and subsequently imported.

The root cause is an **incorrect offset calculation** when handling user pointers that are not aligned to the folio size.

This logic error results in an incorrect `bv_len` (buffer vector length), which subsequently triggers an **Out-of-Bounds (OOB) access** and the use of **uninitialized memory** during I/O operations.

# Root Cause Analysis

## 1. Incorrect Offset Calculation (`io_sqe_buffer_register`)

In `io_sqe_buffer_register`, the kernel attempts to calculate the offset of the first page.

However, the bitmask logic used assumes specific alignment guarantees for user pointers that do not exist.

When `iov->iov_base` is not aligned as expected relative to `imu->folio_shift`, the calculated `off` variable is incorrect.

```C
static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
struct io_mapped_ubuf **pimu,
struct page **last_hpage)
{
// ...
/* * VULNERABILITY:
* The bitwise AND logic here produces an incorrect offset if the
* user pointer (iov_base) alignment does not match the folio logic.
*/
off = (unsigned long) iov->iov_base & ((1UL << imu->folio_shift) - 1);
*pimu = imu;
ret = 0;

for (i = 0; i < nr_pages; i++) {
size_t vec_len;

/* * Because 'off' is potentially wrong, 'vec_len' (the length of
* this segment) becomes smaller than it should be.
*/
vec_len = min_t(size_t, size, (1UL << imu->folio_shift) - off);
bvec_set_page(&imu->bvec[i], pages[i], vec_len, off);
off = 0;
// ...
```

The `bvec->bv_len` stored in the `io_mapped_ubuf` is smaller than the correct value required to represent the data.

## 2. OOB Access in (`io_import_fixed`)

Later, when `io_import_fixed` is called to perform I/O using the registered buffer, it iterates over the buffer vectors.

The logic attempts to skip segments (`seg_skip`) based on the requested offset.

Because the stored `bv_len` is artificially small (due to the bug above), the function believes it needs to skip more segments than actually exist to reach the requested offset.

```C
int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len)
{
// ...
if (offset < bvec->bv_len) {
iter->bvec = bvec;
iter->count -= offset;
iter->iov_offset = offset;
} else {
unsigned long seg_skip;

/* skip first vec */
offset -= bvec->bv_len;

/* * VULNERABILITY:
* Because 'bvec->bv_len' was too small, the remaining 'offset' is too large.
* This causes 'seg_skip' to calculate a value larger than the bvec array length.
*/
seg_skip = 1 + (offset >> imu->folio_shift);

/* * 'iter->bvec' now points Out-Of-Bounds (OOB) past the end of the array.
* This results in the usage of uninitialized memory for io_read/io_write operations.
*/
iter->bvec = bvec + seg_skip;
iter->nr_segs -= seg_skip;
iter->count -= bvec->bv_len + offset;
iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
}
// ...
return 0;
}
```

## Exploit Strategy

The uninitialized memory usage in `io_import_fixed` can be weaponized to achieve a **container escape**.

### Triggering the Vulnerability

By triggering the OOB condition, the `iter->bvec` pointer is made to point to uninitialized memory. We manipulate the kernel heap to ensure that this uninitialized memory region contains a pointer to a page we control or have recently freed.

### 1. Prepare Spray FDs
We set up **io_uring instances** (multiple `RING_FD_COUNT`) that will be used to spray `io_mapped_ubuf` structures and `bvec` arrays onto the kernel heap.

### 2. Spray Ubuf arrays onto Heap
We construct our target memory area. We explicitly use a 1-page shared target buffer mapped via `IORING_OFF_PBUF_RING`, prepended by a 3-page anonymous allocation, creating a contiguous 4-page sequence. We specifically avoid using an anonymous mapping (`MAP_ANONYMOUS`) for the target buffer itself because anonymous pages are notoriously difficult to reliably reclaim as page tables (`pgtable`) after being freed.

When we register this layout using `IORING_REGISTER_BUFFERS` across all our setup instances, the kernel creates `bvec` arrays of size 4 where the `bv_page` of the 4th element perfectly points to the physical page of our target shared buffer `io_addr`. This mass-populates the kernel heap with controllable `bvec` structures.

### 3. Setup Main Exploitation Ring & Free Sprayed Arrays
We set up the main exploitation ring. Crucially, we then **unregister** all the previously sprayed buffers, freeing them back to the heap. The kernel does not zero out this memory on free, meaning our controllable `bvec` array pointers remain in the heap as **uninitialized memory**.

We then map 8KB arrays using `IORING_OFF_PBUF_RING` to exploit the OOB vulnerability. Because the kernel allocates these as order-1 pages, the resulting folio size becomes 8192 bytes (`folio_shift = 13`). We deliberately allocate chunks of 8KB to massage the heap layout, ensuring our target pointer lands **unaligned** relative to this 8KB boundary (e.g. at an odd page boundary).

When the unaligned pointer fails the alignment check, the `io_sqe_buffer_register` offset calculation fails, setting up our Out-of-Bound read/write primitives against the sprayed uninitialized arrays from step 2.

### 4. Spray Target Pages
We allocate **multiple pages** across a wide target area using standard `mmap`. This reclaims the previously freed physical page specifically as a **page table (pgtable)**. Because we only map these pages (without writing to them), the kernel lazily populates the page table entries (PTEs) with pointers to the global `empty_page` (zero page). This predictably fills the page table with known PTE values we can hunt for.

### 5. Trigger Page Write/Read via IO_URING
We trigger an I/O read and write over the vulnerability to perform uninitialized memory reads.

This allows us to read back a valid **Page Table Entry (PTE)** from our mapped pages. The read is successful if the magic byte (`EXPECTED_MAGIC` = `0x25`) matches.

### 6. Calculate and Write Fake PTE
Using the leaked information from our read, we modify the PTE value. We keep the base page address but alter the alignment offset and flags.

Specifically, we apply a bitmask (`PTE_FLAGS_MASK` = `0x367`) which confers bits for **Present, Read/Write, User accessible, Accessed, and Dirty**, turning it into a fully permissive mapping entry.

We adjust the leaked pointer to point to the kernel's `core_pattern` address and write it back into the page tables using our OOB write primitive.

### 7. Spray core_pattern payload
With the mutated page table, we write our core_pattern exploitation payload `|/proc/%P/fd/666 %P` into the newly mapped region.

Finally, when the exploit intentionally triggers a crash with `rip_trigger_crash`, the kernel will execute our arbitrary program as root, giving us a **container escape**.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Requirements:
Capabilities: None
Kernel configuration: CONFIG_IO_URING
User namespaces required: no
Introduced by: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a8edbb424b1391b077407c75d8f5d2ede77aa70d
Fixed by: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3a3c6d61577dbb23c09df3e21f6f9eda1ecd634b
Affected kernel versions: v6.11 - v6.15
Affected component: io_uring
Cause: Out of bound read
Syscall to disable: io_uring_setup, io_uring_register, io_uring_enter
Description:
Out of bound read in io_uring
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
all: exploit
prerequisites:
@echo "No prerequisites needed"
exploit: exploit.cpp target_db.kxdb
g++ exploit.cpp -lkernelXDK -g -static -o exploit -I/usr/local/include/ -L/usr/lib/

Binary file not shown.
Loading