Skip to content

Add/bof dll stomp - BOF concurrency compatibility#2

Open
Loki-rt wants to merge 10 commits into
MaorSabag:add/bof-dll-stompfrom
Loki-rt:add/bof-dll-stomp
Open

Add/bof dll stomp - BOF concurrency compatibility#2
Loki-rt wants to merge 10 commits into
MaorSabag:add/bof-dll-stompfrom
Loki-rt:add/bof-dll-stomp

Conversation

@Loki-rt

@Loki-rt Loki-rt commented May 16, 2026

Copy link
Copy Markdown

Summary

The DLL Stomping implementation for BOFs was converted from a design based on a global singleton with CRITICAL_SECTION to a design based on dynamic per-instance contexts (BOF_STOMP_CTX*), additionally introducing a pool of reusable contexts for asynchronous BOFs.

1. Core architectural change: global → dynamic per-instance context.

Upstream (add/bof-dll-stomp)

bof_stomp.h declares a single global variable:

extern BOF_STOMP_CTX g_BofStomp;
BOOL InitBofStomp(const char* sacrificialDll, int method);

Now

The global variable and the embedded CRITICAL_SECTION are removed. The struct is annotated and the BOOL pooled field is added:

BOF_STOMP_CTX* BofStompCreate(const char* sacrificialDll, int method);
void           BofStompDestroy(BOF_STOMP_CTX* ctx);

BofStompCreate returns a BOF_STOMP_CTX* allocated with MemAllocLocal. BofStompDestroy restores .text/.pdata, frees the saved sections, unmaps the DLL, and releases the struct.

Reason: the singleton prevented concurrent execution of asynchronous BOFs across multiple threads. With independent contexts, each BOF carries its own stomping state without contention.


2. Context pool for asynchronous BOFs (Boffer)

Upstream

Boffer::Init() called InitBofStomp(...) once, creating the singleton. Asynchronous BOFs competed for the same g_BofStomp.lock.

Now

Boffer.h adds:

#define BOF_STOMP_POOL_MAX 32

struct StompSlot {
    BOF_STOMP_CTX* ctx;
    BOOL           inUse;
};

and in Boffer:

StompSlot        stompPool[BOF_STOMP_POOL_MAX];
int              stompPoolSize;
CRITICAL_SECTION stompPoolLock;

BOF_STOMP_CTX* AcquireStompSlot();
void           ReleaseStompSlot(BOF_STOMP_CTX* ctx);

Boffer::Init() now calls the new getBofStompDllsAsync() function to retrieve the list of DLLs and creates one BOF_STOMP_CTX per DLL, marking them as pooled = TRUE.

AsyncBofContext receives the BOF_STOMP_CTX* stompCtx field. When executing an async BOF, RunBof acquires a slot from the pool using AcquireStompSlot(); if no free slots are available, it falls back to VirtualAlloc. Once execution finishes, ReleaseStompSlot returns the slot back to the pool without destroying it.

Boffer::~Boffer() properly destroys all pool slots.


3. Signatures propagated throughout the entire execution stack.

The following functions change their signatures to carry stompCtx instead of accessing the global:

function Upstream now
AllocateSections (coffFile, pHeader, mapSections, outMapFunctions) + BOF_STOMP_CTX* stompCtx
CleanupSections (mapSections, maxSections, mapFunctions) + BOF_STOMP_CTX* stompCtx
ExecuteProc (entryFuncName, args, argsSize, pSymbolTable, pHeader, mapSections) + BOF_STOMP_CTX* stompCtx

4. New configuration function: getBofStompDllsAsync

config.h :

int getBofStompDllsAsync(const char*** outArray);

config.tpl :

Implementation that parses the BOF_STOMP_DLL_NAME_ASYNC macro at runtime (format: "dll1.dll|dll2.dll|dll3.dll") using a static buffer and idempotent lazy initialization. Falls back to {"xpsservices.dll", "Hydrogen.dll", "actxprxy.dll"} if the macro is not defined.


5. Secondary refactorings

AllocateSections — single contiguous VirtualAlloc

In the upstream implementation, each COFF section was allocated with a separate VirtualAlloc call (+ an additional call for mapFunctions). Now, the total size of all sections is calculated using 16-byte alignment (ALIGN_UP) and a single VirtualAlloc call is performed. mapFunctions is placed at the end of the block. CleanupSections only frees mapSections[0] (the base of the block) and no longer iterates through the remaining sections.

FindTextSection — extracted helper

The code used to locate the .text section inside a loaded PE was duplicated across the LoadLibraryEx and NtCreateSection paths. The logic has now been extracted into FindTextSection(PVOID base, PVOID* outTextBase, SIZE_T* outTextSize), and both paths invoke it.

Fix for IMAGE_REL_AMD64_ADDR32NB relocation

The upstream implementation calculated the relative offset using index arithmetic that could become incorrect for non-contiguous sections. It now implements the proper calculation (pc-relative offset = target_addr - (reloc_site + 4)) with 32-bit range validation.

Asynchronous BOF output

The upstream implementation set ctx->state = ASYNC_BOF_STATE_FINISHED and packaged the completion packet inside the BOF thread itself (under lock). A new state, ASYNC_BOF_STATE_DONE (0x4), has now been introduced: the BOF thread transitions to DONE, and the monitor loop is responsible for packaging and sending the completion packet during its iteration, simplifying synchronization.

BofOutputToTask was also refactored to use tls_CurrentBofContext directly instead of relying on the indirect IsAsyncBofThread() + AsyncBofOutput() flow.


UI

image

Quick demonstration: concurrent asynchronous + synchronous BOF execution.

image
  • keylog start: asynchronous BOF
  • kerbeus monitor /interval:1: asynchronous BOF
  • clipboard: synchronous BOF

In Memory

  • Main thread: 5928
  • keylog start: thread 5676
  • kerbeus monitor: thread 2504
image

thread 5676

DLL STOMPING: xpsservices.dll

image

thread 2504

DLL STOMPING: actxprxy.dll

image

To verify that DLL STOMP also works in synchronous mode, we execute SharpHound.exe using execute-assembly so we have enough time to observe what happens in the Beacon’s main thread.

image image

DLL STOMPING: wmp.dll


Now we execute SharpHound.exe again with execute-assembly, but this time in asynchronous mode.

image image image

DLL STOMPING: mfc42.dll


If the asynchronous BOFs finish execution, the DLLs are returned to the pool and become available again for future asynchronous BOF executions. Keep in mind that up to 32 DLLs can be added to the pool.

darks added 10 commits May 4, 2026 19:06
When a BOF exceeds the sacrificial DLL's .text section size and falls
back to VirtualAlloc, two bugs caused crashes with large assemblies
like SharpHound:

1. ADDR32NB relocation was recalculated assuming contiguous sections
   from mapSections[0], which only holds in the stomp path. Restored
   the original formula (site-relative pointer arithmetic) which is
   correct regardless of memory layout.

2. Fallback allocated each section individually via separate VirtualAlloc
   calls, scattering them across the address space. RtlAddFunctionTable
   then received invalid RVAs for .pdata entries referencing distant
   sections, crashing the unwinder on any exception. Changed to a single
   contiguous allocation so all section offsets stay within 32-bit range
   of mapSections[0].

BOFs that fit within the sacrificial DLL .text are unaffected — the
stomp path is unchanged. Fallback now behaves identically to the
original beacon for oversized BOFs.

Tested: SharpHound (1.3MB) runs async via execute-assembly without
killing the beacon. xpsservices.dll, wmp.dll, mfc42.dll used as sacrificial DLL.
With CLR: we close the handle and let the thread continue running. The thread will eventually terminate on its own once the BOF respects the stop event.

Terminating an asynchronous BOF with NtTerminateThread while the CLR is loaded inside the Beacon process is highly dangerous, as it could kill the beacon itself. Therefore, if the CLR is loaded, we never use NtTerminateThread. The BOF must respect the stop event, and if it does not, the thread becomes orphaned, which is still safer than having the beacon crash.
Two issues caused async BOF output to be held until the next
operator command on SMB beacons:

1. ProcessAsyncBofs() was called after Exchange(), so any output
   produced by a BOF wakeup was queued into packerOut too late to
   be sent in the current cycle, requiring an extra loop iteration.
   Fixed by moving ProcessAsyncBofs() before command processing,
   so output is available before Exchange() in the next cycle.

2. After the wakeupEvent was consumed and reset, the SMB connector
   re-entered WaitForMultipleObjects(INFINITE) with no pending
   signal, blocking indefinitely until the parent sent a command.
   Fixed by passing pollIntervalMs=50 to Sleep() whenever async
   BOFs are active. This forces the SMB connector to poll at 50ms
   intervals during BOF execution only, matching the responsiveness
   of HTTP/TCP beacons. Normal INFINITE wait resumes once all
   async BOFs complete.
…utToContext. This way, SMB wakes up naturally whenever there is data, without relying on the 50ms polling loop. HTTP/TCP/DNS still sleep for the full sleep_delay, but if the BOF produces output, WaitMaskWithEvent also receives the wakeupEvent and wakes up.

* We added SignalWakeup() inside BofOutputToContext immediately after writing to the buffer. This uniformly covers all connectors.

* Commit 143215a caused an asynchronous BOF running in an HTTP Beacon to ignore sleep_delay. Although it solved the issue where asynchronous BOF output remained queued in SMB Beacons, it introduced the problem of HTTP Beacons not respecting sleep_delay during asynchronous BOF execution.

* This commit fixes both issues: all Beacons now respect sleep_delay, and asynchronous BOF output no longer remains queued in SMB Beacons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant