Add/bof dll stomp - BOF concurrency compatibility#2
Open
Loki-rt wants to merge 10 commits into
Open
Conversation
added 10 commits
May 4, 2026 19:06
When a BOF exceeds the sacrificial DLL's .text section size and falls back to VirtualAlloc, two bugs caused crashes with large assemblies like SharpHound: 1. ADDR32NB relocation was recalculated assuming contiguous sections from mapSections[0], which only holds in the stomp path. Restored the original formula (site-relative pointer arithmetic) which is correct regardless of memory layout. 2. Fallback allocated each section individually via separate VirtualAlloc calls, scattering them across the address space. RtlAddFunctionTable then received invalid RVAs for .pdata entries referencing distant sections, crashing the unwinder on any exception. Changed to a single contiguous allocation so all section offsets stay within 32-bit range of mapSections[0]. BOFs that fit within the sacrificial DLL .text are unaffected — the stomp path is unchanged. Fallback now behaves identically to the original beacon for oversized BOFs. Tested: SharpHound (1.3MB) runs async via execute-assembly without killing the beacon. xpsservices.dll, wmp.dll, mfc42.dll used as sacrificial DLL.
With CLR: we close the handle and let the thread continue running. The thread will eventually terminate on its own once the BOF respects the stop event. Terminating an asynchronous BOF with NtTerminateThread while the CLR is loaded inside the Beacon process is highly dangerous, as it could kill the beacon itself. Therefore, if the CLR is loaded, we never use NtTerminateThread. The BOF must respect the stop event, and if it does not, the thread becomes orphaned, which is still safer than having the beacon crash.
Two issues caused async BOF output to be held until the next operator command on SMB beacons: 1. ProcessAsyncBofs() was called after Exchange(), so any output produced by a BOF wakeup was queued into packerOut too late to be sent in the current cycle, requiring an extra loop iteration. Fixed by moving ProcessAsyncBofs() before command processing, so output is available before Exchange() in the next cycle. 2. After the wakeupEvent was consumed and reset, the SMB connector re-entered WaitForMultipleObjects(INFINITE) with no pending signal, blocking indefinitely until the parent sent a command. Fixed by passing pollIntervalMs=50 to Sleep() whenever async BOFs are active. This forces the SMB connector to poll at 50ms intervals during BOF execution only, matching the responsiveness of HTTP/TCP beacons. Normal INFINITE wait resumes once all async BOFs complete.
…utToContext. This way, SMB wakes up naturally whenever there is data, without relying on the 50ms polling loop. HTTP/TCP/DNS still sleep for the full sleep_delay, but if the BOF produces output, WaitMaskWithEvent also receives the wakeupEvent and wakes up. * We added SignalWakeup() inside BofOutputToContext immediately after writing to the buffer. This uniformly covers all connectors. * Commit 143215a caused an asynchronous BOF running in an HTTP Beacon to ignore sleep_delay. Although it solved the issue where asynchronous BOF output remained queued in SMB Beacons, it introduced the problem of HTTP Beacons not respecting sleep_delay during asynchronous BOF execution. * This commit fixes both issues: all Beacons now respect sleep_delay, and asynchronous BOF output no longer remains queued in SMB Beacons.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The DLL Stomping implementation for BOFs was converted from a design based on a global singleton with
CRITICAL_SECTIONto a design based on dynamic per-instance contexts (BOF_STOMP_CTX*), additionally introducing a pool of reusable contexts for asynchronous BOFs.1. Core architectural change: global → dynamic per-instance context.
Upstream (
add/bof-dll-stomp)bof_stomp.hdeclares a single global variable:Now
The global variable and the embedded
CRITICAL_SECTIONare removed. The struct is annotated and theBOOL pooledfield is added:BofStompCreatereturns aBOF_STOMP_CTX*allocated withMemAllocLocal.BofStompDestroyrestores.text/.pdata, frees the saved sections, unmaps the DLL, and releases the struct.Reason: the singleton prevented concurrent execution of asynchronous BOFs across multiple threads. With independent contexts, each BOF carries its own stomping state without contention.
2. Context pool for asynchronous BOFs (
Boffer)Upstream
Boffer::Init()calledInitBofStomp(...)once, creating the singleton. Asynchronous BOFs competed for the sameg_BofStomp.lock.Now
Boffer.hadds:and in
Boffer:Boffer::Init()now calls the newgetBofStompDllsAsync()function to retrieve the list of DLLs and creates oneBOF_STOMP_CTXper DLL, marking them aspooled = TRUE.AsyncBofContextreceives theBOF_STOMP_CTX* stompCtxfield. When executing an async BOF,RunBofacquires a slot from the pool usingAcquireStompSlot(); if no free slots are available, it falls back toVirtualAlloc. Once execution finishes,ReleaseStompSlotreturns the slot back to the pool without destroying it.Boffer::~Boffer()properly destroys all pool slots.3. Signatures propagated throughout the entire execution stack.
The following functions change their signatures to carry
stompCtxinstead of accessing the global:AllocateSections(coffFile, pHeader, mapSections, outMapFunctions)+ BOF_STOMP_CTX* stompCtxCleanupSections(mapSections, maxSections, mapFunctions)+ BOF_STOMP_CTX* stompCtxExecuteProc(entryFuncName, args, argsSize, pSymbolTable, pHeader, mapSections)+ BOF_STOMP_CTX* stompCtx4. New configuration function:
getBofStompDllsAsyncconfig.h:config.tpl:Implementation that parses the
BOF_STOMP_DLL_NAME_ASYNCmacro at runtime (format:"dll1.dll|dll2.dll|dll3.dll") using a static buffer and idempotent lazy initialization. Falls back to{"xpsservices.dll", "Hydrogen.dll", "actxprxy.dll"}if the macro is not defined.5. Secondary refactorings
AllocateSections— single contiguousVirtualAllocIn the upstream implementation, each COFF section was allocated with a separate
VirtualAlloccall (+ an additional call formapFunctions). Now, the total size of all sections is calculated using 16-byte alignment (ALIGN_UP) and a singleVirtualAlloccall is performed.mapFunctionsis placed at the end of the block.CleanupSectionsonly freesmapSections[0](the base of the block) and no longer iterates through the remaining sections.FindTextSection— extracted helperThe code used to locate the
.textsection inside a loaded PE was duplicated across theLoadLibraryExandNtCreateSectionpaths. The logic has now been extracted intoFindTextSection(PVOID base, PVOID* outTextBase, SIZE_T* outTextSize), and both paths invoke it.Fix for
IMAGE_REL_AMD64_ADDR32NBrelocationThe upstream implementation calculated the relative offset using index arithmetic that could become incorrect for non-contiguous sections. It now implements the proper calculation (
pc-relative offset = target_addr - (reloc_site + 4)) with 32-bit range validation.Asynchronous BOF output
The upstream implementation set
ctx->state = ASYNC_BOF_STATE_FINISHEDand packaged the completion packet inside the BOF thread itself (under lock). A new state,ASYNC_BOF_STATE_DONE(0x4), has now been introduced: the BOF thread transitions toDONE, and the monitor loop is responsible for packaging and sending the completion packet during its iteration, simplifying synchronization.BofOutputToTaskwas also refactored to usetls_CurrentBofContextdirectly instead of relying on the indirectIsAsyncBofThread()+AsyncBofOutput()flow.UI
Quick demonstration: concurrent asynchronous + synchronous BOF execution.
keylog start: asynchronous BOFkerbeus monitor /interval:1: asynchronous BOFclipboard: synchronous BOFIn Memory
5928keylog start: thread5676kerbeus monitor: thread2504thread
5676thread
2504To verify that DLL STOMP also works in synchronous mode, we execute
SharpHound.exeusingexecute-assemblyso we have enough time to observe what happens in the Beacon’s main thread.Now we execute
SharpHound.exeagain withexecute-assembly, but this time in asynchronous mode.If the asynchronous BOFs finish execution, the DLLs are returned to the pool and become available again for future asynchronous BOF executions. Keep in mind that up to 32 DLLs can be added to the pool.