Skip to content

Trampolines#1777

Open
Bike wants to merge 20 commits into
mainfrom
trampolines
Open

Trampolines#1777
Bike wants to merge 20 commits into
mainfrom
trampolines

Conversation

@Bike

@Bike Bike commented May 27, 2026

Copy link
Copy Markdown
Member

Adds a facility for giving bytecodes native "trampolines" so that they are visible by name in external backtraces (e.g. perf, gdb). Supersedes #1765. drmeister has done the work here, just filing a PR so I can review it.

Christian Schafmeisterr and others added 11 commits May 26, 2026 17:48
Arena-based trampolines (hand-coded x86_64), sampling profiler,
flame graph generation, trampoline-aware backtraces, command-line
extensions, and snapshot save/load support. Excludes bytecode
interpreter changes (computed gotos, VMDynRecord dynamic binding
stack) which remain on the interpreter-work branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also demangle C++ function names
- New trampoline_aarch64.h paralleling trampoline_x86_64.h with
  hand-coded bytecode (36B) and GF (32B) trampolines using LDR
  from literal pool. Shared CIE and per-kind FDEs with full DWARF
  CFI for unwinding. Same instructions for Linux arm64 and Apple
  Silicon; macOS W^X support still needs MAP_JIT in ExecutableArena.
- Wire aarch64 templates into trampolineWork.cc via #elif __aarch64__.
- Demangle C++ symbols in sampling_profiler symbolicate_one via
  abi::__cxa_demangle for readable flame graph frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At profile-start, snapshot /proc/self/maps (Linux) or dyld image list
(macOS) into a sorted executable-range cache. The SIGPROF handler
binary-searches each saved_rip during the frame-pointer walk: if the
address isn't in any executable mapping, the chain is broken (frame
compiled without -fno-omit-frame-pointer) and the walk stops cleanly
instead of following garbage pointers.

New JIT/arena pages registered dynamically via
sampling_profiler_add_executable_range(), called from
ExecutableArena::allocate() so trampolines mmap'd during profiling
are immediately recognized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps a body form with sampling profiler start/stop and writes a
flame graph SVG on completion. Profiler is stopped and reset via
unwind-protect on any exit path.

  (ext:with-flame-profile ("/tmp/my.svg" :rate 197 :title "test")
    (expensive-work))

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .dif file stored hash used to have to match the calculated hash
of the clasp_gc.sif file.  That's too restrictive

@Bike Bike left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly seems ok. some of the comments are false or misleading. I didn't read over the sampling profiler or flame graphs in great detail, but they're optional, so I'm not too concerned about those.

of course, tests are failing, so that's no good

Comment thread include/clasp/core/clasp_gmpxx.h
* That profiler measures user-annotated regions; this one periodically
* snapshots whatever code is running.
*
* See Phase 4 / Phase 5 for post-mortem symbolication and flame-graph

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What phases are this referring to? (Probably phases of a Claude plan, but that's not apparent from the code, so.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 4 and Phase 5 are mentioned here:
https://github.com/clasp-developers/clasp/blob/trampolines/src/core/sampling_profiler.cc#L557
They are symbolication and collapsed-stacks aggregation.

// Populate the calling thread's stack bounds for later frame-walking.
// Must be called from a non-signal context. sampling_profiler_start
// calls this automatically for the calling thread; other threads that
// should be fully profiled need to call ext:profile-register-thread

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they do, since mpPackage.h does for all threads?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are asking.

I added sampling_profiler_register_current_thread so the sampling_profiler has easy access to the thread stacks that we want to walk. It was easier than figuring how to feed it our existing stack bounds. It's 40 bytes for each lisp thread. I added calls for the main thread and all threads launched in mpPackage.cc

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the comment says that threads need to call ext:profile-register-thread. But they actually don't, since you set it up already in all thread launches in mpPackage.cc and for the main thread.

Comment thread src/core/trampoline/trampoline.cc Outdated
// name participates in the same "wrapper:name" -> unique-name substitution
// (the suffix "_end" survives unchanged), so each trampoline gets its own
// matching end marker symbol.
__attribute__((used, noinline)) void WRAPPER_END_MARKER() asm("wrapper:name_end");

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're hardcoding the instructions, none of this should be necessary, and i don't see it being used

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it.

Comment thread src/core/trampoline/trampoline.cc Outdated

return_type bytecode_call(uint64_t pc, void *closure, uint64_t nargs, void **args);
// Indirect through a global function pointer rather than calling bytecode_call
// directly. Each compiled trampoline's call topology is now identical

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"now identical" - claude seems prone to putting progress reports in comments, but they'd be more appropriate in commits

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is removed.

Comment thread src/llvmo/trampoline_arena.cc Outdated
* the preceding CIE (= cie_size + 4).
* - FDE's PC range = code_size.
* Because every slot has the same layout, the bytes are identical across
* slots and a single memcpy is all that's required.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phrased misleadingly - each copy needs to be patched with some addresses, so in the end they will have different bytes. And while there's only one explicit memcpy call, the template is copied into a std::vector first, which probably does the like of memcpy.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. The trampolines use 8-byte absolute address to bytecode_vm so they don't need to be individually fixed up.

bool install_template(const uint8_t* tramp_bytes, size_t tramp_size,
const uint8_t* cie_bytes, size_t cie_len,
const uint8_t* fde_bytes, size_t fde_len) {
std::lock_guard<std::mutex> g(_init_lock);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if _initialized is true first (presumably with acquire) should be faster than grabbing a lock every time we compile anything.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only called once at startup. Not every time we compile anything.

Comment thread src/lisp/kernel/lsp/flamegraph.lisp Outdated

(defun join (list sep)
(with-output-to-string (out)
(loop for cell on list

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(loop for (item . rest) on list do (write-string item out) when rest do (write-char sep out))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - thanks.

}

// Read the interrupted instruction pointer out of the context structure.
// x86_64 only for now — portable to arm64 when we need it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... not portable?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it on arm64 yet.


thread_local ThreadStackBounds t_stack_bounds{0, 0, false};

static void populate_stack_bounds_for_this_thread() {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already do almost exactly this for scanning in the mmtk branch. so I guess that'll have to get merged eventually

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants