frgo misc optimization by Bike · Pull Request #1791 · clasp-developers/clasp

Bike · 2026-06-06T00:41:09Z

Incorporates and supersedes parts of @dg1sbg's #1771:

the allocation profiler slots are now non-atomic, which saves some cycles
countObjectFileNames now uses an accessory non-lisp hash table rather than walking all object files every time, which saves a big chunk of time
cache my_thread in bytecode_vm because TLS is slightly expensive to access
write_string into a string output stream copies in bulk instead of one at a time and avoids boxing/unboxing

The first three are pretty much exactly as in #1771 except that I expanded the use of the my_thread caching. The last with write_string I spun off into a generic bulk copying function for Lisp arrays which is now used for copy-subarray and therefore a couple different functions, like replace. From Lisp it only avoids consing if you copy an array into another of the same element type, but even aside from that it takes care of displacement ahead of time and etc., so it should speed things up.

GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel (member _Allocations) and is only ever accessed via my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator fast path, gcFunctions, startRunStop, memoryManagement). There is no shared instance and no cross-thread read, so the std::atomic counters are pure overhead on registerAllocation(), which runs on every heap allocation. Switch them to plain int64_t with in-class zero-init (which also fixes three counters the constructors never initialized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

"global" makes it sound like it's shared between threads, but it is not. Also remove yet more unnecessary Claudeish comments.

countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on each call, and ensureUniqueMemoryBufferName calls it once per JIT-module registration -- so registering N object files is O(N^2). On a JIT/compilation- heavy workload it was the single largest self-time function (~15% in one profile). _AllObjectFiles is only ever appended to (registerObjectFile, the single add point) or bulk-cleared, never individually pruned, so keep an auxiliary mutex-guarded name->count map in sync at those points and answer countObjectFileNames from it in O(1). The map holds no GC pointers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On Darwin every thread_local access compiles to a _tlv_get_addr thunk call (there is no ELF initial-exec model; tls_model is a no-op). The bytecode interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check) and in several opcodes. Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a local and pass it to maybe_step_call, removing the per-call thunk. The thread does not change during a VM frame. Correct (regression suite identical) and zero-downside (a no-op load off Darwin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default AnsiStream_O::write_string writes one character at a time; each character pays a boxing (clasp_make_character) and a virtual vectorPushExtend with a fill-pointer/realloc check. Override it for string-output-streams to (1) grow the backing string once (geometric), (2) bulk-copy via the underlying simple-vector with non-virtual typed access, and (3) update the output cursor by scanning the range once. A safe fallback (the tested unsafe_setf_subseq path) handles character-source-into-base-string narrowing. Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is byte-identical (verified against a base/extended/tab/newline/fill-pointer/ narrowing golden test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

and use more <algorithm> stuff for other array operations. copy_subarray would be good too but it's a little more awkward due to the possibility of overlap. STL doesn't have a memmove equivalent for iterators/ranges, which is nuts.

copy_n should be about as fast as possible when provided enough type information, and dispatches just enough to handle the common case of types being the same. Hopefully.

The template I had in before wasn't ever actually being used, and trying to make it used resulted in me learning the very cool fact that for C++, !a and !a are not recognized as the same concept. so if constexpr it is.

why on earth did we accept nil as a character designator? I hope removing this doesn't break anything in cando, but that's gross. The base-char range part fixes a problem where e.g. (let ((str1 (make-string 7 :initial-element #\a :element-type 'base-char)) (str2 (make-string 3 :initial-element #\做))) (replace str1 str2 :start1 2) str1) would write just the low byte of the character into the string, in this case Z. Now it signals an error as it should.

dg1sbg and others added 14 commits June 4, 2026 09:08

Rename GlobalAllocationProfiler -> AllocationProfiler

eb51390

"global" makes it sound like it's shared between threads, but it is not. Also remove yet more unnecessary Claudeish comments.

fix claude comment

9bafeee

more my_thread avoidance in bytecode_vm

8786de0

Speed up unsafe_setf_subseq for same element type

e59c9de

and use more <algorithm> stuff for other array operations. copy_subarray would be good too but it's a little more awkward due to the possibility of overlap. STL doesn't have a memmove equivalent for iterators/ranges, which is nuts.

Speed up some bulk array operations (hopefully)

efa4749

copy_n should be about as fast as possible when provided enough type information, and dispatches just enough to handle the common case of types being the same. Hopefully.

Replace Array_O::unsafe_setf_subseq with copy_n(d)

86759e5

fix array copy_n to actually copy disparate integers without boxing

7dacba4

The template I had in before wasn't ever actually being used, and trying to make it used resulted in me learning the very cool fact that for C++, !a and !a are not recognized as the same concept. so if constexpr it is.

update release notes for array changes

7c0f6aa

head off char/int type puns in array copy_n

a0f90a9

Bike mentioned this pull request Jun 6, 2026

Runtime performance optimizations + Apple Silicon FFI-callback W^X fix #1771

Open

fix copy_n start point for virtual case

9969cc9

Bike merged commit 25f2392 into main Jun 6, 2026
5 of 12 checks passed

Bike deleted the frgo-misc-optimization branch June 6, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frgo misc optimization#1791

frgo misc optimization#1791
Bike merged 15 commits into
mainfrom
frgo-misc-optimization

Bike commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bike commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants