Skip to content

frgo misc optimization#1791

Merged
Bike merged 15 commits into
mainfrom
frgo-misc-optimization
Jun 6, 2026
Merged

frgo misc optimization#1791
Bike merged 15 commits into
mainfrom
frgo-misc-optimization

Conversation

@Bike

@Bike Bike commented Jun 6, 2026

Copy link
Copy Markdown
Member

Incorporates and supersedes parts of @dg1sbg's #1771:

  • the allocation profiler slots are now non-atomic, which saves some cycles
  • countObjectFileNames now uses an accessory non-lisp hash table rather than walking all object files every time, which saves a big chunk of time
  • cache my_thread in bytecode_vm because TLS is slightly expensive to access
  • write_string into a string output stream copies in bulk instead of one at a time and avoids boxing/unboxing

The first three are pretty much exactly as in #1771 except that I expanded the use of the my_thread caching. The last with write_string I spun off into a generic bulk copying function for Lisp arrays which is now used for copy-subarray and therefore a couple different functions, like replace. From Lisp it only avoids consing if you copy an array into another of the same element type, but even aside from that it takes care of displacement ahead of time and etc., so it should speed things up.

dg1sbg and others added 14 commits June 4, 2026 09:08
GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel
(member _Allocations) and is only ever accessed via
my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator
fast path, gcFunctions, startRunStop, memoryManagement). There is no shared
instance and no cross-thread read, so the std::atomic counters are pure
overhead on registerAllocation(), which runs on every heap allocation.

Switch them to plain int64_t with in-class zero-init (which also fixes three
counters the constructors never initialized).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"global" makes it sound like it's shared between threads, but it
is not.
Also remove yet more unnecessary Claudeish comments.
countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on
each call, and ensureUniqueMemoryBufferName calls it once per JIT-module
registration -- so registering N object files is O(N^2). On a JIT/compilation-
heavy workload it was the single largest self-time function (~15% in one
profile).

_AllObjectFiles is only ever appended to (registerObjectFile, the single add
point) or bulk-cleared, never individually pruned, so keep an auxiliary
mutex-guarded name->count map in sync at those points and answer
countObjectFileNames from it in O(1). The map holds no GC pointers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Darwin every thread_local access compiles to a _tlv_get_addr thunk call
(there is no ELF initial-exec model; tls_model is a no-op). The bytecode
interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check)
and in several opcodes.

Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a
local and pass it to maybe_step_call, removing the per-call thunk. The thread
does not change during a VM frame. Correct (regression suite identical) and
zero-downside (a no-op load off Darwin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default AnsiStream_O::write_string writes one character at a time; each
character pays a boxing (clasp_make_character) and a virtual vectorPushExtend
with a fill-pointer/realloc check. Override it for string-output-streams to (1)
grow the backing string once (geometric), (2) bulk-copy via the underlying
simple-vector with non-virtual typed access, and (3) update the output cursor by
scanning the range once. A safe fallback (the tested unsafe_setf_subseq path)
handles character-source-into-base-string narrowing.

Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is
byte-identical (verified against a base/extended/tab/newline/fill-pointer/
narrowing golden test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
and use more <algorithm> stuff for other array operations.
copy_subarray would be good too but it's a little more awkward due
to the possibility of overlap. STL doesn't have a memmove equivalent
for iterators/ranges, which is nuts.
copy_n should be about as fast as possible when provided enough
type information, and dispatches just enough to handle the common
case of types being the same. Hopefully.
The template I had in before wasn't ever actually being used, and
trying to make it used resulted in me learning the very cool fact
that for C++, !a and !a are not recognized as the same concept.
so if constexpr it is.
why on earth did we accept nil as a character designator? I hope
removing this doesn't break anything in cando, but that's gross.

The base-char range part fixes a problem where e.g.

(let ((str1 (make-string 7 :initial-element #\a :element-type 'base-char))
      (str2 (make-string 3 :initial-element #\做)))
  (replace str1 str2 :start1 2)
  str1)

would write just the low byte of the character into the string, in
this case Z. Now it signals an error as it should.
@Bike Bike merged commit 25f2392 into main Jun 6, 2026
5 of 12 checks passed
@Bike Bike deleted the frgo-misc-optimization branch June 6, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants