frgo misc optimization#1791
Merged
Merged
Conversation
GlobalAllocationProfiler lives in the THREAD_LOCAL ThreadLocalStateLowLevel (member _Allocations) and is only ever accessed via my_thread_low_level->_Allocations, i.e. by the owning thread alone (allocator fast path, gcFunctions, startRunStop, memoryManagement). There is no shared instance and no cross-thread read, so the std::atomic counters are pure overhead on registerAllocation(), which runs on every heap allocation. Switch them to plain int64_t with in-class zero-init (which also fixes three counters the constructors never initialized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"global" makes it sound like it's shared between threads, but it is not. Also remove yet more unnecessary Claudeish comments.
countObjectFileNames rescanned the entire _AllObjectFiles list with a memcmp on each call, and ensureUniqueMemoryBufferName calls it once per JIT-module registration -- so registering N object files is O(N^2). On a JIT/compilation- heavy workload it was the single largest self-time function (~15% in one profile). _AllObjectFiles is only ever appended to (registerObjectFile, the single add point) or bulk-cleared, never individually pruned, so keep an auxiliary mutex-guarded name->count map in sync at those points and answer countObjectFileNames from it in O(1). The map holds no GC pointers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Darwin every thread_local access compiles to a _tlv_get_addr thunk call (there is no ELF initial-exec model; tls_model is a no-op). The bytecode interpreter hit my_thread on every Lisp call (maybe_step_call's breakstep check) and in several opcodes. Resolve my_thread once per VM frame (bytecode_vm and long_dispatch) into a local and pass it to maybe_step_call, removing the per-call thunk. The thread does not change during a VM frame. Correct (regression suite identical) and zero-downside (a no-op load off Darwin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default AnsiStream_O::write_string writes one character at a time; each character pays a boxing (clasp_make_character) and a virtual vectorPushExtend with a fill-pointer/realloc check. Override it for string-output-streams to (1) grow the backing string once (geometric), (2) bulk-copy via the underlying simple-vector with non-virtual typed access, and (3) update the output cursor by scanning the range once. A safe fallback (the tested unsafe_setf_subseq path) handles character-source-into-base-string narrowing. Measured 1.8x (14 chars) to 62x (2000 chars) vs the per-char path; output is byte-identical (verified against a base/extended/tab/newline/fill-pointer/ narrowing golden test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
and use more <algorithm> stuff for other array operations. copy_subarray would be good too but it's a little more awkward due to the possibility of overlap. STL doesn't have a memmove equivalent for iterators/ranges, which is nuts.
copy_n should be about as fast as possible when provided enough type information, and dispatches just enough to handle the common case of types being the same. Hopefully.
The template I had in before wasn't ever actually being used, and trying to make it used resulted in me learning the very cool fact that for C++, !a and !a are not recognized as the same concept. so if constexpr it is.
why on earth did we accept nil as a character designator? I hope
removing this doesn't break anything in cando, but that's gross.
The base-char range part fixes a problem where e.g.
(let ((str1 (make-string 7 :initial-element #\a :element-type 'base-char))
(str2 (make-string 3 :initial-element #\做)))
(replace str1 str2 :start1 2)
str1)
would write just the low byte of the character into the string, in
this case Z. Now it signals an error as it should.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Incorporates and supersedes parts of @dg1sbg's #1771:
countObjectFileNamesnow uses an accessory non-lisp hash table rather than walking all object files every time, which saves a big chunk of timemy_threadinbytecode_vmbecause TLS is slightly expensive to accesswrite_stringinto a string output stream copies in bulk instead of one at a time and avoids boxing/unboxingThe first three are pretty much exactly as in #1771 except that I expanded the use of the
my_threadcaching. The last withwrite_stringI spun off into a generic bulk copying function for Lisp arrays which is now used forcopy-subarrayand therefore a couple different functions, likereplace. From Lisp it only avoids consing if you copy an array into another of the same element type, but even aside from that it takes care of displacement ahead of time and etc., so it should speed things up.