Work on optimizations by chuggafan · Pull Request #1112 · LADSoft/OrangeC

chuggafan · 2026-01-17T22:32:50Z

This is my incubation area for optimization, so far I'm including my gen.cpp changes while I try figure out what in the universe is going on with my memcpy implementation.

I'll be pushing my memcpy implementation later, but for now this should be a good reference point to compare the appveyor build times with as I suspect this should increase the speeds of the compiler slightly, but not majorly.

Also move from rep movsd to rep movsb, which should be faster on modern hardware.

chuggafan · 2026-01-18T03:12:47Z

Also for the record: Apparently during the testing the current libc memcpy implementation doesn't return the dest address.

This test program exposes it:

#include <string.h>
#include <stdio.h>
#include <inttypes.h>
int main()
{
    char srcdata[12] = {0};
    char destdata[12];
    printf("Before dest addr: %" PRIxPTR ". Before src addr: %" PRIxPTR "\n", destdata, srcdata);
    char* retaddr = memcpy(destdata, srcdata, 12);
    printf("Before dest addr: %" PRIxPTR ". Before src addr: %" PRIxPTR ". Returned: %" PRIxPTR "\n", destdata, srcdata, retaddr);
}

Which produces:

C:\OrangeC\memcpy_tests>occ test.c
occ (OrangeC) Version 7.0.1
Copyright (C) LADSoft 2006-2025

C:\OrangeC\memcpy_tests>test.exe
Before dest addr: 19ff10. Before src addr: 19ff1c
Before dest addr: 19ff10. Before src addr: 19ff1c. Returned: 401066

I'm not sure if this is why when I stub-out the memcpy the compiles start crashing on my end....

chuggafan · 2026-01-18T14:37:26Z

Based on my reference memcpy implementation included in the push, I shaved off ~1-2 minutes of a NORMALcompiles.bat execution on my machine.

Which is a 10% reduction(ish), which is nice ;).

LADSoft · 2026-01-19T02:35:25Z

pretty cool. Thank you!

GitMensch · 2026-01-19T08:02:31Z

Are the AppVeyor failures related? If not: 10% saved and tests otherwise passing seems like a no-brainer to get the new memcpy in.

Just a related note: I've found that in GCC there's often a bad performance if the number of bytes are not known to the compiler; for small amounts (I think it was less than 256 bytes) the general implementation seems to be less good than a simple loop that the optimizer improves - to an amount that I've made an explicit switch between both depending on the size in libcob.

LADSoft · 2026-01-19T14:33:17Z

@GitMensch

the current appveyor faiilures are because I made changes that i didn't quite vet properly. I'm trying to get it worked out but running into one problem after another. Soon, I guess....

Another aspect of what I checked in is it may be a little slower overall. Part of my current vetting is also to see if I can address that...

chuggafan · 2026-01-19T23:42:11Z

@GitMensch So I think it may be partly me (I have no idea why this would break it, it fixes issues if anything)?
The build is semi-unstable on my machine for some reason.
But yhea, if you note the refsrc I actually just have a really basic optimized loop for sz < 256 then move to rep movsb for everything above that except in the backwards memmove case.
I need to sit down and make some actual attempts at an SSE2 memcpy and also sit down for a custom benchmark.

chuggafan · 2026-01-20T03:48:13Z

https://www.microsoft.com/en-us/msrc/blog/2021/01/building-faster-amd64-memset-routines
Also secondarily I want to take a look at this and see what can be done. Something of note is that even the unaligned instructions should do well.
And this is especially true for rep stosb, which apparently has performance severely degredated by not being 32 byte (possibly 64 byte on some platforms) aligned.
So my order is trying to shove movdqu and movdqa into the sphere, then seeing the performance, then moving towards trying to follow that blogpost for seeing about memset.

The goal here should be to knock out any commonly used expensive libc function into being slightly faster.

…opcount`.

chuggafan · 2026-01-24T21:20:00Z

Hmmm.... I found a crash in the occ InstructionParser.cpp:

inline void* memmove_backwards(void* s1, const void* s2, size_t sz)
{
    char* dest_char = s1;
    const char* src_char = s2;
    size_t num_loop = sz / 64;
    /*
    while (num_loop > 0)
    {
        num_loop -= 4;
        __m128i loaded_val = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop));
        __m128i loaded_val1 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 1));
        __m128i loaded_val2 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 2));
        __m128i loaded_val3 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 3));
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 3), loaded_val3);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 2), loaded_val2);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 1), loaded_val1);
        _mm_storeu_si128(((__m128i*)(dest_char) + num_loop), loaded_val);
    }
    */
    sz -= num_loop * 64;
    while (num_loop > 0)
    {
        num_loop -= 4;
        __asm {
            lea ecx, num_loop
            movups xmm0, [src_char + ecx];
            movups [dest_char + ecx], xmm0;
        }
    }
    while (sz > 0)
    {
        sz--;
        ((char*)s1)[sz] = ((const char*)s2)[sz];
    }
    return s1;
}

This crashes the compiler when dealing with everything.
Here's the stacktrace:

C:\OrangeC\memcpy_tests>occ /S basic_memcpy.c
occ (OrangeC) Version 7.0.1    
Copyright (C) LADSoft 2006-2025
Error(212)    basic_memcpy.c(45):  Use LEA to take address of auto variable
version: 7.0.1
Command Line: "C:\OrangeC\bin\occparse" -! --architecture "x86;lssm:ueduudhrbminsdjnupfji" "/S" "basic_memcpy.c"

Access Violation:(C:\OrangeC\bin\occparse.exe)
CS:EIP 0023:0075D6D3  SS:ESP 002B:0019EE6C
EAX: 00000000  EBX: 0019F4F0  ECX: 01218C54  EDX: 00000100  flags: 00010246
EBP: 0019EF74  ESI: 00000000  EDI: 00401000
 DS:     002B   ES:     002B   FS:     0053   GS:     002B


CS:EIP  C6 40 08 00 E9 7C 02 00 00 8B 45 0C 8B 40 0C 89

Stack trace:
                        75d6d3: InstructionParser::GetInstruction(ocode*, shared_ptr<Instruction>&, list<Numeric*, allocator<Numeric*>>&) + 0x50d  module: instructionparser.cpp, line: 179
                        4ec82e: Parser::AssembleInstruction(ocode*) + 0xd3  module: inasm.cpp, line: 1127
                        4ecd53: Parser::inlineAsm(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x45d  module: inasm.cpp, line: 1252
                        4c8d31: Parser::StatementGenerator::ParseAsm(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x82  module: stmt.cpp, line: 3510
                        4c9873: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x7fe  module: stmt.cpp, line: 3867
                        719cb3: Parser::StatementGenerator::Compound(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x657  module: stmt.cpp, line: 4150
                        4c9462: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x3ed  module: stmt.cpp, line: 3766
                        719606: Parser::StatementGenerator::StatementWithoutNonconst(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x2e  module: stmt.cpp, line: 3974
                        4c7f08: Parser::StatementGenerator::ParseWhile(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&) + 0x2ee  module: stmt.cpp, line: 3203
                        4c9541: Parser::StatementGenerator::SingleStatement(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x4cc  module: stmt.cpp, line: 3790
                        719cb3: Parser::StatementGenerator::Compound(list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool) + 0x657  module: stmt.cpp, line: 4150
                        4cb22e: Parser::StatementGenerator::FunctionBody(bool) + 0x5aa  module: stmt.cpp, line: 4651
                        627f4f: Parser::declare(Parser::sym*, Parser::Type**, Parser::StorageClass, Parser::Linkage, list<Parser::FunctionBlock*, allocator<Parser::FunctionBlock*>>&, bool, int, bool, Parser::AccessLevel) + 0x5641  module: declare.cpp, line: 4573
                        401407: Parser::compile(bool) + 0x27e  module: occparse.cpp, line: 452
                        4024a4: main + 0xf48  module: occparse.cpp, line: 728
                        5521fa: __startup + 0x1c6

You'll notice that I'm trying to workaround an issue where the following block of code doesn't work:

C:\OrangeC\memcpy_tests>occ /S basic_memcpy.c
occ (OrangeC) Version 7.0.1
Copyright (C) LADSoft 2006-2025
Error(212)    basic_memcpy.c(44):  Invalid index mode
Error(212)    basic_memcpy.c(45):  Invalid index mode
2 Errors

As my assumption there is that num_loop is in a register or could at least be moved into a register. This code also doesn't work if I decorate num_loop with the register keyword.

This is compounded by the ability of MSVC to happily accept this block of code. I'm not sure if this is a "I don't know enough about the NASM syntax" issue however.

LADSoft · 2026-01-24T23:34:50Z

oh i just realized what is problably wrong with the 'lea' instructions.

nasm syntax does not really understand things like:

mov eax, my_variable

I hadn't thought about that aspect of masm in a very long time....

instead I think you have to do

mov eax, [my_variable]
that would go for lea and other instructions as well...

thsi is actually something that is probably easy to adjust in the inline assembly parser.... it just needs to notice what is happening and add the brackets internally. if you think it would be good for compatbility I can address it.

chuggafan · 2026-01-25T02:00:50Z

I think that we could tackle this two ways, both are acceptable:

Accept masm style, perhaps as a switch(?).
For nasm style, if we can detect masm style assembly, have a error/warning/suggestion about how to fix this (i.e. what to change to get it to nasm style).

I think that's the best solution.

The hot loop is now ~6 instructions as opposed to the old ~8, while doing 4x the amount of compares. Small strings may be slightly slower as the cold loop is there to align the speeds up faster, so strings < 15 in length may be slower. Anything > 32 in length should be faster compared to previously, however.

chuggafan · 2026-02-12T03:33:46Z

I'm somewhat stalling out on this at the moment, but I might get back to this next week.
I'm going to make my own benchmark lib (quick and dirty) and start testing everything at various sizes.

I am still unhappy with my memcpy and memmove implementation, partially because I don't handle aligned/unaligned well, partially because I don't optimize enough for older CPUs, and partially because I'm not optimizing the "reverse case" in memmove where I copy bytes backwards, I need to sit down and start figuring out what I want from first principles harder there, and look at all of my available instructions up to SSE2.

Past that, I think I might aim at memcmp next, it's mostly used in sqlite3 in our codebase, as well as somewhat in the string libcxx library, but even if it's just in sqlite3, that should be a minor win all things considered and is just something that real programs do use.

Outside of that, I am interested in trying to learn llvm's loop transformations, such as matching patterns that will modify a basic block to a function call (in particular, something like optimizing a basic memcpy to a call to the memcpy lib if the size of the memcpy is unknown or greater than a certain value).

GitMensch · 2026-02-12T20:15:32Z

Concerning memcpy: https://github.com/ClickHouse/ClickHouse/blob/master/base/glibc-compatibility/memcpy/memcpy.h

LADSoft · 2026-02-14T02:12:21Z

so i think it is valuable that you are doing this 😄

chuggafan · 2026-03-04T03:21:53Z

I'm kinda-sorta stalling out on this while working on my benchmark lib because picobench and gbenchmark are both not compiling.
I do actually have my little memcmp done, glibc doesn't bother with SSE2 at all for memcmp, I'll probably investigate doing a similar methodology to them, and comparing speeds with this little benchmark library.

On that note, for adding little utilities that don't quite fit with the Utils/ folder, should I add a maint/ folder (top level), just as "things maintainers might want but not people just randomly checking out"?
This is pretty traditional in large projects just as random maintainer stuff.

LADSoft · 2026-03-04T15:02:09Z

yeah that would be fine, adding another folder for maintenance stuff 😄

Im making progress, slowly. Had most of the test stuff compiling and running but there are still a couple of ideosyncracies. After several hours of debugging one of the more complex programs in the test suite I found i've broken simple loops like this:

while (n--)
{
    ....
}

lol...

chuggafan · 2026-03-05T02:30:16Z

I've noticed we also don't support clflush, which I plan on using for my memcmp/memcpy comparison bench because that way I can invalidate the cache in-between each run.

    <Opcode Name="clflush {byte}'memloc:mem8'" op="0x0F:8 0xAE:8 'memloc':8" Class="mem8"/>
    <Opcode Name="clflushopt {byte}'memloc:mem8'" op="0x66:8 0x0F:8 0xAE:8 'memloc':8" Class="mem8"/>

Is this the right way to do this? I'm not sure about the whole referencing portion of op <-> opcode name.

I think I figured it out:

    <Opcode Name="clflush {byte} 'mem:mem8'" mod="7" op="0x0F:8 0xAE:8 'mem':8"/>
    <Opcode Name="clflushopt {byte} 'mem:mem8'" mod="7" op="0x66:8 0x0F:8 0xAE:8 'mem':8"/>

I'll check if the generated code is (roughly) correct, if it is, then I can just go ahead there....

2nd edit, I am provably dumb about this:

    <Opcode Name="clflush">
      <Operands Name="{byte} 'mem:mem8'" op="0x0F:8 0xAE:8 'mem':8"/>
    </Opcode>
    <Opcode Name="clflushopt">
      <Operands Name="{byte} 'mem:mem8'" op="0x66:8 0x0F:8 0xAE:8 'mem':8"/>
    </Opcode>

LADSoft · 2026-03-05T23:11:55Z

i know it is confusing but the second edit is close... it might work except you probably want to use r/m style addressing rather than just a direct address as this instruction allows that.

<Opcode Name="clflush">
      <Operands Name="{byte} 'rm:rm8'" op="0x0F:8 0xAE:8" mod=7 Coding="native" R="0" W="0"/>
</Opcode>

I've added some other stuff you probably need as well....

The coding field is what actually generates the instruction byte codes, here we say 'native' to make it pick whichever addressing mode line applies for the addressing mode present in the source code. (all the possible addressng modes for an rm8 are iiterated earlier in the file) The 'mod' field is set to 7 because that is what the documentation specifies for it:

0F AE /7 CLFLUSH m8

the R and W are the fields in the 64-bit prefix if one is generated... usually they are zero.

i think that will work 😄

chuggafan · 2026-03-09T01:14:25Z

Ok, your suggestion seems to have worked, it does mandate that the byte is there despite that being (implied) and therefore optional, but good enough!

Also fix my strlen modification to subtract a single instruction. Also decorate with noreturn noreturn functions.

chuggafan · 2026-03-21T04:30:21Z

 Name (* = baseline)      |   Dim   |  Total ms |  ns/op  |Baseline| Ops/second 
--------------------------|--------:|----------:|--------:|-------:|----------: 
 bench_memcpy_to_compare  |       8 |     0.000 |      25 |  1.000 | 40000000.0 
 bench_current_memcpy     |       8 |     0.000 |      25 |  1.000 | 40000000.0 
 bench_basic_memcpy *     |       8 |     0.000 |      25 |      - | 40000000.0 
 bench_memcpy_to_compare  |      16 |     0.000 |      12 |  2.000 | 80000000.0 
 bench_current_memcpy     |      16 |     0.000 |      12 |  2.000 | 80000000.0
 bench_basic_memcpy *     |      16 |     0.000 |       6 |      - |160000000.0
 bench_memcpy_to_compare  |      32 |     0.000 |       6 |  1.000 |160000000.0
 bench_current_memcpy     |      32 |     0.000 |       6 |  1.000 |160000000.0
 bench_basic_memcpy *     |      32 |     0.000 |       6 |      - |160000000.0
 bench_memcpy_to_compare  |      64 |     0.000 |       1 |  0.500 |640000000.0
 bench_current_memcpy     |      64 |     0.000 |       3 |  1.000 |320000000.0
 bench_basic_memcpy *     |      64 |     0.000 |       3 |      - |320000000.0
 bench_memcpy_to_compare  |     128 |     0.000 |       1 |  1.000 |640000000.0
 bench_current_memcpy     |     128 |     0.000 |       1 |  1.000 |640000000.0
 bench_basic_memcpy *     |     128 |     0.000 |       1 |      - |640000000.0
 bench_memcpy_to_compare  |     256 |     0.000 |       0 |  0.667 |1280000000.0
 bench_current_memcpy     |     256 |     0.000 |       0 |  0.667 |1280000000.0
 bench_basic_memcpy *     |     256 |     0.000 |       1 |      - |853333333.3
 bench_memcpy_to_compare  |     512 |     0.000 |       0 |  0.667 |2560000000.0
 bench_current_memcpy     |     512 |     0.000 |       0 |  0.667 |2560000000.0
 bench_basic_memcpy *     |     512 |     0.000 |       0 |      - |1706666666.7
 bench_memcpy_to_compare  |    2048 |     0.000 |       0 |  0.333 |10240000000.0
 bench_current_memcpy     |    2048 |     0.000 |       0 |  0.333 |10240000000.0
 bench_basic_memcpy *     |    2048 |     0.001 |       0 |      - |3413333333.3
 bench_memcpy_to_compare  |   32768 |     0.001 |       0 |  0.129 |36408888888.9
 bench_current_memcpy     |   32768 |     0.001 |       0 |  0.129 |36408888888.9
 bench_basic_memcpy *     |   32768 |     0.007 |       0 |      - |4681142857.1

Yippeee!
This is on my Alder-lake system, I was able to (finally) get defender to go away from the files for a moment.
It appears the compiler might actually be generating the (best) code for certain setups? It's probably within variance there.
Next up, memcmp... I have a (fairly) large and well commented reference with my own personal instruction cycle calculations made on the source code because normal tooling doesn't go instruction cycle calcs as far back as I'd like, so I went to the agner fog tables... after that it's all about trying to get it to run the way I want.

LADSoft · 2026-03-23T09:57:44Z

wow. I am so surprised at that...

chuggafan · 2026-03-23T11:52:51Z

To be fair, this was a comparison of the compiled implementation of my code vs the assembly vs an EXTREMELY basic memcpy copying byte by byte, so the results shouldn't be too surprising, I think.

chuggafan · 2026-04-04T15:38:39Z

I've merged my branch with the latest (I did regen the x64 parse stuff so that we have my instruction updates as well as yours).
I did successfully run build.bat, I realized re-reading the file that I need to define %ORANGEC_HOME% to get it to not assume that I'm on appveyor.
It's probably better to check if %APPVEYOR% is True instead of if %ORANGEC_HOME% exists for the build process.

LADSoft · 2026-04-04T16:55:17Z

sounds good.

I'm really happy with the changes you made to omake, thank you for that!

GitMensch · 2026-04-04T20:17:43Z

It's probably better to check if %APPVEYOR% is True instead of if %ORANGEC_HOME% exists for the build process.

that is definitely the way to go

LADSoft · 2026-04-06T00:37:35Z

the next build will check %APPVEYOR%

chuggafan added 2 commits January 17, 2026 17:30

Remove redundant push ecxs for faster adds/subs.

6c45779

Also move from rep movsd to rep movsb, which should be faster on modern hardware.

Make _bcopy a stub for _memmove

0060102

Edit memcpy to be based off of my reference memcpy implementation.

a26c185

Change has_single_bit to rid ourselves of the loop while not using `p…

9c28e1d

…opcount`.

chuggafan mentioned this pull request Jan 24, 2026

Splitting builtin support between the IR and the backend #577

Open

Merge branch 'master' of https://github.com/LADSoft/OrangeC

9628c71

chuggafan added 2 commits January 25, 2026 20:36

Remove a couple of extraneous movs when they can be direct pushes.

a8e9691

Minor removal of evaluating + on strings in logging functions.

1aaf133

chuggafan mentioned this pull request Jan 26, 2026

Minor removal of evaluating + on strings in logging functions. #1115

Merged

chuggafan mentioned this pull request Feb 3, 2026

Ineffecient codegen for loops #1109

Closed

chuggafan added 2 commits February 4, 2026 13:58

Reduce strlen instruction size by one

bbbf68b

Merge branch 'master' of https://github.com/LADSoft/OrangeC

5c5ec44

chuggafan added 2 commits March 8, 2026 21:21

Update to add clflush and clfushopt

6964104

Also fix my strlen modification to subtract a single instruction. Also decorate with noreturn noreturn functions.

Merge branch 'master' of https://github.com/LADSoft/OrangeC

1078002

chuggafan mentioned this pull request Mar 17, 2026

Adding the new SSE instructions as intrinsics #304

Open

chuggafan mentioned this pull request Apr 1, 2026

memcpy() broken #1143

Closed

Merge branch 'master' of https://github.com/LADSoft/OrangeC

2378f58

Conversation

chuggafan commented Jan 17, 2026

Uh oh!

chuggafan commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chuggafan commented Jan 18, 2026

Uh oh!

LADSoft commented Jan 19, 2026

Uh oh!

GitMensch commented Jan 19, 2026

Uh oh!

LADSoft commented Jan 19, 2026

Uh oh!

chuggafan commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chuggafan commented Jan 20, 2026

Uh oh!

chuggafan commented Jan 24, 2026

Uh oh!

LADSoft commented Jan 24, 2026

Uh oh!

chuggafan commented Jan 25, 2026

Uh oh!

chuggafan commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GitMensch commented Feb 12, 2026

Uh oh!

LADSoft commented Feb 14, 2026

Uh oh!

chuggafan commented Mar 4, 2026

Uh oh!

LADSoft commented Mar 4, 2026

Uh oh!

chuggafan commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LADSoft commented Mar 5, 2026

Uh oh!

chuggafan commented Mar 9, 2026

Uh oh!

chuggafan commented Mar 21, 2026

Uh oh!

LADSoft commented Mar 23, 2026

Uh oh!

chuggafan commented Mar 23, 2026

Uh oh!

chuggafan commented Apr 4, 2026

Uh oh!

LADSoft commented Apr 4, 2026

Uh oh!

GitMensch commented Apr 4, 2026

Uh oh!

LADSoft commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chuggafan commented Jan 18, 2026 •

edited

Loading

chuggafan commented Jan 19, 2026 •

edited

Loading

chuggafan commented Feb 12, 2026 •

edited

Loading

chuggafan commented Mar 5, 2026 •

edited

Loading