Conversation
Also move from rep movsd to rep movsb, which should be faster on modern hardware.
|
Also for the record: Apparently during the testing the current libc This test program exposes it: Which produces: I'm not sure if this is why when I stub-out the |
|
Based on my reference memcpy implementation included in the push, I shaved off ~1-2 minutes of a Which is a 10% reduction(ish), which is nice ;). |
|
pretty cool. Thank you! |
|
Are the AppVeyor failures related? If not: 10% saved and tests otherwise passing seems like a no-brainer to get the new memcpy in. Just a related note: I've found that in GCC there's often a bad performance if the number of bytes are not known to the compiler; for small amounts (I think it was less than 256 bytes) the general implementation seems to be less good than a simple loop that the optimizer improves - to an amount that I've made an explicit switch between both depending on the size in libcob. |
|
the current appveyor faiilures are because I made changes that i didn't quite vet properly. I'm trying to get it worked out but running into one problem after another. Soon, I guess.... Another aspect of what I checked in is it may be a little slower overall. Part of my current vetting is also to see if I can address that... |
|
@GitMensch So I think it may be partly me (I have no idea why this would break it, it fixes issues if anything)? |
|
https://www.microsoft.com/en-us/msrc/blog/2021/01/building-faster-amd64-memset-routines The goal here should be to knock out any commonly used expensive libc function into being slightly faster. |
|
Hmmm.... I found a crash in the inline void* memmove_backwards(void* s1, const void* s2, size_t sz)
{
char* dest_char = s1;
const char* src_char = s2;
size_t num_loop = sz / 64;
/*
while (num_loop > 0)
{
num_loop -= 4;
__m128i loaded_val = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop));
__m128i loaded_val1 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 1));
__m128i loaded_val2 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 2));
__m128i loaded_val3 = _mm_loadu_si128(((const __m128i*)(src_char) + num_loop + 3));
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 3), loaded_val3);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 2), loaded_val2);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop + 1), loaded_val1);
_mm_storeu_si128(((__m128i*)(dest_char) + num_loop), loaded_val);
}
*/
sz -= num_loop * 64;
while (num_loop > 0)
{
num_loop -= 4;
__asm {
lea ecx, num_loop
movups xmm0, [src_char + ecx];
movups [dest_char + ecx], xmm0;
}
}
while (sz > 0)
{
sz--;
((char*)s1)[sz] = ((const char*)s2)[sz];
}
return s1;
}This crashes the compiler when dealing with everything. You'll notice that I'm trying to workaround an issue where the following block of code doesn't work: As my assumption there is that This is compounded by the ability of MSVC to happily accept this block of code. I'm not sure if this is a "I don't know enough about the NASM syntax" issue however. |
|
oh i just realized what is problably wrong with the 'lea' instructions. nasm syntax does not really understand things like: mov eax, my_variable I hadn't thought about that aspect of masm in a very long time.... instead I think you have to do mov eax, [my_variable] thsi is actually something that is probably easy to adjust in the inline assembly parser.... it just needs to notice what is happening and add the brackets internally. if you think it would be good for compatbility I can address it. |
|
I think that we could tackle this two ways, both are acceptable: Accept I think that's the best solution. |
The hot loop is now ~6 instructions as opposed to the old ~8, while doing 4x the amount of compares. Small strings may be slightly slower as the cold loop is there to align the speeds up faster, so strings < 15 in length may be slower. Anything > 32 in length should be faster compared to previously, however.
|
I'm somewhat stalling out on this at the moment, but I might get back to this next week. I am still unhappy with my Past that, I think I might aim at Outside of that, I am interested in trying to learn llvm's loop transformations, such as matching patterns that will modify a basic block to a function call (in particular, something like optimizing a basic memcpy to a call to the memcpy lib if the size of the memcpy is unknown or greater than a certain value). |
|
so i think it is valuable that you are doing this 😄 |
|
I'm kinda-sorta stalling out on this while working on my benchmark lib because On that note, for adding little utilities that don't quite fit with the Utils/ folder, should I add a |
|
yeah that would be fine, adding another folder for maintenance stuff 😄 Im making progress, slowly. Had most of the test stuff compiling and running but there are still a couple of ideosyncracies. After several hours of debugging one of the more complex programs in the test suite I found i've broken simple loops like this: lol... |
|
I've noticed we also don't support <Opcode Name="clflush {byte}'memloc:mem8'" op="0x0F:8 0xAE:8 'memloc':8" Class="mem8"/>
<Opcode Name="clflushopt {byte}'memloc:mem8'" op="0x66:8 0x0F:8 0xAE:8 'memloc':8" Class="mem8"/>Is this the right way to do this? I'm not sure about the whole referencing portion of op <-> opcode name. I think I figured it out: <Opcode Name="clflush {byte} 'mem:mem8'" mod="7" op="0x0F:8 0xAE:8 'mem':8"/>
<Opcode Name="clflushopt {byte} 'mem:mem8'" mod="7" op="0x66:8 0x0F:8 0xAE:8 'mem':8"/>I'll check if the generated code is (roughly) correct, if it is, then I can just go ahead there.... 2nd edit, I am provably dumb about this: <Opcode Name="clflush">
<Operands Name="{byte} 'mem:mem8'" op="0x0F:8 0xAE:8 'mem':8"/>
</Opcode>
<Opcode Name="clflushopt">
<Operands Name="{byte} 'mem:mem8'" op="0x66:8 0x0F:8 0xAE:8 'mem':8"/>
</Opcode> |
|
i know it is confusing but the second edit is close... it might work except you probably want to use r/m style addressing rather than just a direct address as this instruction allows that. I've added some other stuff you probably need as well.... The coding field is what actually generates the instruction byte codes, here we say 'native' to make it pick whichever addressing mode line applies for the addressing mode present in the source code. (all the possible addressng modes for an rm8 are iiterated earlier in the file) The 'mod' field is set to 7 because that is what the documentation specifies for it: the R and W are the fields in the 64-bit prefix if one is generated... usually they are zero. i think that will work 😄 |
|
Ok, your suggestion seems to have worked, it does mandate that the |
Also fix my strlen modification to subtract a single instruction. Also decorate with noreturn noreturn functions.
Yippeee! |
|
wow. I am so surprised at that... |
|
To be fair, this was a comparison of the compiled implementation of my code vs the assembly vs an EXTREMELY basic memcpy copying byte by byte, so the results shouldn't be too surprising, I think. |
|
I've merged my branch with the latest (I did regen the x64 parse stuff so that we have my instruction updates as well as yours). |
|
sounds good. I'm really happy with the changes you made to omake, thank you for that! |
that is definitely the way to go |
|
the next build will check %APPVEYOR% |
This is my incubation area for optimization, so far I'm including my
gen.cppchanges while I try figure out what in the universe is going on with mymemcpyimplementation.I'll be pushing my
memcpyimplementation later, but for now this should be a good reference point to compare the appveyor build times with as I suspect this should increase the speeds of the compiler slightly, but not majorly.