bthread: implement lock-free AtomicInteger128 for RISC-V#3355
bthread: implement lock-free AtomicInteger128 for RISC-V#3355Felix-Gong wants to merge 2 commits into
Conversation
2a4b97f to
5cd6d6a
Compare
There was a problem hiding this comment.
Pull request overview
Implements a lock-free (seqlock-based) 128-bit load/store path for bthread::AtomicInteger128 on RISC-V, replacing the prior mutex fallback to improve performance when reading/writing _cpu_time_stat in TaskGroup.
Changes:
- Adds a sequence counter (
_seq) toAtomicInteger128to support seqlock on RISC-V. - Updates
AtomicInteger128::load()/store()with RISC-V inline-asm implementations and fences. - Keeps existing x86_64 (SSE) and ARM NEON code paths unchanged.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/bthread/task_group.h | Adds _seq member and updates the mutex comment for non-lock-free architectures. |
| src/bthread/task_group.cpp | Implements RISC-V seqlock-based 128-bit load/store paths alongside existing x86/ARM implementations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| __asm__ volatile( | ||
| "fence w, w\n\t" | ||
| "sd %0, %1\n\t" | ||
| : | ||
| : "r"(new_seq), "m"(_seq) | ||
| : "memory" | ||
| ); | ||
| __asm__ volatile("fence w, w\n\t" ::: "memory"); | ||
| __asm__ volatile( | ||
| "sd %0, %2\n\t" | ||
| "sd %1, %3\n\t" | ||
| : | ||
| : "r"(value.v1), "r"(value.v2), | ||
| "m"(_value.v1), "m"(_value.v2) | ||
| : "memory" | ||
| ); | ||
| __asm__ volatile("fence w, w\n\t" ::: "memory"); | ||
| new_seq++; | ||
| __asm__ volatile( | ||
| "sd %0, %1\n\t" | ||
| : | ||
| : "r"(new_seq), "m"(_seq) | ||
| : "memory" | ||
| ); |
There was a problem hiding this comment.
Fixed. For sd (store) instructions, memory operands are now correctly declared as outputs using "=m" constraints. For ld (load) instructions, memory operands remain as inputs ("m") since ld only reads memory.
sd %1, %0with: "=m"(_seq) : "r"(new_seq)- memory is outputld %0, %1with: "=r"(seq0) : "m"(_seq)- memory is input
This correctly informs the compiler about which memory locations are modified.
| return {value[0], value[1]}; | ||
| #else // __x86_64__ || __ARM_NEON | ||
| // RISC-V and other architectures use mutex fallback | ||
| #elif defined(__riscv) |
There was a problem hiding this comment.
Fixed. Added && __riscv_xlen == 64 guard to restrict this path to RV64 only. RV32 will fall back to the mutex implementation.
| (void)_seq; | ||
| int64x2_t v = vld1q_s64(reinterpret_cast<int64_t*>(&value)); | ||
| vst1q_s64(reinterpret_cast<int64_t*>(&_value), v); | ||
| #elif defined(__riscv) |
There was a problem hiding this comment.
Fixed. Added && __riscv_xlen == 64 guard for store() as well. RV32 builds will use the mutex fallback.
Replace mutex-based fallback with seqlock implementation for RISC-V platform. This brings RISC-V in line with x86 (SSE) and ARM (NEON) lock-free implementations. Seqlock algorithm: - Reader: read sequence, read data, verify sequence, retry if stale - Writer: increment sequence, write data, increment sequence - All memory accesses use fence instructions for ordering Performance improvement: - bthread_start_urgent latency: 1.5-2% reduction - adding_func throughput: 25% improvement (541ns vs 722ns)
5cd6d6a to
aee927b
Compare
- Use "m" (input) for ld instructions reading memory - Use "=m" (output) for sd instructions writing memory - Swap operand order in sd templates to match RISC-V syntax (register, memory) The previous "+m" constraints in input operand positions caused compilation errors on RISC-V. This fix ensures proper constraint usage: - ld: memory is input "m", register is output "=r" - sd: register is input "r", memory is output "=m"
4fa5409 to
d7841bf
Compare
Summary
Replace mutex-based fallback with seqlock implementation for RISC-V platform, bringing it in line with x86 (SSE) and ARM (NEON) lock-free implementations.
Changes
fenceinstructions for proper memory orderingSeqlock Algorithm
fence r, rw(acquire) andfence w, w(release) for orderingPerformance Improvement
Tested on SOPHGO SG2044 (RISC-V rv64gcv):
bthread_start_urgentlatency: ~1.5-2% reductionadding_functhroughput: ~25% improvement (541ns vs 722ns)Test Plan
Notes
#elif defined(__riscv)preprocessor guards_seqmember is added toAtomicInteger128class but only used on RISC-V (other platforms have(void)_seq;to suppress warnings)