Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
189 commits
Select commit Hold shift + click to select a range
92b8ef3
X86 render
Jul 15, 2025
dc3aefa
CPUProgram patch for ARM termux device
Jul 16, 2025
be757b2
alu share dst and src is src will expire
Jul 16, 2025
e25829c
cont'd
Jul 16, 2025
d93a011
x86 can use lea
Jul 16, 2025
d30479a
fix share reg alu test on arm
Jul 16, 2025
ea1fa51
range newline
Jul 16, 2025
7e0a5b7
no fma, unroll limit size to 8
Jul 16, 2025
40397b6
set up failed test_abs
Jul 16, 2025
12b3e06
abs fail with just data error
Jul 17, 2025
8011352
positive number correct
Jul 17, 2025
50e2912
abs works
Jul 17, 2025
c86afe7
asm
Jul 17, 2025
92be831
arm fma
Jul 17, 2025
bd980e8
abs test with neg hardcoded val
Jul 17, 2025
800f49e
abs with float64
Jul 18, 2025
7b79e0c
test abs integer 32
Jul 18, 2025
87c7772
load and store for var need to check bool, and handle movsd
Jul 18, 2025
0aadd39
abs int32 works
Jul 18, 2025
0f18ff4
just f64 failing only in python
Jul 18, 2025
b1ffae5
8 byte alignment in data
Jul 18, 2025
81058a8
int64 and movsd fix
Jul 19, 2025
38d399e
test acos
Jul 19, 2025
e62755e
acos works
Jul 20, 2025
c58fa1a
test acos with unroll, disable float4 in clang
Jul 20, 2025
089526e
float reg spill
Jul 20, 2025
de2d8e8
full helper for testing backward as well
Jul 20, 2025
cfde1ff
recip; nan on backward pass
Jul 20, 2025
c4a2037
forward only for now
Jul 20, 2025
9dc120d
cmpne
Jul 20, 2025
f8c9a5d
debug info for clang render
Jul 20, 2025
d255f74
x86 bitcast
Jul 20, 2025
3dc2587
test assign specific reg
Jul 20, 2025
3b32531
assign a specific reg
Jul 21, 2025
28d3019
idiv x86 works, incorrect results
Jul 21, 2025
c63f48a
idiv works on x86 correct results
Jul 21, 2025
0becead
test acosh
Jul 21, 2025
d125e2c
need to handle AND for acosh
Jul 21, 2025
d62f7a4
and works
Jul 21, 2025
a804543
acosh invalid result
Jul 21, 2025
b5461f2
just have to use tiny_backend
Jul 21, 2025
0313ff1
handle int64 -> int32
Jul 21, 2025
a4fcf86
default to tiny_backend
Jul 21, 2025
2deb52b
bool
Jul 21, 2025
e76b0ae
test_all works
Jul 22, 2025
1deb479
more testall
Jul 22, 2025
096109c
wip
Jul 22, 2025
b9d3f05
just acosh failing with stack offset being too big
Jul 22, 2025
c9680f1
use x30 as a second stack pointer for stuff > 255
Jul 22, 2025
55d075c
acosh fails with unroll
Jul 23, 2025
c37e44d
revert: just cosh fail with accuracy
Jul 23, 2025
b2a9151
need simpler code before handling acosh
Jul 23, 2025
31bc026
starting Allocator v2
Jul 23, 2025
cb0970d
allocator 2 blueprint
Jul 23, 2025
1b8d9cc
rename do_not_use to blocked
Jul 23, 2025
8a4bcc1
remove helper extend kernel
Jul 23, 2025
7432f40
exclude now uses regs
Jul 23, 2025
1275f68
reserve is reg based
Jul 24, 2025
7a8764c
rename i, index to cur_step
Jul 24, 2025
5430515
remove .variables
Jul 24, 2025
8921780
hoist x86_params
Jul 24, 2025
e877438
sum test
Jul 24, 2025
e2234d4
assert where
Jul 24, 2025
ae4b54c
more acos test
Jul 24, 2025
2d90abb
more test_abs
Jul 24, 2025
551bc07
more tests
Jul 24, 2025
31b1d38
remove __get__ for more explicit assignment
Jul 24, 2025
06c2a85
explicit number of register
Jul 24, 2025
a2db5d4
alloc return just a reg
Jul 24, 2025
21df9b6
fix alloc from pool when regs are blocked
Jul 24, 2025
146fb68
_spill method
Jul 24, 2025
11a4265
assign requires explicit reg type
Jul 24, 2025
5ea59b6
assign_reg and alloc_reg
Jul 24, 2025
4fc04e1
save_var_to_stack is handled by _spill
Jul 24, 2025
861ccc3
alloc multiple drives alloc
Jul 24, 2025
f0487ef
assign multiple implementation
Jul 25, 2025
a911c54
float_cmp can use alloc_multiple
Jul 25, 2025
942831e
set up the branch for assign_multiple
Jul 25, 2025
da88c15
need to load val into reg is stack is set up for a var
Jul 25, 2025
bf6b91c
alu uses assign_multiple
Jul 25, 2025
a90587f
_index uses assign_multiple
Jul 25, 2025
dfbe583
cont'd
Jul 25, 2025
7dd73f2
to_bool uses assign_multiple
Jul 25, 2025
8cd33ea
to_bool uses assign_multiple, cont'd
Jul 25, 2025
c507125
float_cmp to use assign_multiple
Jul 25, 2025
126c628
float_cmp uses assign_multiple
Jul 25, 2025
0a6384d
refactor alu
Jul 25, 2025
3a1da2a
refactor
Jul 25, 2025
d12ee77
refactor
Jul 25, 2025
b74597f
refactor
Jul 25, 2025
899c8ae
refactor
Jul 25, 2025
4e84477
refactor
Jul 25, 2025
6eb4755
refactor
Jul 25, 2025
7339b8e
refactor
Jul 25, 2025
f5b1452
standalone arm cmplt
Jul 25, 2025
2366ff9
standalone arm cmp for both ne and lt
Jul 25, 2025
e6d994a
acosh fails with large positive num
Jul 25, 2025
07ebb61
x86 standalone cmp
Jul 25, 2025
80d8ace
fix arm cmp
Jul 25, 2025
6c01e0a
test acosh with manual switch
Jul 25, 2025
030048c
arm recip with fmov
Jul 26, 2025
0708c66
arm passes acosh
Jul 26, 2025
d1c2685
x86 cmp refactor
Jul 26, 2025
ca00b3f
return reg takes a list
Jul 26, 2025
125576d
x86 cmp refactor
Jul 26, 2025
a899f51
x86 cmp refactor
Jul 26, 2025
e74f146
x86 fdiv
Jul 26, 2025
2a6d910
just high num of log failing on x86
Jul 26, 2025
d49e073
just high num of log with unroll failing on x86
Jul 26, 2025
e5db169
just log with unroll fails on x86
Jul 26, 2025
97945fb
check for zf flag first with ucomiss
Jul 26, 2025
5365933
if stack is not None, load into reg forcifully, x86 fails with log2 o…
Jul 27, 2025
9538061
track reg and stack modification, keep var assign early return if reg…
Jul 27, 2025
82bd5fd
x86 passes acosh, fixes idiv
Jul 27, 2025
315250c
float arange works
Jul 28, 2025
ee73430
argmax implementation
Jul 29, 2025
6c0af4d
cast bool to int
Jul 29, 2025
e0b7ec4
max and argmax on x86
Jul 29, 2025
e0e5d6a
arm max
Jul 29, 2025
829a1be
wip
Jul 29, 2025
2e4e413
x86 pool2d
Jul 29, 2025
df1bef9
fix arm ldr
Jul 29, 2025
96854ed
acc fix
Jul 29, 2025
b60cf1d
Merge remote-tracking branch 'local/asm-10' into asm-10
Jul 29, 2025
644f881
offset x29 if over the limit
Jul 29, 2025
ed4cfe7
make sure to xor before alu for a new destination
Jul 30, 2025
7d00b0d
xor on arm
Jul 30, 2025
8174dda
arm seem to sign extend that fixes the xor issue on x86, need more st…
Jul 30, 2025
d04c78f
pool running out of spill candidates because of too many acc
Jul 30, 2025
31c4592
acc do not reserve
Jul 30, 2025
957c8fb
arm can handle large stack size
Jul 30, 2025
5f1fcf5
sigmoid passes on x86
Jul 30, 2025
a55591a
idiv refactor on x86
Jul 30, 2025
8c3a101
more idiv fixes
Jul 31, 2025
b49267a
no more segfaults
Jul 31, 2025
43a3986
running out of regs, some are orphaned
Jul 31, 2025
f3aa62d
allocator pool
Aug 1, 2025
73b85d5
allocator pool cont'd
Aug 1, 2025
ae57e7d
allocator pool cont'd
Aug 1, 2025
2b6b262
allocator pool cont'd
Aug 1, 2025
e76e721
allocator pool cont'd
Aug 1, 2025
71d4a6e
allocator pool cont'd
Aug 1, 2025
a84ab83
var.load only render code
Aug 1, 2025
adf4660
ldr and str as function helper
Aug 1, 2025
2471ed6
release step is now consistent, idiv is very ugly
Aug 1, 2025
66feb85
pool and acquired sum is now consistent, idiv is very ugly
Aug 1, 2025
c256841
mod works
Aug 1, 2025
f73f076
split arm mod into idiv and alu
Aug 1, 2025
5df9920
ops.sub in arm
Aug 1, 2025
1dacee6
cast bool to float
Aug 2, 2025
0862319
cast bool to float cont'd
Aug 2, 2025
56ae153
cast op
Aug 2, 2025
db8d4dc
cast op
Aug 2, 2025
40bf4d0
print bytes before invoking program
Aug 2, 2025
8579390
cast op on x86, need to now fix uint64 idiv
Aug 2, 2025
616a1da
uint64 division
Aug 2, 2025
dfbb01d
test cmplt backward
Aug 2, 2025
f6e4c55
arm cast
Aug 2, 2025
1911156
bool need to be i32 when converting
Aug 2, 2025
212270e
arm use data section for uint
Aug 2, 2025
88b65c5
arm uint in data section need IReg
Aug 2, 2025
d22d515
bitcast and uint promote
Aug 2, 2025
98eb86b
fix uint overflow
Aug 2, 2025
7de6220
debug info and alignment on arm for data section
Aug 2, 2025
dff4699
fix uint overflow, udiv in arm
Aug 3, 2025
2133020
x86 idiv use xor for uint rdx
Aug 3, 2025
01b4a09
xorps instead of por
Aug 3, 2025
df3d575
uint8
Aug 3, 2025
4a61571
uints max
Aug 3, 2025
bd667d2
test uint min
Aug 3, 2025
7a01620
alu use only 32 bit
Aug 3, 2025
5077d32
x86 params greater than 6
Aug 3, 2025
deb2a61
x86 params offset
Aug 3, 2025
4b88db1
test param exceeding 5
Aug 3, 2025
9bcf831
x86 params use negative value to indicate stack params
Aug 4, 2025
5d59492
scatter reduce
Aug 4, 2025
5c47f9f
stack params only in x86
Aug 4, 2025
f35c996
define_acc uses assign_multiple
Aug 4, 2025
68d8d17
_where bool
Aug 4, 2025
9d1ca36
clear dst first before gated load
Aug 4, 2025
7e93f2c
arm already zero extend
Aug 4, 2025
ab4145c
f32 cast to f64
Aug 4, 2025
f6487cc
log output bytes
Aug 5, 2025
d1d49a2
failing interpolate due to uint8
Aug 5, 2025
519e4bd
unsigned mul for x86 uses rax
Aug 5, 2025
0c37169
gemm na on cpu
Aug 5, 2025
0531ef1
gemm
Aug 5, 2025
2cad65b
f16
Aug 5, 2025
f7c689f
fp16 limited support
Aug 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions test/test_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
if CI:
warnings.filterwarnings("ignore", message="Non-empty compiler output encountered")

FORWARD_ONLY = getenv("FORWARD_ONLY", 0)
FORWARD_ONLY = getenv("FORWARD_ONLY", 1)
PRINT_TENSORS = getenv("PRINT_TENSORS", 0)

def helper_test_op(shps, torch_fxn, tinygrad_fxn=None, atol=1e-6, rtol=1e-3, grad_atol=1e-4, grad_rtol=1e-3,
Expand Down Expand Up @@ -1284,7 +1284,8 @@ def test_small_gemm_range(self):
def test_small_gemm_eye(self):
helper_test_op(None, lambda x,y: x.matmul(y), lambda x,y: x@y, vals=[np.eye(8).astype(np.float32), np.eye(8).astype(np.float32)])
@unittest.skipIf(CI and Device.DEFAULT in ["NV", "LLVM", "GPU", "CUDA"] or IMAGE
or (Device.DEFAULT == "WEBGPU" and platform.system() == "Windows"), "not supported on these in CI/IMAGE")
or (Device.DEFAULT == "WEBGPU" and platform.system() == "Windows")
or (Device.DEFAULT == "ASM"), "not supported on these in CI/IMAGE")
def test_gemm_fp16(self):
helper_test_op([(64,64), (64,64)], lambda x,y: x.half().matmul(y.half()), atol=5e-3, rtol=5e-3)
def test_gemm(self):
Expand Down
801 changes: 801 additions & 0 deletions test/test_ops_2.py

Large diffs are not rendered by default.

21 changes: 18 additions & 3 deletions tinygrad/device.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,6 @@ def _offset(self, buf, size:int, offset:int): return from_mv(self._as_buffer(buf

# CPUProgram is a jit/shellcode program that can be just mmapped and jumped to
class CPUProgram:
rt_lib = ctypes.CDLL(ctypes.util.find_library('System' if OSX else 'kernel32') if OSX or sys.platform == "win32" else 'libgcc_s.so.1')

def __init__(self, name:str, lib:bytes):
if sys.platform == "win32":
Expand All @@ -303,6 +302,8 @@ def __init__(self, name:str, lib:bytes):
# MAP_JIT allows us to easily flip pages from RW- to R-X and vice versa. It is a noop on intel cpus. (man pthread_jit_write_protect_np)
self.mem = mmap(-1, len(lib), MAP_ANON | MAP_PRIVATE | (MAP_JIT if OSX else 0), PROT_READ | PROT_WRITE | PROT_EXEC)

if OSX or sys.platform == "win32":
CPUProgram.rt_lib = ctypes.CDLL(ctypes.util.find_library('System' if OSX else 'kernel32') if OSX or sys.platform == "win32" else 'libgcc_s.so.1')
if OSX: CPUProgram.rt_lib.pthread_jit_write_protect_np(False)
self.mem.write(lib)
if OSX: CPUProgram.rt_lib.pthread_jit_write_protect_np(True)
Expand All @@ -311,19 +312,33 @@ def __init__(self, name:str, lib:bytes):
# libgcc_s comes as shared library but compiler-rt is only a bunch of static library archives which we can't directly load, but fortunately
# it somehow found its way into libSystem on macos (likely because it used __builtin_clear_cache) and libgcc_s is ~always present on linux
# Using ["name"] instead of .name because otherwise name is getting mangled: https://docs.python.org/3.12/reference/expressions.html#index-5
CPUProgram.rt_lib["__clear_cache"](ctypes.c_void_p(mv_address(self.mem)), ctypes.c_void_p(mv_address(self.mem) + len(lib)))
if hasattr(CPUProgram, "rt_lib"):
CPUProgram.rt_lib["__clear_cache"](ctypes.c_void_p(mv_address(self.mem)), ctypes.c_void_p(mv_address(self.mem) + len(lib)))

self.fxn = ctypes.CFUNCTYPE(None)(mv_address(self.mem))

def __call__(self, *bufs, vals=(), wait=False):
args = list(bufs) + list(vals)
if p:=os.environ.get("SAVE_BYTES"):
for i, b in enumerate(bufs[1:]):
print(f"Data {i+1}:")
_bytes = bytes(b)
print(", ".join([f"0x{_b:02x}" for _b in _bytes]))
print()
# NOTE: replace this by --target={host's triple}-elf in clang args once we only support macos sequoia and later.
# Apple relaxes abi requirement for stack arguments to always be at least 8 byte aligned on arm64
# https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms
# This hack is required because clang/llvm bug doesn't allow us to just use {host's triple}+'-elf' (relocation failures)
# The bug was fixed in https://github.com/llvm/llvm-project/commit/454cc36630296262cdb6360b60f90a64a97f7f1a but was only backported to xcode 16+
if platform.machine() == "arm64" and OSX: args = args[:8] + [ctypes.c_int64(a) if isinstance(a, int) else a for a in args[8:]]
return cpu_time_execution(lambda: self.fxn(*args), enable=wait)
ret = cpu_time_execution(lambda: self.fxn(*args), enable=wait)
if p:=os.environ.get("SAVE_BYTES"):
for i, b in enumerate(bufs[0:1]):
print(f"Data {i}:")
_bytes = bytes(b)
print(", ".join([f"0x{_b:02x}" for _b in _bytes]))
print()
return

def __del__(self):
if sys.platform == 'win32': ctypes.windll.kernel32.VirtualFree(ctypes.c_void_p(self.mem), ctypes.c_size_t(0), 0x8000) #0x8000 - MEM_RELEASE
Expand Down
2 changes: 1 addition & 1 deletion tinygrad/opt/heuristic.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ def has_expanded_axis(shape, strides): return any(resolve(s > 1) and not resolve
# if last reduce dim is small(ish), loop unroll the reduce
upcast_size = prod(k.full_shape[a] for a in k.axes_of(AxisType.UPCAST, AxisType.UNROLL))
if k.unrollable_dims and (upcast_size <= 4 or not k.axes_of(AxisType.UNROLL)) and (upcast_size < 64):
if (s:=k.full_shape[k.unrollable_dims[-1]]) <= 32:
if (s:=k.full_shape[k.unrollable_dims[-1]]) <= 8:
k.apply_opt(Opt(OptOps.UNROLL, k.unrollable_dims[-1]-k.first_reduce, 0))
# if it's small, upcast a second reduce dimension too
if k.unrollable_dims and s <= 3 and k.full_shape[k.unrollable_dims[-1]] <= 3:
Expand Down
Loading