chore(O): remove SIMD/NEON backend and BackendType::Metal#21
Merged
dexwritescode merged 5 commits intomainfrom May 6, 2026
Merged
chore(O): remove SIMD/NEON backend and BackendType::Metal#21dexwritescode merged 5 commits intomainfrom
dexwritescode merged 5 commits intomainfrom
Conversation
Delete cpu_buffer.h, compute_backend.mm (abandoned Obj-C++ draft), and the simd/ directory. Remove the misleadingly-named simd_graph() helper. Apple Silicon → MLX only. Linux/Windows → CUDA/ROCm (future phases). No CPU SIMD fallback path will ever be needed for LLM-scale inference.
Delete cpu_buffer.h, compute_backend.mm (abandoned Obj-C++ draft), and the simd/ directory. Remove simd_graph() and metal_graph() helpers, and drop BackendType::Metal from the enum. Apple Silicon → MLX only. Linux/Windows → CUDA/ROCm (future phases). No CPU SIMD or bare Metal backend will be added for LLM-scale inference.
…backs
Phase O cleanup: the Tensor/BackendBuffer and ComputeGraph/ComputeGraphBuilder
abstractions were bypassed entirely on Apple Silicon (all three model families
used mlx_weights_ directly). Removing them closes ~7 100 lines of dead code and
leaves ComputeBackend as a thin lifecycle handle only.
Removed:
- core/tensor.{h,cpp}, core/graph.{h,cpp}
- backends/mlx/mlx_buffer.h, mlx_utils.h
- model/kv_cache.h
- model/gemma_model{,_base}.{h,cpp}
- model/qwen3_moe_model{,_base}.{h,cpp}
- tests/compute/test_symbolic_api.cpp, test_mlx_backend.cpp
Simplified:
- ComputeBackend: 5 lifecycle methods only (type/name/is_available/initialize/cleanup)
- MlxBackend: implements those 5 methods; ~730 lines of Tensor ops deleted
- LlamaModel, GemmaModelMLX, Qwen3MoeModelMLX: removed inheritance from base
Tensor-path classes; MLX classes own config_ and tokenizer_ directly
- ModelLoader: load_model()/load_all_safetensors() removed; load_model_mlx() kept
- language_model.cpp: Gemma/Qwen3MoE dispatch is now MLX-only
- BackendType::Metal removed (vestigial, never instantiated)
- Tests updated to remove calls to deleted APIs (forward(), attention_layer(),
wrap_native_tensor(), load_model(backend))
…ethods ComputeBackend is now a pure lifecycle abstraction: type(), name(), is_available(), initialize(), cleanup(). All ~40 Tensor-based math methods (matmul, dequantize, rope, softmax, sdpa, etc.) are removed from the interface and MlxBackend. GemmaModelMLX and Qwen3MoeModelMLX no longer inherit from their Tensor-based base classes; config_ and tokenizer_ are owned directly. ModelLoader no longer exposes load_model() or load_all_safetensors().
- ErrorCode: remove InvalidArgument, InsufficientMemory, TensorNotFound, NotImplemented — none were ever returned in production code - ComputeBackend: remove preferred_batch_size() and supports_async() — declared and overridden in MlxBackend but never called by any client - ModelConfig: remove name_or_path and transformers_version — parsed from JSON but never read after parsing - LlamaModel: remove context_size_ member — set in mlx_setup(), never read - Qwen3MoeModelMLX: remove context_size_ member — same pattern - Delete tinyllama_inference.h/.cpp — Phase D compatibility alias no longer needed; update 5 test files to use LlamaModel directly - Delete test_attention_qkv_trace.cpp — became an empty placeholder after attention_layer() was removed in Phase O
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cpu_buffer.h,compute_backend.mm(abandoned Obj-C++ draft), and thesimd/directorysimd_graph()andmetal_graph()convenience helpers fromgraph.hBackendType::Metalfrom the enum — updatescompute_backend.cpp,neurons_service.cpp, and the non-MLX mock intest_model_loader.cppApple Silicon → MLX only. Linux/Windows → CUDA/ROCm (future phases). No CPU SIMD fallback or bare Metal backend will exist for LLM-scale inference.