Skip to content

Commit 64b087c

Browse files
author
Ralf Waldukat
committed
Fix flash_attn default to match upstream AUTO behavior
Critical fixes from code review: - server/settings.py: Change flash_attn default from False to None (AUTO) Upstream llama.cpp defaults to LLAMA_FLASH_ATTN_TYPE_AUTO, server was incorrectly forcing DISABLED, blocking optimization for models that need it - llama_cpp.py: Consistent stub style (pass -> ...) for llama_max_tensor_buft_overrides - CMakeLists.txt: Document version workaround for mtmd build
1 parent 77b13a4 commit 64b087c

File tree

3 files changed

+5
-2
lines changed

3 files changed

+5
-2
lines changed

CMakeLists.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,9 @@ if (LLAMA_BUILD)
154154
endif()
155155

156156
# Set version for mtmd (required by upstream CMakeLists.txt)
157+
# NOTE: This is a workaround for mtmd build requirements.
158+
# Version is set to 0.0.0 for local builds. If upstream adds version
159+
# compatibility checks, this may need to match llama.cpp version.
157160
if (NOT DEFINED LLAMA_BUILD_NUMBER)
158161
set(LLAMA_BUILD_NUMBER 0)
159162
endif()

llama_cpp/llama_cpp.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1400,7 +1400,7 @@ def llama_supports_rpc() -> bool: ...
14001400
@ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t)
14011401
def llama_max_tensor_buft_overrides() -> int:
14021402
"""Get maximum number of tensor buffer type overrides"""
1403-
pass
1403+
...
14041404

14051405

14061406
# LLAMA_API enum llama_params_fit_status llama_params_fit(

llama_cpp/server/settings.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ class ModelSettings(BaseSettings):
104104
default=True, description="Whether to offload kqv to the GPU."
105105
)
106106
flash_attn: Optional[bool] = Field(
107-
default=False,
107+
default=None,
108108
description="Use flash attention. None=auto, True=enabled, False=disabled.",
109109
)
110110
# Sampling Params

0 commit comments

Comments
 (0)