auto-derive max_req_total_len from model config#1297
auto-derive max_req_total_len from model config#1297Owleye4 wants to merge 10 commits intoModelTC:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements automatic derivation of max_req_total_len from model configurations, replacing the previous hardcoded default. It introduces logic to handle various RoPE scaling types and ensures consistency across processes by publishing the effective, KV-cache-clamped value via shared memory. Documentation has been updated to reflect these changes. Feedback suggests using canonical paths for cached configuration lookups, defining the safety margin for token clamping as a named constant, and considering a more dynamic cap for CUDA graph capture lengths.
There was a problem hiding this comment.
Code Review
This pull request implements automatic derivation of the --max_req_total_len parameter from model configurations, such as max_sequence_length or max_position_embeddings adjusted by RoPE scaling factors. It introduces logic to soft-clamp this value against the actual KV-cache pool capacity and uses shared memory to synchronize the effective limit across different server processes. Additionally, the changes include early S3 model preparation and updates to documentation. Feedback was provided suggesting that a hardcoded margin of 8 used during KV pool clamping should be replaced with a named constant to improve code maintainability.
127de7b to
698e0b7
Compare
Auto-derive
max_req_total_lenfrommodel_dir/config.jsonat API start time.If derivation fails, fall back to the previous default value to keep existing behavior.