feat: Support llm-d manual runs#16
feat: Support llm-d manual runs#16albertoperdomo2 wants to merge 3 commits intoopenshift-psap:mainfrom
Conversation
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
📝 WalkthroughWalkthroughAdds LLM-D deployment mode to the JSON import script, introducing deployment metadata fields (DP, EP, replicas, prefill_pod_count, decode_pod_count, router_config, notes), CLI flags and validation, propagated parameters through parsing/processing, and conditional CSV headers/rows including the new columns. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
manual_runs/scripts/import_manual_runs_json_v2.py (2)
28-52: Docstring missing documentation for new parameters.The new LLM-D parameters (
dp,ep,replicas,prefill_pod_count,decode_pod_count,router_config) are not documented in the Args section of the docstring.📝 Suggested docstring addition
Args: benchmark: Benchmark data from JSON (guidellm 0.5.x format). accelerator: Accelerator type (e.g., H200, MI300X). model_name: Name of the AI model. version: Version of the inference server. tp_size: Tensor parallelism size. runtime_args: Runtime configuration arguments. global_data_config: Global data configuration from top-level args. image_tag: Container image tag used for the run. guidellm_version: Version of guidellm used to run the benchmark. guidellm_start_time_ms: Aggregated start time in milliseconds. guidellm_end_time_ms: Aggregated end time in milliseconds. + dp: Data parallelism size (LLM-D mode). + ep: Expert parallelism size (LLM-D mode). + replicas: Number of replicas (LLM-D mode). + prefill_pod_count: Number of prefill pods (LLM-D mode). + decode_pod_count: Number of decode pods (LLM-D mode). + router_config: Router/endpoint picker configuration (LLM-D mode).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 28 - 52, Add docstring entries for the new LLM-D parameters (dp, ep, replicas, prefill_pod_count, decode_pod_count, router_config) in the function's Args section: describe each parameter's expected type and purpose (e.g., dp: int or None — data parallel size; ep: int or None — expert/experts or expert parallelism; replicas: int or None — number of model replicas; prefill_pod_count: int or None — number of pods used for prefill stage; decode_pod_count: int or None — number of pods used for decode stage; router_config: dict or None — routing configuration for request distribution), mark optional parameters as None default, and keep formatting consistent with the existing docstring style used in this function.
199-212: Docstring missing documentation for new parameters.Similar to
process_benchmark_section, the new LLM-D parameters should be documented in the Args section.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 199 - 212, The docstring for the function that starts with "Parse guidellm 0.5.x JSON benchmark results." is missing entries for the new LLM-D parameters; update its Args section to document each new parameter exactly as done in process_benchmark_section (include parameter name, type, and brief description for LLM-D specific fields such as any tokenizer/sequence/prompt config, temperature/beam/candidate settings, or other runtime flags added), ensuring names match the function signature and runtime_args/guidellm_version/other existing params are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 426-476: The LLM-D branch (when args.llm_d is true) is missing the
guidellm_start_time_ms and guidellm_end_time_ms columns, so rows produced by
parse_guidellm_json (which computes and adds guidellm_start_time_ms and
guidellm_end_time_ms) are truncated when written; update the fieldnames list
inside the args.llm_d block to include guidellm_start_time_ms and
guidellm_end_time_ms (matching the standard-mode fieldnames) so those timing
fields are preserved in the CSV output.
---
Nitpick comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 28-52: Add docstring entries for the new LLM-D parameters (dp, ep,
replicas, prefill_pod_count, decode_pod_count, router_config) in the function's
Args section: describe each parameter's expected type and purpose (e.g., dp: int
or None — data parallel size; ep: int or None — expert/experts or expert
parallelism; replicas: int or None — number of model replicas;
prefill_pod_count: int or None — number of pods used for prefill stage;
decode_pod_count: int or None — number of pods used for decode stage;
router_config: dict or None — routing configuration for request distribution),
mark optional parameters as None default, and keep formatting consistent with
the existing docstring style used in this function.
- Around line 199-212: The docstring for the function that starts with "Parse
guidellm 0.5.x JSON benchmark results." is missing entries for the new LLM-D
parameters; update its Args section to document each new parameter exactly as
done in process_benchmark_section (include parameter name, type, and brief
description for LLM-D specific fields such as any tokenizer/sequence/prompt
config, temperature/beam/candidate settings, or other runtime flags added),
ensuring names match the function signature and
runtime_args/guidellm_version/other existing params are preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8c34a60d-9fcd-4e64-96af-b02dfcabf30a
📒 Files selected for processing (1)
manual_runs/scripts/import_manual_runs_json_v2.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
manual_runs/scripts/import_manual_runs_json_v2.py (1)
413-529:⚠️ Potential issue | 🟡 MinorData loss risk when appending CSV files with different modes.
The script supports two CSV modes: standard (default) and
--llm-d. When appending to an existing CSV file (line 414), if the existing file was created with a different mode, thecombined_df[fieldnames]filtering at line 529 will silently drop all columns not in the current mode's fieldnames list.For example: Running in standard mode on an LLM-D CSV will drop DP, EP, replicas, prefill_pod_count, decode_pod_count, and router_config columns from all rows.
Note: The
--llm-dflag is not documented in the README and requires explicit parameters (--dp,--ep,--replicas,--router-config), reducing the likelihood of accidental mode mixing. However, the vulnerability exists and should be mitigated to prevent silent data loss.Consider adding a schema mismatch check when appending:
🛡️ Proposed fix: Add schema mismatch detection
if os.path.exists(args.csv_file): print(f"Appending {len(new_data_df)} new rows to {args.csv_file}...") existing_df = pd.read_csv(args.csv_file) + # Detect schema mismatch + llm_d_columns = {"DP", "EP", "replicas", "prefill_pod_count", "decode_pod_count", "router_config"} + existing_has_llm_d = bool(llm_d_columns & set(existing_df.columns)) + if existing_has_llm_d != args.llm_d: + print(f"Warning: Existing CSV {'has' if existing_has_llm_d else 'lacks'} LLM-D columns, " + f"but current mode is {'LLM-D' if args.llm_d else 'standard'}. " + f"Some columns may be dropped or added with null values.") combined_df = pd.concat([existing_df, new_data_df], ignore_index=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 413 - 529, Appending to an existing CSV ignores schema differences between modes and then drops columns by reindexing to the current mode's fieldnames; detect and prevent this by comparing the existing_df.columns set to the current mode fieldnames (use args.llm_d to choose the expected list), and if there is a mismatch raise/print a clear error or merge-safe warning instead of blindly doing combined_df = combined_df[fieldnames]; update the logic around existing_df/combined_df and the fieldnames construction (refer to variables fieldnames, existing_df, combined_df, and args.llm_d) to either (a) preserve all existing columns by unioning fieldnames with existing_df.columns before reindexing, or (b) abort with a schema-mismatch message that explains which columns differ and suggests running with the matching --llm-d flag.
🧹 Nitpick comments (1)
manual_runs/scripts/import_manual_runs_json_v2.py (1)
28-51: Docstrings lack documentation for new parameters.The new LLM-D parameters (
dp,ep,replicas,prefill_pod_count,decode_pod_count,router_config) are not documented in the function docstrings. Consider adding Args entries for completeness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 28 - 51, The docstring for the function that processes a single benchmark section is missing entries for the new LLM-D parameters (dp, ep, replicas, prefill_pod_count, decode_pod_count, router_config); update the Args block in that function's docstring to add a short description and expected type for each of these parameters (e.g., dp: data parallelism size (int), ep: expert parallelism (int), replicas: number of model replicas (int), prefill_pod_count: pods used for prefill (int), decode_pod_count: pods used for decode (int), router_config: routing configuration dict/str) so the docstring remains complete and consistent with other parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 413-529: Appending to an existing CSV ignores schema differences
between modes and then drops columns by reindexing to the current mode's
fieldnames; detect and prevent this by comparing the existing_df.columns set to
the current mode fieldnames (use args.llm_d to choose the expected list), and if
there is a mismatch raise/print a clear error or merge-safe warning instead of
blindly doing combined_df = combined_df[fieldnames]; update the logic around
existing_df/combined_df and the fieldnames construction (refer to variables
fieldnames, existing_df, combined_df, and args.llm_d) to either (a) preserve all
existing columns by unioning fieldnames with existing_df.columns before
reindexing, or (b) abort with a schema-mismatch message that explains which
columns differ and suggests running with the matching --llm-d flag.
---
Nitpick comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 28-51: The docstring for the function that processes a single
benchmark section is missing entries for the new LLM-D parameters (dp, ep,
replicas, prefill_pod_count, decode_pod_count, router_config); update the Args
block in that function's docstring to add a short description and expected type
for each of these parameters (e.g., dp: data parallelism size (int), ep: expert
parallelism (int), replicas: number of model replicas (int), prefill_pod_count:
pods used for prefill (int), decode_pod_count: pods used for decode (int),
router_config: routing configuration dict/str) so the docstring remains complete
and consistent with other parameters.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb9730ea-0526-4f4f-a6f9-c31219aef13d
📒 Files selected for processing (1)
manual_runs/scripts/import_manual_runs_json_v2.py
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
manual_runs/scripts/import_manual_runs_json_v2.py (1)
28-52: Docstring missing documentation for new parameters.The new parameters (
dp,ep,replicas,prefill_pod_count,decode_pod_count,router_config,notes) are not documented in the function docstring. Consider adding them for completeness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 28 - 52, Update the function docstring for process_benchmark_section to document the newly added parameters dp, ep, replicas, prefill_pod_count, decode_pod_count, router_config, and notes; for each parameter add a short one-line description and expected type (e.g., int, dict, str, or None) and include any semantic meaning (e.g., dp/ep are data/engine parallel sizes, replicas is number of server replicas, prefill_pod_count/decode_pod_count are pod counts for prefill/decode stages, router_config is routing settings, notes is freeform metadata). Ensure these entries follow the existing Args style and placement with the other parameters (accelerator, model_name, etc.) in the same docstring block for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 273-279: The call to process_benchmark_section is missing the
notes argument so the CLI --notes value parsed in parse_guidellm_json is never
forwarded; update the call that currently passes dp, ep, replicas,
prefill_pod_count, decode_pod_count, router_config (the invocation inside
parse_guidellm_json) to include notes=notes so the notes parameter is propagated
into process_benchmark_section and stored on each row.
---
Nitpick comments:
In `@manual_runs/scripts/import_manual_runs_json_v2.py`:
- Around line 28-52: Update the function docstring for process_benchmark_section
to document the newly added parameters dp, ep, replicas, prefill_pod_count,
decode_pod_count, router_config, and notes; for each parameter add a short
one-line description and expected type (e.g., int, dict, str, or None) and
include any semantic meaning (e.g., dp/ep are data/engine parallel sizes,
replicas is number of server replicas, prefill_pod_count/decode_pod_count are
pod counts for prefill/decode stages, router_config is routing settings, notes
is freeform metadata). Ensure these entries follow the existing Args style and
placement with the other parameters (accelerator, model_name, etc.) in the same
docstring block for consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b23b331d-2b0f-41d7-940e-fbb39bf04705
📒 Files selected for processing (1)
manual_runs/scripts/import_manual_runs_json_v2.py
| dp=dp, | ||
| ep=ep, | ||
| replicas=replicas, | ||
| prefill_pod_count=prefill_pod_count, | ||
| decode_pod_count=decode_pod_count, | ||
| router_config=router_config, | ||
| ) |
There was a problem hiding this comment.
Missing notes parameter in call to process_benchmark_section.
The notes parameter is accepted by parse_guidellm_json (line 200) but is never forwarded to process_benchmark_section. This means the CLI --notes value will be silently ignored and all rows will have notes=None.
🐛 Proposed fix to forward the notes parameter
dp=dp,
ep=ep,
replicas=replicas,
prefill_pod_count=prefill_pod_count,
decode_pod_count=decode_pod_count,
router_config=router_config,
+ notes=notes,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dp=dp, | |
| ep=ep, | |
| replicas=replicas, | |
| prefill_pod_count=prefill_pod_count, | |
| decode_pod_count=decode_pod_count, | |
| router_config=router_config, | |
| ) | |
| dp=dp, | |
| ep=ep, | |
| replicas=replicas, | |
| prefill_pod_count=prefill_pod_count, | |
| decode_pod_count=decode_pod_count, | |
| router_config=router_config, | |
| notes=notes, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manual_runs/scripts/import_manual_runs_json_v2.py` around lines 273 - 279,
The call to process_benchmark_section is missing the notes argument so the CLI
--notes value parsed in parse_guidellm_json is never forwarded; update the call
that currently passes dp, ep, replicas, prefill_pod_count, decode_pod_count,
router_config (the invocation inside parse_guidellm_json) to include notes=notes
so the notes parameter is propagated into process_benchmark_section and stored
on each row.
Add support for llm-d manual runs with the necessary new fields.
Summary by CodeRabbit