Conversation
debermudez
left a comment
There was a problem hiding this comment.
This seems surprisingly lightweight and straightforward.
I think the only piece really missing is an update to the unit tests.
I would also like to set this up to test it e2e too.
af8c734 to
1f50b6e
Compare
tgerdesnv
left a comment
There was a problem hiding this comment.
I'd really like to see an E2E test protecting this. Not sure if the CI is ready for that yet, but a high priority ticket should be created at a minimum.
| if args.output_tokens_mean_deterministic: | ||
| parser.error( | ||
| "The --output-tokens-mean-deterministic option is only supported with the Triton service-kind." | ||
| "The --output-tokens-mean-deterministic option is only supported with the kserve endpoint type." |
There was a problem hiding this comment.
Should the code be changed to check endpoint_type != kserve? I know that with the current code it is the same result, but it introduces an assumption (endpoint kserve -> service_kind triton) that could trip up a future developer.
| default="tensorrtllm", | ||
| required=False, | ||
| help=f'When using the "triton" service-kind, ' | ||
| help=f'When using the "kserve" endpoint type, ' |
There was a problem hiding this comment.
Can generate endpoint not use trtllm vs vllm?
There was a problem hiding this comment.
It can - I haven't added any different behavior for the different backends. Actually - it has only been tested against vllm at the moment. So this is fair point ...
Let me move this back to draft - plan to test trt-llm in the next week or so
| void | ||
| ChatCompletionRequest::SendResponse(bool is_final, bool is_null) | ||
| { | ||
| // if final response has already been sent |
There was a problem hiding this comment.
The classes in this file should be renamed since they aren't specific to Chat Completions.
There was a problem hiding this comment.
I think "HTTP with SSE Support" is in the end what it is .... not sure the best name.
There was a problem hiding this comment.
I'd really like to see the classes refactored. We shouldn't need two independent full http clients. Either one goes away, or we get a base class and then some really thin implementation classes on top. We already have stories for this (TMA-1644), so no big deal if this is ignored in this PR.
| @@ -731,7 +748,10 @@ def _extract_openai_text_output(self, response: str) -> str: | |||
|
|
|||
| def _is_openai_empty_response(self, response: str) -> bool: | |||
There was a problem hiding this comment.
We should change the name of the function since it's no longer just openai
| ) | ||
|
|
||
| endpoint_group.add_argument( | ||
| "--service-kind", |
There was a problem hiding this comment.
Finally one less CLI option 😄 Can we also update the README to reflect the changes in CLI options?
There was a problem hiding this comment.
@nnshah1 what's blocking this PR as being marked ready for review?
There was a problem hiding this comment.
Its out of date from some of our other work now and needs to be ported to the new repository at the bare minimum. I really like the additions here so I would like to see them integrated soon as well.
This PR does two main things:
Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".
Remove "service-kind" parameter from top level cli. Service kind can be inferred from
endpoint-typeandendpoint-typeis more clear.endpoint-typeis tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based onendpoint-type. To facillitate a newendpoint-typeofkservewas added.Existing Tests have been updated.
No new tests added - could be done - or done as separate PR.
Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.