Skip to content

fix(openvino): define PartialShape bounds for tensors#21637

Draft
thedanhoffman wants to merge 2 commits intoggml-org:masterfrom
thedanhoffman:ov_partial_shape
Draft

fix(openvino): define PartialShape bounds for tensors#21637
thedanhoffman wants to merge 2 commits intoggml-org:masterfrom
thedanhoffman:ov_partial_shape

Conversation

@thedanhoffman
Copy link
Copy Markdown
Contributor

Overview

Bound PartialShape dimensions to be within some reasonable range. This is sufficient to fix an OpenCL allocation issue observed by running #20938 with a GPU.

Even in the case with no functional problems, these shape bounds help inform the OpenVINO graph compiler.

Additional information

Discussed offline with @cavusmustafa. The underlying issue will probably be root-caused to an OpenVINO upstream issue, but if "best practices" on the API user end solves this issue, then I don't see much of a problem with enabling this to work around the issue.

export GGML_OPENVINO_STATEFUL_EXECUTION=1

/home/dhoffman/llama_openvino/llama.cpp/build/bin/llama-cli --verbose -m /home/dhoffman/models/Qwen3-4B-Instruct-2507-Q4_0.gguf -ngl 99 -np 1 -c 8192 -ctk f16 -ctv f16 -ctkd f16 -ctvd f16 --flash-attn on --jinja --reasoning-format deepseek --simple-io --single-turn -n 16 -p hello --no-warmup -c 16

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: All code written was done by myself personally, but I did throw an AI at the debug process. Again, I work at Intel and have discussed this with the owners in-person, so the changes are not unreasonable

if (m_is_static) {
input_shape = ov::PartialShape{1, 1, 1, m_compute_params.output_len};
} else {
input_shape = ov::PartialShape{1, 1, 1, ov::Dimension(1, m_compute_params.output_len)};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me but why do we need to remove one liners for this?

For example, can we still use something like this?

input_shape = ov::PartialShape{1, 1, 1, m_is_static ?
                        m_compute_params.output_len : ov::Dimension(1, m_compute_params.output_len)};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i can change it back to how it was

if (!m_is_static) {
// do not fix ctx size to make llama-bench work across test params
input_shape[2] = -1;
input_shape[2] = dim_span_ctx;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama-bench failed on my side for larger context sizes and for all stateful executions.
tested with llama3.2 1B q4_0

stateless: -d 512,1024 fails
stateful: all ctx sizes fail

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

investigating this now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache wasn't being properly invalidated and there's no easy way (AFAICT) to get the max possible batch size

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenVINO labels Apr 10, 2026
@thedanhoffman
Copy link
Copy Markdown
Contributor Author

cache wasn't being invalidated for cases where the batch size changes. adjusted ComputeParams to accept the max observed prefill size and invalidate the cache in cases where the maximum increases

I'm hoping for most "real" workloads, there will be some sort of system prompt which exceeds whatever batch size is configured, so there shouldn't be a performance hit in this case (except maybe post-warmup, but that's a one-time cost, so not a big deal)

@thedanhoffman thedanhoffman marked this pull request as draft April 10, 2026 05:19
@thedanhoffman thedanhoffman marked this pull request as draft April 10, 2026 05:19
@thedanhoffman
Copy link
Copy Markdown
Contributor Author

also went ahead and moved this to draft since how we go about invalidating the cache might be a broader architectural question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenVINO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants