Skip to content

chore(benchmarks): enhance safetensors loading strategies#146

Open
wolegechu wants to merge 2 commits into
mainfrom
ychu/update-bench
Open

chore(benchmarks): enhance safetensors loading strategies#146
wolegechu wants to merge 2 commits into
mainfrom
ychu/update-bench

Conversation

@wolegechu
Copy link
Copy Markdown
Contributor

@wolegechu wolegechu commented Jan 20, 2026

Note

Introduces a new loader strategy and H2D microbenchmark, plus documentation.

  • Loader host_pack (Strategy C_host_pack): Implements CPU-side pack + 1D H2D path (--strategy=host_pack/hp), including planning, host staging, pinned-pack buffers, async H2D, and metrics; adds StrategyKind::kC_HostPack, execution flow, and logging.
  • New h2d_2d_baseline mode: Benchmarks cudaMemcpy2DAsync(H2D) for 2D stride-to-packed copies; adds mode parsing, per-GPU execution, timing, and aggregated throughput output.
  • CLI additions: New flags for 2D H2D (--h2d_2d_width_bytes, --h2d_2d_height, --h2d_2d_src_pitch_bytes, --h2d_2d_dst_pitch_bytes) and extended --h2d_per_gpu_pinned_pool semantics; updated --mode and --strategy help.
  • Utilities: Adds HostBufferSink for host staging writes; minor planner/stat counters and result logging updates to include C_host_pack.
  • Docs: README expanded with host_pack usage and h2d_2d_baseline; new detailed benchmark report covering results and comparisons.

Written by Cursor Bugbot for commit b1ee0aa. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1ee0aa84b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3804 to +3805
host_pack_pool =
std::make_unique<common::memory::PinnedBufferPool>(pack_total_bytes, pack_chunk_bytes, std::move(pool_opts));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align host_pack pinned pool size to 4K

PinnedBufferPool allocates its slab with aligned_alloc(4096, slab_bytes) (see core/common/memory/pinned_buffer_pool.cc:115–118), which requires the total size be a multiple of 4096. Here pack_total_bytes is derived from pack_chunk_bytes * 2, but pack_chunk_bytes is only aligned to 512, so values like 1024/3072/5120 are possible for small or narrow slices. In those cases the constructor will LOG(FATAL) and the host_pack benchmark will abort. Consider rounding pack_total_bytes (or pack_chunk_bytes) up to PinnedBufferPool::kMemoryAlignment before constructing the pool.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant