Skip to content

UPSTREAM PR #21405: vendor : update cpp-httplib to 0.40.1#1331

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21405-cpp-httplib-041
Open

UPSTREAM PR #21405: vendor : update cpp-httplib to 0.40.1#1331
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21405-cpp-httplib-041

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 4, 2026

Note

Source pull request: ggml-org/llama.cpp#21405

Overview

Additional information

Requirements

@loci-review
Copy link
Copy Markdown

loci-review bot commented Apr 4, 2026

Overview

Single commit updates cpp-httplib vendor library (0.40.0 → 0.40.1). Analysis covers 125,669 functions across 15 binaries: 146 modified (0.12%), 186 new (0.15%), 93 removed (0.07%), 125,244 unchanged (99.66%).

Power consumption changes (all <0.2%):

  • build.bin.llama-bench: +0.185%
  • build.bin.llama-cvector-generator: +0.176%
  • build.bin.llama-tts: +0.092%
  • All other binaries (libllama.so, libmtmd.so, llama-tokenize, llama-quantize, llama-qwen2vl-cli, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, libggml.so, libggml-cpu.so, libggml-base.so): 0.000%

Key finding: Major HTTP/WebSocket client initialization improvements (85-88% faster) with minor regressions in non-critical utility functions.

Function Analysis

Major Improvements (HTTP Client Initialization):

httplib::Client::Client() (llama-bench, llama-tts, llama-cvector-generator):

  • Response time: 61,787-61,825 ns → 7,131-7,165 ns (-88.4% to -88.5%, ~54,000 ns savings)
  • Throughput time: 591-609 ns → 379 ns (-36% to -38%)
  • Source change: Replaced regex-based URL parsing (std::regex_match, 43,000 ns compilation + 7,400 ns matching) with structured detail::parse_url() (1,733 ns) using UrlComponents. Eliminated expensive regex compilation from constructor hot path.

httplib::WebSocketClient::WebSocketClient() (llama-bench, llama-tts, llama-cvector-generator):

  • Response time: 62,761-62,819 ns → 8,893-8,945 ns (-85.7% to -85.8%, ~54,000 ns savings)
  • Throughput time: 629-638 ns → 442-444 ns (-29.7% to -30.4%)
  • Source change: Same optimization pattern—replaced regex validation with structured parsing, applied move semantics for host/path assignments.

STL Template Improvements (llama-tts):

  • std::vector<jinja::token>::begin(): 265 ns → 84 ns (-68.2%)
  • nlohmann::json::get(): 243 ns → 61 ns throughput (-75.0%)
  • __gnu_cxx::__ops::__pred_iter(): 269 ns → 93 ns throughput (-65.6%)
  • Cause: cpp-httplib eliminated std::any and std::regex, reducing template bloat and enabling better compiler optimizations for all STL templates.

Minor Regressions (Non-Critical Paths):

std::make_error_code() (llama-bench):

  • Response time: 145 ns → 333 ns (+128.7%, +187 ns)
  • Throughput time: 109 ns → 296 ns (+171.5%, +187 ns)
  • Cause: Compiler-generated entry block indirection (9 blocks → 11 blocks). Error handling path, not in hot path.

serialize_parser_variant() (llama-cvector-generator):

  • Throughput time: 62 ns → 182 ns (+193.3%, +120 ns)
  • Cause: Unnecessary entry block indirection from compiler code generation. Called once per grammar compilation, not per token.

httplib::ClientImpl::stop() (llama-bench):

  • Throughput time: 130 ns → 308 ns (+137.3%, +178 ns)
  • Cause: Entry point indirect branching. Client cleanup function, infrequently called.

Other analyzed functions (std::swap, std::function::operator=, SSEClient::set_headers, Jinja template utilities) showed minor regressions (65-86 ns) in non-critical paths due to compiler code generation changes.

Flame Graph Comparison

Selected function: httplib::Client::Client() (llama-tts) — best illustrates the 88% response time improvement from regex elimination.

Base version:

Base version flame graph

Target version:

Target version flame graph

Base version dominated by regex compilation (_M_compile: 42,990 ns, 68% of total) and matching (7,374 ns). Target version eliminates all regex operations, replacing with lightweight parse_url (1,730 ns, 24% of total). Call depth reduced from 10 to 4 levels, achieving 8.7x speedup.

Additional Findings

Zero impact on core inference operations: All changes isolated to HTTP/networking infrastructure. No modifications to performance-critical paths: matrix operations (GEMM), attention mechanisms, KV cache, quantization kernels, or GPU backends (CUDA, Metal, HIP, Vulkan, SYCL). Core inference libraries (libllama.so, libggml.so, libggml-cpu.so) show 0.000% power consumption change.

Architectural alignment: Update follows llama.cpp's philosophy of using specialized implementations over generic libraries. HTTP client initialization improvements benefit model downloading, API communication, and benchmarking infrastructure without touching inference engine.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 34734bc to 55afbee Compare April 11, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from d101579 to 63ab8d1 Compare April 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants