VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks by gabilan · Pull Request #1611 · sipsorcery-org/sipsorcery

gabilan · 2026-05-07T01:24:22Z

Summary

This PR improves end-to-end VP8 encode throughput in SIPSorcery.VP8 while keeping bit-exact parity on the default single-partition (log2 = 0) path and adding decoder round-trip coverage for optional multi-partition streams.

What "Legacy", "Optimized", and "four-part" mean here

Legacy — The scalar encode pipeline (LegacyVp8FrameEncodePipeline): same stages as always, no SIMD/intrinsic fast paths (IEncoderMemoryOps.UseSimdEncoderKernels off for this pipeline). Useful as the reference for parity tests and apples-to-apples regressions.
Optimized — The SIMD-capable pipeline (OptimizedVp8FrameEncodePipeline via Vp8FrameEncodePipelineFactory): FDCT / Walsh / quantize / IDCT and related helpers can use x64 SSE2/SSE4.1 or ARM64 AdvSimd where implemented. This is the main throughput path; it still uses one token partition (log2 = 0) unless you configure multi-partition (below).
Four-part (multi-partition tokens) — Not "4×4" in the transform-block sense (VP8 still uses 4×4 residuals everywhere). Here it means four parallel VP8 token bitstreams (log2NumTokenPartitions = 2, i.e. 2^2 = 4 partitions), with per-partition packing and a stitched output. Benchmarks label this Keyframe_Optimized_4Part / Inter_Optimized_4Part. Only the Optimized factory path is set up this way in the benches; Legacy stays single-partition for comparison.

Key changes

Encode pipeline kinds (Legacy vs Optimized)
SIMD/intrinsic work runs only on the Optimized pipeline (Vp8FrameEncodePipelineFactory, IEncoderMemoryOps.UseSimdEncoderKernels). The Legacy path remains scalar for apples-to-apples comparison and regression safety.
SIMD kernels (x64 SSE2/SSE4.1, ARM64 AdvSimd)
Optional paths for FDCT / Walsh / quantize / IDCT and related residual helpers (*EncoderSimd.cs, EncoderMemoryOps, mb_encoder dispatch). Scalar implementations stay the reference in dct / quantize where applicable.
PackTokens hot path
vp8_pack_tokens (Span path) hoists BOOL_CODER state and inlines the bool coder loop to cut per-token overhead (bitstream.cs).
Pooled bitstream output
ArraySegment<byte> / "Pooled" API returns a slice over FrameEncoderBuffers.OutBuf instead of allocating + copying every frame; byte[] entry points remain as copy-out wrappers.
Multi–token-partition encoding (log2 0..3)
Parallel per-partition pack (partition = MB row r & (N-1)), stitch of the (N-1)*3 size table and partition bytes, header log2_nbr_of_dct_partitions wired in bitstream validation. OutBuf is pinned for the whole encode so BOOL_CODER's raw pointers stay valid under GC pressure.
Correctness / perf hygiene
tokenize: per-thread last coef-row cache for ConditionalWeakTable fast path (avoids cross-thread wrong-row races); shared ValidateLog2TokenPartitions; reused PartitionLengthsScratch for multi-partition lengths.
Diagnostics & CI
Optional EncodeProfiler phase buckets (thread-local enable flag so parallel xUnit does not flip profiling for other tests); BenchmarkDotNet project for pipeline/micro benches; optional GitHub Actions workflow for VP8 (vp8-encoder-ci.yml).

Scope is roughly ~6k line churn (one squashed commit on the branch for ease of push); content naturally groups into SIMD/pipeline, pack, pooling, multi-partition, profiler/benchmarks/CI, test/review fixes.

Performance (local BenchmarkDotNet)

Host: macOS Tahoe 26.3, Apple M1 Max (10 cores), .NET SDK 10.0.103, runtime .NET 10.0.3, Arm64 RyuJIT (AdvSimd available).

Project: test/SIPSorcery.VP8.Benchmarks, filter EncodePipelineBenchmarks, DefaultJob (warm, out-of-process child).

Keyframes (random I420, q=32; one full encode per iteration)

Method	Resolution	Mean	Note
`Keyframe_Legacy`	640×480	27.91 ms	Scalar baseline
`Keyframe_Optimized`	640×480	25.95 ms	~7% faster than Legacy
`Keyframe_Legacy`	1280×720	83.35 ms
`Keyframe_Optimized`	1280×720	78.16 ms	~6% faster than Legacy
`Keyframe_Optimized_4Part`	640×480	10.94 ms	Four token partitions; ~61% less time than Legacy @ 640×480 (~2.5×); higher alloc (~4.2 KB vs ~1.6 KB in BDN managed view)
`Keyframe_Optimized_4Part`	1280×720	32.73 ms

Inter (640×480; InvocationCount=1, ~5–8 ms/iter — directional only)

Method	Mean
`Inter_Legacy`	7.49 ms
`Inter_Optimized`	5.86 ms
`Inter_Optimized_4Part`	4.23 ms

How to read BenchmarkDotNet Ratio: in this class the baseline is Keyframe_Legacy @ 640×480; ratios vs that row are not "Legacy vs Optimized at the same resolution" for 720p rows. Prefer the explicit ms table above for cross-resolution comparisons.

Reproduce (Release):

dotnet run -c Release --project test/SIPSorcery.VP8.Benchmarks -- --filter '*EncodePipelineBenchmarks*' -e github

Testing

Unit tests: existing VP8 unit tests updated/extended where touched; new coverage includes:
- Legacy vs Optimized parity for contiguous I420 encode (encode_pipeline_parity_unittest).
- Multi-partition keyframe + inter decoder round-trips and header sanity (multi_partition_unittest).
- SIMD-focused tests for fdct/quantize/idct/walsh where added.
- Pack tokens reference checks + two-thread GetCoefProbRowForPack stress (pack_tokens_unittest).
- Bool decoder: vp8dx_start_decode-style helpers that held unpinned pointers were removed; tests pin buffers for decode spans.
Rationale: pack/token paths dominate profile time; parity tests lock the default layout to the scalar/Legacy reference, while multi-partition tests assert spec-correct streams and decoder acceptance (layout differs from single-partition by design for log2 > 0).
Known environment caveats: two tests require GDI+ / Windows-oriented image APIs; CI excludes them on non-Windows-friendly hosts (vp8-encoder-ci.yml filter). Benchmarks are Release / BenchmarkDotNet and are meant for perf signal, not functional correctness.

Splitting / stacking

Maintainers prefer a smaller review: I'm happy to rework this into a stacked series.

~5 PRs (coarse): SIMD+pipeline → pack fast path → pooled output → multi-partition (+pinning) → profiler/benchmarks+CI
~20 PRs (fine): e.g. SIMD split per kernel + infra, multi-partition split (plumbing / buffers / parallel pack / stitch / pin / tests), benchmarks split per bench + workflow — as outlined in review discussion.

If you want a stack, specify preferred granularity and base branch naming; I can branch and open dependent PRs from this same fork.

…ition, benchmarks - Legacy vs Optimized encode pipelines (Vp8FrameEncodePipeline); SIMD only on Optimized - Scalar DCT/quantize with optional SSE2/SSE4.1/AdvSimd kernels (Fdct/Idct/Walsh/Quantize) - PackTokens: hoist BOOL_CODER state + inline bool coder in bitstream - Pooled ArraySegment output over FrameEncoderBuffers.OutBuf; VP8Codec partitioning knob - Multi token partitions (log2 0..3), parallel pack, stitch + pinned OutBuf for BOOL_CODER - tokenize: thread-local coef-row cache; shared log2 validation helper - EncodeProfiler, BenchmarkDotNet project, optional GitHub workflow for VP8 CI - Tests: parity, multi-partition round-trip, SIMD unit tests, coef row threading test Co-authored-by: Cursor <cursoragent@cursor.com>

The profiler test toggles global EncodeProfiler state; running it with the full xUnit parallel workload can abort the test host (reproduced locally with the same dotnet test filter as CI). Exclude this class from vp8-encoder runs; it can still be executed locally or in a dedicated single-threaded job. Co-authored-by: Cursor <cursoragent@cursor.com>

- Make EncodeProfiler.Enabled thread-local for parallel xUnit runs. - Remove unsafe vp8dx_start_decode array helper; pin buffers at call sites. - Fix QuantizeEncoderSimd spill stores for stackalloc int alignment. - Run full VP8 CI suite in parallel (drop MaxParallelThreads=1; keep GDI+ test filter). - Refresh EncodePipeline benchmark doc comments with current M1 Max timings. Co-authored-by: Cursor <cursoragent@cursor.com>

…artition VP8 encoder: SIMD-gated pipeline, pack fast path, pooling, multi-part…

sipsorcery · 2026-05-07T05:00:06Z

Thanks for the PR.

Other than the failing CI pipelines could you add the header blocks to the new files to keep them consistent with the rest of the code base. Ideally also recording which LLM model was used which will help for targetting future improvements.

sipsorcery · 2026-05-07T05:03:35Z

Actually a bigger issue is the encoder stage is now boken for this PR. The WebRTCGetStarted example now produces the result below for me:

gabilan and others added 4 commits May 6, 2026 17:41

Merge pull request #1 from gabilan/vp8-encoder-throughput-simd-multip…

f1bba05

…artition VP8 encoder: SIMD-gated pipeline, pack fast path, pooling, multi-part…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611

VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611
gabilan wants to merge 4 commits into
sipsorcery-org:masterfrom
gabilan:master

gabilan commented May 7, 2026

Uh oh!

sipsorcery commented May 7, 2026

Uh oh!

sipsorcery commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gabilan commented May 7, 2026

Summary

What "Legacy", "Optimized", and "four-part" mean here

Key changes

Performance (local BenchmarkDotNet)

Keyframes (random I420, q=32; one full encode per iteration)

Inter (640×480; InvocationCount=1, ~5–8 ms/iter — directional only)

Testing

Splitting / stacking

Uh oh!

sipsorcery commented May 7, 2026

Uh oh!

sipsorcery commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants