Skip to content

VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611

Open
gabilan wants to merge 4 commits into
sipsorcery-org:masterfrom
gabilan:master
Open

VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611
gabilan wants to merge 4 commits into
sipsorcery-org:masterfrom
gabilan:master

Conversation

@gabilan
Copy link
Copy Markdown

@gabilan gabilan commented May 7, 2026

Summary

This PR improves end-to-end VP8 encode throughput in SIPSorcery.VP8 while keeping bit-exact parity on the default single-partition (log2 = 0) path and adding decoder round-trip coverage for optional multi-partition streams.

What "Legacy", "Optimized", and "four-part" mean here

  • Legacy — The scalar encode pipeline (LegacyVp8FrameEncodePipeline): same stages as always, no SIMD/intrinsic fast paths (IEncoderMemoryOps.UseSimdEncoderKernels off for this pipeline). Useful as the reference for parity tests and apples-to-apples regressions.

  • Optimized — The SIMD-capable pipeline (OptimizedVp8FrameEncodePipeline via Vp8FrameEncodePipelineFactory): FDCT / Walsh / quantize / IDCT and related helpers can use x64 SSE2/SSE4.1 or ARM64 AdvSimd where implemented. This is the main throughput path; it still uses one token partition (log2 = 0) unless you configure multi-partition (below).

  • Four-part (multi-partition tokens) — Not "4×4" in the transform-block sense (VP8 still uses 4×4 residuals everywhere). Here it means four parallel VP8 token bitstreams (log2NumTokenPartitions = 2, i.e. 2^2 = 4 partitions), with per-partition packing and a stitched output. Benchmarks label this Keyframe_Optimized_4Part / Inter_Optimized_4Part. Only the Optimized factory path is set up this way in the benches; Legacy stays single-partition for comparison.

Key changes

  • Encode pipeline kinds (Legacy vs Optimized)
    SIMD/intrinsic work runs only on the Optimized pipeline (Vp8FrameEncodePipelineFactory, IEncoderMemoryOps.UseSimdEncoderKernels). The Legacy path remains scalar for apples-to-apples comparison and regression safety.

  • SIMD kernels (x64 SSE2/SSE4.1, ARM64 AdvSimd)
    Optional paths for FDCT / Walsh / quantize / IDCT and related residual helpers (*EncoderSimd.cs, EncoderMemoryOps, mb_encoder dispatch). Scalar implementations stay the reference in dct / quantize where applicable.

  • PackTokens hot path
    vp8_pack_tokens (Span path) hoists BOOL_CODER state and inlines the bool coder loop to cut per-token overhead (bitstream.cs).

  • Pooled bitstream output
    ArraySegment<byte> / "Pooled" API returns a slice over FrameEncoderBuffers.OutBuf instead of allocating + copying every frame; byte[] entry points remain as copy-out wrappers.

  • Multi–token-partition encoding (log2 0..3)
    Parallel per-partition pack (partition = MB row r & (N-1)), stitch of the (N-1)*3 size table and partition bytes, header log2_nbr_of_dct_partitions wired in bitstream validation. OutBuf is pinned for the whole encode so BOOL_CODER's raw pointers stay valid under GC pressure.

  • Correctness / perf hygiene
    tokenize: per-thread last coef-row cache for ConditionalWeakTable fast path (avoids cross-thread wrong-row races); shared ValidateLog2TokenPartitions; reused PartitionLengthsScratch for multi-partition lengths.

  • Diagnostics & CI
    Optional EncodeProfiler phase buckets (thread-local enable flag so parallel xUnit does not flip profiling for other tests); BenchmarkDotNet project for pipeline/micro benches; optional GitHub Actions workflow for VP8 (vp8-encoder-ci.yml).

Scope is roughly ~6k line churn (one squashed commit on the branch for ease of push); content naturally groups into SIMD/pipeline, pack, pooling, multi-partition, profiler/benchmarks/CI, test/review fixes.

Performance (local BenchmarkDotNet)

Host: macOS Tahoe 26.3, Apple M1 Max (10 cores), .NET SDK 10.0.103, runtime .NET 10.0.3, Arm64 RyuJIT (AdvSimd available).

Project: test/SIPSorcery.VP8.Benchmarks, filter EncodePipelineBenchmarks, DefaultJob (warm, out-of-process child).

Keyframes (random I420, q=32; one full encode per iteration)

Method Resolution Mean Note
Keyframe_Legacy 640×480 27.91 ms Scalar baseline
Keyframe_Optimized 640×480 25.95 ms ~7% faster than Legacy
Keyframe_Legacy 1280×720 83.35 ms
Keyframe_Optimized 1280×720 78.16 ms ~6% faster than Legacy
Keyframe_Optimized_4Part 640×480 10.94 ms Four token partitions; ~61% less time than Legacy @ 640×480 (~2.5×); higher alloc (~4.2 KB vs ~1.6 KB in BDN managed view)
Keyframe_Optimized_4Part 1280×720 32.73 ms

Inter (640×480; InvocationCount=1, ~5–8 ms/iter — directional only)

Method Mean
Inter_Legacy 7.49 ms
Inter_Optimized 5.86 ms
Inter_Optimized_4Part 4.23 ms

How to read BenchmarkDotNet Ratio: in this class the baseline is Keyframe_Legacy @ 640×480; ratios vs that row are not "Legacy vs Optimized at the same resolution" for 720p rows. Prefer the explicit ms table above for cross-resolution comparisons.

Reproduce (Release):

dotnet run -c Release --project test/SIPSorcery.VP8.Benchmarks -- --filter '*EncodePipelineBenchmarks*' -e github

Testing

  • Unit tests: existing VP8 unit tests updated/extended where touched; new coverage includes:
    • Legacy vs Optimized parity for contiguous I420 encode (encode_pipeline_parity_unittest).
    • Multi-partition keyframe + inter decoder round-trips and header sanity (multi_partition_unittest).
    • SIMD-focused tests for fdct/quantize/idct/walsh where added.
    • Pack tokens reference checks + two-thread GetCoefProbRowForPack stress (pack_tokens_unittest).
    • Bool decoder: vp8dx_start_decode-style helpers that held unpinned pointers were removed; tests pin buffers for decode spans.
  • Rationale: pack/token paths dominate profile time; parity tests lock the default layout to the scalar/Legacy reference, while multi-partition tests assert spec-correct streams and decoder acceptance (layout differs from single-partition by design for log2 > 0).
  • Known environment caveats: two tests require GDI+ / Windows-oriented image APIs; CI excludes them on non-Windows-friendly hosts (vp8-encoder-ci.yml filter). Benchmarks are Release / BenchmarkDotNet and are meant for perf signal, not functional correctness.

Splitting / stacking

Maintainers prefer a smaller review: I'm happy to rework this into a stacked series.

  • ~5 PRs (coarse): SIMD+pipeline → pack fast path → pooled output → multi-partition (+pinning) → profiler/benchmarks+CI
  • ~20 PRs (fine): e.g. SIMD split per kernel + infra, multi-partition split (plumbing / buffers / parallel pack / stitch / pin / tests), benchmarks split per bench + workflow — as outlined in review discussion.

If you want a stack, specify preferred granularity and base branch naming; I can branch and open dependent PRs from this same fork.

gabilan and others added 4 commits May 6, 2026 17:41
…ition, benchmarks

- Legacy vs Optimized encode pipelines (Vp8FrameEncodePipeline); SIMD only on Optimized
- Scalar DCT/quantize with optional SSE2/SSE4.1/AdvSimd kernels (Fdct/Idct/Walsh/Quantize)
- PackTokens: hoist BOOL_CODER state + inline bool coder in bitstream
- Pooled ArraySegment output over FrameEncoderBuffers.OutBuf; VP8Codec partitioning knob
- Multi token partitions (log2 0..3), parallel pack, stitch + pinned OutBuf for BOOL_CODER
- tokenize: thread-local coef-row cache; shared log2 validation helper
- EncodeProfiler, BenchmarkDotNet project, optional GitHub workflow for VP8 CI
- Tests: parity, multi-partition round-trip, SIMD unit tests, coef row threading test

Co-authored-by: Cursor <cursoragent@cursor.com>
The profiler test toggles global EncodeProfiler state; running it with the
full xUnit parallel workload can abort the test host (reproduced locally with
the same dotnet test filter as CI). Exclude this class from vp8-encoder runs;
it can still be executed locally or in a dedicated single-threaded job.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Make EncodeProfiler.Enabled thread-local for parallel xUnit runs.
- Remove unsafe vp8dx_start_decode array helper; pin buffers at call sites.
- Fix QuantizeEncoderSimd spill stores for stackalloc int alignment.
- Run full VP8 CI suite in parallel (drop MaxParallelThreads=1; keep GDI+ test filter).
- Refresh EncodePipeline benchmark doc comments with current M1 Max timings.

Co-authored-by: Cursor <cursoragent@cursor.com>
…artition

VP8 encoder: SIMD-gated pipeline, pack fast path, pooling, multi-part…
@sipsorcery
Copy link
Copy Markdown
Member

Thanks for the PR.

Other than the failing CI pipelines could you add the header blocks to the new files to keep them consistent with the rest of the code base. Ideally also recording which LLM model was used which will help for targetting future improvements.

@sipsorcery
Copy link
Copy Markdown
Member

Actually a bigger issue is the encoder stage is now boken for this PR. The WebRTCGetStarted example now produces the result below for me:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants