VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611
Open
gabilan wants to merge 4 commits into
Open
VP8 encoder: pipeline-gated SIMD, pack fast path, pooled output, multi-partition tokens, benchmarks#1611gabilan wants to merge 4 commits into
gabilan wants to merge 4 commits into
Conversation
…ition, benchmarks - Legacy vs Optimized encode pipelines (Vp8FrameEncodePipeline); SIMD only on Optimized - Scalar DCT/quantize with optional SSE2/SSE4.1/AdvSimd kernels (Fdct/Idct/Walsh/Quantize) - PackTokens: hoist BOOL_CODER state + inline bool coder in bitstream - Pooled ArraySegment output over FrameEncoderBuffers.OutBuf; VP8Codec partitioning knob - Multi token partitions (log2 0..3), parallel pack, stitch + pinned OutBuf for BOOL_CODER - tokenize: thread-local coef-row cache; shared log2 validation helper - EncodeProfiler, BenchmarkDotNet project, optional GitHub workflow for VP8 CI - Tests: parity, multi-partition round-trip, SIMD unit tests, coef row threading test Co-authored-by: Cursor <cursoragent@cursor.com>
The profiler test toggles global EncodeProfiler state; running it with the full xUnit parallel workload can abort the test host (reproduced locally with the same dotnet test filter as CI). Exclude this class from vp8-encoder runs; it can still be executed locally or in a dedicated single-threaded job. Co-authored-by: Cursor <cursoragent@cursor.com>
- Make EncodeProfiler.Enabled thread-local for parallel xUnit runs. - Remove unsafe vp8dx_start_decode array helper; pin buffers at call sites. - Fix QuantizeEncoderSimd spill stores for stackalloc int alignment. - Run full VP8 CI suite in parallel (drop MaxParallelThreads=1; keep GDI+ test filter). - Refresh EncodePipeline benchmark doc comments with current M1 Max timings. Co-authored-by: Cursor <cursoragent@cursor.com>
…artition VP8 encoder: SIMD-gated pipeline, pack fast path, pooling, multi-part…
Member
|
Thanks for the PR. Other than the failing CI pipelines could you add the header blocks to the new files to keep them consistent with the rest of the code base. Ideally also recording which LLM model was used which will help for targetting future improvements. |
Member
|
Actually a bigger issue is the encoder stage is now boken for this PR. The WebRTCGetStarted example now produces the result below for me:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
This PR improves end-to-end VP8 encode throughput in
SIPSorcery.VP8while keeping bit-exact parity on the default single-partition (log2 = 0) path and adding decoder round-trip coverage for optional multi-partition streams.What "Legacy", "Optimized", and "four-part" mean here
Legacy — The scalar encode pipeline (
LegacyVp8FrameEncodePipeline): same stages as always, no SIMD/intrinsic fast paths (IEncoderMemoryOps.UseSimdEncoderKernelsoff for this pipeline). Useful as the reference for parity tests and apples-to-apples regressions.Optimized — The SIMD-capable pipeline (
OptimizedVp8FrameEncodePipelineviaVp8FrameEncodePipelineFactory): FDCT / Walsh / quantize / IDCT and related helpers can use x64 SSE2/SSE4.1 or ARM64 AdvSimd where implemented. This is the main throughput path; it still uses one token partition (log2 = 0) unless you configure multi-partition (below).Four-part (multi-partition tokens) — Not "4×4" in the transform-block sense (VP8 still uses 4×4 residuals everywhere). Here it means four parallel VP8 token bitstreams (
log2NumTokenPartitions = 2, i.e. 2^2 = 4 partitions), with per-partition packing and a stitched output. Benchmarks label thisKeyframe_Optimized_4Part/Inter_Optimized_4Part. Only the Optimized factory path is set up this way in the benches; Legacy stays single-partition for comparison.Key changes
Encode pipeline kinds (
LegacyvsOptimized)SIMD/intrinsic work runs only on the Optimized pipeline (
Vp8FrameEncodePipelineFactory,IEncoderMemoryOps.UseSimdEncoderKernels). The Legacy path remains scalar for apples-to-apples comparison and regression safety.SIMD kernels (x64 SSE2/SSE4.1, ARM64 AdvSimd)
Optional paths for FDCT / Walsh / quantize / IDCT and related residual helpers (
*EncoderSimd.cs,EncoderMemoryOps,mb_encoderdispatch). Scalar implementations stay the reference indct/quantizewhere applicable.PackTokens hot path
vp8_pack_tokens(Span path) hoistsBOOL_CODERstate and inlines the bool coder loop to cut per-token overhead (bitstream.cs).Pooled bitstream output
ArraySegment<byte>/ "Pooled" API returns a slice overFrameEncoderBuffers.OutBufinstead of allocating + copying every frame;byte[]entry points remain as copy-out wrappers.Multi–token-partition encoding (
log20..3)Parallel per-partition pack (partition = MB row
r & (N-1)), stitch of the(N-1)*3size table and partition bytes, headerlog2_nbr_of_dct_partitionswired inbitstreamvalidation.OutBufis pinned for the whole encode soBOOL_CODER's raw pointers stay valid under GC pressure.Correctness / perf hygiene
tokenize: per-thread last coef-row cache forConditionalWeakTablefast path (avoids cross-thread wrong-row races); sharedValidateLog2TokenPartitions; reusedPartitionLengthsScratchfor multi-partition lengths.Diagnostics & CI
Optional
EncodeProfilerphase buckets (thread-local enable flag so parallel xUnit does not flip profiling for other tests); BenchmarkDotNet project for pipeline/micro benches; optional GitHub Actions workflow for VP8 (vp8-encoder-ci.yml).Scope is roughly ~6k line churn (one squashed commit on the branch for ease of push); content naturally groups into SIMD/pipeline, pack, pooling, multi-partition, profiler/benchmarks/CI, test/review fixes.
Performance (local BenchmarkDotNet)
Host: macOS Tahoe 26.3, Apple M1 Max (10 cores), .NET SDK 10.0.103, runtime .NET 10.0.3, Arm64 RyuJIT (AdvSimd available).
Project:
test/SIPSorcery.VP8.Benchmarks, filterEncodePipelineBenchmarks, DefaultJob (warm, out-of-process child).Keyframes (random I420, q=32; one full encode per iteration)
Keyframe_LegacyKeyframe_OptimizedKeyframe_LegacyKeyframe_OptimizedKeyframe_Optimized_4PartKeyframe_Optimized_4PartInter (640×480; InvocationCount=1, ~5–8 ms/iter — directional only)
Inter_LegacyInter_OptimizedInter_Optimized_4PartHow to read BenchmarkDotNet
Ratio: in this class the baseline isKeyframe_Legacy@ 640×480; ratios vs that row are not "Legacy vs Optimized at the same resolution" for 720p rows. Prefer the explicit ms table above for cross-resolution comparisons.Reproduce (Release):
Testing
encode_pipeline_parity_unittest).multi_partition_unittest).GetCoefProbRowForPackstress (pack_tokens_unittest).vp8dx_start_decode-style helpers that held unpinned pointers were removed; tests pin buffers for decode spans.log2 > 0).vp8-encoder-ci.ymlfilter). Benchmarks are Release / BenchmarkDotNet and are meant for perf signal, not functional correctness.Splitting / stacking
Maintainers prefer a smaller review: I'm happy to rework this into a stacked series.
If you want a stack, specify preferred granularity and base branch naming; I can branch and open dependent PRs from this same fork.