Skip to content

Feature/umq fast path#137

Open
vkommine wants to merge 3 commits into
intel:mainfrom
vkommine:feature/umq-fast-path
Open

Feature/umq fast path#137
vkommine wants to merge 3 commits into
intel:mainfrom
vkommine:feature/umq-fast-path

Conversation

@vkommine
Copy link
Copy Markdown

No description provided.

vkommine added 3 commits May 18, 2026 22:18
- UAPI copy: CMDQ_INFO, UMQ_ENABLE, UMQ_DISABLE ioctls + mmap offsets
- vpu_command_queue.cpp: VPUDeviceQueueUMQ class
  - tryCreate(): CMDQ_INFO + UMQ_ENABLE + mmap ring + mmap doorbell
  - submitCommandBuffer(): ring write + doorbell MMIO + setUmqMode(true)
  - ringJob(): sfence-ordered ring buffer write + doorbell ring
  - checkReset(): reset_counter detection
- vpu_command_queue.hpp: VPUDeviceQueueUMQ declaration
- vpu_device.cpp: UMQ queue creation path
- vpu_driver_api.cpp/hpp: commandQueueGetInfo, commandQueueUmqEnable/Disable
- command_buffer.cpp: waitForCompletion() ULLS-Light pattern
  - UMQ mode: umonitor + umwait(C0.2, 16000 cycles) + yield between iters
    matches Intel GPU compute-runtime WaitUtils (wait_util.h)
  - Non-UMQ mode: unchanged (busyWait 15ms cap + ioctl)
- command_buffer.hpp: setUmqMode(), umqMode flag

Async benchmark (1M iter, cpu10 @ 2GHz, TILES=1):
  BS4  nireq=1: -24% median latency, +30% FPS vs baseline
  BS8  nireq=1: -22% median latency, +24% FPS vs baseline
  BS10 nireq=1: -20% median latency, +19% FPS vs baseline
  nireq=4: ~9% FPS improvement across all batch sizes
Phase 3 of persistent CmdQ PoC (PTL-SUT-0144 / NPU5010):

- ivpu_accel.h (uapi): add DRM_IVPU_CMDQ_FLAG_PERSISTENT 0x00000002u;
  add __u32 _pad field to drm_ivpu_cmdq_create for ABI alignment
- vpu_driver_api.hpp: extend commandQueueCreate() with isPersistent param
- vpu_driver_api.cpp: set DRM_IVPU_CMDQ_FLAG_PERSISTENT in createArgs.flags
  when isPersistent=true
- vpu_command_queue.cpp: pass isPersistent=umqCapability when creating the
  UMQ command queue so persistent flag is set on the hot path

The persistent flag instructs the KMD/FW to skip per-submission CmdQ setup,
eliminating ~8-10µs of FW FSM overhead per inference.

Latency results (PTL-SUT-0144, Mar5 RC FW, cpu10@2GHz, 50K iters):
  MLP15_b4:  -8.40µs (-14.1%)   MLP15_b8:  -8.77µs (-14.5%)
  MLP15_b10: -9.83µs (-16.1%)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant