fix: after the machine reboot, vNPU cannot be deleted, resulting in idle resource occupation #54

dartagnanli · 2026-01-30T09:07:42Z

After the machine restarts, there is no time to execute the destroy method of Ascend, resulting in the vNPU being idle. When the model is rescheduled, a new vNPU will be created. This situation can lead to the model service pod not starting up when resources are insufficient. It is necessary to manually delete the vNPU that was not deleted before the last restart. Here, I have added a logic to periodically check for idle vNPU. When an idle vNPU is found, the deletion logic will be executed.

In the image below, the vNPU with a status of 0 represents the idle vNPU left after a restart. These vNPU were used before the model restart, and kubelet did not have enough time to execute the post-hook method of the container during the restart

hami-robot · 2026-01-30T09:07:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dartagnanli
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-01-30T09:07:51Z

Welcome @dartagnanli! It looks like this is your first PR to Project-HAMi/ascend-device-plugin 🎉

DSFans2014 · 2026-02-02T02:09:50Z

please sign-off your commit

internal/manager/manager.go

cmd/main.go

…dle resource occupation Signed-off-by: libin18 <libin18@kingsoft.com>

…pus code Signed-off-by: libin18 <libin18@kingsoft.com>

hami-robot bot added the dco-signoff: no label Jan 30, 2026

hami-robot bot requested review from DSFans2014 and archlitchi January 30, 2026 09:07

hami-robot bot added the size/L label Jan 30, 2026

DSFans2014 reviewed Feb 2, 2026

View reviewed changes

internal/manager/manager.go Outdated Show resolved Hide resolved

DSFans2014 reviewed Feb 2, 2026

View reviewed changes

cmd/main.go Outdated Show resolved Hide resolved

dartagnanli added 2 commits February 4, 2026 16:37

fix: after the machine reboot, vNPU cannot be deleted, resulting in i…

243af02

…dle resource occupation Signed-off-by: libin18 <libin18@kingsoft.com>

refactor: translate Chinese into English,and adjust the clean idle vn…

09ab29d

…pus code Signed-off-by: libin18 <libin18@kingsoft.com>

dartagnanli force-pushed the main branch from 6644a40 to 09ab29d Compare February 4, 2026 08:40

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: after the machine reboot, vNPU cannot be deleted, resulting in idle resource occupation #54

fix: after the machine reboot, vNPU cannot be deleted, resulting in idle resource occupation #54

Uh oh!

dartagnanli commented Jan 30, 2026

Uh oh!

hami-robot bot commented Jan 30, 2026

Uh oh!

hami-robot bot commented Jan 30, 2026

Uh oh!

DSFans2014 commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: after the machine reboot, vNPU cannot be deleted, resulting in idle resource occupation #54

Are you sure you want to change the base?

fix: after the machine reboot, vNPU cannot be deleted, resulting in idle resource occupation #54

Uh oh!

Conversation

dartagnanli commented Jan 30, 2026

Uh oh!

hami-robot bot commented Jan 30, 2026

Uh oh!

hami-robot bot commented Jan 30, 2026

Uh oh!

DSFans2014 commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants