Skip to content

stp, sig-network: Introduce the stuntime measurement STP#37

Open
Anatw wants to merge 1 commit intoRedHatQE:mainfrom
Anatw:stuntime_measurement_stp
Open

stp, sig-network: Introduce the stuntime measurement STP#37
Anatw wants to merge 1 commit intoRedHatQE:mainfrom
Anatw:stuntime_measurement_stp

Conversation

@Anatw
Copy link

@Anatw Anatw commented Feb 18, 2026

What this PR does

Introduce STP for stuntime measurement of VMs through live migration across different migration scenarios, focusing on secondary networks: Linux bridge and OVN localnet.

Summary by CodeRabbit

  • Documentation
    • Added a QE test plan for measuring VM live-migration stuntime on secondary networks (Linux bridge and OVN localnet).
    • Defines measurement method (ICMP with high-resolution timestamps), baseline/threshold calculation, and bidirectional testing across 12 scenarios and three migration paths.
    • Includes environment blueprint, tooling, entry criteria, risks/mitigations, known limitations, traceability, and approvers.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a new QE Software Test Plan document for measuring stuntime during VM live migration on secondary networks (Linux bridge and OVN localnet). Specifies measurement method (ICMP ping with high-resolution timestamps, IPv4), baseline/threshold logic, 12 test scenarios, environment, tooling, entry criteria, risks, limitations, traceability, and approvers.

Changes

Cohort / File(s) Summary
Test Plan Documentation
stps/sig-network/stuntime_measurement.md
Adds a comprehensive QE Software Test Plan for stuntime measurement during VM live migration on secondary networks. Includes metadata/conventions, feature overview, scope/out-of-scope, measurement approach (ICMP ping, high-resolution timestamps, IPv4-only), baseline/threshold calculation (global baseline from repeated BM runs; per-scenario thresholds allowed), 12 scenarios (two topology types × three migration paths, bidirectional initiators), environment blueprint (Bare Metal focus, multi-node, NMState), tools (pytest / openshift-virtualization-tests), entry criteria, risks & mitigations, known limitations, traceability to CNV-72773, and approvers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'stp, sig-network: Introduce the stuntime measurement STP' directly and clearly summarizes the main change: introducing a new STP document for stuntime measurement in the sig-network domain.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-virtualization-qe-bot-5

Report bugs in Issues

Welcome! 🎉

This pull request will be automatically processed with the following features:

🔄 Automatic Actions

  • Reviewer Assignment: Reviewers are automatically assigned based on the OWNERS file in the repository root
  • Size Labeling: PR size labels (XS, S, M, L, XL, XXL) are automatically applied based on changes
  • Issue Creation: A tracking issue is created for this PR and will be closed when the PR is merged or closed
  • Branch Labeling: Branch-specific labels are applied to track the target branch
  • Auto-verification: Auto-verified users have their PRs automatically marked as verified
  • Labels: Enabled categories: branch, can-be-merged, cherry-pick, has-conflicts, hold, needs-rebase, size, verified, wip

📋 Available Commands

PR Status Management

  • /wip - Mark PR as work in progress (adds WIP: prefix to title)
  • /wip cancel - Remove work in progress status
  • /hold - Block PR merging (approvers only)
  • /hold cancel - Unblock PR merging
  • /verified - Mark PR as verified
  • /verified cancel - Remove verification status
  • /reprocess - Trigger complete PR workflow reprocessing (useful if webhook failed or configuration changed)
  • /regenerate-welcome - Regenerate this welcome message

Review & Approval

  • /lgtm - Approve changes (looks good to me)
  • /approve - Approve PR (approvers only)
  • /assign-reviewers - Assign reviewers based on OWNERS file
  • /assign-reviewer @username - Assign specific reviewer
  • /check-can-merge - Check if PR meets merge requirements

Testing & Validation

  • /retest tox - Run Python test suite with tox
  • /retest all - Run all available tests

Cherry-pick Operations

  • /cherry-pick <branch> - Schedule cherry-pick to target branch when PR is merged
    • Multiple branches: /cherry-pick branch1 branch2 branch3

Label Management

  • /<label-name> - Add a label to the PR
  • /<label-name> cancel - Remove a label from the PR

✅ Merge Requirements

This PR will be automatically approved when the following conditions are met:

  1. Approval: /approve from at least one approver
  2. LGTM Count: Minimum 2 /lgtm from reviewers
  3. Status Checks: All required status checks must pass
  4. No Blockers: No WIP, hold, conflict labels
  5. Verified: PR must be marked as verified (if verification is enabled)

📊 Review Process

Approvers and Reviewers

Approvers:

  • EdDev

Reviewers:

  • Anatw
  • EdDev
  • azhivovk
  • servolkov
  • yossisegev
Available Labels
  • hold
  • verified
  • wip
  • lgtm
  • approve

💡 Tips

  • WIP Status: Use /wip when your PR is not ready for review
  • Verification: The verified label is automatically removed on each new commit
  • Cherry-picking: Cherry-pick labels are processed when the PR is merged
  • Permission Levels: Some commands require approver permissions
  • Auto-verified Users: Certain users have automatic verification and merge privileges

For more information, please refer to the project documentation or contact the maintainers.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
stps/sig-network/stuntime_measurement.md (1)

98-100: Clarify baseline storage and consider a minimum threshold floor.

Two gaps worth addressing before the plan is finalised:

  1. Where will the baseline values live? The STP says the threshold is "defined during the development phase" but does not specify where the concrete measured values (max stuntime per scenario) will be recorded — e.g., a Jira comment, a constants file in the test code, or a follow-up doc. Capturing this keeps the baseline auditable and makes future updates traceable.

  2. No minimum floor on the threshold. If the BM baseline yields a very low max (e.g., 50 ms), min(50 ms × 4, 5 s) = 200 ms. At 100 ms ping granularity that leaves only a ~1-packet margin, which may cause spurious failures. Consider establishing a minimum threshold floor (e.g., 500 ms) to absorb measurement noise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` around lines 98 - 100, Update the
"Baseline and threshold" section in stuntime_measurement.md to (1) specify where
baseline measurements will be recorded (e.g., a dedicated
"stuntime-baselines.md" or a constants file in the test repo and an associated
Jira ticket ID) and the required metadata (scenario name, run date, BM cluster
ID, max stuntime, author) so baselines are auditable and traceable, and (2) add
a minimum threshold floor (e.g., floor = 500 ms) applied after computing
min(max*4, 5s) to avoid spurious failures from very low baselines; reference the
section header "Baseline and threshold" when adding these requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Line 166: Update the "Known Limitations" line that currently reads "Not
covering any special hardware or operators (e.g. no SR-IOV or service-mesh
reference)." by removing the misleading "service-mesh" reference and replacing
it with a more relevant exclusion such as "DPDK" or "macvlan"; specifically edit
the markdown under the Known Limitations heading (the line starting with "Not
covering any special hardware or operators") to mention only applicable
secondary CNIs/hardware (e.g., SR-IOV, DPDK, macvlan) and drop any
application-layer technologies like Istio/OSSM.
- Line 61: Add a note to the "Known Limitations" or "Test Environment" section
that the chosen command invocation (ping -D -O -i 0.1) requires elevated
privileges because unprivileged users are limited to a 200ms minimum interval;
explicitly state that tests must run as root, with CAP_NET_RAW, or with
appropriate kernel settings (e.g., net.ipv4.ping_group_range) inside the Fedora
VM, and update the test framework invocation to perform privilege escalation or
validate/abort if those privileges/settings are not present so the 100ms
interval measurement will succeed.

---

Nitpick comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Around line 98-100: Update the "Baseline and threshold" section in
stuntime_measurement.md to (1) specify where baseline measurements will be
recorded (e.g., a dedicated "stuntime-baselines.md" or a constants file in the
test repo and an associated Jira ticket ID) and the required metadata (scenario
name, run date, BM cluster ID, max stuntime, author) so baselines are auditable
and traceable, and (2) add a minimum threshold floor (e.g., floor = 500 ms)
applied after computing min(max*4, 5s) to avoid spurious failures from very low
baselines; reference the section header "Baseline and threshold" when adding
these requirements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
stps/sig-network/stuntime_measurement.md (1)

101-101: Unresolved: ping -i 0.1 requires elevated privileges inside the Fedora VM.

ping enforces a 200ms minimum interval for unprivileged users — sub-200ms intervals require root, CAP_NET_RAW on the binary, or a sufficiently broad net.ipv4.ping_group_range. The test environment section and known limitations do not document this requirement. Please add a note to either the Test Environment or Known Limitations section specifying the required privilege configuration for the Fedora VM.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` at line 101, Add a note in the Test
Environment or Known Limitations section that the proposed command "ping -D -O
-i 0.1" requires elevated privileges on the Fedora VM (unprivileged users are
limited to 200ms intervals), and specify the acceptable remediation options
(running as root, granting CAP_NET_RAW to the ping binary, or configuring
net.ipv4.ping_group_range) so the reader knows the required privilege
configuration for accurate 100ms ICMP measurements of Stuntime.
🧹 Nitpick comments (3)
stps/sig-network/stuntime_measurement.md (3)

145-151: Add Exit Criteria to complement the Entry Criteria.

Section 4 only defines conditions to start testing. Adding Exit Criteria (e.g., all P0 scenarios executed, baseline thresholds recorded and committed, no open blockers) provides a clear definition of done for the testing phase and is standard STP practice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` around lines 145 - 151, The
document only lists "4. Entry Criteria" and lacks a matching "Exit Criteria" to
define test completion; add a new subsection (e.g., "4a. Exit Criteria" or "5.
Exit Criteria") after the Entry Criteria heading in stuntime_measurement.md that
enumerates concrete done conditions such as "All P0/P1 test scenarios executed
and passed or documented failures," "Baseline thresholds recorded and
committed," "No open blocking defects," "Test reports and logs uploaded," and
"Acceptance sign-off obtained" so testers have a clear definition of done;
ensure the new section mirrors the style/format of the Entry Criteria list and
references the same terms used elsewhere in the document (Entry Criteria, P0
scenarios, baseline thresholds).

7-9: Consider linking the HLD in the Enhancement(s) field.

The field currently shows "-". If an HLD document exists for this feature, it should be referenced here — an enhancement PR is not required, but at minimum the HLD link provides traceability context for reviewers.

Based on learnings: in this repository, when no enhancement PR exists, it is acceptable to reference only the HLD document in the Enhancement(s) field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` around lines 7 - 9, Update the
Enhancement(s) field in stuntime_measurement.md to reference the HLD document
instead of a "-" placeholder: locate the table row labeled "**Enhancement(s)**"
and replace the "-" with a link to the HLD (or a short note like "HLD: <URL or
doc name>") so reviewers have traceability; if no HLD exists, replace "-" with
"None" or "No enhancement PR; see HLD: <if available>" to make intent explicit.

63-83: Consider condensing the test goals into a matrix.

The 12 bullet points are fully determined by the cross-product of two dimensions, resulting in repeated phrasing across both network sections. A compact table would present the same information without redundancy:

Migration Scenario Linux Bridge — Migrated VM Linux Bridge — Static VM OVN Localnet — Migrated VM OVN Localnet — Static VM
Same node → different node
Different node → same node
Between two different nodes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` around lines 63 - 83, The "Testing
Goals" section repeats 12 nearly identical bullet points; replace the two
repeated network subsections ("VM with secondary network connected to a Linux
bridge" and "VM with secondary network connected to OVN localnet") with a
compact matrix/table that cross-products the three migration scenarios ("Same
node → different node", "Different node → same node", "Between two different
nodes") against the four test targets ("Linux Bridge — Migrated VM", "Linux
Bridge — Static VM", "OVN Localnet — Migrated VM", "OVN Localnet — Static VM"),
keep the P0 priority note, and ensure the table entries mark which scenarios
apply (e.g., ✓) to remove redundancy while preserving all original test cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Around line 96-98: Update the "Baseline and threshold" section to explicitly
state that baselines and thresholds are derived per scenario (i.e., compute a
separate baseline for each of the 12 stuntime measurement scenarios) and clarify
that "10 runs" means 10 repetitions per scenario (120 measurements total);
describe the computation as: for each scenario, run it 10 times on the BM
cluster, take the maximum stuntime observed for that scenario, then set that
scenario's threshold to min(max × 4, 5s). Reference the "Baseline and threshold"
heading and the "stuntime measurement scenarios" phrasing so readers know which
items get per-scenario baselines.

---

Duplicate comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Line 101: Add a note in the Test Environment or Known Limitations section that
the proposed command "ping -D -O -i 0.1" requires elevated privileges on the
Fedora VM (unprivileged users are limited to 200ms intervals), and specify the
acceptable remediation options (running as root, granting CAP_NET_RAW to the
ping binary, or configuring net.ipv4.ping_group_range) so the reader knows the
required privilege configuration for accurate 100ms ICMP measurements of
Stuntime.

---

Nitpick comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Around line 145-151: The document only lists "4. Entry Criteria" and lacks a
matching "Exit Criteria" to define test completion; add a new subsection (e.g.,
"4a. Exit Criteria" or "5. Exit Criteria") after the Entry Criteria heading in
stuntime_measurement.md that enumerates concrete done conditions such as "All
P0/P1 test scenarios executed and passed or documented failures," "Baseline
thresholds recorded and committed," "No open blocking defects," "Test reports
and logs uploaded," and "Acceptance sign-off obtained" so testers have a clear
definition of done; ensure the new section mirrors the style/format of the Entry
Criteria list and references the same terms used elsewhere in the document
(Entry Criteria, P0 scenarios, baseline thresholds).
- Around line 7-9: Update the Enhancement(s) field in stuntime_measurement.md to
reference the HLD document instead of a "-" placeholder: locate the table row
labeled "**Enhancement(s)**" and replace the "-" with a link to the HLD (or a
short note like "HLD: <URL or doc name>") so reviewers have traceability; if no
HLD exists, replace "-" with "None" or "No enhancement PR; see HLD: <if
available>" to make intent explicit.
- Around line 63-83: The "Testing Goals" section repeats 12 nearly identical
bullet points; replace the two repeated network subsections ("VM with secondary
network connected to a Linux bridge" and "VM with secondary network connected to
OVN localnet") with a compact matrix/table that cross-products the three
migration scenarios ("Same node → different node", "Different node → same node",
"Between two different nodes") against the four test targets ("Linux Bridge —
Migrated VM", "Linux Bridge — Static VM", "OVN Localnet — Migrated VM", "OVN
Localnet — Static VM"), keep the P0 priority note, and ensure the table entries
mark which scenarios apply (e.g., ✓) to remove redundancy while preserving all
original test cases.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efd6fcc and 4cac21a.

📒 Files selected for processing (1)
  • stps/sig-network/stuntime_measurement.md

@Anatw
Copy link
Author

Anatw commented Feb 23, 2026

Change:

  • Change marking to reflect review status.
  • Move stuntime measurement details into its own section (Measurement approach).
  • Reword IPv6 out-of-scope for clearness.
  • Reword worst-case guarantee rationale (scope exclusion, not product limitation).
  • Expand Platform rationale (BM-only; product behavior, not cloud/PSI).
  • Remove service-mesh from Known Limitations.

@Anatw
Copy link
Author

Anatw commented Feb 23, 2026

Change:

  • Drop another IPv6 future reference.
  • Refactor threshold definition for clarity,

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
stps/sig-network/stuntime_measurement.md (1)

100-101: Consider breaking the measurement approach into a short structured list.

The single paragraph packs tool selection, discarded alternatives, command flags, and the stuntime calculation formula into one dense run-on sentence. Splitting it into sub-bullets (one per concern) would make this section much easier to scan.

♻️ Suggested restructure
-**Measurement approach**
-Stuntime will be measured using the ICMP ping tool (simple, already in codebase and matches main use case with a simple stuntime measurement). No need for more robust connectivity tools since the goal is stuntime duration, not connection verification. Alternatives considered: tcping (not in codebase, adds dependency), iperf3 (heavier, overkill for drop/return timing), curl (requires server in VM). ICMP packets will be sent at 100ms intervals with UNIX timestamps enabled and explicit reporting of dropped packets (ping -D -O -i 0.1) to achieve high-resolution measurement. Stuntime is defined as the connectivity gap duration, calculated by subtracting the timestamp of the last successful packet before failure from the timestamp of the first successful packet after recovery.
+**Measurement approach**
+
+Stuntime is measured using the ICMP `ping` tool — it is already in the codebase, is simple, and directly matches the use case (stuntime duration, not connection verification).
+
+Alternatives considered and discarded:
+- **tcping**: not in codebase, adds a dependency.
+- **iperf3**: heavier protocol, overkill for drop/return timing.
+- **curl**: requires a server process inside the VM.
+
+ICMP packets are sent at 100 ms intervals using `ping -D -O -i 0.1`:
+- `-D`: prints a UNIX timestamp before each output line.
+- `-O`: reports dropped packets explicitly ("no answer yet for icmp_seq=N").
+- `-i 0.1`: sets the 100 ms inter-packet interval.
+
+**Stuntime calculation:** connectivity gap duration = timestamp of the first successful packet after recovery − timestamp of the last successful packet before failure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stps/sig-network/stuntime_measurement.md` around lines 100 - 101, Split the
dense "Measurement approach" paragraph into a short structured list: one bullet
for the chosen tool and why (ICMP ping), one bullet listing discarded
alternatives and brief reasons, one bullet showing the exact command/flags to
use (ping -D -O -i 0.1) and their purpose, and one bullet stating the stuntime
definition and calculation (subtract timestamp of last successful packet before
failure from first successful packet after recovery); keep each bullet short and
use the existing headings/phrasing ("Measurement approach", "ICMP ping", "ping
-D -O -i 0.1", "stuntime") so readers can easily locate and scan the content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Line 132: The Test Environment table's Network row is missing a Configuration
cell; update the table in stuntime_measurement.md by adding a short, consistent
value such as "IPv4 / Multi-NIC" (or "IPv4; Multi-NIC") into the Configuration
column for the row labeled "Network" so the table matches other rows and remains
consistent.

---

Nitpick comments:
In `@stps/sig-network/stuntime_measurement.md`:
- Around line 100-101: Split the dense "Measurement approach" paragraph into a
short structured list: one bullet for the chosen tool and why (ICMP ping), one
bullet listing discarded alternatives and brief reasons, one bullet showing the
exact command/flags to use (ping -D -O -i 0.1) and their purpose, and one bullet
stating the stuntime definition and calculation (subtract timestamp of last
successful packet before failure from first successful packet after recovery);
keep each bullet short and use the existing headings/phrasing ("Measurement
approach", "ICMP ping", "ping -D -O -i 0.1", "stuntime") so readers can easily
locate and scan the content.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4cac21a and b28b6da.

📒 Files selected for processing (1)
  • stps/sig-network/stuntime_measurement.md


| Category | Tools/Frameworks |
|:-------------------|:--------------------------------------------------------------------------------------------------|
| **Test Framework** | Standard pytest/openshift-virtualization-tests. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of this section, I see it is empty in other STPs. It is implementation detail, tomorrow we get rid of pytest and this section immediately becomes obsolete.

But I understand you are following the template. Lets see what @rnetser says.


| Requirement ID | Requirement Summary | Test Scenario(s) | Tier | Priority |
|:---------------|:---------------------|:-----------------|:-------|:---------|
| CNV-72773 | As a user, I want to know what stuntime I can expect from a VM during live migration, in different migration scenarios. | Measure stuntime for all 12 scenarios (Linux bridge + OVN localnet × 3 migration scenarios × bidirectional connectivity initiation) | Tier 2 | P0 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, as for me I get it, hopefully other folks will not force you to disclose the list.

@Anatw
Copy link
Author

Anatw commented Feb 26, 2026

Changes:

  • Defer threshold details to STD (Feature Overview + Technology Challenges)
  • Trim Scope of Testing duplication; consolidate IPv4/IPv6 reasoning into Out-of-Scope table
  • Expand IPv6 out-of-scope rationale; add IP-family-independent risk mitigation
  • Fix Test Strategy: Security N→N/A, Compatibility Y→N, add Regression cross-version note
  • Add Guest OS to Test Environment; replace Known Limitations list with Out-of-Scope reference

@servolkov
Copy link

servolkov commented Feb 26, 2026

Generally LGTM to me, I have caught an idea and way you are going to move, I don't want to spend time and focus on nit details anymore. Thanks.

@Anatw
Copy link
Author

Anatw commented Mar 1, 2026

Changes:

  • Remove OS reference (move to the STD).
  • Remove Test Framework reference.

Copy link
Contributor

@azhivovk azhivovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Collaborator

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the first section: "Motivation and Requirements".

I will continue with the rest later, please start addressing the current ones.

Comment on lines +10 to +11
| **Feature in Jira** | https://issues.redhat.com/browse/CNV-72773 |
| **Jira Tracking** | https://issues.redhat.com/browse/CNV-78676 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Jira Tracking" refers to the epic that tracks this effort and in which all the work is expressed.
"Feature in Jira" is referring to a Feature type ticket, but there seems to be none, so you need to explain why not or provide an alternative.

@rnetser , maybe the fields should be names "Feature ref" and "Epic ref", WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern I saw in previous STPs. Changed according to your explanation.

| **QE Owner(s)** | Anat Wax (awax@redhat.com) |
| **Owning SIG** | sig-network |
| **Participating SIGs** | sig-network |
| **Current Status** | Draft |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When merged, it will not be a draft.

@rnetser , this is a repeating issue I see. Maybe it needs to be dropped as no one will update this field.
In the kubevirt project, the status is tracked by an issue that has links to everything, but here the tracker is the epic and it seems good enough IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wondering how this will work as well.

Comment on lines +21 to +22
- **Migrated VM:** The VM that is live-migrated during the test.
- **Static VM:** The peer VM that remains on its node throughout the test. Used as the reference point for migration direction (e.g., "from the static VM's node").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strong hint that you defined an STP in this document.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant "STD".
A question arose during the review about what "live migrated from the same node to a different node" or "live migrated from a different node to the same node" mean. I added these definitions (Migrated VM, Static VM) so the test scenarios would be clear.
#37 (comment)


### **Feature Overview**

Customers running live migration on secondary networks need predictable VM downtime. We need a way to detect regressions in migration behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running live migration on secondary networks

This describes something different from what I suspect you intended.
There is such a feature on which migration actually runs on secondary networks (and not on the primary). What I suspect you intended to describe here is that the network connectivity traffic which is relevant for the "downtime" is passing through these networks.

IMO you should generalize the description here and do not mention which network.
In the scope, you can emphasize that the primary or other secondary network types are not in the scope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this feature, thanks for the explanation. This is indeed not what I meant to describe.
I simplified this paragraph. The network scope (default pod network and other secondary types out of scope) is already covered in the Out of Scope section.

### **Feature Overview**

Customers running live migration on secondary networks need predictable VM downtime. We need a way to detect regressions in migration behavior.
The feature defines and measures VM stuntime during live migration and establishes a baseline and a pass/fail threshold for testing (to be defined in the STD). Testing focuses on configurations used by the vast majority of our customers - secondary network configurations: Linux bridge and OVN localnet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The feature defines and measures VM stuntime during live migration and establishes a baseline and a pass/fail threshold for testing (to be defined in the STD). Testing focuses on configurations used by the vast majority of our customers - secondary network configurations: Linux bridge and OVN localnet.
The feature defines and measures VM stuntime during live migration and establishes a baseline with a pass/fail threshold. Cover and focus on configurations used by the vast majority of our customers - secondary network configurations: Linux bridge and OVN localnet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

| **Review Requirements** | [x] | Publish VM stuntime during live migration for users' awareness. To be measured on secondary networks - Linux bridge and OVN localnet. | |
| **Understand Value** | [x] | Published stuntime lets users set expectations for VM downtime during live migration and compare OCP-V with other virtualization solutions. For QE, automated measurement provides on-demand stuntime data, and continuous test coverage catches regressions in live migration behavior. | |
| **Customer Use Cases** | [x] | Customers need predictable VM downtime during VM live migration. | |
| **Testability** | [x] | Stuntime is testable: connectivity gap is measurable from first packet loss to first packet recovery. The measuring scope (secondary networks, topologies) is well-defined. | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **Testability** | [x] | Stuntime is testable: connectivity gap is measurable from first packet loss to first packet recovery. The measuring scope (secondary networks, topologies) is well-defined. | |
| **Testability** | [x] | Testable by measuring the time period in which network traffic is lost. | |

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

| **Understand Value** | [x] | Published stuntime lets users set expectations for VM downtime during live migration and compare OCP-V with other virtualization solutions. For QE, automated measurement provides on-demand stuntime data, and continuous test coverage catches regressions in live migration behavior. | |
| **Customer Use Cases** | [x] | Customers need predictable VM downtime during VM live migration. | |
| **Testability** | [x] | Stuntime is testable: connectivity gap is measurable from first packet loss to first packet recovery. The measuring scope (secondary networks, topologies) is well-defined. | |
| **Acceptance Criteria** | [x] | Stuntime measured in a BM environment to allow later publication in blog/KCS. Stuntime value must be easily retrievable from test logs to enable baseline updates and reports. | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In which env it is measured is irrelevant in this context.
  • "Easily retrievable" and "from test logs" are relative and assume implementation details. Please focus on the end result, like "Measured values can be publicly shared and used as formal thresholds".

To clarify, it is unlikely that we will take the numbers measured from the logs and post them anywhere.
Most likely, and this will be my expectation, is that the tests will measure values and assert that they are in a certain range. The range will be published and the tests will just make sure we are not outside of the range.
In case we detect the range changes, there is either a product issue or we need to update the published values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks.

| **Customer Use Cases** | [x] | Customers need predictable VM downtime during VM live migration. | |
| **Testability** | [x] | Stuntime is testable: connectivity gap is measurable from first packet loss to first packet recovery. The measuring scope (secondary networks, topologies) is well-defined. | |
| **Acceptance Criteria** | [x] | Stuntime measured in a BM environment to allow later publication in blog/KCS. Stuntime value must be easily retrievable from test logs to enable baseline updates and reports. | |
| **Non-Functional Requirements (NFRs)** | [x] | Measured stuntime will be documented in a KCS or a Red Hat blog post. | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way this is relevant to the STP? I do not understand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand NFRs correctly, I thought the documentation (KCS/blog) fell under this category.
Reading through the example, it does say docs specifically:

Confirmed coverage for NFRs, including Performance, Security, Usability, Downtime, Connectivity, Monitoring (alerts/metrics), Scalability, Portability (e.g., cloud support), and Docs.

If it doesn’t fit in this context, I’m happy to change it, could you clarify what you’d expect here?


| Check | Done | Details/Notes | Comments |
|:---------------------------------|:-----|:--------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|
| **Developer Handoff/QE Kickoff** | [x] | Not a new feature. | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact it is not a new product feature is not convincing. Any work is expected to pass through some kind of a kickoff to sync the relevant engineering members and clarify open question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a meeting with you and with Petr to go over the epic and discuss measurement strategies. I've updated the details.

| Check | Done | Details/Notes | Comments |
|:---------------------------------|:-----|:--------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|
| **Developer Handoff/QE Kickoff** | [x] | Not a new feature. | |
| **Technology Challenges** | [x] | Stuntime is sensitive to network workload and infrastructure, so measured values may vary across different labs and environments. A single threshold will be established and validated across available BM environments (see STD). Per-environment adjustments will be considered only if specific labs show consistent deviation. | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I do not think you should mention STD in an STP.
  • You can mention what is needed, something like "a relative stable setup environment which can provide as stable results as possible".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Introduce STP for stuntime measurement of VMs through live migration
across different migration scenarios, focusing on secondary networks:
Linux bridge and OVN localnet.
@Anatw
Copy link
Author

Anatw commented Mar 4, 2026

Changed - according to CR:

  • Feature Overview: Generalized; removed network-specific wording; scope now points to Section II.
  • Metadata: Clarified Feature in Jira (N/A, work tracked under Epic).
  • Requirements checklist: Updated Review Requirements, Understand Value, Customer Use Cases, Testability, Acceptance Criteria, Developer Handoff, and Technology Challenges per review feedback.
  • Scope of Testing: Clarified connectivity vs migration; added a pointer to Testing Goals.
  • Out of Scope: Confirmed default pod network and other secondary CNIs are listed.
  • Other: Minor edits and consistency fixes (e.g., Storage row, Upgrade wording).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants