Summary
get_nth_of_type() in selector_generation.rs identifies a node's position among its siblings by comparing outer_html() strings, not by DOM node identity. When two or more sibling elements have identical HTML content, the function always returns position 1 for every one of them, producing wrong nth-of-type() selectors.
Category
Bug
Severity
Medium
Location
- File:
src/parser/selector_generation.rs
- Line(s): 74–91 (
get_nth_of_type function), called at lines 23 and 55
Details
// Find our position by comparing node identity via outer_html
let our_html = selector.outer_html();
for (i, child) in same_tag.iter().enumerate() {
if child.outer_html().as_str() == our_html.as_str() {
return i + 1;
}
}
Consider this HTML:
<ul>
<li>Apple</li>
<li>Apple</li>
<li>Apple</li>
</ul>
For the third <li>, our_html is <li>Apple</li>. The loop finds the first child that matches — always index 0 — and returns position 1. All three siblings get nth-of-type(1), meaning:
generate_css_selector returns li:nth-of-type(1) for all three
generate_xpath_selector returns //li[1] for all three
This affects any scraper that uses selector generation for element relocation (the adaptive mode feature). Re-locating the second or third duplicate sibling will always resolve to the first one instead.
Suggested Fix
The ego_tree crate exposes NodeId which is a stable, comparable identity token. Use it for position finding instead of HTML string comparison:
fn get_nth_of_type(selector: &Selector) -> usize {
let tag = selector.tag();
let self_id = selector.node_id(); // expose NodeId or pass it in
if let Some(parent) = selector.parent() {
let children = parent.children();
let same_tag: Vec<_> = children.into_iter().filter(|c| c.tag() == tag).collect();
if same_tag.len() <= 1 {
return 0;
}
for (i, child) in same_tag.iter().enumerate() {
if child.node_id() == self_id {
return i + 1;
}
}
}
0
}
This requires exposing a node_id() accessor on Selector (the field is already pub(crate) internally). Alternatively, use std::ptr::eq on the underlying node reference if the tree crate provides a stable reference.
Effort Estimate
30 min
Automated finding by repo-health-agent v1.0
Summary
get_nth_of_type()inselector_generation.rsidentifies a node's position among its siblings by comparingouter_html()strings, not by DOM node identity. When two or more sibling elements have identical HTML content, the function always returns position1for every one of them, producing wrongnth-of-type()selectors.Category
Bug
Severity
Medium
Location
src/parser/selector_generation.rsget_nth_of_typefunction), called at lines 23 and 55Details
Consider this HTML:
For the third
<li>,our_htmlis<li>Apple</li>. The loop finds the first child that matches — always index 0 — and returns position1. All three siblings getnth-of-type(1), meaning:generate_css_selectorreturnsli:nth-of-type(1)for all threegenerate_xpath_selectorreturns//li[1]for all threeThis affects any scraper that uses selector generation for element relocation (the adaptive mode feature). Re-locating the second or third duplicate sibling will always resolve to the first one instead.
Suggested Fix
The
ego_treecrate exposesNodeIdwhich is a stable, comparable identity token. Use it for position finding instead of HTML string comparison:This requires exposing a
node_id()accessor onSelector(the field is alreadypub(crate)internally). Alternatively, usestd::ptr::eqon the underlying node reference if the tree crate provides a stable reference.Effort Estimate
30 min
Automated finding by repo-health-agent v1.0