Skip to content

[repo-health] Bug: get_nth_of_type() uses outer_html string comparison for node identity — generates wrong selectors for duplicate siblings #14

@Liohtml

Description

@Liohtml

Summary

get_nth_of_type() in selector_generation.rs identifies a node's position among its siblings by comparing outer_html() strings, not by DOM node identity. When two or more sibling elements have identical HTML content, the function always returns position 1 for every one of them, producing wrong nth-of-type() selectors.

Category

Bug

Severity

Medium

Location

  • File: src/parser/selector_generation.rs
  • Line(s): 74–91 (get_nth_of_type function), called at lines 23 and 55

Details

// Find our position by comparing node identity via outer_html
let our_html = selector.outer_html();
for (i, child) in same_tag.iter().enumerate() {
    if child.outer_html().as_str() == our_html.as_str() {
        return i + 1;
    }
}

Consider this HTML:

<ul>
  <li>Apple</li>
  <li>Apple</li>
  <li>Apple</li>
</ul>

For the third <li>, our_html is <li>Apple</li>. The loop finds the first child that matches — always index 0 — and returns position 1. All three siblings get nth-of-type(1), meaning:

  • generate_css_selector returns li:nth-of-type(1) for all three
  • generate_xpath_selector returns //li[1] for all three

This affects any scraper that uses selector generation for element relocation (the adaptive mode feature). Re-locating the second or third duplicate sibling will always resolve to the first one instead.

Suggested Fix

The ego_tree crate exposes NodeId which is a stable, comparable identity token. Use it for position finding instead of HTML string comparison:

fn get_nth_of_type(selector: &Selector) -> usize {
    let tag = selector.tag();
    let self_id = selector.node_id(); // expose NodeId or pass it in
    if let Some(parent) = selector.parent() {
        let children = parent.children();
        let same_tag: Vec<_> = children.into_iter().filter(|c| c.tag() == tag).collect();
        if same_tag.len() <= 1 {
            return 0;
        }
        for (i, child) in same_tag.iter().enumerate() {
            if child.node_id() == self_id {
                return i + 1;
            }
        }
    }
    0
}

This requires exposing a node_id() accessor on Selector (the field is already pub(crate) internally). Alternatively, use std::ptr::eq on the underlying node reference if the tree crate provides a stable reference.

Effort Estimate

30 min


Automated finding by repo-health-agent v1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions