Skip to content

fix(haidan): Fix seeding quantity and size extraction with deduplication#741

Open
madrays wants to merge 1 commit intopt-plugins:masterfrom
madrays:fix/haidan-seeding-data
Open

fix(haidan): Fix seeding quantity and size extraction with deduplication#741
madrays wants to merge 1 commit intopt-plugins:masterfrom
madrays:fix/haidan-seeding-data

Conversation

@madrays
Copy link

@madrays madrays commented Nov 1, 2025

fix(haidan): 修复做种数量和体积获取逻辑,添加去重机制

  • 修复做种数量获取:从表格行数去重统计,而不是直接读取错误的XXX
  • 修复做种体积获取:累加时对重复种子进行去重
  • 使用种子ID(details.php?id=XXX)作为唯一标识进行去重
  • 解决了海胆站点返回的HTML片段中数据100%重复的问题"

- Fix seeding quantity: Count deduplicated table rows instead of reading incorrect 588
- Fix seeding size: Add deduplication when accumulating torrent sizes
- Use torrent ID (details.php?id=XXX) as unique identifier for deduplication
- Resolve issue where haidan site returns 100% duplicated data in HTML fragment
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

@sourcery-ai
Copy link

sourcery-ai bot commented Nov 1, 2025

Reviewer's Guide

This PR overhauls the Haidan site scraper by moving seeding count and size calculations into a new AJAX‐driven process pipeline, parsing the returned HTML fragment with createDocument and Sizzle, deduplicating entries by torrent ID, and accumulating sizes with parseSizeString, plus minor regex and URL fixes.

Sequence diagram for AJAX-driven seeding info extraction and deduplication

sequenceDiagram
    participant Client
    participant HaidanSite
    participant "createDocument/Sizzle"
    participant "Deduplication Logic"
    participant "Size Accumulation"
    Client->>HaidanSite: GET /getusertorrentlistajax.php (type=seeding)
    HaidanSite-->>Client: HTML fragment (table rows)
    Client->>"createDocument/Sizzle": Parse HTML fragment
    "createDocument/Sizzle"-->>"Deduplication Logic": Extract torrent IDs from details.php?id=XXX
    "Deduplication Logic"-->>Client: Deduplicated seeding count
    "createDocument/Sizzle"-->>"Size Accumulation": Extract and sum sizes for unique torrents
    "Size Accumulation"-->>Client: Deduplicated seeding size
Loading

ER diagram for deduplicated seeding data extraction

erDiagram
    USER ||--o{ TORRENT : "seeds"
    TORRENT {
        id int PK
        size float
    }
    USER {
        id int PK
        seeding_count int
        seeding_size float
    }
    USER ||--o{ SEEDING : "deduplicated by torrent id"
    SEEDING {
        user_id int FK
        torrent_id int FK
    }
Loading

Class diagram for updated seeding and seedingSize extraction logic

classDiagram
    class SiteMetadata {
        +process: Array<ProcessStep>
    }
    class ProcessStep {
        +requestConfig
        +fields
        +selectors
    }
    class SeedingSelector {
        +filters: [deduplicate by torrent ID]
    }
    class SeedingSizeSelector {
        +filters: [deduplicate by torrent ID, accumulate size]
    }
    SiteMetadata --> ProcessStep
    ProcessStep --> SeedingSelector
    ProcessStep --> SeedingSizeSelector
    SeedingSelector <|-- SeedingSizeSelector
Loading

File-Level Changes

rows
  • Extracted torrent ID from details.php link or fallback to row HTML
  • Used a Set to dedupe IDs and returned its size
  • Change Details Files
    Imported HTML and filesize utilities and fixed regex for numeric parsing
    • Added imports for parseSizeString, sizePattern, Sizzle, and createDocument
    • Adjusted whitespace/comma regex to /[\s,]/g
    src/packages/site/definitions/haidan.ts
    Updated site base URL src/packages/site/definitions/haidan.ts
    Replaced static selector-based seeding extraction with AJAX pipeline
    • Removed direct seeding selectors and moved seeding/size fields into process steps
    • Added multi-step process: index.php → userdetails.php → getusertorrentlistajax.php → mybonus.php
    src/packages/site/definitions/haidan.ts
    Implemented deduplicated seeding count logic
    • Parsed HTML fragment into Document and selected all
    src/packages/site/definitions/haidan.ts
    Implemented deduplicated seeding size accumulation
    • Auto-detected size column index using sizePattern
    • Parsed each row’s size cell with parseSizeString
    • Skipped duplicate IDs via Set and summed total size
    src/packages/site/definitions/haidan.ts

    Tips and commands

    Interacting with Sourcery

    • Trigger a new review: Comment @sourcery-ai review on the pull request.
    • Continue discussions: Reply directly to Sourcery's review comments.
    • Generate a GitHub issue from a review comment: Ask Sourcery to create an
      issue from a review comment by replying to it. You can also reply to a
      review comment with @sourcery-ai issue to create an issue from it.
    • Generate a pull request title: Write @sourcery-ai anywhere in the pull
      request title to generate a title at any time. You can also comment
      @sourcery-ai title on the pull request to (re-)generate the title at any time.
    • Generate a pull request summary: Write @sourcery-ai summary anywhere in
      the pull request body to generate a PR summary at any time exactly where you
      want it. You can also comment @sourcery-ai summary on the pull request to
      (re-)generate the summary at any time.
    • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
      request to (re-)generate the reviewer's guide at any time.
    • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
      pull request to resolve all Sourcery comments. Useful if you've already
      addressed all the comments and don't want to see them anymore.
    • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
      request to dismiss all existing Sourcery reviews. Especially useful if you
      want to start fresh with a new review - don't forget to comment
      @sourcery-ai review to trigger a new review!

    Customizing Your Experience

    Access your dashboard to:

    • Enable or disable review features such as the Sourcery-generated pull request
      summary, the reviewer's guide, and others.
    • Change the review language.
    • Add, remove or edit custom review instructions.
    • Adjust other review settings.

    Getting Help

    @Rhilip
    Copy link
    Collaborator

    Rhilip commented Nov 4, 2025

    对NPHP构架站点,其seeding和seedingSize的计算有专门的函数适配,仅从pr来看其实现和 schema 的基本类似,但 schema 中无去重实现

    /**
    * 鉴于NexusPHP这里使用ajax交互,如果强行指定 responseType: 'document' ,
    * 由于返回字段并不是 valid-html, 此时会解析失败(即 data = undefined ),
    * 所以此处不指定 responseType,而是返回文本形式的 string,交由 getUserSeedingStatus
    * 生成 Document
    *
    * @param userId
    * @param type
    * @protected
    */
    protected async requestUserSeedingPage(userId: number, type: string = "seeding"): Promise<string | null> {
    const { data } = await this.request<string>({
    url: "/getusertorrentlistajax.php",
    params: { userid: userId, type },
    });
    return data || null;
    }
    protected async parseUserInfoForSeedingStatus(flushUserInfo: Partial<IUserInfo>): Promise<Partial<IUserInfo>> {
    const userId = flushUserInfo.id as number;
    const userSeedingRequestString = await this.requestUserSeedingPage(userId);
    let seedStatus = { seeding: 0, seedingSize: 0 };
    if (userSeedingRequestString && userSeedingRequestString?.includes("<table")) {
    const userSeedingDocument = createDocument(userSeedingRequestString);
    const divSeeding = Sizzle("div > div:contains(' | ')", userSeedingDocument);
    if (divSeeding.length > 0 && divSeeding[0].textContent) {
    const seedingText = divSeeding[0].textContent.split("|");
    seedStatus.seeding = definedFilters.parseNumber(seedingText[0]);
    seedStatus.seedingSize = definedFilters.parseSize(seedingText[1]);
    } else {
    const trAnothers = Sizzle("table:last tr:not(:eq(0))", userSeedingDocument);
    if (trAnothers.length > 0) {
    seedStatus.seeding = trAnothers.length;
    // 根据自动判断应该用 td.rowfollow:eq(?)
    let sizeIndex = 2;
    const tdAnothers = Sizzle("> td", trAnothers[0]);
    for (let i = 0; i < tdAnothers.length; i++) {
    if (sizePattern.test((tdAnothers[i] as HTMLElement).innerText)) {
    sizeIndex = i;
    break;
    }
    }
    trAnothers.forEach((trAnother) => {
    const sizeSelector = Sizzle(`td:eq(${sizeIndex})`, trAnother)[0] as HTMLElement;
    seedStatus.seedingSize += parseSizeString(sizeSelector.innerText.trim());
    });
    }
    }
    }
    flushUserInfo = mergeWith(flushUserInfo, seedStatus, (objValue, srcValue) => {
    return typeof srcValue === "undefined" ? objValue : srcValue;
    });
    return flushUserInfo;
    }


    另外,关于是否需要去重我认为是有必要商榷的。
    从NPHP的代码来看 /getusertorrentlistajax.php?userid=xxxx&type=seeding 对应到的SQL是类似 SELECT xxx FROM peers WHERE xxxx ,如果用户确实有多地做种的情况,那么其返回必然是有重复的。在这种情况下,去重反而会导致其统计出错。
    至于软件重启原因导致的多做种情况,会被站点定时程序清理而在重刷新中回归正常。

    @madrays
    Copy link
    Author

    madrays commented Nov 4, 2025

    对NPHP构架站点,其seeding和seedingSize的计算有专门的函数适配,仅从pr来看其实现和 schema 的基本类似,但 schema 中无去重实现

    /**
    * 鉴于NexusPHP这里使用ajax交互,如果强行指定 responseType: 'document' ,
    * 由于返回字段并不是 valid-html, 此时会解析失败(即 data = undefined ),
    * 所以此处不指定 responseType,而是返回文本形式的 string,交由 getUserSeedingStatus
    * 生成 Document
    *
    * @param userId
    * @param type
    * @protected
    */
    protected async requestUserSeedingPage(userId: number, type: string = "seeding"): Promise<string | null> {
    const { data } = await this.request<string>({
    url: "/getusertorrentlistajax.php",
    params: { userid: userId, type },
    });
    return data || null;
    }
    protected async parseUserInfoForSeedingStatus(flushUserInfo: Partial<IUserInfo>): Promise<Partial<IUserInfo>> {
    const userId = flushUserInfo.id as number;
    const userSeedingRequestString = await this.requestUserSeedingPage(userId);
    let seedStatus = { seeding: 0, seedingSize: 0 };
    if (userSeedingRequestString && userSeedingRequestString?.includes("<table")) {
    const userSeedingDocument = createDocument(userSeedingRequestString);
    const divSeeding = Sizzle("div > div:contains(' | ')", userSeedingDocument);
    if (divSeeding.length > 0 && divSeeding[0].textContent) {
    const seedingText = divSeeding[0].textContent.split("|");
    seedStatus.seeding = definedFilters.parseNumber(seedingText[0]);
    seedStatus.seedingSize = definedFilters.parseSize(seedingText[1]);
    } else {
    const trAnothers = Sizzle("table:last tr:not(:eq(0))", userSeedingDocument);
    if (trAnothers.length > 0) {
    seedStatus.seeding = trAnothers.length;
    // 根据自动判断应该用 td.rowfollow:eq(?)
    let sizeIndex = 2;
    const tdAnothers = Sizzle("> td", trAnothers[0]);
    for (let i = 0; i < tdAnothers.length; i++) {
    if (sizePattern.test((tdAnothers[i] as HTMLElement).innerText)) {
    sizeIndex = i;
    break;
    }
    }
    trAnothers.forEach((trAnother) => {
    const sizeSelector = Sizzle(`td:eq(${sizeIndex})`, trAnother)[0] as HTMLElement;
    seedStatus.seedingSize += parseSizeString(sizeSelector.innerText.trim());
    });
    }
    }
    }
    flushUserInfo = mergeWith(flushUserInfo, seedStatus, (objValue, srcValue) => {
    return typeof srcValue === "undefined" ? objValue : srcValue;
    });
    return flushUserInfo;
    }

    另外,关于是否需要去重我认为是有必要商榷的。 从NPHP的代码来看 /getusertorrentlistajax.php?userid=xxxx&type=seeding 对应到的SQL是类似 SELECT xxx FROM peers WHERE xxxx ,如果用户确实有多地做种的情况,那么其返回必然是有重复的。在这种情况下,去重反而会导致其统计出错。 至于软件重启原因导致的多做种情况,会被站点定时程序清理而在重刷新中回归正常。

    海胆这个站点一直无法正常获取做种量和做种体积,且体积异常为 YB 级别(导致时间轴汇总数据完全异常),所以尝试了特殊处理,处理完了发现这个站点百分百重复(应该是同时有 v4 和 v6 导致的这个问题),所以又尝试了去重,如果大佬有更好的方案当然是更好的~

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants