Refined RSS ingestion, added chunking and improved retrieval quality by galaxy101quest · Pull Request #355 · nlweb-ai/NLWeb

galaxy101quest · 2025-09-28T09:02:51Z

Refined RSS ingestion to better fit website articles and improved retrieval quality by splitting long posts into meaningful sections. The result is more precise answers.

rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.
db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.
requirements.txt (import): added beautifulsoup as it's needed in db_load.py

Contributions by Misha - misha@futurescreen.media

…rieval quality by splitting long posts into meaningful sections. The result is more precise answers. - rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects. - db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval. Contributions by Misha - misha@futurescreen.media

- needed in db_load.py

chelseacarter29 · 2025-09-29T03:46:20Z

Hi Misha! @galaxy101quest

Thanks for the PR! This is a good start - we may want to make a few adjustments and are going to do some additional testing. A couple of things I've seen so far:

Entire podcast RSS feeds now encode as one document instead of individual episodes
Something may be off in the chunking; I tried to encode some articles to test it out (I used https://platformer.news/feed to see what article results might look like) and it seems to be returning many of the same article but jumping to different sections (e.g., comments sometimes). Need to look into it a bit more.

Something Guha and I were chatting about was maybe having the ability to specify the 'type' when doing data load so you have options.

galaxy101quest · 2025-09-29T11:17:12Z

Hi Chelsea :) @chelseacarter29

Thanks for the feedback. It's not supposed to do that, so something is off.
I have a few versions on my end - I'll check for improvements and send them over.
I'll test both points with the rss feed from the website you shared - that way it would be easier to compare results.

Yes, I was also thinking about having more options when loading data - different types might make things easier.

rvguha · 2026-04-05T23:11:15Z

Thanks for the contribution! Closing this as the codebase has changed significantly since this was opened (notably code/ → AskAgent/ restructure). If you'd like to revisit this, please feel free to open a fresh PR against the current main branch.

Misha Ristich and others added 3 commits September 28, 2025 10:46

Add beautifulsoup4 to requirements.txt for RSS chunking

83141fc

- needed in db_load.py

Merge branch 'main' into rss_chunking_branch

3599c65

rvguha closed this Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refined RSS ingestion, added chunking and improved retrieval quality#355

Refined RSS ingestion, added chunking and improved retrieval quality#355
galaxy101quest wants to merge 3 commits intonlweb-ai:mainfrom
galaxy101quest:rss_chunking_branch

galaxy101quest commented Sep 28, 2025 •

edited

Loading

Uh oh!

chelseacarter29 commented Sep 29, 2025

Uh oh!

galaxy101quest commented Sep 29, 2025

Uh oh!

rvguha commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

galaxy101quest commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chelseacarter29 commented Sep 29, 2025

Uh oh!

galaxy101quest commented Sep 29, 2025

Uh oh!

rvguha commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galaxy101quest commented Sep 28, 2025 •

edited

Loading