Skip to content

Refined RSS ingestion, added chunking and improved retrieval quality#355

Closed
galaxy101quest wants to merge 3 commits intonlweb-ai:mainfrom
galaxy101quest:rss_chunking_branch
Closed

Refined RSS ingestion, added chunking and improved retrieval quality#355
galaxy101quest wants to merge 3 commits intonlweb-ai:mainfrom
galaxy101quest:rss_chunking_branch

Conversation

@galaxy101quest
Copy link
Copy Markdown

@galaxy101quest galaxy101quest commented Sep 28, 2025

Refined RSS ingestion to better fit website articles and improved retrieval quality by splitting long posts into meaningful sections. The result is more precise answers.

  • rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.

  • db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.

  • requirements.txt (import): added beautifulsoup as it's needed in db_load.py

Contributions by Misha - misha@futurescreen.media

Misha Ristich and others added 3 commits September 28, 2025 10:46
…rieval quality by splitting long posts into meaningful sections. The result is more precise answers.

- rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.

- db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.

Contributions by Misha - misha@futurescreen.media
@chelseacarter29
Copy link
Copy Markdown
Collaborator

Hi Misha! @galaxy101quest

Thanks for the PR! This is a good start - we may want to make a few adjustments and are going to do some additional testing. A couple of things I've seen so far:

  • Entire podcast RSS feeds now encode as one document instead of individual episodes
  • Something may be off in the chunking; I tried to encode some articles to test it out (I used https://platformer.news/feed to see what article results might look like) and it seems to be returning many of the same article but jumping to different sections (e.g., comments sometimes). Need to look into it a bit more.

Something Guha and I were chatting about was maybe having the ability to specify the 'type' when doing data load so you have options.

@galaxy101quest
Copy link
Copy Markdown
Author

Hi Chelsea :) @chelseacarter29

Thanks for the feedback. It's not supposed to do that, so something is off.
I have a few versions on my end - I'll check for improvements and send them over.
I'll test both points with the rss feed from the website you shared - that way it would be easier to compare results.

Yes, I was also thinking about having more options when loading data - different types might make things easier.

@rvguha
Copy link
Copy Markdown
Collaborator

rvguha commented Apr 5, 2026

Thanks for the contribution! Closing this as the codebase has changed significantly since this was opened (notably code/AskAgent/ restructure). If you'd like to revisit this, please feel free to open a fresh PR against the current main branch.

@rvguha rvguha closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants