Refined RSS ingestion, added chunking and improved retrieval quality#355
Refined RSS ingestion, added chunking and improved retrieval quality#355galaxy101quest wants to merge 3 commits intonlweb-ai:mainfrom
Conversation
…rieval quality by splitting long posts into meaningful sections. The result is more precise answers. - rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects. - db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval. Contributions by Misha - misha@futurescreen.media
- needed in db_load.py
|
Hi Misha! @galaxy101quest Thanks for the PR! This is a good start - we may want to make a few adjustments and are going to do some additional testing. A couple of things I've seen so far:
Something Guha and I were chatting about was maybe having the ability to specify the 'type' when doing data load so you have options. |
|
Hi Chelsea :) @chelseacarter29 Thanks for the feedback. It's not supposed to do that, so something is off. Yes, I was also thinking about having more options when loading data - different types might make things easier. |
|
Thanks for the contribution! Closing this as the codebase has changed significantly since this was opened (notably |
Refined RSS ingestion to better fit website articles and improved retrieval quality by splitting long posts into meaningful sections. The result is more precise answers.
rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.
db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.
requirements.txt (import): added beautifulsoup as it's needed in db_load.py
Contributions by Misha - misha@futurescreen.media