feat: add structured archive export#17
Conversation
|
Hello, thanks for the PR. 🙌 I like some of the ideas, the ITIR and structured JSON export thing is something I don't really understand why it is needed or what is the purpose of it. Maybe you wanna give more insights there. |
|
Hi mate, thanks for making the repo. Firstly I had a few issues around sign-in detection (I forget exactly which now - I think around session detection/redirects), and then I also have some stupidly long threads which the scraper reported as having pulled completely, when this was not the case. Regarding the JSON export, I figured better to just provide a generic interface, however, the canonical use case within my repos is to store within SQLite (see https://github.com/1ch1n/mychatarchive). I don't think I PR'd any ITIR integration per-se, more that, again, canonically ITIR operates over that db (arbitrary text also fine, dedupe is priority re SQL)... Please let me know if this helps :) |
|
@chboishabba Heyo, I added a new folder where you can add custom file exporters. Maybe that could work out to solve some of your problems. |
|
I'll have a look. I started on adding artifact export for links, pictures, and code... The main thing is being able to reliably export massive threads. Perplexity's pagination will happily omit loading turns if it can avoid it... I modified the loader script to handle this better... I think I also added pdf/md export |
|
@chboishabba Is the unreliable fetching of long threads still unreliable on the latest commit? |
Summary
This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.
The exporter now writes
itir.perplexity.thread.v1JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.Why
The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.
A major practical issue is long Perplexity threads. The first browser-captured
/rest/thread/<id>response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.Pagination / long-thread behavior
This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:
/rest/thread/<thread-id>response for the conversation being exportednext_cursorfrom page context with authenticated browser cookiesThat last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.
For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path.
npm run bundle:perplexity-downloadsconverts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.Changes
EXPORT_STRUCTURED_JSON=true.EXPORT_MARKDOWN=true.bundle:perplexity-downloadsfor converting downloaded Markdown chunks into the same structured JSON shape.Validation
npm run type-checknpm run test:unit(25 tests passed)