fix: restore compatibility with Perplexity API changes (2026|06) by DaveG7 · Pull Request #39 · simwai/perplexity-ai-export

DaveG7 · 2026-06-03T21:18:07Z

Description

Perplexity changed their REST API response format, breaking both thread discovery and conversation extraction.

Library discovery (library-discovery.ts):

The list_ask_threads response no longer includes slug, query_str, first_answer, last_query_datetime, total_threads, etc. — replaced by a leaner format with link, context_uuid,
and mode_type
Updated RawThread and ConversationMeta interfaces to handle the new format while keeping old fields optional for backward compatibility
Fixed URL construction: now uses thread.link first, falling back to slug, then uuid — the old code produced https://www.perplexity.ai/search/undefined for all new-format
threads

Conversation extraction (conversation-extractor.ts):

Replaced the Playwright response-event listener approach with a direct page.evaluate() fetch to /rest/thread/{uuid} — same pattern already used by library discovery, more
reliable
Deduplicated answer blocks in parseMessages: the new API returns both ask_text_0_markdown and ask_text blocks with identical content, causing doubled output

Related Issues

Fixes #

Checklist

I have read the CONTRIBUTING.md
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

Diagnostic Logs

N/A — The conversation extractor no longer uses debug/api-diagnostics.jsonl — the response-listener that wrote to it was replaced by a direct API call, making per-response diagnostics unnecessary.

Cheers, Dave

simwai · 2026-06-04T14:45:14Z

Hello @DaveG7, thanks for your PR.
It would be neat if you could post a sanitized version of the exact HTTP request and response in order to verify this shape change.

What I like:
I appreciate that you added the new schema as fallback.

What I don't like:

You removed the complete zod validator for the HTTP response shape.
You removed the diagnostics logic.
You swapped the approach to page.evaluate() with the reason of more reliability. Why is it more reliable here? I don't see it.

…rect fetch Address PR simwai#39 review feedback on conversation-extractor: - Restore ApiResponseSchema, validated against a live 2026 /rest/thread/{id} response. Pagination is the top-level has_next_page/next_cursor pair (not collection_info, which is the list endpoint). Diagnose-and-continue: shape drift writes a diagnostic and falls through to the per-entry EntrySchema gate. - Restore ApiDiagnosticsWriter calls (zod_error / unknown_shape / empty_entries) so the debug/api-diagnostics.jsonl path the REPL references works again. - Keep the page.evaluate()+fetch approach for consistency with library-discovery (the response-listener was the lone divergent /rest/ path); replace hardcoded version=2.18 with shared DEFAULT_API_VERSION. - Remove dead adaptive-timeout no-ops (reduceTimeout/recoverTimeout) and their now-unused worker-pool callers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A live thread with 212 entries paginates across 3 pages of ~99; the new single-fetch path only returned page 1, truncating long conversations. fetchThreadData now keeps the single-fetch fast path for normal threads and, when has_next_page is true, follows the top-level next_cursor (same URL + &cursor=<encoded>) accumulating entries in API order until the thread is complete, capped at 50 pages. Split the per-page fetch+validate into fetchThreadPage. This restores the long-thread coverage the old response listener provided. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

DaveG7 · 2026-06-04T21:25:48Z

Hey @simwai, thanks for the careful review — you were right to push back on the parts I'd stripped out. Went back through all of it.

Sanitized request/response you asked for:

GET https://www.perplexity.ai/rest/thread/<thread-uuid>?version=2.18&source=default
Cookie: <session cookies>           # credentials: 'include'
→ HTTP 200

{
  "entries": [
    {
      "uuid": "<uuid>",
      "status": "COMPLETED",
      "thread_title": "i am a simple test thread",
      "query_str": "i am a simple test thread",
      "updated_datetime": "2026-06-04T20:01:55.688657",
      "blocks": [
        { "intended_usage": "ask_text_0_markdown",
          "markdown_block": { "answer": "Understood — this is a test thread." } },
        { "intended_usage": "ask_text",
          "markdown_block": { "answer": "Understood — this is a test thread." } }
        // plan / workflow_root / answer_tabs / pending_followups blocks elided
      ]
      // classifier_results / mhe_predictions / social_info etc. elided (not used)
    }
  ],
  "background_entries": [],
  "has_next_page": false,
  "next_cursor": null,
  "status": "success",
  "thread_metadata": { "title": "i am a simple test thread", "...": "..." }
}

Two things it cleared up: pagination is a top-level has_next_page/next_cursor pair, not inside collection_info (that key isn't on the thread endpoint) — I'd had the check in the wrong place. And the answer comes back twice (ask_text_0_markdown + ask_text), which is the duplication the dedup handles.

On your three points:

1 — Validator restored. Dropping it was the wrong move. ApiResponseSchema validates the body again in fetchThreadData(), modelled on the real shape above (optional fields, so new keys don't reject a valid response). I made it diagnose-and-continue — on a mismatch it writes a diagnostic and falls through to the per-entry EntrySchema gate, so a future drift gets surfaced instead of silently dropping threads. Happy to hard-gate instead if you prefer.

2 — Diagnostics restored. ApiDiagnosticsWriter writes again (zod_error / unknown_shape / empty_entries). I'd missed that the REPL still points users to debug/api-diagnostics.jsonl — removing the writer quietly broke that, so thanks for catching it.

3 — page.evaluate() — fair question, "more reliable" was lazy on my part. The real reason is consistency: library-discovery.ts already fetches every /rest/thread/list_* endpoint this exact way, and the response-listener was the one /rest/ path that diverged (and needed the adaptive-timeout handling to compensate — those reduceTimeout/recoverTimeout hooks had decayed to no-ops worker-pool.ts was still calling, so I removed them). It also gives explicit HTTP-status handling, and I swapped the hardcoded version=2.18 for a shared DEFAULT_API_VERSION. If you'd rather keep the listener, I'm glad to go that way.

On pagination — I dug into this properly. Short threads come back in one response (has_next_page: false), but I found a long one that genuinely paginates: 212 entries across 3 pages of ~99. The next page is the same URL + &cursor=<encodeURIComponent(next_cursor)>, and each page carries its own next_cursor/has_next_page. So I implemented the accumulation: fetchThreadData keeps the single-fetch fast path for normal threads, and when has_next_page is true it follows the cursor and concatenates entries (in API order — same order the old listener relied on) until the thread is complete, with a 50-page safety cap. So this actually closes the gap the old listener used to cover for long threads, rather than just warning.

Thanks again — happy to adjust any of it.

Cheers Dave

simwai · 2026-06-04T22:12:23Z

Thank you for the explanation. However, consistency with library-discovery.ts is not a strong enough reason to replace the working network listener. The listener handled pagination reliably and had no timing issues. The page.evaluate fetch approach introduces undeniable timing downsides that the listener did not have. I would prefer to keep the original listener and only update the Zod schema and pagination detection as needed.

simwai · 2026-06-04T22:12:39Z

@DaveG7

simwai · 2026-06-04T22:13:49Z

I also would like to know if the HTTP request and response you have sent is real or completely from AI?

DaveG7 · 2026-06-06T09:41:21Z

Ciao @simwai

Sorry for the delay, family and so... ;-) In reality, we could switch to our native language, but easy, let's stick to the international language.

I fully understand your skepticism. But I am not here to "Dich bespassen" and waste your time; time is what most people lack, me included. I am seriously contributing in the ways I can, and to shortly answer your question:
Yes, the request was fully manually done over the devTools in the browser and then redacted from my personal data and shortened a bit. If an unshortened version of the request/response would help you evaluate the PR more confidently, I am happy to provide it.

My first PR was rushed; I had no time to dig into your code base. I was searching for a fast solution to extract my data from Perplexity, which I have been using for approximately 1 year. Wanted to build a small pipeline in a Docker container but had some problems getting it to run, as you need the Headless=false option. When I tested your version locally (before #39 ), it wasn't working for me. At that time I wasn't aware of the possible debugging output, which could have helped even better. I leaned on AI assistance to get something working quickly, which stripped some of your original logic—I apologize for that.

Second PR should be better, and so far it's working fine for me. It's your repo, you choose what you can use and what not. No offense taken; on the contrary, I will dig into your mentioned timing concerns, as this is precisely the thing I still do not have the needed experience to directly see that stuff. You're an engineer, I am not. I have taken another main path (natural science), but IT and Tech is my second "Stammbein", it's passion and interest since childhood, something I try to add into my primary education path.

I wish you a pleasant weekend and I am happy to hear from you.

Cheers Dave

simwai · 2026-06-06T10:25:08Z

@DaveG7 I understand that the last versions were buggy. I did not enough testing myself before merging the output of the agent.

I will investigate today a little bit more the requests and responses. I am sure we can get this done somehow.

Btw it is helpful when you say on which url you spotted the request/response.

DaveG7 added 3 commits June 3, 2026 22:59

fix: adapt scraper to Perplexity API changes (2026)

ddfa165

chore: ignore AI tooling directories

69e499e

test: update conversation-extractor tests for new API approach

7cfba60

DaveG7 and others added 3 commits June 4, 2026 20:35

Merge branch 'master' into master

4dae353

Merge branch 'master' into master

c9baafb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore compatibility with Perplexity API changes (2026|06)#39

fix: restore compatibility with Perplexity API changes (2026|06)#39
DaveG7 wants to merge 7 commits into
simwai:masterfrom
DaveG7:master

DaveG7 commented Jun 3, 2026

Uh oh!

simwai commented Jun 4, 2026 •

edited

Loading

Uh oh!

DaveG7 commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

DaveG7 commented Jun 6, 2026

Uh oh!

simwai commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DaveG7 commented Jun 3, 2026

Description

Uh oh!

simwai commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveG7 commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

simwai commented Jun 4, 2026

Uh oh!

DaveG7 commented Jun 6, 2026

Uh oh!

simwai commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simwai commented Jun 4, 2026 •

edited

Loading

simwai commented Jun 6, 2026 •

edited

Loading