Skip to content

fix: restore compatibility with Perplexity API changes (2026|06)#39

Open
DaveG7 wants to merge 7 commits into
simwai:masterfrom
DaveG7:master
Open

fix: restore compatibility with Perplexity API changes (2026|06)#39
DaveG7 wants to merge 7 commits into
simwai:masterfrom
DaveG7:master

Conversation

@DaveG7

@DaveG7 DaveG7 commented Jun 3, 2026

Copy link
Copy Markdown

Description

Perplexity changed their REST API response format, breaking both thread discovery and conversation extraction.

Library discovery (library-discovery.ts):

  • The list_ask_threads response no longer includes slug, query_str, first_answer, last_query_datetime, total_threads, etc. — replaced by a leaner format with link, context_uuid,
    and mode_type
  • Updated RawThread and ConversationMeta interfaces to handle the new format while keeping old fields optional for backward compatibility
  • Fixed URL construction: now uses thread.link first, falling back to slug, then uuid — the old code produced https://www.perplexity.ai/search/undefined for all new-format
    threads

Conversation extraction (conversation-extractor.ts):

  • Replaced the Playwright response-event listener approach with a direct page.evaluate() fetch to /rest/thread/{uuid} — same pattern already used by library discovery, more
    reliable
  • Deduplicated answer blocks in parseMessages: the new API returns both ask_text_0_markdown and ask_text blocks with identical content, causing doubled output

Related Issues

Fixes #

Checklist

  • I have read the CONTRIBUTING.md
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

Diagnostic Logs

N/A — The conversation extractor no longer uses debug/api-diagnostics.jsonl — the response-listener that wrote to it was replaced by a direct API call, making per-response diagnostics unnecessary.

Cheers, Dave

@simwai

simwai commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Hello @DaveG7, thanks for your PR.
It would be neat if you could post a sanitized version of the exact HTTP request and response in order to verify this shape change.

What I like:
I appreciate that you added the new schema as fallback.

What I don't like:

  • You removed the complete zod validator for the HTTP response shape.
  • You removed the diagnostics logic.
  • You swapped the approach to page.evaluate() with the reason of more reliability. Why is it more reliable here? I don't see it.

DaveG7 and others added 3 commits June 4, 2026 20:35
…rect fetch

Address PR simwai#39 review feedback on conversation-extractor:

- Restore ApiResponseSchema, validated against a live 2026 /rest/thread/{id}
  response. Pagination is the top-level has_next_page/next_cursor pair (not
  collection_info, which is the list endpoint). Diagnose-and-continue: shape
  drift writes a diagnostic and falls through to the per-entry EntrySchema gate.
- Restore ApiDiagnosticsWriter calls (zod_error / unknown_shape / empty_entries)
  so the debug/api-diagnostics.jsonl path the REPL references works again.
- Keep the page.evaluate()+fetch approach for consistency with library-discovery
  (the response-listener was the lone divergent /rest/ path); replace hardcoded
  version=2.18 with shared DEFAULT_API_VERSION.
- Remove dead adaptive-timeout no-ops (reduceTimeout/recoverTimeout) and their
  now-unused worker-pool callers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A live thread with 212 entries paginates across 3 pages of ~99; the new
single-fetch path only returned page 1, truncating long conversations.

fetchThreadData now keeps the single-fetch fast path for normal threads and,
when has_next_page is true, follows the top-level next_cursor (same URL +
&cursor=<encoded>) accumulating entries in API order until the thread is
complete, capped at 50 pages. Split the per-page fetch+validate into
fetchThreadPage. This restores the long-thread coverage the old response
listener provided.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@DaveG7

DaveG7 commented Jun 4, 2026

Copy link
Copy Markdown
Author

Hey @simwai, thanks for the careful review — you were right to push back on the parts I'd stripped out. Went back through all of it.

Sanitized request/response you asked for:

GET https://www.perplexity.ai/rest/thread/<thread-uuid>?version=2.18&source=default
Cookie: <session cookies>           # credentials: 'include'
→ HTTP 200
{
  "entries": [
    {
      "uuid": "<uuid>",
      "status": "COMPLETED",
      "thread_title": "i am a simple test thread",
      "query_str": "i am a simple test thread",
      "updated_datetime": "2026-06-04T20:01:55.688657",
      "blocks": [
        { "intended_usage": "ask_text_0_markdown",
          "markdown_block": { "answer": "Understood — this is a test thread." } },
        { "intended_usage": "ask_text",
          "markdown_block": { "answer": "Understood — this is a test thread." } }
        // plan / workflow_root / answer_tabs / pending_followups blocks elided
      ]
      // classifier_results / mhe_predictions / social_info etc. elided (not used)
    }
  ],
  "background_entries": [],
  "has_next_page": false,
  "next_cursor": null,
  "status": "success",
  "thread_metadata": { "title": "i am a simple test thread", "...": "..." }
}

Two things it cleared up: pagination is a top-level has_next_page/next_cursor pair, not inside collection_info (that key isn't on the thread endpoint) — I'd had the check in the wrong place. And the answer comes back twice (ask_text_0_markdown + ask_text), which is the duplication the dedup handles.

On your three points:

1 — Validator restored. Dropping it was the wrong move. ApiResponseSchema validates the body again in fetchThreadData(), modelled on the real shape above (optional fields, so new keys don't reject a valid response). I made it diagnose-and-continue — on a mismatch it writes a diagnostic and falls through to the per-entry EntrySchema gate, so a future drift gets surfaced instead of silently dropping threads. Happy to hard-gate instead if you prefer.

2 — Diagnostics restored. ApiDiagnosticsWriter writes again (zod_error / unknown_shape / empty_entries). I'd missed that the REPL still points users to debug/api-diagnostics.jsonl — removing the writer quietly broke that, so thanks for catching it.

3 — page.evaluate() — fair question, "more reliable" was lazy on my part. The real reason is consistency: library-discovery.ts already fetches every /rest/thread/list_* endpoint this exact way, and the response-listener was the one /rest/ path that diverged (and needed the adaptive-timeout handling to compensate — those reduceTimeout/recoverTimeout hooks had decayed to no-ops worker-pool.ts was still calling, so I removed them). It also gives explicit HTTP-status handling, and I swapped the hardcoded version=2.18 for a shared DEFAULT_API_VERSION. If you'd rather keep the listener, I'm glad to go that way.

On pagination — I dug into this properly. Short threads come back in one response (has_next_page: false), but I found a long one that genuinely paginates: 212 entries across 3 pages of ~99. The next page is the same URL + &cursor=<encodeURIComponent(next_cursor)>, and each page carries its own next_cursor/has_next_page. So I implemented the accumulation: fetchThreadData keeps the single-fetch fast path for normal threads, and when has_next_page is true it follows the cursor and concatenates entries (in API order — same order the old listener relied on) until the thread is complete, with a 50-page safety cap. So this actually closes the gap the old listener used to cover for long threads, rather than just warning.

Thanks again — happy to adjust any of it.

Cheers Dave

@simwai

simwai commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Thank you for the explanation. However, consistency with library-discovery.ts is not a strong enough reason to replace the working network listener. The listener handled pagination reliably and had no timing issues. The page.evaluate fetch approach introduces undeniable timing downsides that the listener did not have. I would prefer to keep the original listener and only update the Zod schema and pagination detection as needed.

@simwai

simwai commented Jun 4, 2026

Copy link
Copy Markdown
Owner

@DaveG7

@simwai

simwai commented Jun 4, 2026

Copy link
Copy Markdown
Owner

I also would like to know if the HTTP request and response you have sent is real or completely from AI?

@DaveG7

DaveG7 commented Jun 6, 2026

Copy link
Copy Markdown
Author

Ciao @simwai

Sorry for the delay, family and so... ;-) In reality, we could switch to our native language, but easy, let's stick to the international language.

I fully understand your skepticism. But I am not here to "Dich bespassen" and waste your time; time is what most people lack, me included. I am seriously contributing in the ways I can, and to shortly answer your question:
Yes, the request was fully manually done over the devTools in the browser and then redacted from my personal data and shortened a bit. If an unshortened version of the request/response would help you evaluate the PR more confidently, I am happy to provide it.

My first PR was rushed; I had no time to dig into your code base. I was searching for a fast solution to extract my data from Perplexity, which I have been using for approximately 1 year. Wanted to build a small pipeline in a Docker container but had some problems getting it to run, as you need the Headless=false option. When I tested your version locally (before #39 ), it wasn't working for me. At that time I wasn't aware of the possible debugging output, which could have helped even better. I leaned on AI assistance to get something working quickly, which stripped some of your original logic—I apologize for that.

Second PR should be better, and so far it's working fine for me. It's your repo, you choose what you can use and what not. No offense taken; on the contrary, I will dig into your mentioned timing concerns, as this is precisely the thing I still do not have the needed experience to directly see that stuff. You're an engineer, I am not. I have taken another main path (natural science), but IT and Tech is my second "Stammbein", it's passion and interest since childhood, something I try to add into my primary education path.

I wish you a pleasant weekend and I am happy to hear from you.

Cheers Dave

@simwai

simwai commented Jun 6, 2026

Copy link
Copy Markdown
Owner

@DaveG7 I understand that the last versions were buggy. I did not enough testing myself before merging the output of the agent.

I will investigate today a little bit more the requests and responses. I am sure we can get this done somehow.

Btw it is helpful when you say on which url you spotted the request/response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants