feat(phys_org): add Phys.org mirror site#6
Open
Within-yao wants to merge 1 commit into
Open
Conversation
Adds a 16th WebHarbor mirror at https://phys.org — a science / technology / research news aggregator. Real RSS-derived catalog of 210 articles across 7 categories (Physics, Earth, Technology, Biology, Chemistry, Astronomy, Nanotechnology) with real thumbnails, plus 4 benchmark users with seeded saved articles, comments (incl. cross-user reply chains), and search history. Registered as the 16th site at port 40015. .gitignore was tightened because the previous inline-comment patterns for sites/*/scraped_data/ and sites/*/instance/ were not matching (Codex finding, fixed in this PR). Site features: - Categories with recent/popular sort - Article detail with source journal / institution / DOI - Threaded comments with reply UI (parent-article validation) - Save articles with notes (auth) - Token-overlap scored search with category filter - Trending list, user profile, account edit, login/register Determinism work for byte-identical reset: - RSS pubDate parsing strips trailing TZ token (strptime %Z rejects EDT) - Pinned bcrypt hash for benchmark users (random salt would drift md5) - Per-article RNG seeded by slug for synthesized author/journal/views - /article/<slug> GET no longer mutates Article.views (Codex finding) Open-redirect hardening: - _safe_next() validates next= targets in /login and /save (Codex finding) Tasks: 18 WebVoyager-format tasks in sites/phys_org/tasks.jsonl, covering search, browse, detail, comment thread reading, save toggle, auth flows, and one comparison task. Assets: heavy assets (instance_seed/phys_org.db, static/images/) live in the paired HF dataset PR; phys_org.tar.gz is 460K, db md5 b4a324122c3cb0a56b8d511e73ff13a7. .assets-revision uses 'main' so the HF merge will roll in automatically.
Author
|
Closing temporarily — will reopen tomorrow after I run the docker build + /reset/ parity verification and Playwright visual diff. Local cache + branch retained. |
Author
|
Reopening: docker build + /reset/phys_org parity + Playwright visual diff are now done. Updating PR body with verification outputs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a 16th WebHarbor mirror at https://phys.org — a science / technology / research news aggregator. Real RSS-derived catalog of 210 articles across 7 categories (Physics, Earth Sciences, Technology, Biology, Chemistry, Astronomy & Space, Nanotechnology) with real thumbnails, plus 4 benchmark users with seeded saved articles, comments (incl. cross-user reply chains), and search history.
Registered as the 16th site at port
40015. The eighth categoryotheris created but empty (the live phys.org RSS endpoint for that section 404s).Site features
recent/popularsortpost_comment)/savedsites/phys_org/tasks.jsonlSeeded rows
TestPass123!Benchmark users:
alice.j@test.com,bob.c@test.com,carol.d@test.com,david.k@test.com.Determinism work for byte-identical reset
Three subtle non-determinism fixes that the benchmark invariant required:
strptime('%Z')rejectsEDT/PDT, so all 210 articles were collapsing ontoMIRROR_REFERENCE_DATE. Fixed by stripping the trailing TZ token in_parse_pub. Side-benefit: task Add UC Berkeley mirror site (port 40015) #11 (compare two pub dates) now has a real answer.bcrypt.generate_password_hashmixes a random salt on every call, which would changeusers.password_hashat every seed and break byte-identity. Fixed by pinning one valid hash (PINNED_PASSWORD_HASH);check_password_hashaccepts it normally.random.Random(slug + ":...")so two clean machines produce identical rows.Codex review fixes (4 findings, 4 fixed)
.gitignoreinline-comment patterns weren't matchingsites/*/scraped_data/andsites/*/instance/. Fixed: comments moved to their own lines.git check-ignore -vnow matches both./article/<slug>GET was incrementingArticle.viewson every page view, letting agent browse-order shift answers for trending tasks (Broken homepage images in Booking and ESPN mirrors #3, Align search URLs with upstream sites #15) and break/resetbyte-identity. Fixed: increment removed;viewsis now read-only seed data.next=redirect targets in/loginand/savewere not validated. Fixed:_safe_next()only accepts same-app relative paths.article_detail.html, noparent.article_idvalidation. Fixed: every comment renders a Reply link that pre-fills a hiddenparent_id;post_comment()rejects non-int parents, missing parents, and parents from a different article.Paired Hugging Face PR
phys_org.tar.gz(460 KB)5c4294d956a7a8e0dbb7bade7bfd150eb4a324122c3cb0a56b8d511e73ff13a7sites/phys_org/instance_seed/phys_org.db(495 KB seed) +sites/phys_org/static/images/(210 real RSS thumbnails, 840 KB).assets-revisionis left atrevision: mainso the HF merge will roll in automatically (same approach as the TED PR).Verification
All checks below were run on this contributor's machine against
webharbor:devbuilt from this branch../scripts/check_assets.sh./scripts/build.sh webharbor:devContainer start
All 16 sites came up — startup log:
Port sweep — all 16 return 200
/_healthfor phys_org/reset/phys_orgbyte-identical reset/reset-allparallel resetIdempotency across container restart
Phys.org route smoke (10 routes, all 200)
Phys.org auth + state-workflow (
alice.j@test.com)Reply hardening — all 4 invalid-parent cases rejected with Invalid reply target.:
Open-redirect guard:
Playwright visual diff vs live phys.org
10 full-page Chromium screenshots (1280×900 viewport) captured side-by-side under
/tmp/phys_org_screenshots/{mirror,live}/:localhost:41015)phys.org)///category/physics/physics-news//article/quantum-circuit-test-finally-exposes-what-has-been-warping-performance/search?q=quantum/search/?search=quantum/login/account/login/Visual fidelity: deep-navy header + tagline + search bar, blue accent button, white background, card-based article grid with thumb-left / title+meta-right layout, sidebar with Trending list, Related Stories on detail pages, and a colored source/journal block on the article body. Not pixel-identical (intentionally), but the brand feel is consistent with the live site.
Files
sites/phys_org/{app.py, seed_data.py, _health.py, requirements.txt, tasks.jsonl, templates/*, static/{css,icons,js}/*}websyn_start.sh,control_server.py,Dockerfile,.gitignoreinstance_seed/phys_org.db,static/images/) live in HF PR Fix user-visible mirror URL leaks #7, not in git.🤖 Generated with Claude Code