Skip to content

feat(phys_org): add Phys.org mirror site#6

Open
Within-yao wants to merge 1 commit into
aiming-lab:mainfrom
Within-yao:feat/phys_org-mirror
Open

feat(phys_org): add Phys.org mirror site#6
Within-yao wants to merge 1 commit into
aiming-lab:mainfrom
Within-yao:feat/phys_org-mirror

Conversation

@Within-yao
Copy link
Copy Markdown

@Within-yao Within-yao commented May 13, 2026

Summary

Adds a 16th WebHarbor mirror at https://phys.org — a science / technology / research news aggregator. Real RSS-derived catalog of 210 articles across 7 categories (Physics, Earth Sciences, Technology, Biology, Chemistry, Astronomy & Space, Nanotechnology) with real thumbnails, plus 4 benchmark users with seeded saved articles, comments (incl. cross-user reply chains), and search history.

Registered as the 16th site at port 40015. The eighth category other is created but empty (the live phys.org RSS endpoint for that section 404s).

Site features

  • Categories with recent / popular sort
  • Article detail page with source journal / institution / DOI byline
  • Threaded comments with reply UI (parent-article validation in post_comment)
  • Save articles with notes (auth-gated), browsable via /saved
  • Token-overlap scored search with optional category filter
  • Trending list, public user profile, account edit, login/register/logout
  • 18 WebVoyager-format tasks in sites/phys_org/tasks.jsonl

Seeded rows

  • articles: 210
  • categories: 8 (one empty by design)
  • users: 4 benchmark + bcrypt password TestPass123!
  • comments: 15 (incl. 3 cross-user reply chains)
  • saved_articles: 23
  • search_history: 10

Benchmark users: alice.j@test.com, bob.c@test.com, carol.d@test.com, david.k@test.com.

Determinism work for byte-identical reset

Three subtle non-determinism fixes that the benchmark invariant required:

  1. RSS pubDate parsingstrptime('%Z') rejects EDT / PDT, so all 210 articles were collapsing onto MIRROR_REFERENCE_DATE. Fixed by stripping the trailing TZ token in _parse_pub. Side-benefit: task Add UC Berkeley mirror site (port 40015) #11 (compare two pub dates) now has a real answer.
  2. Pinned bcrypt hashbcrypt.generate_password_hash mixes a random salt on every call, which would change users.password_hash at every seed and break byte-identity. Fixed by pinning one valid hash (PINNED_PASSWORD_HASH); check_password_hash accepts it normally.
  3. Per-row RNG seeded by slug — synthesized author / journal / institution / view counts use random.Random(slug + ":...") so two clean machines produce identical rows.

Codex review fixes (4 findings, 4 fixed)

  • High .gitignore inline-comment patterns weren't matching sites/*/scraped_data/ and sites/*/instance/. Fixed: comments moved to their own lines. git check-ignore -v now matches both.
  • Medium /article/<slug> GET was incrementing Article.views on every page view, letting agent browse-order shift answers for trending tasks (Broken homepage images in Booking and ESPN mirrors #3, Align search URLs with upstream sites #15) and break /reset byte-identity. Fixed: increment removed; views is now read-only seed data.
  • Low next= redirect targets in /login and /save were not validated. Fixed: _safe_next() only accepts same-app relative paths.
  • Low Reply support was route-only — no UI control on article_detail.html, no parent.article_id validation. Fixed: every comment renders a Reply link that pre-fills a hidden parent_id; post_comment() rejects non-int parents, missing parents, and parents from a different article.

Paired Hugging Face PR

  • Heavy assets: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/7
  • Tarball: phys_org.tar.gz (460 KB)
  • Tarball md5: 5c4294d956a7a8e0dbb7bade7bfd150e
  • DB md5: b4a324122c3cb0a56b8d511e73ff13a7
  • Asset contents: sites/phys_org/instance_seed/phys_org.db (495 KB seed) + sites/phys_org/static/images/ (210 real RSS thumbnails, 840 KB)
  • .assets-revision is left at revision: main so the HF merge will roll in automatically (same approach as the TED PR).

Verification

All checks below were run on this contributor's machine against webharbor:dev built from this branch.

./scripts/check_assets.sh

[check] all sites have instance_seed/ (14 sites lack at least one optional asset dir — that's OK)

./scripts/build.sh webharbor:dev

[build] webharbor:dev ready (5.92GB)
IMAGE           ID             DISK USAGE   CONTENT SIZE
webharbor:dev   43c8218e62bf       5.92GB         2.76GB

Container start

docker run -d --rm --name wh-test \
  -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev

All 16 sites came up — startup log:

[WebSyn] Waiting for sites to become ready...
  [2/30s] 16/16 sites ready
[WebSyn] Site status:
  [OK] allrecipes :40000     [OK] amazon :40001          [OK] apple :40002
  [OK] arxiv :40003          [OK] bbc_news :40004        [OK] booking :40005
  [OK] github :40006         [OK] google_flights :40007  [OK] google_map :40008
  [OK] google_search :40009  [OK] huggingface :40010     [OK] wolfram_alpha :40011
  [OK] cambridge_dictionary :40012  [OK] coursera :40013 [OK] espn :40014
  [OK] phys_org :40015

Port sweep — all 16 return 200

41000: 200   41001: 200   41002: 200   41003: 200
41004: 200   41005: 200   41006: 200   41007: 200
41008: 200   41009: 200   41010: 200   41011: 200
41012: 200   41013: 200   41014: 200   41015: 200

/_health for phys_org

$ curl -s http://localhost:41015/_health
{"ok":true,"site":"phys_org"}

/reset/phys_org byte-identical reset

$ curl -X POST http://localhost:8201/reset/phys_org
{"pid":156,"ready":true,"site":"phys_org"}

$ docker exec wh-test md5sum \
  /opt/WebSyn/phys_org/instance/phys_org.db \
  /opt/WebSyn/phys_org/instance_seed/phys_org.db
b4a324122c3cb0a56b8d511e73ff13a7  /opt/WebSyn/phys_org/instance/phys_org.db
b4a324122c3cb0a56b8d511e73ff13a7  /opt/WebSyn/phys_org/instance_seed/phys_org.db

/reset-all parallel reset

$ time curl -s -X POST http://localhost:8201/reset-all
... ok: true, all 16 sites ready ...
elapsed: 0.72s

Idempotency across container restart

$ docker restart wh-test
wh-test
$ docker exec wh-test md5sum /opt/WebSyn/phys_org/instance/phys_org.db /opt/WebSyn/phys_org/instance_seed/phys_org.db
b4a324122c3cb0a56b8d511e73ff13a7  /opt/WebSyn/phys_org/instance/phys_org.db
b4a324122c3cb0a56b8d511e73ff13a7  /opt/WebSyn/phys_org/instance_seed/phys_org.db

Phys.org route smoke (10 routes, all 200)

/                                                                  200
/category/physics                                                  200
/category/astronomy?sort=popular                                   200
/article/quantum-circuit-test-finally-exposes-what-has-been-warping-performance  200
/search?q=quantum                                                  200
/trending                                                          200
/login                                                             200
/register                                                          200
/user/alice_j                                                      200
/_health                                                           200

Phys.org auth + state-workflow (alice.j@test.com)

GET /login: 200
POST /login: 302 → /                          (CSRF + bcrypt verified)
GET /saved: 200, 6 article cards              (matches seeded saved set)
GET /account: 200, recent searches present
GET /search?q=CRISPR: 200                     (handles 0-result query)
detail save_btn: True
comment post: 200, persisted: True            (top-level + reply to seeded comment)
save post: 200, flash visible: True
GET /user/alice_j: 200, recent comments rendered

Reply hardening — all 4 invalid-parent cases rejected with Invalid reply target.:

valid same-article reply persisted:                    True
cross-article injection rejected (parent_id from another article): True
malformed parent_id ('abc') rejected:                  True
non-existent parent_id ('99999') rejected:             True
top-level (empty parent_id) still works:               True

Open-redirect guard:

POST /login?next=https://evil.example.com/x → 302 → /  (NOT the attacker URL)

Playwright visual diff vs live phys.org

10 full-page Chromium screenshots (1280×900 viewport) captured side-by-side under /tmp/phys_org_screenshots/{mirror,live}/:

page mirror (localhost:41015) live (phys.org)
home / /
category /category/physics /physics-news/
article /article/quantum-circuit-test-finally-exposes-what-has-been-warping-performance first phys-news tile (currently a 3D-atomic-rearrangement quantum article)
search /search?q=quantum /search/?search=quantum
login /login /account/login/

Visual fidelity: deep-navy header + tagline + search bar, blue accent button, white background, card-based article grid with thumb-left / title+meta-right layout, sidebar with Trending list, Related Stories on detail pages, and a colored source/journal block on the article body. Not pixel-identical (intentionally), but the brand feel is consistent with the live site.

Files

  • New: sites/phys_org/{app.py, seed_data.py, _health.py, requirements.txt, tasks.jsonl, templates/*, static/{css,icons,js}/*}
  • Modified: websyn_start.sh, control_server.py, Dockerfile, .gitignore
  • Heavy assets (instance_seed/phys_org.db, static/images/) live in HF PR Fix user-visible mirror URL leaks #7, not in git.

🤖 Generated with Claude Code

Adds a 16th WebHarbor mirror at https://phys.org — a science / technology
/ research news aggregator. Real RSS-derived catalog of 210 articles
across 7 categories (Physics, Earth, Technology, Biology, Chemistry,
Astronomy, Nanotechnology) with real thumbnails, plus 4 benchmark users
with seeded saved articles, comments (incl. cross-user reply chains),
and search history.

Registered as the 16th site at port 40015. .gitignore was tightened
because the previous inline-comment patterns for sites/*/scraped_data/
and sites/*/instance/ were not matching (Codex finding, fixed in this
PR).

Site features:
- Categories with recent/popular sort
- Article detail with source journal / institution / DOI
- Threaded comments with reply UI (parent-article validation)
- Save articles with notes (auth)
- Token-overlap scored search with category filter
- Trending list, user profile, account edit, login/register

Determinism work for byte-identical reset:
- RSS pubDate parsing strips trailing TZ token (strptime %Z rejects EDT)
- Pinned bcrypt hash for benchmark users (random salt would drift md5)
- Per-article RNG seeded by slug for synthesized author/journal/views
- /article/<slug> GET no longer mutates Article.views (Codex finding)

Open-redirect hardening:
- _safe_next() validates next= targets in /login and /save (Codex finding)

Tasks: 18 WebVoyager-format tasks in sites/phys_org/tasks.jsonl,
covering search, browse, detail, comment thread reading, save toggle,
auth flows, and one comparison task.

Assets: heavy assets (instance_seed/phys_org.db, static/images/) live
in the paired HF dataset PR; phys_org.tar.gz is 460K, db md5
b4a324122c3cb0a56b8d511e73ff13a7. .assets-revision uses 'main' so the
HF merge will roll in automatically.
@Within-yao
Copy link
Copy Markdown
Author

Closing temporarily — will reopen tomorrow after I run the docker build + /reset/ parity verification and Playwright visual diff. Local cache + branch retained.

@Within-yao Within-yao closed this May 13, 2026
@Within-yao
Copy link
Copy Markdown
Author

Reopening: docker build + /reset/phys_org parity + Playwright visual diff are now done. Updating PR body with verification outputs.

@Within-yao Within-yao reopened this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant