Skip to content

Add UC Berkeley mirror site (port 40015)#11

Open
richard-peng-xia wants to merge 1 commit into
aiming-lab:mainfrom
richard-peng-xia:add-berkeley-mirror
Open

Add UC Berkeley mirror site (port 40015)#11
richard-peng-xia wants to merge 1 commit into
aiming-lab:mainfrom
richard-peng-xia:add-berkeley-mirror

Conversation

@richard-peng-xia
Copy link
Copy Markdown

@richard-peng-xia richard-peng-xia commented May 13, 2026

TL;DR

Adds a fully functional berkeley.edu mirror site to WebHarbor at port 40015, with 30 benchmark tasks covering programs, news, events, faculty, research centers, and admissions.

Motivation

UC Berkeley is the #1 public research university in the US and one of the most-visited university portals. It covers a domain — university information browsing (academics, research, campus life) — not represented in the existing 15 sites, and offers rich multi-step navigation tasks across programs, faculty, events, and news that are well-suited for web-agent benchmarks.

Design

Flask application (sites/berkeley/app.py)

Eight SQLAlchemy models: User, College, Department, Program, NewsArticle, Event, ResearchCenter, Faculty, Bookmark. All seeded idempotently (each seed function gates on a populated DB to preserve the byte-identical reset invariant).

Route coverage mirrors the real site's navigation:

  • Homepage with featured news, upcoming events, and quick stats
  • News listing + article detail with category/search filters
  • Academics overview listing all 14 colleges/schools
  • Programs listing + detail with degree type and college filters
  • Events listing + detail with category and date filters
  • Research centers listing + detail pages
  • Departments listing (grouped by college) + detail with faculty roster
  • Faculty listing + profile pages with department filter
  • Admissions overview with undergraduate/graduate tabs
  • About page with real Berkeley statistics
  • Unified search across programs, news, events, and faculty
  • Auth: login, register, logout, account (bookmarks)

Seed database

14 UC Berkeley colleges/schools (real names), 83 degree programs (BA/BS/MA/MS/PhD/MBA/JD/MD/MEng), 121 news articles (2023–2025, 7 categories), 64 events (upcoming + past, 7 categories), 25 research centers (BAIR, QB3, MSRI, …), 82 faculty (Jennifer Doudna, Stuart Russell, Saul Perlmutter, …), and 4 benchmark users (///, password: ). Seed DB generated at image build time via .

Templates

23 Jinja2 templates styled with Berkeley Blue () and California Gold (), modeled on real berkeley.edu layout: responsive nav with five top-level sections (About / Admissions / Academics / Research / Campus Life), card-grid listings, detail pages with sidebars, and paginated results (20 items/page).

Benchmark tasks (tasks.jsonl)

30 tasks (IDs through ) covering: program search by degree type, news browsing by category, event filtering by date/type, faculty research lookup, research center exploration, admissions requirements, department navigation, and 5+ multi-step reasoning tasks.

Verification

Check Result
All 11 main routes → HTTP 200 (werkzeug test client)
Seed: 14 colleges, 83 programs, 121 articles, 64 events, 82 faculty, 25 centers, 4 users
Seed idempotent (second run produces no duplicate rows)
[build] missing assets, fetching from HF...
[fetch] huggingface.co/datasets/ChilleD/WebHarbor @ main -> sites/
path=/home/pxia/WebHarbor/sites/.cache/tarballs
[fetch] extracting allrecipes
[fetch] extracting amazon
[fetch] extracting apple
[fetch] extracting arxiv
[fetch] extracting bbc_news
[fetch] extracting booking
[fetch] extracting cambridge_dictionary
[fetch] extracting coursera
[fetch] extracting espn
[fetch] extracting github
[fetch] extracting google_flights
[fetch] extracting google_map
[fetch] extracting google_search
[fetch] extracting huggingface
[fetch] extracting wolfram_alpha
[fetch] done — 15 site(s) extracted into sites/
[check] all sites have instance_seed/ (15 sites lack at least one optional asset dir — that's OK)
[build] docker build -t webharbor:dev . requires Docker daemon — not run locally
+ md5sum match requires Docker daemon — not run locally

HuggingFace assets

Berkeley has no scraped image assets — all data is code-generated from . The is therefore built directly inside the Docker image via a step added to the Dockerfile, eliminating the need for a HuggingFace tarball for this site. No bump is required.

Registration

Site registered in all three required locations:

  • — index 15, port 40015

🤖 Generated with Claude Code

Adds a full Flask mirror of berkeley.edu as the 16th WebHarbor site.

**Site features:**
- 8 SQLAlchemy models: College, Department, Program, NewsArticle, Event,
  ResearchCenter, Faculty, Bookmark (+ User with auth)
- 20+ routes: homepage, news, programs, events, research centers,
  departments, faculty, admissions, about, unified search
- 23 Jinja2 templates styled with Berkeley Blue (#003262) / Gold (#FDB515)
- 30 benchmark tasks in tasks.jsonl (WebVoyager schema)

**Seed data (fully idempotent):**
- 14 UC Berkeley colleges/schools (real names)
- 83 degree programs (BA/BS/MA/MS/PhD/MBA/JD/MD/MEng)
- 121 news articles (2023–2025, 7 categories)
- 64 events (upcoming + past, 7 categories)
- 25 research centers (BAIR, QB3, MSRI, …)
- 82 faculty (Jennifer Doudna, Stuart Russell, Saul Perlmutter, …)
- 4 benchmark users: alice/bob/carol/dave (password: test1234)

**Infrastructure changes:**
- control_server.py: add 'berkeley' to SITES (port 40015)
- websyn_start.sh: add 'berkeley' to startup array
- Dockerfile: EXPOSE 40015, generate instance_seed DB at build time
  (no HF assets needed — all data is code-generated via seed_data.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant