Quick Start · Features · How It Works · Pipeline Tech Stack · npm · Roadmap
No more vendor lock-in. Your Google Sites content belongs to you. Paste a URL, get a complete static clone with all images, styles, and navigation — ready for self-hosting.
Google Sites stores your content behind an SPA that search engines can't index and you can't export. google-sites-clone uses a two-pass pipeline (SingleFile + Puppeteer) to capture everything — CSS fidelity from SingleFile and clean semantic content from Puppeteer — then merges both into standalone HTML files with localized images and SEO metadata.
| Feature | Description |
|---|---|
| 🔍 Auto-crawl | Discovers all pages from sidebar navigation automatically |
| 🎨 Two-pass pipeline | SingleFile for CSS/images + Puppeteer for clean content |
| 🖼️ Image localization | Downloads all images as local files (no CDN dependency) |
| 📺 YouTube thumbnails | Converts embedded iframes to clickable thumbnails |
| 🎬 Video grid | Injects CSS Grid of video thumbnails into SingleFile pages |
| 🗺️ SEO ready | Generates sitemap.xml + robots.txt |
| ⚡ Batch processing | 5 pages per batch with anti-rate-limit pauses |
| 🔄 SPA fallback | Internal navigation for pages that fail direct URL loading |
| 🚀 GitHub Pages deploy | One command to push to gh-pages branch |
| 📦 ZIP export | Create downloadable archive of cloned site |
Clone any Google Site at gsclone.osovsky.com — sign in with Google, paste URL, get ZIP by email.
Requires: Node.js 18+, Chrome/Chromium. SingleFile CLI is auto-installed on first run (~30 MB).
npx google-sites-clone https://sites.google.com/view/your-site📦 View on npm · Installs: Puppeteer (~400 MB) + SingleFile CLI (~30 MB)
📋 Manual setup
git clone https://github.com/maximosovsky/google-sites-clone
cd google-sites-clone
npm install
node bin/gsclone.js https://sites.google.com/view/your-site⚙️ CLI Options
gsclone <url> [options]
Options:
-o, --output <dir> Output directory (default: ./clone)
--max-pages <n> Limit number of pages to clone
--cooldown <ms> Pause between batches of 5 pages in ms (default: 60000)
--no-images Skip image localization
--no-youtube Skip YouTube thumbnail download
--serve Start local server after build
--custom-nav Use custom sidebar navigation
--inline Keep images inline (base64)
--zip Create ZIP archive of site after build🚀 Deploy to GitHub Pages
gsclone deploy ./clone/site --repo username/my-clonePushes site/ to the gh-pages branch. Enable Pages in repo Settings → Pages → Branch: gh-pages.
| Tier | Auth | Clones | Max ZIP |
|---|---|---|---|
| Free | 1 total | 250 MB | |
| Starred | Google + GitHub + ⭐ repo | 5/day, 20/month | 250 MB |
| Unlimited | By request | ∞ | ∞ |
URL → [1. Crawl] → page-map.json (~2 KB)
→ [2. SingleFile] → _pages/ CSS + base64 (~7 MB/page)
→ [3. Puppeteer] → _content/ clean content (~9 KB/page)
→ [4. Images] → site/images/ decoded files
→ [4b. Video] → site/thumbnails/ (~50 KB each)
→ [5. Build] → site/ iframe nav + video grid + report
→ [6. ZIP] → clone-result.zip (40–250 MB)
→ [7. R2 Upload] → Cloudflare R2 (direct via aws s3 cp)
→ [8. Email] → "Clone ready!" + download links
| Pass | Tool | Output | Size |
|---|---|---|---|
| 1 | Puppeteer | page-map.json |
~2 KB |
| 2 | SingleFile CLI | _pages/*.html |
~7 MB/page |
| 3 | Puppeteer ×5 | _content/*.html |
~9 KB/page |
| 4 | Base64 decoder | site/images/ |
varies |
| 4b | Video scanner | site/thumbnails/ |
~50 KB each |
| 5 | Build script | site/ (nav + grid + report) |
— |
| 6 | ZIP | clone-result.zip |
up to 250 MB |
| 7 | AWS CLI | R2: zips/ (7d auto-delete), reports/ (360d auto-delete) |
— |
graph TD
URL["🌐 Google Sites URL"] --> S1
subgraph "Step 1: Crawl"
S1["Puppeteer opens main page"] --> S1b["Parse sidebar links"]
S1b --> PM["📋 page-map.json"]
end
subgraph "Step 2: SingleFile"
PM --> S2["SingleFile CLI per page"]
S2 --> SF["📦 _pages/ — CSS + base64"]
end
subgraph "Step 3: Puppeteer"
PM --> S3["Puppeteer batch ×5"]
S3 --> PP["📄 _content/ — clean HTML"]
end
subgraph "Step 4: Assets"
SF --> S4["Decode base64 images"]
S4 --> IMG["🖼️ site/images/"]
PP --> S4v["Scan for video iframes"]
S4v --> THUMB["🎬 site/thumbnails/"]
end
subgraph "Step 5: Build"
PM --> NAV["Sidebar navigation"]
SF --> COPY["Copy SF pages"]
IMG --> REWRITE["Rewrite image URLs"]
THUMB --> REPLACE["Video thumbnail grid"]
NAV --> INDEX["index.html"]
COPY --> SITE["📁 site/"]
REWRITE --> SITE
REPLACE --> SITE
INDEX --> SITE
end
SITE --> ZIP["📦 ZIP archive"]
ZIP --> R2["☁️ Cloudflare R2"]
R2 --> EMAIL["📧 Clone ready!"]
| Layer | Technology |
|---|---|
| Runtime | Node.js 18+ |
| Content extraction | Puppeteer |
| CSS preservation | SingleFile CLI |
| CLI interface | Commander.js |
| Auth | Google OAuth 2.0, GitHub OAuth |
| Storage | Cloudflare R2 (S3-compatible) |
| Resend | |
| Hosting | Vercel (landing + API) |
google-sites-clone/
├── bin/
│ └── gsclone.js # CLI entry point
├── lib/
│ ├── index.js # Pipeline orchestrator
│ ├── crawl.js # Auto-crawl navigation
│ ├── singlefile.js # SingleFile pass
│ ├── puppeteer.js # Puppeteer batch extraction
│ ├── images.js # Base64 → local images
│ ├── video.js # YouTube/Vimeo thumbnail download
│ ├── build.js # iframe nav + page assembly
│ ├── report.js # Clone report dashboard
│ ├── deploy.js # GitHub Pages deploy
│ ├── zip.js # ZIP archive creation
│ └── error-reporter.js # R2 error logs + Resend email
├── rebuild.js # Quick rebuild from cache
├── site/ # Landing page (Vercel)
│ ├── index.html # Landing page + auth UI
│ ├── style.css # Design system
│ ├── vercel.json # API rewrites
│ └── api/
│ ├── _session.js # HMAC session helper
│ ├── _r2.js # Cloudflare R2 helper
│ ├── _redis.js # Upstash Redis helper
│ ├── _ratelimit.js # Usage tier rate limiting
│ ├── _email.js # Resend email helper
│ ├── auth-google.js # Google OAuth redirect
│ ├── auth-google-callback.js
│ ├── auth-github.js # GitHub OAuth redirect
│ ├── auth-github-callback.js
│ ├── auth-me.js # Get current user
│ ├── auth-logout.js # Clear session
│ ├── clone.js # Trigger clone pipeline
│ ├── upload.js # Webhook: R2 upload + email
│ ├── download.js # Presigned R2 download
│ └── preview.js # Clone preview
├── ARCHITECTURE.md
├── MANUAL.md
├── ROADMAP.md
└── package.json
See ROADMAP.md for full details.
- Core pipeline (SingleFile + Puppeteer)
- CLI interface
- Auto-crawl navigation
- Image localization
- iframe-based navigation (sidebar + content)
- Clone report dashboard
- YouTube/Vimeo thumbnail download
- Video grid (YT/Vimeo/GDrive)
- GitHub Pages deploy
- ZIP export
- npm publish
- Rate limits + usage tiers (Free / Starred / Unlimited)
- Email delivery (Resend)
- Cloudflare R2 storage + lifecycle auto-cleanup
- Real Google OAuth
- Real GitHub OAuth
| Tool | Approach | Google Sites (new) |
|---|---|---|
| HTTrack | Recursive wget-style crawl | ❌ Can't execute JavaScript — downloads empty SPA shell |
| google-sites-backup | Google Sites API (GData) | ❌ Classic Sites only, API deprecated |
| generate-static-site | Headless SSR pre-render | |
| google-sites-clone | Puppeteer + SingleFile | ✅ Full SPA rendering, auto-crawl, CSS fidelity, image localization |
New Google Sites (2020+) is a single-page application — all content is rendered by JavaScript. Traditional crawlers see an empty page. That's why this project uses a headless browser.
Fork → feature/name → PR
Maxim Osovsky. Licensed under MIT.