A comprehensive index of public .yxdb files found on GitHub — both the original E1 format and the newer AMP-engine E2 format.
YXDB (.yxdb) is the native binary data format used by Alteryx. There are two variants:
- E1 — The original format, produced by the classic Alteryx engine.
- E2 — A newer variant produced by Alteryx's AMP (Alteryx Multi-threaded Processing) engine.
Both share the .yxdb extension but are distinct formats.
| Repositories | Files | |
|---|---|---|
| E1 | 170 | 2,407 |
| E2 | 57 | 207 |
| Total unique repos | 203 | 2,614 |
1,388 GitHub repositories scanned. 24 repos contain both E1 and E2 files.
Source listings:
- E1-Sources.md — Index of repositories containing E1 files with per-repo counts.
- E2-Sources.md — Full index of E2 files with per-file paths, sizes, and SHA-256 hashes. Repos archived on Software Heritage.
- index.json — Machine-readable index of all repositories and files.
-
Discover — Search GitHub for repos containing
.yxdbor.yxzpfiles using repository search queries and code-search queries (for adjacent Alteryx file extensions like.yxmd,.yxmc,.yxzp) via theghCLI. Alteryx-owned repos are filtered out. -
Check — For each candidate repo, resolve the current HEAD commit SHA. If the SHA matches the last check, skip entirely (no API calls for tree enumeration or downloads). This makes weekly re-runs cheap.
-
Download & classify — For repos with new commits, enumerate the git tree for
.yxdb/.yxzpfiles, download each one, and classify as E1 or E2. E2 files are saved locally and recorded in the index. -
Archive — Repos containing E2 files are submitted to the Software Heritage Foundation for permanent, independent archival. E1 files are common enough that archival is not necessary.
- Python 3.13+
ghCLI installed and authenticated (gh auth login)
# Full run: discover new repos + check for updates + download
uv run scripts/scan.py
# Discovery only (no downloads — just find new candidate repos)
uv run scripts/scan.py --discover-only
# Re-check known repos only (skip discovery search)
uv run scripts/scan.py --check-onlyuv run scripts/archive.py # submit all unarchived repos
uv run scripts/archive.py --status # check submission status
uv run scripts/archive.py --dry-run # show what would be submitted
uv run scripts/archive.py --force # re-submit all reposSet SWH_API_TOKEN in .env for higher rate limits (1,200/hr vs 120/hr anonymous).
uv run scripts/download.py e2 # download all 207 E2 files
uv run scripts/download.py e1 # download all 2,407 E1 files
uv run scripts/download.py all # download everything
uv run scripts/download.py e2 --repo OWNER/NAME # single repo
uv run scripts/download.py e2 --dry-run # previewE2 files download directly from paths in index.json. E1 files require tree enumeration per repo (only counts are indexed).
uv run scripts/gen_index.py # regenerate index.json from known_repos.jsonuv run scripts/report.pyThe scanner is designed to be respectful to GitHub:
- Uses
ghCLI (which handles auth tokens automatically) - 2-second pause between search API calls
- 0.15–0.3 second pause between file downloads
- Automatic 65-second back-off on rate limit errors
- Repos at the same commit SHA are skipped entirely
data/known_repos.json— The ledger of every repo checked, with commit SHAs and E2 file hashesdata/swh_submissions.json— Software Heritage submission logindex.json— Machine-readable index of all YXDB sourcesE2-Sources.md/E1-Sources.md— Human-readable source listingsdownloads/— Downloaded E2 files (git-ignored)data/sources.json— Transient download resume state (git-ignored)
YXDB-Sources/
├── README.md # This file — project overview
├── E1-Sources.md # E1 file source listing
├── E2-Sources.md # E2 file source listing
├── index.json # Machine-readable index (both E1 + E2)
├── LICENSE # MIT (code) + CC BY 4.0 (docs)
├── scripts/
│ ├── scan.py # Weekly scan orchestrator
│ ├── download.py # Bulk file downloader
│ ├── archive.py # Software Heritage archival submissions
│ ├── gen_index.py # Generate index.json from known_repos.json
│ ├── detect.py # Format detection + metadata extraction
│ ├── github.py # GitHub API + download helpers
│ ├── state.py # Persistent state management
│ ├── report.py # Status report generator
│ └── seed_known_repos.py # One-time seed from existing provenance
├── data/
│ ├── known_repos.json # Committed — the repo/commit/file ledger
│ └── swh_submissions.json # Committed — SWH submission log
└── downloads/ # Git-ignored — downloaded YXDB files
- Code (
.pyfiles): MIT - Documentation/Data (
README.md,E1-Sources.md,E2-Sources.md,index.json): CC BY 4.0
If you reference the source listings or index data, please credit this project. The referenced repositories are owned by their respective authors and subject to their own licenses.