Skip to content

Commit ee155fe

Browse files
committed
wip
1 parent 4c1831b commit ee155fe

67 files changed

Lines changed: 2653 additions & 1752 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/copilot-instructions.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,30 @@
11
# JDK Metadata DB Scraper - AI Coding Guide
22

33
## Project Overview
4-
A parallel Java application that scrapes JDK metadata from 35+ vendors (Temurin, Zulu, Liberica, Corretto, etc.) via vendor APIs and GitHub releases. Outputs structured JSON metadata files with checksums for each JDK distribution.
4+
A parallel Java application that scrapes JDK metadata from 35+ distros (Temurin, Zulu, Liberica, Corretto, etc.) via distro APIs and GitHub releases. Outputs structured JSON metadata files with checksums for each JDK distribution.
55

66
## Architecture
77

88
### Core Execution Flow
99
1. **Main** (`Main.java`) - CLI entry via Picocli, manages ExecutorService for parallel scraping
1010
2. **ScraperFactory** - Uses Java ServiceLoader to discover scrapers via `META-INF/services/dev.jbang.jdkdb.scraper.Scraper$Discovery`
1111
3. **ProgressReporter** - Dedicated thread receives events from all scrapers via `BlockingQueue<ProgressEvent>`
12-
4. **Scrapers** - Each vendor scraper implements `Callable<ScraperResult>` for concurrent execution
12+
4. **Scrapers** - Each distro scraper implements `Callable<ScraperResult>` for concurrent execution
1313

1414
### Base Class Hierarchy
1515
- **BaseScraper** - Common functionality: HTTP downloads, hash computation, metadata persistence, progress tracking
1616
- **GitHubReleaseScraper** extends BaseScraper - GitHub API integration with pagination and rate limiting
1717
- **AdoptiumMarketplaceScraper** extends BaseScraper - Adoptium Marketplace API patterns
18-
- Vendor scrapers (e.g., `Temurin`, `Microsoft`, `SemeruBaseScraper`) - Specific API implementations
18+
- Distro scrapers (e.g., `Temurin`, `Microsoft`, `SemeruBaseScraper`) - Specific API implementations
1919

2020
### Service Provider Interface (SPI)
2121
All scrapers register via nested `Discovery` class implementing `Scraper.Discovery`:
2222
```java
2323
public static class Discovery implements Scraper.Discovery {
24-
public String name() { return "vendor-id"; }
25-
public String vendor() { return "vendor-name"; }
24+
public String name() { return "distro-id"; }
25+
public String distro() { return "distro-name"; }
2626
public When when() { return When.ALWAYS; }
27-
public Scraper create(ScraperConfig config) { return new VendorScraper(config); }
27+
public Scraper create(ScraperConfig config) { return new DistroScraper(config); }
2828
}
2929
```
3030
Registration: Add fully qualified class name to `src/main/resources/META-INF/services/dev.jbang.jdkdb.scraper.Scraper$Discovery`
@@ -41,13 +41,13 @@ Registration: Add fully qualified class name to `src/main/resources/META-INF/ser
4141
### Running Scrapers
4242
```bash
4343
java -jar build/libs/jdkdb-scraper-*-standalone.jar --list # List all scrapers
44-
java -jar build/libs/jdkdb-scraper-*-standalone.jar --scrapers temurin # Run specific vendor
44+
java -jar build/libs/jdkdb-scraper-*-standalone.jar --scrapers temurin # Run specific distro
4545
java -jar build/libs/jdkdb-scraper-*-standalone.jar --from-start # Ignore existing metadata
4646
java -jar build/libs/jdkdb-scraper-*-standalone.jar --limit-progress 3 # Limit to 3 items for testing
4747
```
4848

4949
### GitHub Token for Rate Limiting
50-
Set `GITHUB_TOKEN` environment variable to avoid GitHub API rate limits when scraping GitHub-based vendors.
50+
Set `GITHUB_TOKEN` environment variable to avoid GitHub API rate limits when scraping GitHub-based distros.
5151

5252
## Project-Specific Conventions
5353

@@ -87,16 +87,16 @@ Use inherited normalization methods from BaseScraper:
8787
## Data Model
8888

8989
### JdkMetadata Fields (snake_case in JSON)
90-
Required: `vendor`, `filename`, `version`, `java_version`, `os`, `architecture`, `file_type`, `image_type`, `url`
90+
Required: `distro`, `filename`, `version`, `java_version`, `os`, `architecture`, `file_type`, `image_type`, `url`
9191
Checksums: `md5`, `sha1`, `sha256`, `sha512` + corresponding `*_file` fields for external checksum URLs
9292
Features: Array of strings (e.g., `["openj9"]`, `["lts"]`, `["musl"]`)
9393
This class follow the API defined in `./openapi.yaml` and can't be changed!
9494

9595
### Output Structure
9696
```
97-
docs/
98-
├── metadata/vendor/{vendor-name}/*.json # Individual release metadata
99-
└── checksums/vendor/{vendor-name}/* # Hash files
97+
/
98+
├── metadata/{distro-name}/*.json # Individual release metadata
99+
└── checksums/{distro-name}/* # Hash files
100100
```
101101

102102
## Testing
@@ -134,11 +134,11 @@ Override `getApiBase()`, `getAvailableReleasesPath()`, `getAssetsPathTemplate()`
134134
## Dependencies
135135
- Jackson 2.16.1 for JSON (use `readJson(string)` helper in BaseScraper)
136136
- Java 21+ HttpClient (`java.net.http`) - configured in `HttpUtils` with 30s timeout, auto-redirect
137-
- SLF4J/Logback - Logger per scraper: `LoggerFactory.getLogger("vendors." + name)`
137+
- SLF4J/Logback - Logger per scraper: `LoggerFactory.getLogger("distros." + name)`
138138
- Picocli 4.7.5 - CLI in Main.java only
139139

140140
## Key Files to Reference
141-
- [BaseScraper.java](src/main/java/dev/jbang/jdkdb/scraper/BaseScraper.java) - All helper methods and patterns
142-
- [Temurin.java](src/main/java/dev/jbang/jdkdb/scraper/vendors/Temurin.java) - Adoptium Marketplace example
143-
- [SemeruBaseScraper.java](src/main/java/dev/jbang/jdkdb/scraper/vendors/SemeruBaseScraper.java) - GitHub Release + repo discovery
144-
- [BaseScraperTest.java](src/test/java/dev/jbang/jdkdb/scraper/BaseScraperTest.java) - OS/arch normalization reference
141+
- [BaseScraper.java](../src/main/java/dev/jbang/jdkdb/scraper/BaseScraper.java) - All helper methods and patterns
142+
- [Temurin.java](../src/main/java/dev/jbang/jdkdb/scraper/distros/Temurin.java) - Adoptium Marketplace example
143+
- [SemeruBaseScraper.java](../src/main/java/dev/jbang/jdkdb/scraper/distros/SemeruBaseScraper.java) - GitHub Release + repo discovery
144+
- [BaseScraperTest.java](../src/test/java/dev/jbang/jdkdb/scraper/BaseScraperTest.java) - OS/arch normalization reference

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ out/
2323
/logs
2424
*.code-workspace
2525
.vscode/
26-
docs/
26+
metadata/
27+
checksums/

README.md

Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
# jdkdb-scraper - JDK Metadata DB Scraper
22

3-
A Java-based application for scraping JDK metadata from various vendors. This project replaces the original bash scripts with a robust, parallel Java implementation.
3+
A Java-based application for scraping JDK metadata from various distros. This project replaces the original bash scripts with a robust, parallel Java implementation.
44

55
This project is based on [Joschi's Java Metadata project](https://github.com/joschi/java-metadata) and incorporates ideas from the [Foojay's Disco API project](https://github.com/foojayio/discoapi).
66

77
## Features
88

9-
- **Parallel Execution**: Run multiple vendor scrapers concurrently for improved performance
10-
- **Selective Scraping**: Run all scrapers or select specific vendors
9+
- **Parallel Execution**: Run multiple distro scrapers concurrently for improved performance
10+
- **Selective Scraping**: Run all scrapers or select specific distros
1111
- **Central Reporting**: Thread-safe progress reporting with real-time status updates
12-
- **Extensible Architecture**: Easy to add new vendor scrapers
13-
- **Generic Base Classes**: Reduces code duplication for similar vendors (e.g., Semeru versions, Trava versions)
12+
- **Extensible Architecture**: Easy to add new distro scrapers
13+
- **Generic Base Classes**: Reduces code duplication for similar distros (e.g., Semeru versions, Trava versions)
1414
- **Comprehensive Logging**: SLF4J/Logback integration with both console and file output
1515
- **Multi-command CLI**: Separate commands for updating metadata, generating indexes, downloading checksums, and cleaning up old releases
1616
- **Archive Extraction**: Automatically extracts release information from JDK archives
@@ -50,17 +50,17 @@ jbang scraper@jbangdev/jdkdb-scraper update --include tar_gz,zip
5050
# Update: Exclude specific file types
5151
jbang scraper@jbangdev/jdkdb-scraper update --exclude msi,exe
5252

53-
# Index: Generate all.json files for all vendors
53+
# Index: Generate all.json files for all distros
5454
jbang scraper@jbangdev/jdkdb-scraper index
5555

56-
# Index: Regenerate all.json for specific vendors
57-
jbang scraper@jbangdev/jdkdb-scraper index --vendors temurin,zulu
56+
# Index: Regenerate all.json for specific distros
57+
jbang scraper@jbangdev/jdkdb-scraper index --distros temurin,zulu
5858

59-
# Download: Download and compute missing checksums for all vendors
59+
# Download: Download and compute missing checksums for all distros
6060
jbang scraper@jbangdev/jdkdb-scraper download
6161

62-
# Download: Process specific vendors
63-
jbang scraper@jbangdev/jdkdb-scraper download --vendors microsoft
62+
# Download: Process specific distros
63+
jbang scraper@jbangdev/jdkdb-scraper download --distros microsoft
6464

6565
# Download: Randomize download order
6666
jbang scraper@jbangdev/jdkdb-scraper download --randomize
@@ -88,8 +88,8 @@ jbang scraper@jbangdev/jdkdb-scraper clean --prune-checksums
8888

8989
The application provides four main commands:
9090

91-
- **`update`** - Scrape JDK metadata from various vendors and update metadata files
92-
- **`index`** - Generate aggregated all.json files for vendor directories
91+
- **`update`** - Scrape JDK metadata from various distros and update metadata files
92+
- **`index`** - Generate aggregated all.json files for distro directories
9393
- **`download`** - Download and compute checksums for metadata files with missing checksums
9494
- **`clean`** - Clean up metadata by removing incomplete files and pruning old EA releases
9595

@@ -114,7 +114,7 @@ The application checks for tokens in this order: environment variable first, the
114114

115115
### Typical usage
116116

117-
- You can simply run `update` in the root of the data repository (where the `docs/` folder is located) and let it do its work. It will scrape all the vendor sites, obtain the latest metadata, download the jdk distributions, calculate checksums and update all the indices. Nothing else to be done. But this can take some time.
117+
- You can simply run `update` in the root of the data repository (where the `metadata/` folder is located) and let it do its work. It will scrape all the distro sites, obtain the latest metadata, download the jdk distributions, calculate checksums and update all the indices. Nothing else to be done. But this can take some time.
118118
- You can split the work into two steps:
119119

120120
1. You run `update --no-download` which will do the scraping and will make sure that we have all the latest distributions cataloged. It will write all the metadata but with _missing_ checksums (and release info).
@@ -130,12 +130,12 @@ And finally the `clean` command can be used to get rid of any invalid or orphane
130130

131131
```bash
132132
Usage: jdkdb-scraper [-hV] [COMMAND]
133-
Scrapes JDK metadata from various vendors and generates index files
133+
Scrapes JDK metadata from various distros and generates index files
134134
-h, --help Show this help message and exit.
135135
-V, --version Print version information and exit.
136136
Commands:
137-
update Scrape JDK metadata from various vendors and update metadata files
138-
index Generate all.json files for vendor directories by aggregating
137+
update Scrape JDK metadata from various distros and update metadata files
138+
index Generate all.json files for distro directories by aggregating
139139
individual metadata files
140140
download Download and compute checksums for metadata files that have missing
141141
checksum values
@@ -156,11 +156,11 @@ Usage: jdkdb-scraper update [-hlV] [--from-start] [--no-download] [--no-index]
156156
[--skip-ea=<skipEa>] [-t=<maxThreads>]
157157
[-s=<scraperIds>[,<scraperIds>...]]...
158158

159-
Scrape JDK metadata from various vendors and update metadata files
159+
Scrape JDK metadata from various distros and update metadata files
160160

161161
Options:
162162
-c, --checksum-dir=<checksumDir>
163-
Directory to store checksum files (default: docs/checksums)
163+
Directory to store checksum files (default: db/checksums)
164164
--exclude=<excludeFileTypes>[,<excludeFileTypes>...]
165165
Exclude these file types (e.g., msi,exe). These types will
166166
not be downloaded.
@@ -178,7 +178,7 @@ Options:
178178
Maximum total number of downloads to accept before
179179
stopping (default: unlimited)
180180
-m, --metadata-dir=<metadataDir>
181-
Directory to store metadata files (default: docs/metadata)
181+
Directory to store metadata files (default: db/metadata)
182182
--max-failures=<maxFailures>
183183
Maximum number of allowed failures per scraper before
184184
aborting that scraper (default: 10)
@@ -202,9 +202,9 @@ Options:
202202
203203
```bash
204204
Usage: jdkdb-scraper index [-hV] [--allow-incomplete] [-m=<metadataDir>]
205-
[-v=<vendorNames>[,<vendorNames>...]]...
205+
[-v=<distroNames>[,<distroNames>...]]...
206206

207-
Generate all.json files for vendor directories by aggregating individual
207+
Generate all.json files for distro directories by aggregating individual
208208
metadata files
209209

210210
Options:
@@ -214,10 +214,10 @@ Options:
214214
-h, --help Show this help message and exit.
215215
-m, --metadata-dir=<metadataDir>
216216
Directory containing metadata files (default:
217-
docs/metadata)
218-
-v, --vendors=<vendorNames>[,<vendorNames>...]
219-
Comma-separated list of vendor names to regenerate
220-
all.json for (if not specified, all vendors are
217+
db/metadata)
218+
-v, --distros=<distroNames>[,<distroNames>...]
219+
Comma-separated list of distro names to regenerate
220+
all.json for (if not specified, all distros are
221221
processed)
222222
-V, --version Print version information and exit.
223223
```
@@ -232,14 +232,14 @@ Usage: jdkdb-scraper download [-hV] [--randomize] [--stats-only]
232232
[--limit-progress=<limitProgress>]
233233
[--limit-total=<limitTotal>]
234234
[-m=<metadataDir>] [-t=<maxThreads>]
235-
[-v=<vendorNames>[,<vendorNames>...]]...
235+
[-v=<distroNames>[,<distroNames>...]]...
236236

237237
Download and compute checksums for metadata files that have missing checksum
238238
values
239239

240240
Options:
241241
-c, --checksum-dir=<checksumDir>
242-
Directory to store checksum files (default: docs/checksums)
242+
Directory to store checksum files (default: db/checksums)
243243
--exclude=<excludeFileTypes>[,<excludeFileTypes>...]
244244
Exclude these file types (e.g., msi,exe). These types will
245245
not be downloaded.
@@ -255,17 +255,17 @@ Options:
255255
stopping (default: unlimited)
256256
-m, --metadata-dir=<metadataDir>
257257
Directory containing metadata files (default:
258-
docs/metadata)
258+
db/metadata)
259259
--randomize Randomize the order of downloads instead of processing
260260
files in order
261261
--stats-only Skip downloading files and only show statistics (for
262262
testing/dry-run)
263263
-t, --threads=<maxThreads>
264264
Maximum number of parallel download threads (default:
265265
number of processors)
266-
-v, --vendors=<vendorNames>[,<vendorNames>...]
267-
Comma-separated list of vendor names to process (if not
268-
specified, all vendors are processed)
266+
-v, --distros=<distroNames>[,<distroNames>...]
267+
Comma-separated list of distro names to process (if not
268+
specified, all distros are processed)
269269
-V, --version Print version information and exit.
270270
```
271271
@@ -282,12 +282,12 @@ Clean up metadata by removing incomplete files and pruning old EA releases
282282
Options:
283283
-c, --checksum-dir=<checksumDir>
284284
Directory containing checksum files (default:
285-
docs/checksums)
285+
db/checksums)
286286
--dry-run Show statistics without actually deleting files
287287
-h, --help Show this help message and exit.
288288
-m, --metadata-dir=<metadataDir>
289289
Directory containing metadata files (default:
290-
docs/metadata)
290+
db/metadata)
291291
--prune-checksums
292292
Remove orphaned checksum files that don't have a matching
293293
metadata file
@@ -339,7 +339,7 @@ java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar update
339339
### Core Components
340340
341341
- **Main**: Entry point with Picocli command dispatcher
342-
- **UpdateCommand**: Scrapes JDK metadata from vendors and updates files
342+
- **UpdateCommand**: Scrapes JDK metadata from distros and updates files
343343
- **IndexCommand**: Aggregates individual metadata files into all.json files
344344
- **DownloadCommand**: Downloads JDK files to compute missing checksums
345345
- **CleanCommand**: Cleans up incomplete metadata and prunes old EA releases
@@ -353,13 +353,13 @@ java -jar build/libs/jdkdb-scraper-1.0.0-SNAPSHOT-standalone.jar update
353353
- **Scraper.Discovery**: Service provider interface for scraper registration via Java ServiceLoader
354354
- **DownloadManager**: Interface for downloading JDK files (with default and no-op implementations)
355355
356-
### Vendor Scrapers
356+
### Distro Scrapers
357357
358-
The project includes **35 vendor scrapers**, supporting all major JDK distributions:
358+
The project includes **35 distro scrapers**, supporting all major JDK distributions:
359359
360-
#### Scraper IDs and Vendors
360+
#### Scraper IDs and Distros
361361
362-
| Scraper ID | Vendor | Notes |
362+
| Scraper ID | Distro | Notes |
363363
|------------|--------|-------|
364364
| `adoptopenjdk` | AdoptOpenJDK | Legacy |
365365
| `bisheng` | Bisheng | Huawei |
@@ -413,8 +413,8 @@ Example:
413413
```java
414414
package dev.jbang.jdkdb.scraper.impl;
415415
416-
public class NewVendor extends BaseScraper {
417-
public NewVendor(ScraperConfig config) {
416+
public class NewDistro extends BaseScraper {
417+
public NewDistro(ScraperConfig config) {
418418
super(config);
419419
}
420420
@@ -458,17 +458,17 @@ public class NewVendor extends BaseScraper {
458458
public static class Discovery implements Scraper.Discovery {
459459
@Override
460460
public String name() {
461-
return "new-vendor";
461+
return "new-distro";
462462
}
463463

464464
@Override
465-
public String vendor() {
466-
return "New Vendor";
465+
public String distro() {
466+
return "New Distro";
467467
}
468468

469469
@Override
470470
public Scraper create(ScraperConfig config) {
471-
return new NewVendor(config);
471+
return new NewDistro(config);
472472
}
473473
}
474474
}
@@ -508,7 +508,7 @@ src/
508508
│ │ │ ├── PaginatedIterator.java # GitHub pagination helper
509509
│ │ │ ├── InterruptedProgressException.java # Exception types
510510
│ │ │ ├── TooManyFailuresException.java
511-
│ │ │ └── impl/ # Vendor scraper implementations
511+
│ │ │ └── impl/ # Distro scraper implementations
512512
│ │ │ ├── Temurin.java
513513
│ │ │ ├── Zulu.java
514514
│ │ │ ├── ZuluPrime.java
@@ -553,7 +553,7 @@ src/
553553
│ │ ├── HtmlUtils.java # HTML parsing utilities
554554
│ │ ├── HttpUtils.java # HTTP operations
555555
│ │ ├── MetadataUtils.java # Metadata validation/utilities
556-
│ │ ├── VendorLoggerDiscriminator.java # Logging configuration
556+
│ │ ├── DistroLoggerDiscriminator.java # Logging configuration
557557
│ │ └── VersionComparator.java # Version comparison
558558
│ └── resources/
559559
│ ├── logback.xml # Logging configuration
@@ -585,25 +585,25 @@ src/
585585
586586
## Output
587587
588-
The scrapers generate structured output in the `docs/` directory:
588+
The scrapers generate structured output in the `metadata/` directory:
589589
590-
### Metadata Files (`docs/metadata/`)
590+
### Metadata Files (`db/metadata`)
591591
592592
1. **Top-level aggregated indexes**:
593-
- `all.json` - All JDK releases across all vendors
593+
- `all.json` - All JDK releases across all distros
594594
- `ga.json` - General Availability (stable) releases only
595595
- `ea.json` - Early Access releases only
596-
- `latest.json` - Latest releases per vendor
596+
- `latest.json` - Latest releases per distro
597597
2. **Organized by release type** (`all/`, `ea/`, `ga/`):
598598
- OS-specific files: `linux.json`, `macosx.json`, `windows.json`, `aix.json`, `solaris.json`
599599
- Architecture-specific subdirectories with further breakdowns
600-
3. **Vendor-specific metadata** (`vendor/<vendor-name>/`):
600+
3. **Distro-specific metadata** (`<distro-name>/`):
601601
- Individual `.json` files for each JDK release
602-
- `all.json` file combining all releases for that vendor
602+
- `all.json` file combining all releases for that distro
603603
604-
### Checksum Files (`docs/checksums/`)
604+
### Checksum Files (`db/checksums/`)
605605
606-
- Stored in vendor-specific directories: `docs/checksums/<vendor-name>/`
606+
- Stored in distro-specific directories: `db/checksums/<distro-name>/`
607607
- Contains MD5, SHA1, SHA256, and SHA512 checksum files
608608
- Organized to match the corresponding metadata files
609609

0 commit comments

Comments
 (0)