Conversion tools to create ColDP archives from various online sources not readily available otherwise. The conversion is fully automated so it can run in a scheduler.
# Build fat JAR
mvn package
# Run a specific generator
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s <source>| Option | Default | Description |
|---|---|---|
-s, --source |
(required) | Source name (see below) |
-r, --repository |
/tmp/coldp-generator |
Output directory for generated archives |
--tmp |
/tmp/coldp-generator-sources |
Directory for downloaded source files |
--api-key |
API key for authenticated sources, e.g. WSC | |
--lpsn-user / --lpsn-pass |
Credentials for LPSN | |
--date |
Date filter for incremental updates for WSC | |
--no-download |
false |
Skip downloading source files; reuse existing local copies |
| Source | Name | ChecklistBank | Notes |
|---|---|---|---|
antcat |
AntCat | 54937 | Online Catalog of the Ants of the World |
bats |
Bats of the World | 314574 | Chiroptera taxonomy by Simmons & Cirranello (AMNH) |
birdlife |
Birdlife HBW | 170809 | Handbook of the Birds of the World |
biolib |
BioLib | 54592 | |
clements |
Clements | 2013 | Clements Checklist of Birds of the World |
cycads |
Cycads | 1163 | The World List of Cycads |
grin |
GRIN | 2018 | GRIN-Global Taxonomy (cultivated plants) |
ioc |
IOC World Bird List | 2036 | IOC World Bird List (latest version auto-detected) |
ictv |
ICTV | 1014 | ICTV Master Species List |
ipni |
IPNI | 2006 | International Plant Names Index |
itis |
ITIS | 2144 | Integrated Taxonomic Information System (all 7 kingdoms) |
lpsn |
LPSN | 2015 | List of Prokaryotic names with Standing in Nomenclature |
mdd |
MDD | 9802 | Mammal Diversity Database |
mites |
Mites | ||
otl |
OTL | 201891 | Open Tree of Life Synthesis Tree |
ott |
OTT | 201890 | Open Tree of Life Reference Taxonomy |
pbdb |
PBDB | 1174 | The Paleobiology Database |
pfnr |
PFNR | 314595 | International Fossil Plant Names Registry |
wikidata |
Wikidata | 314569 | Wikidata taxonomy (downloads full Wikidata + Commons dumps, ~260 GB total) |
wikispecies |
WikiSpecies | 314570 | |
wsc |
WSC | 56185 | World Spider Catalog |
Requires cabextract to be installed (brew install cabextract). Downloads a single taxonomy_data.cab (~27 MB) from the GRIN-Global downloads page.
Requires --lpsn-user and --lpsn-pass credentials.
Downloads and processes two large dumps automatically:
| Dump | Size | Purpose |
|---|---|---|
latest-all.json.gz (Wikidata) |
~160 GB | Taxonomy, names, synonyms, distributions, references |
commonswiki-latest-pages-articles.xml.bz2 (Commons) |
~106 GB | Taxon gallery images with metadata |
Both dumps are downloaded in parallel at startup (the Commons dump in a background thread while Wikidata processing runs). Freshness is checked via Last-Modified headers; files are only re-downloaded when the remote is newer. Use --no-download to skip all downloads and reuse existing local files.
Output includes Media.tsv populated from two sources:
- P18 (Wikidata property): one representative image per taxon, URL built directly from the filename in the Wikidata dump — no extra HTTP calls.
- P935 gallery pages (Commons dump): all images listed in the taxon's curated Commons gallery, with
title,created,creator,license, andremarksextracted from the file description pages.
# Standard run (downloads everything automatically):
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s wikidata
# Re-run without re-downloading (both dumps already cached):
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s wikidata --no-downloadFind and delete 404 error files by their exact size:
find . -type f -size 131c -delete