Skip to content

CatalogueOfLife/coldp-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

159 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ColDP Archive Generator

Conversion tools to create ColDP archives from various online sources not readily available otherwise. The conversion is fully automated so it can run in a scheduler.

Build & Run

# Build fat JAR
mvn package

# Run a specific generator
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s <source>

CLI Options

Option Default Description
-s, --source (required) Source name (see below)
-r, --repository /tmp/coldp-generator Output directory for generated archives
--tmp /tmp/coldp-generator-sources Directory for downloaded source files
--api-key API key for authenticated sources, e.g. WSC
--lpsn-user / --lpsn-pass Credentials for LPSN
--date Date filter for incremental updates for WSC
--no-download false Skip downloading source files; reuse existing local copies

Supported Sources

Source Name ChecklistBank Notes
antcat AntCat 54937 Online Catalog of the Ants of the World
bats Bats of the World 314574 Chiroptera taxonomy by Simmons & Cirranello (AMNH)
birdlife Birdlife HBW 170809 Handbook of the Birds of the World
biolib BioLib 54592
clements Clements 2013 Clements Checklist of Birds of the World
cycads Cycads 1163 The World List of Cycads
grin GRIN 2018 GRIN-Global Taxonomy (cultivated plants)
ioc IOC World Bird List 2036 IOC World Bird List (latest version auto-detected)
ictv ICTV 1014 ICTV Master Species List
ipni IPNI 2006 International Plant Names Index
itis ITIS 2144 Integrated Taxonomic Information System (all 7 kingdoms)
lpsn LPSN 2015 List of Prokaryotic names with Standing in Nomenclature
mdd MDD 9802 Mammal Diversity Database
mites Mites
otl OTL 201891 Open Tree of Life Synthesis Tree
ott OTT 201890 Open Tree of Life Reference Taxonomy
pbdb PBDB 1174 The Paleobiology Database
pfnr PFNR 314595 International Fossil Plant Names Registry
wikidata Wikidata 314569 Wikidata taxonomy (downloads full Wikidata + Commons dumps, ~260 GB total)
wikispecies WikiSpecies 314570
wsc WSC 56185 World Spider Catalog

Generator-specific Notes

GRIN

Requires cabextract to be installed (brew install cabextract). Downloads a single taxonomy_data.cab (~27 MB) from the GRIN-Global downloads page.

LPSN

Requires --lpsn-user and --lpsn-pass credentials.

Wikidata

Downloads and processes two large dumps automatically:

Dump Size Purpose
latest-all.json.gz (Wikidata) ~160 GB Taxonomy, names, synonyms, distributions, references
commonswiki-latest-pages-articles.xml.bz2 (Commons) ~106 GB Taxon gallery images with metadata

Both dumps are downloaded in parallel at startup (the Commons dump in a background thread while Wikidata processing runs). Freshness is checked via Last-Modified headers; files are only re-downloaded when the remote is newer. Use --no-download to skip all downloads and reuse existing local files.

Output includes Media.tsv populated from two sources:

  • P18 (Wikidata property): one representative image per taxon, URL built directly from the filename in the Wikidata dump — no extra HTTP calls.
  • P935 gallery pages (Commons dump): all images listed in the taxon's curated Commons gallery, with title, created, creator, license, and remarks extracted from the file description pages.
# Standard run (downloads everything automatically):
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s wikidata

# Re-run without re-downloading (both dumps already cached):
java -jar target/coldp-generator-1.0-SNAPSHOT.jar -s wikidata --no-download

World Spider Catalog (WSC)

Find and delete 404 error files by their exact size:

find . -type f -size 131c -delete 

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors