raincloud/docs/v1/datasets.md at develop · spiraldb/raincloud

Dataset Short Name	Dataset Full Name	Dataset Description	Dataset Source (URL)	Data Kind	License	Row Count	Row Groups - Parquet	File Size - Parquet	File Size - Vortex
120 years of Olympic history: athletes and results	120 years of Olympic history: athletes and results	basic bio data on athletes and medal results from Athens 1896 to Rio 2016. This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. — adapted from the dataset's Kaggle description (heesoo37/120-years-of-olympic-history-athletes-and-results).	https://github.com/rgriff23/Olympic_history	Tabular (CSV)	CC0-1.0	271,116	—	4.4 MB	5.6 MB
⚠ C4 (en, validation)	AllenAI/C4 — Colossal Clean Crawled Corpus, English validation split	C4 (Raffel et al., JMLR 2020) — a heavily-filtered scrape of Common Crawl used to pretrain T5. This entry pulls only the 8-shard English validation split (~365k documents), enough for type coverage and as a smoke-test for the Common-Crawl scrape playbook. Flip allow_patterns to `en/c4-train.*.json.gz` to mirror the 327 GB English training set.	https://huggingface.co/datasets/allenai/c4	Structured (JSON)	ODC-By-1.0	364,608	—	339.7 MB	435.3 MB
⚠ CodeParrot Clean (validation)	codeparrot/codeparrot-clean — validation split	61k Python source files scraped from MIT/BSD/Apache-licensed GitHub repos by the CodeParrot project. Validation split (142 MB raw .json.gz). Showcases code-corpus shape: string `content`, numeric quality metrics (line_mean, alpha_frac), boolean autogenerated flag, and per-row license attribution.	https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid	Structured + Blobs	Apache-2.0	61,373	—	149.7 MB	343.5 MB
⚠ FineMath (4+ quality subset)	HuggingFaceTB/finemath — quality≥4 math web pages	Math-focused subset of the fineweb pipeline, filtered to documents with an automated quality score ≥ 4 (the higher of the two quality tiers). 64 parquet shards. Schema mirrors fineweb (`text` + harvest metadata).	https://huggingface.co/datasets/HuggingFaceTB/finemath	Tabular (Parquet)	ODC-By-1.0	6,699,493	7	12,647.7 MB	21,724.8 MB
⚠ FinePDFs (English test sample)	HuggingFaceFW/finepdfs — English test split (sample)	Sample slice of HuggingFace's PDF-derived corpus (3T tokens total, 1733 languages). Schema mirrors fineweb (`text` + harvest metadata) but the source is OCR/extracted PDFs rather than HTML. Limited here to the English test split (~1 shard) for tractable build size — flip allow patterns to `eng_Latn/train/*.parquet` for the full English corpus (579 shards, multi-hour fetch).	https://huggingface.co/datasets/HuggingFaceFW/finepdfs	Tabular (Parquet)	ODC-By-1.0	373	1	7.0 MB	12.7 MB
⚠ Fineweb (sample, 10BT)	HuggingFaceFW/fineweb — 10B-token reproducibility sample	10B-token reproducibility sample of HuggingFace's Fineweb (a 15TB Common-Crawl-filtered English-text corpus released for LLM pretraining research). 15 parquet shards of `text` + deduplication metadata (~32 GB raw). Flip allow_patterns to `sample/100BT/.parquet` (300 GB) or `sample/350BT/.parquet` (1 TB) for larger reproducibility samples.	https://huggingface.co/datasets/HuggingFaceFW/fineweb	Tabular (Parquet)	ODC-By-1.0	14,868,862	—	20,364.5 MB	26,202.2 MB
⚠ Fineweb-2 (Swedish sample)	HuggingFaceFW/fineweb-2 — Swedish Latin-script subset (sample)	Swedish-language subset of HuggingFace's multilingual fineweb-2 (the 1000+-language extension of the original English fineweb). Each row carries `text` plus the same dedup/quality metadata fields as fineweb. Picked Swedish for a moderately-sized representative non-English split; flip `hf_allow_patterns` to `<lang>_<script>/*.parquet` for any of the other 1000+ language/script pairs.	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2	Tabular (Parquet)	ODC-By-1.0	3,626,000	4	4,583.5 MB	5,807.7 MB
⚠ LAION-400M (metadata)	LAION-400M — image-text pairs, metadata + CLIP only	400M image-text pairs scraped from Common Crawl, joined with CLIP-ViT-B/32 similarity scores. This entry pulls only the metadata parquets (URL, caption, height/width, similarity) — the images themselves are not redistributed by LAION. Gated on Hugging Face: requires accepting LAION's terms once via the dataset page before the API will serve downloads.	https://laion.ai/blog/laion-400-open-dataset/	Tabular (Parquet)	CC-BY-4.0	2,820,459	—	256.4 MB	318.3 MB
⚠ SlimPajama-6B	DKYoon/SlimPajama-6B — 6B-token deduplicated sample	6B-token deduplicated subsample of Cerebras's SlimPajama-627B (itself a cleaned subset of RedPajama-1T). 50 parquet shards of `text` + `meta.redpajama_set_name` covering Common Crawl, GitHub, Wikipedia, books, ArXiv. Used as a small reproducible pretraining sample by the LLM research community.	https://huggingface.co/datasets/DKYoon/SlimPajama-6B	Tabular (Parquet)	Apache-2.0	5,507,693	—	9,441.5 MB	13,695.7 MB
Abalone	UCI ML Repository — Abalone	Predict the age of abalone from physical measurements Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.	https://archive.ics.uci.edu/dataset/1/abalone	Tabular (CSV)	CC-BY-4.0	4,177	—	0.1 MB	0.1 MB
Adult	UCI ML Repository — Adult	Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to determine whether a person's income is over $50,000 a year.	https://archive.ics.uci.edu/dataset/2/adult	Tabular (CSV)	CC-BY-4.0	48,842	—	0.4 MB	0.4 MB
AI2 ARC	allenai/ai2_arc — AI2 Reasoning Challenge (Easy + Challenge)	7.7K grade-school-level science multiple-choice questions, split into Easy (5.2K) and Challenge (2.6K) subsets. Each row carries `question`, `choices` (`list<string>`), and `answerKey`. Both subsets concatenated with a `split` column distinguishing the source.	https://huggingface.co/datasets/allenai/ai2_arc	Tabular (Parquet)	CC-BY-SA-4.0	7,787	1	0.8 MB	1.1 MB
AI4I 2020 Predictive Maintenance Dataset	UCI ML Repository — AI4I 2020 Predictive Maintenance Dataset	The AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real predictive maintenance data encountered in industry. Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of our knowledge.	https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset	Tabular (CSV)	CC-BY-4.0	10,000	—	0.1 MB	0.1 MB
Air Quality	UCI ML Repository — Air Quality	Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses.	https://archive.ics.uci.edu/dataset/360/air+quality	Tabular (CSV)	CC-BY-4.0	9,357	—	0.2 MB	0.2 MB
Airbnb Open Data	Airbnb Open Data	New York Airbnb Open Data. New York City Airbnb Data Cleaning Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. — adapted from the dataset's Kaggle description (arianazmoudeh/airbnbopendata).	http://insideairbnb.com/explore/	Tabular (CSV)	ODbL-1.0	102,599	—	4.9 MB	5.6 MB
Airbnb Prices in European Cities	Airbnb Prices in European Cities	Determinants of Price by Room Type, Location, Cleanliness Rating, and More. Each listing is evaluated for various attributes such as room types, cleanliness and satisfaction ratings, bedrooms, distance from the city centre, and more to capture an in-depth understanding of Airbnb prices on both weekdays and weekends. Using spatial econometric methods, we analyse and identify the determinants of Airbnb prices across these cities. We hope that this data set offers insight into how global markets are affected by social dynamics and geographical factors which in turn determine pricing strategies for optimal profitability! — adapted from the dataset's Kaggle description (thedevastator/airbnb-prices-in-european-cities).	https://zenodo.org/record/4446043#.Y9Y9ENJBwUE	Tabular (CSV)	CC0-1.0	51,707	—	3.5 MB	3.0 MB
AMPds — Whole-House Electricity	AMPds v2 — Whole-House Electricity (Makonin et al., Sci. Data 2016)	Whole-house electricity consumption time-series from AMPds v2 (the Almanac of Minutely Power dataset, version 2). 1-minute resolution measurements of the residence's main electrical service over April 2012–April 2014, paired with rich electrical sub-metrics (voltage, current, frequency, power factor, real / reactive / apparent power, plus running totals). One row per minute (~1.05M rows). Schema: unix_ts, V, I, f, DPF, APF, P, Pt, Q, Qt, S, St.	https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FIE0S4	Tabular (CSV)	CC-BY-4.0	1,051,200	2	18.6 MB	20.3 MB
Anthropic Economic Index	Anthropic/EconomicIndex — Claude.ai usage panel (release 2026-03-24)	Aggregated Claude.ai usage metrics from the fourth Anthropic Economic Index report (data window 2026-02-05 – 2026-02-12). Each row carries one metric value for a (geography, facet, variable, cluster) tuple. Schema: geo_id (country / country-state / global ISO codes), geography, date_start, date_end, platform_and_product, facet (collaboration / request / onet_task / etc.), level, variable, cluster_name, value. Bound here to the consumer-product file from the latest release (`release_2026_03_24/data/aei_raw_claude_ai_.csv`); the sibling `aei_raw_1p_api_.csv` covers API usage and earlier releases live under `release_2025_*/data/` and `release_2026_01_15/data/`.	https://huggingface.co/datasets/Anthropic/EconomicIndex	Tabular (CSV)	MIT	425,257	1	3.6 MB	5.8 MB
Anthropic HH-RLHF (helpful-base)	Anthropic Helpful & Harmless RLHF — helpful-base subset	Anthropic's RLHF preference dataset, scoped to the helpful-base subset (~161k pairs of `chosen` / `rejected` assistant responses to the same prompt). Each pair is a complete conversation transcript. First RLHF-preference-data slug in the catalog. Other subsets (harmless-base, helpful-online, helpful-rejection-sampled, red-team-attempts) can be added as sibling slugs.	https://huggingface.co/datasets/Anthropic/hh-rlhf	Structured (JSON)	MIT	46,189	—	25.8 MB	34.7 MB
Anthropic Interviewer	Anthropic/AnthropicInterviewer — qualitative AI-interview transcripts	Anthropic's published transcripts from AI-conducted qualitative research interviews, across three populations: creatives, scientists, and workforce participants. Each row carries the interview transcript plus participant demographics and study metadata.	https://huggingface.co/datasets/Anthropic/AnthropicInterviewer	Tabular (Parquet)	MIT	1,250	1	3.0 MB	5.4 MB
Auto MPG	UCI ML Repository — Auto MPG	Revised from CMU StatLib library, data concerns city-cycle fuel consumption This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original". "The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)	https://archive.ics.uci.edu/dataset/9/auto+mpg	Tabular (CSV)	CC-BY-4.0	398	—	0.0 MB	0.0 MB
Automobile	UCI ML Repository — Automobile	From 1985 Ward's Automotive Yearbook This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling".	https://archive.ics.uci.edu/dataset/10/automobile	Tabular (CSV)	CC-BY-4.0	205	—	0.0 MB	0.0 MB
Aya Collection (aya_dataset config)	CohereLabs/aya_collection — Aya-curated subset	Smaller `aya_dataset` config of the broader Aya Collection (which totals ~513M instances across 114 languages, mostly templated). This subset is the human-curated Aya Dataset in the Aya Collection's unified schema. Flip allow patterns to other configs (e.g. `templated_xnli/`) for the templated variants.	https://huggingface.co/datasets/CohereLabs/aya_collection	Tabular (Parquet)	Apache-2.0	202,364	1	99.1 MB	160.3 MB
Aya Dataset	CohereLabs/aya_dataset — human-curated multilingual instructions	204K human-curated multilingual instruction/response pairs in 65 languages, contributed by 3K participants in the Cohere For AI Aya open-science project. Each row carries `inputs`, `targets`, `language`, `language_code`, and `annotation_type`.	https://huggingface.co/datasets/CohereLabs/aya_dataset	Tabular (Parquet)	Apache-2.0	204,112	1	99.3 MB	162.2 MB
Bank Account Fraud Dataset Suite (NeurIPS 2022)	Bank Account Fraud Dataset Suite (NeurIPS 2022)	Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation. The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind! Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of “month”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income). — adapted from the dataset's Kaggle description (sgpjesus/bank-account-fraud-dataset-neurips-2022).	https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf	Tabular (CSV)	CC-BY-NC-SA-4.0	1,000,000	—	61.1 MB	56.5 MB
Bank Marketing	UCI ML Repository — Bank Marketing	The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.	https://archive.ics.uci.edu/dataset/222/bank+marketing	Tabular (CSV)	CC-BY-4.0	45,211	—	0.3 MB	0.4 MB
Behavioral Risk Factor Surveillance System	Behavioral Risk Factor Surveillance System	Public health surveys of 400k people from 2011-2015. The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. — adapted from the dataset's Kaggle description (cdc/behavioral-risk-factor-surveillance-system).	https://www.cdc.gov/brfss/	Custom	CC0-1.0	445,132	—	26.8 MB	54.2 MB
Beijing Multi-Site Air Quality	UCI ML Repository — Beijing Multi-Site Air Quality (Liang et al., 2017)	Hourly air-quality + meteorological readings from 12 monitoring stations across Beijing, March 2013 – February 2017 (~420K rows). Each station ships as a separate CSV in the upstream nested zip; the build concatenates them with the existing station column intact. Schema (18 cols): No, year, month, day, hour, PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM, station. Standard substitute for the KDD'15 U-Air dataset (which is no longer publicly hosted).	https://archive.ics.uci.edu/dataset/501/beijing+multi+site+air+quality+data	Tabular (CSV)	CC-BY-4.0	420,768	1	5.4 MB	7.3 MB
BeIR / MS MARCO	BeIR/msmarco — text-only passage corpus + queries	MS MARCO repackaged for the BEIR retrieval benchmark: 8.84M passages from the Bing search corpus paired with 510k user queries. Both as `large_string` parquet — solid showcase for string-heavy datasets and FSST encoding in Vortex. Concat'd into one slug with a `split` column distinguishing corpus vs queries.	https://huggingface.co/datasets/BeIR/msmarco	Tabular (Parquet)	MIT	9,351,785	—	1,095.3 MB	1,681.8 MB
BI-Arade	Public BI Benchmark — Arade	Public BI Benchmark workload `Arade` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Arade	Tabular (CSV)	MIT	9,888,775	—	312.5 MB	136.7 MB
BI-Bimbo	Public BI Benchmark — Bimbo	Public BI Benchmark workload `Bimbo` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Bimbo	Tabular (CSV)	MIT	74,180,464	—	368.6 MB	436.5 MB
BI-CityMaxCapita	Public BI Benchmark — CityMaxCapita	Public BI Benchmark workload `CityMaxCapita` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CityMaxCapita	Tabular (CSV)	MIT	912,657	—	102.0 MB	136.6 MB
BI-CMSprovider	Public BI Benchmark — CMSprovider	Public BI Benchmark workload `CMSprovider` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CMSprovider	Tabular (CSV)	MIT	18,575,754	—	798.3 MB	804.8 MB
BI-CommonGovernment	Public BI Benchmark — CommonGovernment	Public BI Benchmark workload `CommonGovernment` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CommonGovernment	Tabular (CSV)	MIT	141,123,827	—	6,358.3 MB	9,153.8 MB
BI-Corporations	Public BI Benchmark — Corporations	Public BI Benchmark workload `Corporations` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Corporations	Tabular (CSV)	MIT	741,723	—	53.6 MB	67.9 MB
BI-Eixo	Public BI Benchmark — Eixo	Public BI Benchmark workload `Eixo` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Eixo	Tabular (CSV)	MIT	7,559,227	—	463.1 MB	616.6 MB
BI-Euro2016	Public BI Benchmark — Euro2016	Public BI Benchmark workload `Euro2016` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Euro2016	Tabular (CSV)	MIT	2,052,497	—	127.7 MB	156.9 MB
BI-Food	Public BI Benchmark — Food	Public BI Benchmark workload `Food` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Food	Tabular (CSV)	MIT	5,216,593	—	36.4 MB	40.4 MB
BI-Generico	Public BI Benchmark — Generico	Public BI Benchmark workload `Generico` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Generico	Tabular (CSV)	MIT	114,124,607	—	2,341.3 MB	3,619.4 MB
BI-HashTags	Public BI Benchmark — HashTags	Public BI Benchmark workload `HashTags` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/HashTags	Tabular (CSV)	MIT	511,511	—	138.7 MB	186.8 MB
BI-Hatred	Public BI Benchmark — Hatred	Public BI Benchmark workload `Hatred` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Hatred	Tabular (CSV)	MIT	873,166	—	100.6 MB	133.6 MB
BI-IGlocations1	Public BI Benchmark — IGlocations1	Public BI Benchmark workload `IGlocations1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IGlocations1	Tabular (CSV)	MIT	81,611	—	1.8 MB	2.3 MB
BI-IGlocations2	Public BI Benchmark — IGlocations2	Public BI Benchmark workload `IGlocations2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IGlocations2	Tabular (CSV)	MIT	4,341,308	—	515.6 MB	720.2 MB
BI-IUBLibrary	Public BI Benchmark — IUBLibrary	Public BI Benchmark workload `IUBLibrary` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IUBLibrary	Tabular (CSV)	MIT	1,795	—	0.2 MB	0.2 MB
BI-Medicare1	Public BI Benchmark — Medicare1	Public BI Benchmark workload `Medicare1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare1	Tabular (CSV)	MIT	17,290,144	—	939.9 MB	800.9 MB
BI-Medicare2	Public BI Benchmark — Medicare2	Public BI Benchmark workload `Medicare2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare2	Tabular (CSV)	MIT	18,306,546	—	853.2 MB	939.2 MB
BI-Medicare3	Public BI Benchmark — Medicare3	Public BI Benchmark workload `Medicare3` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare3	Tabular (CSV)	MIT	9,287,877	—	452.9 MB	481.6 MB
BI-MedPayment1	Public BI Benchmark — MedPayment1	Public BI Benchmark workload `MedPayment1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MedPayment1	Tabular (CSV)	MIT	9,153,273	—	419.5 MB	472.5 MB
BI-MedPayment2	Public BI Benchmark — MedPayment2	Public BI Benchmark workload `MedPayment2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MedPayment2	Tabular (CSV)	MIT	9,153,273	—	488.0 MB	524.0 MB
BI-MLB	Public BI Benchmark — MLB	Public BI Benchmark workload `MLB` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MLB	Tabular (CSV)	MIT	32,472,563	—	1,160.3 MB	2,018.4 MB
BI-Motos	Public BI Benchmark — Motos	Public BI Benchmark workload `Motos` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Motos	Tabular (CSV)	MIT	28,364,361	—	581.7 MB	894.2 MB
BI-MulheresMil	Public BI Benchmark — MulheresMil	Public BI Benchmark workload `MulheresMil` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MulheresMil	Tabular (CSV)	MIT	7,561,432	—	464.9 MB	622.8 MB
BI-NYC	Public BI Benchmark — NYC	Public BI Benchmark workload `NYC` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/NYC	Tabular (CSV)	MIT	19,242,976	—	856.5 MB	733.2 MB
BI-PanCreactomy1	Public BI Benchmark — PanCreactomy1	Public BI Benchmark workload `PanCreactomy1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/PanCreactomy1	Tabular (CSV)	MIT	9,153,273	—	423.0 MB	475.2 MB
BI-PanCreactomy2	Public BI Benchmark — PanCreactomy2	Public BI Benchmark workload `PanCreactomy2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/PanCreactomy2	Tabular (CSV)	MIT	18,306,546	—	845.9 MB	948.3 MB
BI-Physicians	Public BI Benchmark — Physicians	Public BI Benchmark workload `Physicians` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Physicians	Tabular (CSV)	MIT	9,153,273	—	419.5 MB	473.4 MB
BI-Provider	Public BI Benchmark — Provider	Public BI Benchmark workload `Provider` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Provider	Tabular (CSV)	MIT	73,226,184	—	3,412.1 MB	3,752.2 MB
BI-RealEstate1	Public BI Benchmark — RealEstate1	Public BI Benchmark workload `RealEstate1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/RealEstate1	Tabular (CSV)	MIT	39,062,718	—	2,367.9 MB	2,418.3 MB
BI-RealEstate2	Public BI Benchmark — RealEstate2	Public BI Benchmark workload `RealEstate2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/RealEstate2	Tabular (CSV)	MIT	66,415,881	—	5,305.3 MB	6,127.7 MB
BI-Redfin1	Public BI Benchmark — Redfin1	Public BI Benchmark workload `Redfin1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin1	Tabular (CSV)	MIT	12,120,220	—	1,619.4 MB	1,642.1 MB
BI-Redfin2	Public BI Benchmark — Redfin2	Public BI Benchmark workload `Redfin2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin2	Tabular (CSV)	MIT	9,090,165	—	1,214.9 MB	1,227.8 MB
BI-Redfin3	Public BI Benchmark — Redfin3	Public BI Benchmark workload `Redfin3` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin3	Tabular (CSV)	MIT	6,534,558	—	875.0 MB	909.6 MB
BI-Redfin4	Public BI Benchmark — Redfin4	Public BI Benchmark workload `Redfin4` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin4	Tabular (CSV)	MIT	3,267,279	—	457.8 MB	477.1 MB
BI-Rentabilidad	Public BI Benchmark — Rentabilidad	Public BI Benchmark workload `Rentabilidad` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Rentabilidad	Tabular (CSV)	MIT	3,595,905	—	759.9 MB	952.7 MB
BI-Romance	Public BI Benchmark — Romance	Public BI Benchmark workload `Romance` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Romance	Tabular (CSV)	MIT	3,173,176	—	325.9 MB	463.9 MB
BI-SalariesFrance	Public BI Benchmark — SalariesFrance	Public BI Benchmark workload `SalariesFrance` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/SalariesFrance	Tabular (CSV)	MIT	16,223,877	—	2,493.9 MB	2,832.0 MB
BI-TableroSistemaPenal	Public BI Benchmark — TableroSistemaPenal	Public BI Benchmark workload `TableroSistemaPenal` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TableroSistemaPenal	Tabular (CSV)	MIT	25,274,916	—	272.2 MB	386.0 MB
BI-Taxpayer	Public BI Benchmark — Taxpayer	Public BI Benchmark workload `Taxpayer` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Taxpayer	Tabular (CSV)	MIT	91,532,730	—	4,264.8 MB	4,685.6 MB
BI-Telco	Public BI Benchmark — Telco	Public BI Benchmark workload `Telco` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Telco	Tabular (CSV)	MIT	2,913,060	—	1,051.0 MB	951.5 MB
BI-TrainsUK1	Public BI Benchmark — TrainsUK1	Public BI Benchmark workload `TrainsUK1` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TrainsUK1	Tabular (CSV)	MIT	12,909,724	—	443.7 MB	510.5 MB
BI-TrainsUK2	Public BI Benchmark — TrainsUK2	Public BI Benchmark workload `TrainsUK2` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TrainsUK2	Tabular (CSV)	MIT	31,123,554	—	1,263.8 MB	1,596.0 MB
BI-Uberlandia	Public BI Benchmark — Uberlandia	Public BI Benchmark workload `Uberlandia` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Uberlandia	Tabular (CSV)	MIT	7,559,227	—	464.9 MB	627.1 MB
BI-USCensus	Public BI Benchmark — USCensus	Public BI Benchmark workload `USCensus` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/USCensus	Tabular (CSV)	MIT	9,398,385	—	2,421.9 MB	3,037.9 MB
BI-Wins	Public BI Benchmark — Wins	Public BI Benchmark workload `Wins` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Wins	Tabular (CSV)	MIT	2,115,449	—	734.3 MB	820.4 MB
BI-YaleLanguages	Public BI Benchmark — YaleLanguages	Public BI Benchmark workload `YaleLanguages` — pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via `public_bi_merge`.	https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/YaleLanguages	Tabular (CSV)	MIT	5,762,082	—	94.7 MB	123.0 MB
Bike Sharing	UCI ML Repository — Bike Sharing	This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.	https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset	Tabular (CSV)	CC-BY-4.0	17,379	—	0.2 MB	0.2 MB
Breast Cancer	UCI ML Repository — Breast Cancer	This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. (See also lymphography and primary-tumor.) This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.	https://archive.ics.uci.edu/dataset/14/breast+cancer	Tabular (CSV)	CC-BY-4.0	286	—	0.0 MB	0.0 MB
Breast Cancer Wisconsin (Diagnostic)	UCI ML Repository — Breast Cancer Wisconsin (Diagnostic)	Diagnostic Wisconsin Breast Cancer Database. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/ Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree.	https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic	Tabular (CSV)	CC-BY-4.0	569	—	0.1 MB	0.1 MB
Breast Cancer Wisconsin (Original)	UCI ML Repository — Breast Cancer Wisconsin (Original)	Original Wisconsin Breast Cancer Database Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself: Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991) Group 8: 86 instances (November 1991) -----------------------------------------…	https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original	Tabular (CSV)	CC-BY-4.0	699	—	0.0 MB	0.0 MB
CA Housing	California Housing Prices	Median house prices for California districts derived from the 1990 census. This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome. The data contains information from the 1990 California census. — adapted from the dataset's Kaggle description (camnugent/california-housing-prices).	http://lib.stat.cmu.edu/datasets/houses.zip	Custom	CC0-1.0	20,640	—	0.3 MB	0.4 MB
Car Evaluation	UCI ML Repository — Car Evaluation	Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods. Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure: CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . .	https://archive.ics.uci.edu/dataset/19/car+evaluation	Tabular (CSV)	CC-BY-4.0	1,728	—	0.0 MB	0.0 MB
Cardiovascular Diseases Risk Prediction Dataset	Cardiovascular Diseases Risk Prediction Dataset	The 2021 BRFSS Dataset from CDC. CVDs Risk Prediction Using Personal Lifestyle Factors - Check my notebook here! 😄 - Access the web application I created here! 🔗 BRFSS Dataset The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. — adapted from the dataset's Kaggle description (alphiree/cardiovascular-diseases-risk-prediction-dataset).	https://www.cdc.gov/brfss/	Custom	CC0-1.0	418,268	—	32.3 MB	58.0 MB
CDC Diabetes Health Indicators	UCI ML Repository — CDC Diabetes Health Indicators	The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. Dataset link: https://www.cdc.gov/brfss/annual_data/annual_2014.html	https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators	Tabular (CSV)	CC-BY-4.0	253,680	—	2.2 MB	1.4 MB
Census Income	UCI ML Repository — Census Income	Predict whether income exceeds $50K/yr based on census data. Also known as Adult dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to determine whether a person makes over 50K a year.	https://archive.ics.uci.edu/dataset/20/census+income	Tabular (CSV)	CC-BY-4.0	48,842	—	0.4 MB	0.4 MB
Chess (Lichess)	Lichess Standard Rated Games (2013-01 monthly dump)	20,000+ Lichess Games, including moves, victor, rating, opening details and more. General Info This is a set of just over 20,000 games collected from a selection of users on the site Lichess.org, and how to collect more. I will also upload more games in the future as I collect them. I collected this data using the [Lichess API][2], which enables collection of any given users game history. — adapted from the dataset's Kaggle description (datasnaek/chess).	https://database.lichess.org/	Custom	CC0-1.0	121,332	—	22.8 MB	23.9 MB
Chronic Kidney Disease	UCI ML Repository — Chronic Kidney Disease	This dataset can be used to predict the chronic kidney disease and it can be collected from the hospital nearly 2 months of period. We use the following representation to collect the dataset age - age bp - blood pressure sg - specific gravity al - albumin su - sugar rbc - red blood cells pc - pus cell pcc - pus cell clumps ba - bacteria bgr - blood glucose random bu - blood urea sc - serum creatinine sod - sodium pot - potassium hemo - hemoglobin pcv - packed cell volume wc - white blood cell count rc - red blood cell count htn - hypertension dm - diabetes mellitus cad - coronary artery disease…	https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease	Tabular (CSV)	CC-BY-4.0	400	—	0.0 MB	0.0 MB
ClickBench Hits	ClickBench Hits (Yandex Metrica log)	100M-row Yandex Metrica web-analytics event log used by the ClickBench OLAP benchmark suite. 105 columns covering URL, user agent, geo, click counts, and session metadata — heterogeneous string + numeric mix that exercises columnar query engines on wide, sparse rows.	https://github.com/ClickHouse/ClickBench/blob/main/LICENSE	Tabular (Parquet)	Apache-2.0	99,997,497	—	9,497.6 MB	14,562.4 MB
CNN/DailyMail	abisee/cnn_dailymail — news summarization (3.0.0)	300K English news articles from CNN and the Daily Mail (2007–2015) paired with multi-sentence reference summaries. Each row carries `article`, `highlights`, and `id`. Standard summarization eval; uses the 3.0.0 (non-anonymized) version. Original-author repo (Abi See).	https://huggingface.co/datasets/abisee/cnn_dailymail	Tabular (Parquet)	Apache-2.0	311,971	1	543.3 MB	713.8 MB
CodeContests	deepmind/code_contests (Li et al., 2022)	13K competitive-programming problems from Codeforces, AtCoder, et al., released with AlphaCode. Each row carries `name`, `description`, `cf_*` Codeforces metadata, `public_tests`, `private_tests`, and reference solutions in multiple languages. Used for code-generation evaluation.	https://huggingface.co/datasets/deepmind/code_contests	Tabular (Parquet)	CC-BY-4.0	13,610	1	4,108.3 MB	—
Cohere Wikipedia Simple (multilingual-v3 embeddings)	Cohere/wikipedia-2023-11-embed-multilingual-v3 — Simple English subset (1024d)	Simple English Wikipedia (646k passage-chunked rows) paired with Cohere's embed-multilingual-v3 1024-dimensional embeddings. The `simple` subset of the larger `Cohere/wikipedia-2023-11-embed-multilingual-v3` repo, scoped via `hf_allow_patterns`. Showcases `fixed_size_list<float, 1024>` via the cast in `hf_concat_splits`.	https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3	Tabular (Parquet)	Apache-2.0	646,424	—	1,263.2 MB	2,293.7 MB
COIG	BAAI/COIG — Chinese Open Instruction Generalist	BAAI's Chinese instruction-tuning corpus assembled from translation, exam, leetcode, human-value alignment, and counterfactual-correction subsets. Each row carries `instruction`, `input`, `output`, plus a subset-source tag. The `default` config concatenates the translatable and non-translatable splits.	https://huggingface.co/datasets/BAAI/COIG	Tabular (Parquet)	Apache-2.0	178,246	1	55.5 MB	101.7 MB
Concrete Compressive Strength	UCI ML Repository — Concrete Compressive Strength	Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. Number of instances 1030 Number of Attributes 9 Attribute breakdown 8 quantitative input variables, and 1 quantitative output variable Missing Attribute Values None	https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength	Tabular (CSV)	CC-BY-4.0	1,030	—	0.0 MB	0.0 MB
CoronaHack -Chest X-Ray-Dataset	CoronaHack -Chest X-Ray-Dataset	Classify the X Ray image which is having Corona. Corona - COVID19 virus affects the respiratory system of healthy individual & Chest X -Ray is one of the important imaging methods to identify the corona virus. With the Chest X - Ray dataset, Develop a Machine Learning Model to classify the X Rays of Healthy vs Pneumonia (Corona) affected patients & this model powers the AI application to test the Corona Virus in Faster Phase. Postdoctoral Fellow, Mila, University of Montreal for the dataset below for corona dataset & 80% dataset collected from different sources. — adapted from the dataset's Kaggle description (praveengovi/coronahack-chest-xraydataset).	https://github.com/ieee8023/covid-chestxray-dataset	Tabular (CSV)	Attribution 4.0 International (CC BY 4.0)	5,910	—	0.1 MB	0.1 MB
Cosmopedia (Stanford subset)	HuggingFaceTB/cosmopedia — Stanford-style synthetic textbooks subset	Synthetic textbook-style content generated by Mixtral-8x7B-Instruct in the style of Stanford coursework (one of the eight cosmopedia subsets). 13 shards, ~5GB. Open-weight-model provenance — no closed-API ToS issue. Flip allow patterns to `auto_math_text/`, `khanacademy/`, `openstax/`, `stories/`, or `web_samples_v1/` for the other subsets.	https://huggingface.co/datasets/HuggingFaceTB/cosmopedia	Tabular (Parquet)	Apache-2.0	1,020,024	1	2,022.9 MB	3,116.9 MB
Countries of the World	CIA World Factbook (JSON mirror → VARIANT parquet)	Country names linked to region, population, area size, GDP, mortality and more. World fact sheet, fun to link with other datasets. Information on population, region, area size, infant mortality and more. [Source:][1] All these data sets are made up of data from the US government. — adapted from the dataset's Kaggle description (fernandol/countries-of-the-world).	https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html	Custom	CC0-1.0	262	—	1.6 MB	4.7 MB
COVID-19 data from John Hopkins University	COVID-19 data from John Hopkins University	Updated daily at 6am UTC in both raw and convenient form. This is a daily updating version of COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). The data updates every day at 6am UTC, which updates just after the raw JHU data typically updates. I'm making it available in both a raw form (files with the prefix RAW) and convenient form (files prefixed with CONVENIENT). — adapted from the dataset's Kaggle description (antgoldbloom/covid19-data-from-john-hopkins-university).	https://github.com/CSSEGISandData/COVID-19	Tabular (CSV)	Attribution 4.0 International (CC BY 4.0)	289	—	1.5 MB	2.3 MB
COVID-19 World Vaccination Progress	COVID-19 World Vaccination Progress	Daily and Total Vaccination for COVID-19 in the World from Our World in Data. Data is collected daily from Our World in Data GitHub repository for covid-19, merged and uploaded. Country level vaccination data is gathered and assembled in one single file. Then, this data file is merged with locations data file to include vaccination sources information. — adapted from the dataset's Kaggle description (gpreda/covid-world-vaccination-progress).	https://github.com/owid/covid-19-data	Tabular (CSV)	CC0-1.0	196,246	—	3.7 MB	6.8 MB
Credit Approval	UCI ML Repository — Credit Approval	This data concerns credit card applications; good mix of attributes This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.	https://archive.ics.uci.edu/dataset/27/credit+approval	Tabular (CSV)	CC-BY-4.0	690	—	0.0 MB	0.0 MB
Crimes in Boston	Crimes in Boston	Times, locations, and descriptions of crimes. Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. Records begin in June 14, 2015 and continue to September 3, 2018. — adapted from the dataset's Kaggle description (AnalyzeBoston/crimes-in-boston).	https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system	Tabular (CSV)	CC0-1.0	319,073	—	5.9 MB	20.0 MB
Databricks Dolly 15k	databricks-dolly-15k (Conover et al., 2023)	15k human-written instruction/response pairs across 7 task categories (closed QA, classification, summarization, etc.), generated by Databricks employees specifically for instruction-tuning open LLMs. Distinguished by being entirely human-authored (no model output recycled as data), making it CC-BY-SA-3.0 — commercially-clear unlike most synthetic instruction corpora.	https://huggingface.co/datasets/databricks/databricks-dolly-15k	Structured (JSON)	CC-BY-SA-3.0	15,011	—	5.1 MB	6.3 MB
dbpedia + Embeddings	DBpedia Entities 1M + OpenAI text-embedding-3-large (1536-dim)	1M DBpedia entity abstracts paired with 1536-dim OpenAI text-embedding-3-large embeddings (Qdrant's release). Each row carries the entity's title, abstract, and the dense vector. Standard reference for vector-search benchmarks.	https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M	Tabular (Parquet)	CC-BY-SA-4.0	1,000,000	—	6,898.3 MB	10,616.6 MB
Default of Credit Card Clients	UCI ML Repository — Default of Credit Card Clients	This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.	https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients	Tabular (CSV)	CC-BY-4.0	30,000	—	1.2 MB	1.0 MB
Diabetes	UCI ML Repository — Diabetes	This diabetes dataset is from AIM '94 Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps. Diabetes files consist of four fields per record.	https://archive.ics.uci.edu/dataset/34/diabetes	Tabular (CSV)	CC-BY-4.0	29,264	—	0.1 MB	0.1 MB
Diabetes 130-US Hospitals for Years 1999-2008	UCI ML Repository — Diabetes 130-US Hospitals for Years 1999-2008	The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days. The goal is to determine the early readmission of the patient within 30 days of discharge. The problem is important for the following reasons. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria. (1) It is an inpatient encounter (a hospital admission). (2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis. (3) The length of stay was at least 1 day and at most 14 days. (4) Laboratory tests were performed during the encounter. (5) Medicati…	https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008	Tabular (CSV)	CC-BY-4.0	101,766	—	2.0 MB	2.2 MB
Diabetes Health	Diabetes Health Indicators Dataset	253,680 survey responses from cleaned BRFSS 2015 + balanced dataset. Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. — adapted from the dataset's Kaggle description (alexteboul/diabetes-health-indicators-dataset).	https://www.cdc.gov/brfss/annual_data/annual_2015.html	Custom	CC0-1.0	441,456	—	33.1 MB	57.6 MB
Disease Symptoms	Disease Symptom Prediction	helps to create a disease prediction or healthcare system. A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv Content There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. — adapted from the dataset's Kaggle description (itachi9604/disease-symptom-description-dataset).	https://github.com/itachi9604/healthcare-chatbot	Tabular (CSV)	CC-BY-SA-4.0	4,920	—	0.0 MB	0.1 MB
Docmatix (zero-shot subset)	HuggingFaceM4/Docmatix — zero-shot evaluation subset	Small zero-shot evaluation subset of Docmatix, the synthetic Doc-VQA training corpus released with Idefics3 (~1M images total, 9.5M Q&A pairs in the full set). Each row pairs page images with question/answer tuples.	https://huggingface.co/datasets/HuggingFaceM4/Docmatix	Tabular (Parquet)	MIT	1,900	1	604.5 MB	615.4 MB
Dry Bean	UCI ML Repository — Dry Bean	Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains. Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera.	https://archive.ics.uci.edu/dataset/602/dry+bean+dataset	Tabular (CSV)	CC-BY-4.0	13,611	—	1.7 MB	0.8 MB
Electric Motor Temperature	Electric Motor Temperature	185 hrs recordings from a permanent magnet synchronous motor (PMSM). UPDATE 26.04.2021 All data is deanonymized now. Moreover, 17 additional measurement profiles were added, expanding the dataset from 138 hours to 185 hours of records. The data set comprises several sensor data collected from a permanent magnet synchronous motor (PMSM) deployed on a test bench. — adapted from the dataset's Kaggle description (wkirgsn/electric-motor-temperature).	https://github.com/upb-lea/deep-pmsm	Tabular (CSV)	CC-BY-SA-4.0	1,330,816	—	84.2 MB	102.5 MB
ElectricityLoadDiagrams20112014	UCI ML Repository — ElectricityLoadDiagrams20112014	This data set contains electricity consumption of 370 points/clients. Data set has no missing values. Values are in kW of each 15 min. To convert values in kWh values must be divided by 4. Each column represent one client. Some clients were created after 2011. In these cases consumption were considered zero. All time labels report to Portuguese hour. However all days present 96 measures (24*4). Every year in March time change day (which has only 23 hours) the values between 1:00 am and 2:00 am are zero for all points. Every year in October time change day (which has 25 hours) the values between 1:00 am and 2:00 am aggregate the consumption of two hours.	https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014	Tabular (CSV)	CC-BY-4.0	140,256	—	40.6 MB	52.1 MB
Emissions by Country	Emissions by Country	Quantifying Sources and Emission Levels. It contains information on total emissions as well as from coal, oil, gas, cement production and flaring, and other sources. The data also provides a breakdown of per capita CO2 emission per country - showing which countries are leading in pollution levels and identifying potential areas where reduction efforts should be concentrated. This dataset is essential for anyone who wants to get informed about their own environmental footprint or conduct research on international development trends More Datasets > For more datasets, click here. — adapted from the dataset's Kaggle description (thedevastator/global-fossil-co2-emissions-by-country-2002-2022).	https://zenodo.org/record/7215364	Tabular (CSV)	CC0-1.0	63,104	—	0.7 MB	1.1 MB
Emotions NLP	Emotions Dataset for NLP	Emotions dataset for NLP classification tasks. Few questions your emotion classification model can answer based on your customer review What is the sentiment of your customer comment? What is the mood of today's special food ? — adapted from the dataset's Kaggle description (praveengovi/emotions-dataset-for-nlp).	https://www.aclweb.org/anthology/D18-1404/	Tabular (Parquet)	CC-BY-SA-4.0	416,809	—	16.1 MB	19.3 MB
Energy Efficiency	UCI ML Repository — Energy Efficiency	This study looked into assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters. We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.	https://archive.ics.uci.edu/dataset/242/energy+efficiency	Tabular (CSV)	CC-BY-4.0	768	—	0.0 MB	0.0 MB
Estimation of Obesity Levels Based On Eating Habits and Physical Condition	UCI ML Repository — Estimation of Obesity Levels Based On Eating Habits and Physical Condition	This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.	https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition	Tabular (CSV)	CC-BY-4.0	2,111	—	0.1 MB	0.1 MB
Exoplanet Hunting in Deep Space	Exoplanet Hunting in Deep Space	Kepler labelled time series data. The Search for New Earths ------------------------- GitHub The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems. — adapted from the dataset's Kaggle description (keplersmachines/kepler-labelled-time-series-data).	https://github.com/winterdelta/keplersmachines	Tabular (CSV)	CC0-1.0	5,087	—	102.2 MB	116.2 MB
F1 Championship	Formula 1 World Championship (1950 - 2024)	F1 race data from 1950 to 2024. Formula 1 (a.k.a. F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. — adapted from the dataset's Kaggle description (rohanrao/formula-1-world-championship-1950-2020).	https://ergast.com/mrd/	Tabular (CSV)	CC0-1.0	26,759	—	0.4 MB	0.7 MB
FineTranslations (Swedish sample)	HuggingFaceFW/finetranslations — Swedish parallel-text sample	Single 2 GB shard of HuggingFace's 1+T-token translation corpus — Swedish-anchored parallel-text. Schema centers on paired source/target text plus translation-quality metadata. Bound here to the first shard (`swe_Latn/train/0000.parquet` at refs/convert/parquet); the full Swedish split is 182 shards × 2 GB = ~365 GB. Flip allow_patterns to `swe_Latn/train/*.parquet` for the full corpus, or to `<lang>_<script>/train/0000.parquet` for any of the other ~600 language anchors at the same one-shard size.	https://huggingface.co/datasets/HuggingFaceFW/finetranslations	Tabular (Parquet)	ODC-By-1.0	321,000	1	1,450.3 MB	1,853.2 MB
FitBit Tracker	FitBit Fitness Tracker Data	Pattern recognition with tracker data: : Improve Your Overall Health. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). — adapted from the dataset's Kaggle description (arashnic/fitbit).	https://zenodo.org/record/53894#.YMoUpnVKiP9	Tabular (CSV)	CC0-1.0	1,397	—	0.0 MB	0.1 MB
Football Results	International football results from 1872 to 2026	An up-to-date dataset of over 49,000 international football results. Well, what happened was that I was looking for a semi-definite easy-to-read list of international football matches and couldn't find anything decent. So I took it upon myself to collect it for my own use. I might as well share it. — adapted from the dataset's Kaggle description (martj42/international-football-results-from-1872-to-2017).	https://github.com/martj42/international_results	Tabular (CSV)	CC0-1.0	49,328	—	0.4 MB	0.5 MB
Forest Fires	UCI ML Repository — Forest Fires	This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at: http://www.dsi.uminho.pt/~pcortez/forestfires). In [Cortez and Morais, 2007], the output 'area' was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE.	https://archive.ics.uci.edu/dataset/162/forest+fires	Tabular (CSV)	CC-BY-4.0	517	—	0.0 MB	0.0 MB
FRAMES	google/frames-benchmark — Factuality, Retrieval, And reasoning MEasurement Set	824 multi-hop fact questions designed to require both retrieval (across multiple Wikipedia articles) and reasoning. Each row carries the `Prompt`, `Answer`, and `wiki_links` (the relevant source URLs). Small but structurally rich for retrieval-augmented eval.	https://huggingface.co/datasets/google/frames-benchmark	Tabular (Parquet)	Apache-2.0	824	1	0.2 MB	0.2 MB
German Traffic Signs	GTSRB - German Traffic Sign Recognition Benchmark	Multi-class, single-image classification challenge. We cordially invite researchers from relevant fields to participate: The competition is designed to allow for participation without special domain knowledge. Our benchmark has the following properties: - Single-image, multi-class classification problem - More than 40 classes - More than 50,000 images in total - Large, lifelike database Acknowledgements [INI Benchmark Website][1] [1]: http://benchmark.ini.rub.de/ — adapted from the dataset's Kaggle description (meowmeowmeowmeowmeow/gtsrb-german-traffic-sign).	http://benchmark.ini.rub.de/	Tabular (CSV)	CC0-1.0	12,630	—	0.1 MB	0.1 MB
GHCN-Daily	NOAA Global Historical Climatology Network (Daily)	3.17B daily weather observations from NOAA's Global Historical Climatology Network — surface-station readings since 1763. One row per (station, day, element) with min/max temperature, precipitation, snowfall, etc. The reference dataset for century-scale climate time-series analysis.	https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily	Custom	US-Government-PD	3,178,406,394	—	5,975.8 MB	10,492.7 MB
Glass Classification	Glass Classification	Can you correctly identify glass type?	https://archive.ics.uci.edu/	Tabular (CSV)	DbCL-1.0	214	—	0.0 MB	0.0 MB
GloVe 6B 100d	GloVe 6B Global Vectors (100-dimensional)	400,000 English word embeddings at 100 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The middle of the three GloVe-6B slugs; the dimension most commonly cited in classic NLP papers.	https://nlp.stanford.edu/projects/glove/	Structured (Embeddings)	PDDL-1.0	400,000	—	143.9 MB	128.6 MB
GloVe 6B 200d	GloVe 6B Global Vectors (200-dimensional)	400,000 English word embeddings at 200 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The largest of the three GloVe-6B slugs; higher-fidelity vectors at 4× the storage.	https://nlp.stanford.edu/projects/glove/	Structured (Embeddings)	PDDL-1.0	400,000	—	285.3 MB	255.1 MB
GloVe 6B 50d	GloVe 6B Global Vectors (50-dimensional)	400,000 English word embeddings at 50 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The smallest of the three GloVe-6B slugs.	https://nlp.stanford.edu/projects/glove/	Structured (Embeddings)	PDDL-1.0	400,000	—	73.3 MB	68.3 MB
goodbooks-10k	goodbooks-10k	Ten thousand books, one million ratings. Also books marked to read, and tags. This version of the dataset is obsolete. It contains duplicate ratings (same userid,bookid), as reported by Philipp Spachtholz in his illustrious notebook. The current version has duplicates removed, and more ratings (six million), sorted by time. — adapted from the dataset's Kaggle description (zygmunt/goodbooks-10k).	https://github.com/zygmuntz	Tabular (CSV)	CC-BY-SA-4.0	5,976,479	—	18.7 MB	23.2 MB
Google Cluster Trace 2011 — machine_events	Google Cluster Trace v2 — machine_events table (2011)	Machine-add/remove/update events from Google's 29-day production-cluster trace (May 2011, ~12.5K-machine cluster). One row per event: timestamp (microseconds since trace start), machine ID, event type (ADD / REMOVE / UPDATE), platform ID (string-hashed), and CPU+memory capacity. Six columns with no header in the source file — column names autogenerate as `f0..f5`. Schema documented at the upstream `schema.csv`. Subset of the broader cluster trace (job_events, task_events, task_usage, machine_attributes, task_constraints — all present in the same GCS bucket). machine_events is a single ~347 KB compressed file; the task_events / task_usage tables are 500 parts each and ~50 GB total.	https://github.com/google/cluster-data	Tabular (CSV)	CC-BY-3.0	37,780	1	0.3 MB	0.4 MB
GSM8K	Grade School Math 8K (Cobbe et al., 2021)	8.5k grade-school math word problems with step-by-step natural-language reasoning in the answer (Cobbe et al., 2021). Each problem requires 2–8 elementary arithmetic steps. The standard arithmetic-reasoning eval for chain-of-thought prompting; this is the `main` config (a sibling `socratic` config exists upstream but is omitted here).	https://huggingface.co/datasets/openai/gsm8k	Tabular (Parquet)	MIT	8,792	—	1.8 MB	2.5 MB
Hacker News	Hacker News posts + comments archive	28.7M Hacker News posts and comments spanning the site's full history (2007 onward). Single flat parquet from Google's bigquery-public-data export; each row carries id, type (story/comment/poll), author, time, parent, title, text, and url. Standard dataset for Hacker News thread analysis.	https://news.ycombinator.com/	Tabular (Parquet)	Public	41,813,385	—	6,761.8 MB	8,489.9 MB
Heart Disease	UCI ML Repository — Heart Disease	4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).	https://archive.ics.uci.edu/dataset/45/heart+disease	Tabular (CSV)	CC-BY-4.0	303	—	0.0 MB	0.0 MB
Heart Disease Health Indicators Dataset	Heart Disease Health Indicators Dataset	253,680 survey responses from cleaned BRFSS 2015 - binary classification. Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year — making it the leading cause of death. The buildup of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease. — adapted from the dataset's Kaggle description (alexteboul/heart-disease-health-indicators-dataset).	https://www.cdc.gov/brfss/annual_data/annual_2015.html	Custom	CC0-1.0	441,456	—	33.1 MB	57.6 MB
Heart Disease Indicators	Indicators of Heart Disease (2022 UPDATE)	2022 annual CDC survey data of 400k+ adults related to their health status. According to the CDC, heart disease is a leading cause of death for people of most races in the U.S. (African Americans, American Indians and Alaska Natives, and whites). half of all Americans (47%) have at least 1 of 3 major risk factors for heart disease: high blood pressure, high cholesterol, and smoking. — adapted from the dataset's Kaggle description (kamilpytlak/personal-key-indicators-of-heart-disease).	https://www.cdc.gov/brfss/annual_data/annual_2022.html	Custom	CC0-1.0	445,132	—	26.8 MB	54.2 MB
Heart Failure Clinical Records	UCI ML Repository — Heart Failure Clinical Records	This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features. A detailed description of the dataset can be found in the Dataset section of the following paper: Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone". BMC Medical Informatics and Decision Making 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5	https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records	Tabular (CSV)	CC-BY-4.0	299	—	0.0 MB	0.0 MB
HellaSwag	HellaSwag — Commonsense NLI (Zellers et al., ACL 2019)	70k commonsense-NLI multiple-choice questions: pick the plausible ending to a video-caption or how-to context. Adversarial filtering via models-of-the-day was used to keep the wrong endings hard (Zellers et al., ACL 2019). Standard benchmark for commonsense reasoning in LLMs.	https://huggingface.co/datasets/Rowan/hellaswag	Tabular (Parquet)	MIT	59,950	—	23.2 MB	30.2 MB
HelpSteer2	nvidia/HelpSteer2 — open reward-model training data	~10K human-rated prompt/response pairs scored on five attributes: helpfulness, correctness, coherence, complexity, verbosity (each 0–4). NVIDIA's open-source reward-model training set. Each row carries `prompt`, `response`, plus the five score columns.	https://huggingface.co/datasets/nvidia/HelpSteer2	Tabular (Parquet)	CC-BY-4.0	21,362	1	12.6 MB	20.7 MB
Hotel Booking	Hotel Booking Demand	From the paper: hotel booking demand datasets. Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? — adapted from the dataset's Kaggle description (jessemostipak/hotel-booking-demand).	https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md	Tabular (CSV)	Attribution 4.0 International (CC BY 4.0)	119,390	—	0.9 MB	1.5 MB
HotpotQA (fullwiki)	hotpotqa/hotpot_qa — multi-hop QA, fullwiki configuration	113K multi-hop QA over Wikipedia. Fullwiki config: each question must be answered by reasoning across multiple Wikipedia paragraphs. Each row carries `question`, `answer`, `supporting_facts` (list of title/sentence-id pairs), and `context` (list of relevant paragraphs).	https://huggingface.co/datasets/hotpotqa/hotpot_qa	Tabular (Parquet)	CC-BY-SA-4.0	105,257	1	263.0 MB	342.9 MB
Housing Prices Dataset	Housing Prices Dataset	Housing Prices Prediction - Regression Problem. A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model? — adapted from the dataset's Kaggle description (yasserh/housing-prices-dataset).	https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg	Tabular (CSV)	CC0-1.0	545	—	0.0 MB	0.0 MB
Human Activity Recognition Using Smartphones	UCI ML Repository — Human Activity Recognition Using Smartphones	Human Activity Recognition database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually.	https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones	Tabular (CSV)	CC-BY-4.0	10,299	—	26.4 MB	19.7 MB
HumanEval	openai/openai_humaneval (Chen et al., 2021)	Canonical 164-problem Python coding benchmark. Each row carries `task_id`, `prompt` (function signature + docstring), `canonical_solution`, `test`, and `entry_point`. The standard 'pass@k' eval target.	https://huggingface.co/datasets/openai/openai_humaneval	Tabular (Parquet)	MIT	164	1	0.1 MB	0.1 MB
Individual Household Electric Power Consumption	UCI ML Repository — Individual Household Electric Power Consumption	Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available. This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months). Notes: 1.(global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3. 2.The dataset contains some missing values in the measurements (nearly 1,25% of the rows).	https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption	Tabular (CSV)	CC-BY-4.0	2,075,259	—	9.0 MB	22.3 MB
Iowa Liquor Sales	Iowa Liquor Sales	12 million alcoholic beverage sales in the Midwest. The Iowa Department of Commerce requires that every store that sells alcohol in bottled form for off-the-premises consumption must hold a class "E" liquor license (an arrangement typical of most of the state alcohol regulatory bodies). All alcoholic sales made by stores registered thusly with the Iowa Department of Commerce are logged in the Commerce department system, which is in turn published as open data by the State of Iowa. This dataset contains information on the name, kind, price, quantity, and location of sale of sales of individual containers or packages of containers of alcoholic beverages. — adapted from the dataset's Kaggle description (residentmario/iowa-liquor-sales).	https://gist.github.com/dannguyen/18ed71d3451d147af414	Tabular (CSV)	CC0-1.0	12,591,077	—	281.4 MB	350.1 MB
Iris	UCI ML Repository — Iris	A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods. This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain. This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick@espeedaz.net ).	https://archive.ics.uci.edu/dataset/53/iris	Tabular (CSV)	CC-BY-4.0	150	—	0.0 MB	0.0 MB
JSONBench (Bluesky 100m)	ClickHouse JSONBench — Bluesky firehose 100 M records	100 M Bluesky firehose events (likes / follows / posts / reposts / ...) stored as a single VARIANT column; benchmark dataset for semi-structured workloads.	https://github.com/ClickHouse/JSONBench	Custom	Apache-2.0	100,000,000	—	11,411.6 MB	—
Kepler Exoplanet Search Results	Kepler Exoplanet Search Results	10000 exoplanet candidates examined by the Kepler Space Observatory. The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission. — adapted from the dataset's Kaggle description (nasa/kepler-exoplanet-search-results).	https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html	Tabular (CSV)	CC0-1.0	9,564	—	3.2 MB	3.2 MB
LibriSpeech (test-clean)	openslr/librispeech_asr — clean test split (sample)	LibriSpeech ASR clean test split: ~2.6K read-aloud utterances from audiobooks with verbatim transcripts. Each row carries `audio` (`struct<bytes, path>`), `text`, `speaker_id`, `chapter_id`. Sample of the full ~1000-hour corpus; flip allow patterns to `all/train.clean.100/*.parquet` (or 360/500) for the bigger splits.	https://huggingface.co/datasets/openslr/librispeech_asr	Tabular (Parquet)	CC-BY-4.0	2,620	1	330.0 MB	350.5 MB
Loan Default Dataset	Loan Default Dataset	Loan Default Classification Problem. Banks earn a major revenue from lending loans. But it is often associated with risk. The borrower's may default on the loan. — adapted from the dataset's Kaggle description (yasserh/loan-default-dataset).	https://raw.githubusercontent.com/Masterx-AI/Project_Loan_Default_Risk_Expectancy_/main/loan.jpg	Tabular (CSV)	CC0-1.0	148,670	—	2.9 MB	2.9 MB
Lung Cancer	Lung Cancer	Does Smoking cause Lung Cancer.	https://archive.ics.uci.edu/	Tabular (CSV)	CC0-1.0	32	—	0.0 MB	0.1 MB
MAGIC Gamma Telescope	UCI ML Repository — MAGIC Gamma Telescope	Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters.	https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope	Tabular (CSV)	CC-BY-4.0	19,020	—	1.0 MB	0.5 MB
Marketing Analytics	Marketing Analytics	Practice Exploratory and Statistical Analysis with Marketing Data. This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations. The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance Acknowledgement I do not own this dataset. — adapted from the dataset's Kaggle description (jackdaoud/marketing-data).	https://github.com/nailson/ifood-data-business-analyst-test	Tabular (CSV)	CC0-1.0	2,240	—	0.1 MB	0.1 MB
MBPP	google-research-datasets/mbpp — Mostly Basic Python Problems	974 short Python coding problems with reference solutions and three test cases each. Each row carries `text` (problem statement), `code` (reference solution), `test_list`, and `test_setup_code`. Smaller complement to HumanEval at the entry-level coding-eval tier.	https://huggingface.co/datasets/google-research-datasets/mbpp	Tabular (Parquet)	CC-BY-4.0	974	1	0.1 MB	0.2 MB
Medical Cost	Medical Cost Personal Datasets	Health Insurance Premium charges based on Gender, BMI and other characteristics. This Dataset is something I found online when I wanted to practice regression models. It is an openly available online dataset at multiple places. Though I do not know the exact origin and collection methodology of the data, I would recommend this dataset to everybody who is just beginning their journey in Data science. — adapted from the dataset's Kaggle description (simranjain17/insurance).	https://github.com/stedy/Machine-Learning-with-R-datasets	Tabular (CSV)	DbCL-1.0	1,338	—	0.0 MB	0.0 MB
MedMCQA	openlifescienceai/medmcqa — Indian medical-entrance exam MCQ	194K multiple-choice questions from Indian medical-entrance exams (AIIMS, NEET-PG). Each row carries `question`, four `opa..opd` answer options, `cop` (correct option index), `subject_name`, `topic_name`, and `exp` (explanation).	https://huggingface.co/datasets/openlifescienceai/medmcqa	Tabular (Parquet)	Apache-2.0	193,155	1	53.7 MB	70.8 MB
mlcourse.ai	mlcourse.ai	Datasets and notebooks of the open Machine Learning course mlcourse.ai. mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). Having both a Ph.D. degree in applied math and a Kaggle Competitions Master tier, Yury aimed at designing an ML course with a perfect balance between theory and practice. — adapted from the dataset's Kaggle description (kashnitsky/mlcourse).	https://github.com/Yorko/mlcourse.ai}},	Tabular (CSV)	CC-BY-NC-SA-4.0	32,561	—	0.3 MB	0.3 MB
MMLU	Massive Multitask Language Understanding (Hendrycks et al., 2021)	57-subject multiple-choice eval covering STEM, humanities, social sciences, and professional topics (Hendrycks et al., 2021). 14k test questions with 4 answer choices each. Loaded from the `all` config (subjects merged into a single parquet with a `subject` column for grouping); the per-subject configs and the `auxiliary_train` split are not pulled in. The most-cited general-knowledge LLM benchmark.	https://huggingface.co/datasets/cais/mmlu	Tabular (Parquet)	MIT	115,700	—	27.9 MB	85.9 MB
MMLU-Pro	TIGER-Lab/MMLU-Pro — improved 12k-question multitask eval	12k harder multiple-choice questions across 14 categories, with 10 answer options per question (vs MMLU's 4) and chain-of-thought reasoning included. Successor to cais/mmlu we ship as Tier A; small (4 MB) but high-value for LLM evaluation tooling.	https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro	Tabular (Parquet)	MIT	12,102	—	2.6 MB	4.5 MB
MMMLU (multilingual MMLU)	openai/MMMLU — MMLU professionally translated into 14 languages	MMLU's 14k test questions translated into 14 languages by professional human translators (released by OpenAI). 14 CSV shards keyed by `<LANG>-<COUNTRY>` filename; concatenated into one parquet with the language tag preserved as a `split` column. First multilingual-eval slug in the catalog.	https://huggingface.co/datasets/openai/MMMLU	Tabular (CSV)	MIT	196,588	—	36.3 MB	61.6 MB
MMMU	MMMU/MMMU — Massive Multi-discipline Multimodal Understanding	11.5K multimodal college-level questions across 30 disciplines (art, business, science, health, humanities, tech). Each row carries question text, up to 7 reference images (binary blobs), 4 answer choices, and a discipline label. Comprehensive cross-modal expert-AGI eval.	https://huggingface.co/datasets/MMMU/MMMU	Tabular (Parquet)	Apache-2.0	11,550	1	3,465.0 MB	3,505.0 MB
MNIST	ylecun/mnist — handwritten digit classification	70K 28x28 grayscale handwritten digit images (60k train + 10k test) with integer 0–9 labels. The canonical small image-classification benchmark. Image column is binary-blob PNG bytes.	https://huggingface.co/datasets/ylecun/mnist	Tabular (Parquet)	MIT	70,000	1	15.1 MB	19.5 MB
Movie Industry	Movie Industry	Movies dataset for recommendation system. Welcome to the Movie Recommendation Dataset! This dataset is curated for building recommendation systems in the fascinating world of movies. Whether you're a data scientist, machine learning enthusiast, or a movie buff, this dataset provides a rich collection of information about various movies, offering endless possibilities for analysis and recommendation system development. — adapted from the dataset's Kaggle description (abdallahwagih/movies).	https://github.com/Juanets/movie-stats	Tabular (CSV)	CC0-1.0	7,668	—	0.4 MB	0.5 MB
Mushroom	UCI ML Repository — Mushroom	From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.	https://archive.ics.uci.edu/dataset/73/mushroom	Tabular (CSV)	CC-BY-4.0	8,124	—	0.0 MB	0.1 MB
NBA Database	NBA Database	NBA rookies classification. This is publicly available data that has been scraped from NBA statistics. The data is from between 1990 and 2016. Each row describes the performance of a basketball player during their first ('rookie') year. — adapted from the dataset's Kaggle description (tombutton/basketball).	https://github.com/wyattowalsh/nba-db	Tabular (CSV)	CC-BY-SA-4.0	1,308	—	0.0 MB	0.1 MB
New York City Airbnb Open Data	New York City Airbnb Open Data	Airbnb listings and metrics in NYC, NY, USA (2019). Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019. This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. — adapted from the dataset's Kaggle description (dgomonov/new-york-city-airbnb-open-data).	http://data.insideairbnb.com/united-states/ny/new-york-city/	Tabular (CSV)	CC0-1.0	48,895	—	2.1 MB	2.2 MB
News Headlines Sarcasm	News Headlines Dataset For Sarcasm Detection	High quality dataset for the task of Sarcasm and Fake News Detection. Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! Follow me on LinkedIn for commentary on latest AI developments: https://www.linkedin.com/in/misrarishabh/ Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets. — adapted from the dataset's Kaggle description (rmisra/news-headlines-dataset-for-sarcasm-detection).	https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection	Tabular (Parquet)	Attribution 4.0 International (CC BY 4.0)	26,709	—	0.8 MB	0.9 MB
NIPS Papers	NIPS Papers	Titles, authors, abstracts, and extracted text for all NIPS papers (1987-2017). Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. It covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning. This dataset includes the title, authors, abstracts, and extracted text for all NIPS papers to date (ranging from the first 1987 conference to the current 2016 conference). — adapted from the dataset's Kaggle description (benhamner/nips-papers).	https://github.com/benhamner/nips-papers	Tabular (CSV)	ODbL-1.0	9,680	—	119.4 MB	188.0 MB
No Robots	HuggingFaceH4/no_robots — 10k human-written instruction-tuning examples	10k high-quality human-written prompt/response pairs from HuggingFaceH4. Each row carries `messages: list<struct<role, content>>` plus a category label across 10 task types. Smaller but cleaner counterpart to OpenOrca / OpenAssistant in the instruction-tuning corner of the catalog.	https://huggingface.co/datasets/HuggingFaceH4/no_robots	Tabular (Parquet)	CC-BY-NC-4.0	20,000	—	13.8 MB	18.3 MB
NYC 311	NYC 311 Service Requests (2020–present)	20.9M non-emergency service requests filed with NYC 311 from 2020 to present. Covers complaints (noise, sanitation, illegal parking, etc.), inquiries, and service requests — each row carries borough, agency, category, complaint type, location, and resolution metadata.	https://www.nyc.gov/home/terms-of-use.page	Tabular (CSV)	NYC Open Data (public)	21,007,848	—	1,480.6 MB	2,006.2 MB
NYC Parking Tickets	NYC Parking Tickets	42.3M Rows of Parking Ticket Data, Aug 2013-June 2017. The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year!). This data is made publicly available to aid in ticket resolution and to guide policymakers. There are four files, covering Aug 2013-June 2017. — adapted from the dataset's Kaggle description (new-york-city/nyc-parking-tickets).	https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2024/pvqr-7yc4	Tabular (CSV)	CC0-1.0	11,353,336	—	227.1 MB	375.8 MB
NYC Property Sales	NYC Property Sales	A year's worth of properties sold on the NYC real estate market. This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period. This dataset contains the location, address, type, sale price, and sale date of building units sold. A reference on the trickier fields: BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). — adapted from the dataset's Kaggle description (new-york-city/nyc-property-sales).	https://www.nyc.gov/site/finance/property/property-rolling-sales-data.page	Custom	CC0-1.0	81,567	—	1.3 MB	2.3 MB
NYC TLC FHV 2025	NYC TLC FHV Trip Data 2025	Pre-app for-hire-vehicle trips (livery, black-car, luxury) reported by base companies to the TLC for 2025. Smaller and patchier than the high-volume Uber/Lyft data; useful as a contrast to the FHVHV record.	https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page	Tabular (Parquet)	NYC TLC Terms of Use	25,047,544	—	253.1 MB	197.2 MB
NYC TLC Green 2025	NYC TLC Green Trip Data 2025	NYC's outer-borough Boro Taxis (street-hail livery cabs introduced in 2013) for calendar year 2025. Pickups are restricted to areas outside Manhattan's central business district plus the airports — complements the medallion-fleet yellow data with a different geographic footprint.	https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page	Tabular (Parquet)	NYC TLC Terms of Use	591,375	—	12.5 MB	10.4 MB
NYC TLC HVFHV 2025	NYC TLC HVFHV Trip Data 2025	High-volume for-hire-vehicle trips — the post-2019 Uber, Lyft, Via, and Juno records the TLC began collecting after the rideshare-cap legislation. Far larger than the medallion or non-HV FHV streams; the dominant share of NYC's ride-hail data.	https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page	Tabular (Parquet)	NYC TLC Terms of Use	243,589,684	—	5,606.6 MB	5,273.4 MB
NYC TLC Yellow 2025	NYC TLC Yellow Trip Data 2025	Manhattan's iconic medallion taxis (the yellow cabs) — every metered trip recorded by the TLC for calendar year 2025. Rides are concentrated in Manhattan and the airports; pickups are exclusive to medallion holders. Long-running monthly time series; the go-to dataset for OLAP demos.	https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page	Tabular (Parquet)	NYC TLC Terms of Use	48,722,602	—	857.0 MB	784.2 MB
NYPD Complaints	NYPD Complaint Data Historic	Historic NYPD complaint records from 2006 forward — every felony, misdemeanor, and violation reported to police. Each row has incident type, premises, suspect/victim demographics, location (precinct, borough, lat/lon), and dates.	https://www.nyc.gov/home/terms-of-use.page	Tabular (CSV)	NYC Open Data (public)	10,071,507	—	320.3 MB	389.1 MB
Online Retail	UCI ML Repository — Online Retail	This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.	https://archive.ics.uci.edu/dataset/352/online+retail	Tabular (CSV)	CC-BY-4.0	541,909	—	2.9 MB	3.3 MB
Online Retail II	UCI ML Repository — Online Retail II	A real online retail transaction data set of two years. This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.	https://archive.ics.uci.edu/dataset/502/online+retail+ii	Tabular (CSV)	CC-BY-4.0	1,067,371	—	5.9 MB	6.9 MB
Online Shoppers Purchasing Intention Dataset	UCI ML Repository — Online Shoppers Purchasing Intention Dataset	Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping. The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.	https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset	Tabular (CSV)	CC-BY-4.0	12,330	—	0.2 MB	0.3 MB
Open Food Facts	Open Food Facts product database	Crowd-sourced product facts for 4.4M packaged food products. Each row carries deeply nested nutrition, ingredient, allergen, and labelling metadata (origin, packaging, traffic-light scores). One of the larger heavily-nested JSON-shaped corpora in the catalog; the current build ships the canonical JSONL as a single `raw_json: string` column.	https://world.openfoodfacts.org/data	Custom	ODbL-1.0	4,466,927	—	12,910.8 MB	36,431.3 MB
OpenAssistant Conversations (oasst1)	OpenAssistant Conversations Release 1 (Köpf et al., NeurIPS 2023)	85k crowd-authored assistant conversation messages organized as tree-structured threads, spanning 35 languages (Köpf et al., NeurIPS 2023). Each row carries quality, toxicity, and emoji-feedback labels alongside the message text and parent pointer. First public RLHF-grade conversation corpus; powers OpenAssistant and many downstream fine-tunes.	https://huggingface.co/datasets/OpenAssistant/oasst1	Tabular (Parquet)	Apache-2.0	88,838	—	27.0 MB	43.2 MB
OpenLibrary Authors	Internet Archive OpenLibrary — Authors	Bibliographic records for book authors — name variants, birth/death dates, Wikidata cross-references, biographical notes. Joins to `openlibrary-works` via author keys. Sourced from OpenLibrary's monthly data dumps.	https://openlibrary.org/developers/dumps	Custom	CC0-1.0	15,177,329	—	809.1 MB	2,061.5 MB
OpenLibrary Editions	Internet Archive OpenLibrary — Editions	Bibliographic records for individual book editions (ISBN, publisher, language, page count, physical format). Each edition ties back to a `works` row via the work key. Sourced from OpenLibrary's monthly data dumps.	https://openlibrary.org/developers/dumps	Custom	CC0-1.0	55,962,700	—	12,931.6 MB	28,298.0 MB
OpenLibrary Works	Internet Archive OpenLibrary — Works	Bibliographic records for literary works (the abstract concept of a book — title, author, subjects). Each work has many editions; shipped as a separate slug. Sourced from OpenLibrary's monthly data dumps.	https://openlibrary.org/developers/dumps	Custom	CC0-1.0	40,981,783	—	4,245.1 MB	8,826.4 MB
OpenOrca	OpenOrca (Open-Orca, 2023) — GPT-4 + GPT-3.5 augmented FLAN	~4.2M instruction-response pairs generated via the Orca self-augmentation method, drawn from FLAN-Collection prompts answered by GPT-4 (1M) and GPT-3.5 (3.2M). Concatenated into one parquet with a `split` column distinguishing the two model sources.	https://huggingface.co/datasets/Open-Orca/OpenOrca	Tabular (Parquet)	MIT	4,233,923	—	2,713.5 MB	3,610.5 MB
OpenPowerlifting	OpenPowerlifting meet results	3.9M competition-lift records from powerlifting meets worldwide, maintained by openpowerlifting.org. One row per lift attempt with lifter, federation, weight class, equipment, and the four scores (squat / bench / deadlift / total).	https://www.openpowerlifting.org/data	Tabular (CSV)	CC0-1.0	3,916,281	—	104.8 MB	156.5 MB
Optical Recognition of Handwritten Digits	UCI ML Repository — Optical Recognition of Handwritten Digits	Two versions of this database available; see folder We used preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions. For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G. T. Candela, D. L.	https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits	Tabular (CSV)	CC-BY-4.0	5,620	—	0.2 MB	0.2 MB
OSM Germany Nodes	OpenStreetMap Germany — Nodes	OSM nodes for Germany from the Geofabrik extract — point features (addresses, POIs, traffic signals, etc.) with their geographic coordinates and tag bag. Emitted as GeoParquet 1.1 with WKB geometry.	https://www.openstreetmap.org/copyright	Geo (OSM PBF)	ODbL-1.0	432,906,290	—	15,335.9 MB	21,061.1 MB
OSM Germany Relations	OpenStreetMap Germany — Relations	OSM relations (composite features) for Germany — multi-polygon land covers, route memberships, administrative boundaries. The richest of the three OSM-Germany slugs in tag complexity. Emitted as GeoParquet 1.1 with WKB geometry.	https://www.openstreetmap.org/copyright	Geo (OSM PBF)	ODbL-1.0	889,712	—	91.2 MB	148.5 MB
OSM Germany Ways	OpenStreetMap Germany — Ways	OSM ways (linear features) for Germany from the Geofabrik extract — roads, paths, rivers, building outlines, railway lines. Each row carries a tag bag plus the ordered node references. Emitted as GeoParquet 1.1 with WKB LineString / Polygon geometry.	https://www.openstreetmap.org/copyright	Geo (OSM PBF)	ODbL-1.0	70,097,667	701	9,514.0 MB	14,789.9 MB
OSMI Mental Health in Tech 2016	OSMI 2016 Mental Health in Tech Survey	Data on prevalence and attitudes towards mental health among tech workers. OSMI Mental Health in Tech Survey 2016 Currently over 1400 responses, the ongoing 2016 survey aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. How Will This Data Be Used? We are interested in gauging how mental health is viewed within the tech/IT workplace, and the prevalence of certain mental health disorders within the tech industry. — adapted from the dataset's Kaggle description (osmi/mental-health-in-tech-2016).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	1,433	—	0.1 MB	0.3 MB
OSMI Mental Health in Tech 2017	OSMI 2017 Mental Health in Tech Survey	Data on prevalence and attitudes towards mental health among tech workers. OSMI Mental Health in Tech Survey 2017 The 2017 survey aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. How Will This Data Be Used? We are interested in gauging how mental health is viewed within the tech/IT workplace, and the prevalence of certain mental health disorders within the tech industry. — adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2017).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	756	—	0.2 MB	0.4 MB
OSMI Mental Health in Tech 2018	OSMI 2018 Mental Health in Tech Survey	Data on prevalence and attitudes towards mental health among tech workers. — adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2018).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	417	—	0.2 MB	0.3 MB
OSMI Mental Health in Tech 2019	OSMI 2019 Mental Health in Tech Survey	— adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2019).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	352	—	0.1 MB	0.2 MB
OSMI Mental Health in Tech 2020	OSMI 2020 Mental Health in Tech Survey	Results from the yearly survey. — adapted from the dataset's Kaggle description (osmihelp/osmi-2020-mental-health-in-tech-survey-results).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	180	—	0.1 MB	0.2 MB
OSMI Mental Health in Tech 2021	OSMI 2021 Mental Health in Tech Survey	Results from the yearly survey. — adapted from the dataset's Kaggle description (osmihelp/osmh-2021-mental-health-in-tech-survey-results).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	131	—	0.1 MB	0.2 MB
OSMI Mental Health in Tech 2022	OSMI 2022 Mental Health in Tech Survey	Results from the yearly survey. — adapted from the dataset's Kaggle description (osmihelp/osmh-mental-health-in-tech-survey-2022).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	164	—	0.1 MB	0.2 MB
OSMI Mental Health in Tech 2023	OSMI 2023 Mental Health in Tech Survey	Results from the yearly survey. — adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2023).	https://osmhhelp.org/research.html	Tabular (CSV)	CC-BY-SA-4.0	6	—	0.1 MB	0.1 MB
P3 (eval-subset)	bigscience/P3 — T0 held-out eval subset (7 prompt templates)	Curated 7-config subset of bigscience/P3 (Sanh et al., ICLR 2022) — the held-out evaluation tasks used to benchmark T0 / T0pp, each represented by one canonical prompt template. Configs concatenated with a `split` column distinguishing train / validation / test where available; the source `<config>` itself is preserved via hf_concat_splits's per-shard tagging. Templates picked: super_glue_rte_GPT_3_style, super_glue_cb_GPT_3_style, super_glue_copa_C1_or_C2_premise_so_because_, super_glue_wic_GPT_3_prompt, super_glue_boolq_GPT_3_Style, winogrande_winogrande_debiased_Replace, hellaswag_Appropriate_continuation_Yes_or_No. Flip the allow-pattern list to any of the 658 configs upstream (see the parquet API for the full inventory) for a different prompt slice.	https://huggingface.co/datasets/bigscience/P3	Tabular (Parquet)	Apache-2.0	102,963	1	27.4 MB	39.5 MB
Palmer Penguins	Palmer Archipelago (Antarctica) penguin data	Drop in replacement for Iris Dataset. Please refer to the official Github page for details and license information. The details below have also been taken from there. Artwork: @allisonhorst Palmer Archipelago (Antarctica) penguin data Data were collected and made available by Dr. — adapted from the dataset's Kaggle description (parulpandey/palmer-archipelago-antarctica-penguin-data).	https://archive.ics.uci.edu/	Tabular (CSV)	CC0-1.0	344	—	0.0 MB	0.0 MB
Parkinsons	UCI ML Repository — Parkinsons	Oxford Parkinson's Disease Detection Dataset This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD. The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording.	https://archive.ics.uci.edu/dataset/174/parkinsons	Tabular (CSV)	CC-BY-4.0	195	—	0.0 MB	0.1 MB
People's Speech (clean validation)	MLCommons/peoples_speech — clean validation split	Validation slice of MLCommons People's Speech, a 30K-hour CC/PD-licensed English supervised conversational ASR corpus. Each row carries `audio` (`struct<bytes, path>`) plus alignment metadata. The full corpus is multi-TB; this validation split is a tractable sample. Flip allow patterns to `clean/train/*.parquet` for the full ~28K-hour clean training set.	https://huggingface.co/datasets/MLCommons/peoples_speech	Tabular (Parquet)	CC-BY-2.0	18,622	1	2,199.5 MB	2,345.0 MB
Phishing Websites	UCI ML Repository — Phishing Websites	This dataset collected mainly from: PhishTank archive, MillerSmiles archive, Googleâ€™s searching operators. One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features.	https://archive.ics.uci.edu/dataset/327/phishing+websites	Tabular (CSV)	CC-BY-4.0	11,055	—	0.0 MB	0.1 MB
PleIAs SYNTH	PleIAs/SYNTH — open generalist synthetic reasoning corpus	PleIAs's open synthetic dataset for training small reasoning models, generated from open-weight model outputs (no closed-API ToS issue). 7 train shards. Schema centers on prompt/completion pairs with task-type metadata.	https://huggingface.co/datasets/PleIAs/SYNTH	Tabular (Parquet)	CDLA-Permissive-2.0	777,943	1	1,565.3 MB	2,153.1 MB
Predict Droughts using Weather & Soil Data	Predict Droughts using Weather & Soil Data	Predicting continental US drought levels using meteorological & soil data. To make using previous drought scores for prediction easier (e.g. by interpolating), I merged them into one file and set the drought scores to NaN were not available. The US drought monitor is a measure of drought across the US manually created by experts using a wide range of data. — adapted from the dataset's Kaggle description (cdminix/us-drought-meteorological-data).	https://github.com/Epistoteles/predicting-drought	Tabular (CSV)	CC0-1.0	19,300,680	—	513.8 MB	529.5 MB
Predict Students' Dropout and Academic Success	UCI ML Repository — Predict Students' Dropout and Academic Success	A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies. The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. The data is used to build classification models to predict students' dropout and academic sucess. The problem is formulated as a three category classification task, in which there is a strong imbalance towards one of the classes.	https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success	Tabular (CSV)	CC-BY-4.0	4,424	—	0.1 MB	0.1 MB
PubMedQA (labeled)	qiaojin/PubMedQA — 1k expert-labeled biomedical QA	1k expert-annotated biomedical research questions paired with PubMed abstracts and yes/no/maybe answers. Each row carries `question`, `context.contexts` (`list<string>`), `long_answer`, and `final_decision`. The `pqa_artificial` (61k) and `pqa_unlabeled` (211k) splits exist upstream — point allow_patterns at them for the larger set.	https://huggingface.co/datasets/qiaojin/PubMedQA	Tabular (Parquet)	MIT	1,000	1	0.7 MB	1.0 MB
Real Estate Valuation	UCI ML Repository — Real Estate Valuation	The real estate valuation is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The â€œreal estate valuationâ€� is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).	https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set	Tabular (CSV)	CC-BY-4.0	414	—	0.0 MB	0.0 MB
San Francisco Building Permits	San Francisco Building Permits	5 years and 200k building permits. Background A building permit is an official approval document issued by a governmental agency that allows you or your contractor to proceed with a construction or remodeling project on one's property. For more details go to https://www.thespruce.com/what-is-a-building-permit-1398344. Each city or county has its own office related to buildings, that can do multiple functions like issuing permits, inspecting buildings to enforce safety measures, modifying rules to accommodate needs of the growing population etc. — adapted from the dataset's Kaggle description (aparnashastry/building-permit-applications-data).	https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9/data	Tabular (CSV)	DbCL-1.0	198,900	—	14.6 MB	19.4 MB
Seeds	UCI ML Repository — Seeds	Measurements of geometrical properties of kernels belonging to three different varieties of wheat. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates.	https://archive.ics.uci.edu/dataset/236/seeds	Tabular (CSV)	CC-BY-4.0	210	—	0.0 MB	0.0 MB
Seoul Bike Sharing Demand	UCI ML Repository — Seoul Bike Sharing Demand	The dataset contains count of public bicycles rented per hour in the Seoul Bike Sharing System, with corresponding weather data and holiday information Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.	https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand	Tabular (CSV)	CC-BY-4.0	8,760	—	0.1 MB	0.1 MB
SF Salaries	SF Salaries	Explore San Francisco city employee salary data. One way to understand how a city government works is by looking at who it employs and how its employees are compensated. This data contains the names, job title, and compensation for San Francisco city employees on an annual basis from 2011 to 2014. Exploration Ideas To help get you started, here are some data exploration ideas: - How have salaries changed over time between different groups of people? — adapted from the dataset's Kaggle description (kaggle/sf-salaries).	https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd	Tabular (CSV)	CC0-1.0	1,096,102	—	58.9 MB	51.3 MB
SMS Spam Collection	UCI ML Repository — SMS Spam Collection	The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research. This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages.	https://archive.ics.uci.edu/dataset/228/sms+spam+collection	Tabular (CSV)	CC-BY-4.0	5,574	—	0.2 MB	0.3 MB
Spambase	UCI ML Repository — Spambase	Classifying Email as Spam or Non-Spam The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... The classification task for this dataset is to determine whether a given email is spam or not. Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter.	https://archive.ics.uci.edu/dataset/94/spambase	Tabular (CSV)	CC-BY-4.0	4,601	—	0.2 MB	0.4 MB
SQuAD v2	Stanford Question Answering Dataset v2.0	Stanford Question Answering Dataset v2 — 130k crowdsourced questions about Wikipedia paragraphs, with 50k of them deliberately unanswerable from the given context. Each row pairs a question, the passage it was asked about, and the canonical answer span(s). The v2 release added the unanswerable subset specifically to test whether models know when to abstain — a longstanding QA-eval weakness.	https://huggingface.co/datasets/rajpurkar/squad_v2	Tabular (Parquet)	CC-BY-SA-4.0	142,192	—	11.1 MB	16.5 MB
Stack Overflow 2018 Developer Survey	Stack Overflow 2018 Developer Survey	Individual responses on the 2018 Developer Survey fielded by Stack Overflow. Each year, we at Stack Overflow ask the developer community about everything from their favorite technologies to their job preferences. This year marks the eighth year we’ve published our Annual Developer Survey results—with the largest number of respondents yet. Over 100,000 developers took the 30-minute survey in January 2018. — adapted from the dataset's Kaggle description (stackoverflow/stack-overflow-2018-developer-survey).	https://insights.stackoverflow.com/survey/2018	Tabular (CSV)	DbCL-1.0	98,855	—	6.9 MB	9.5 MB
Stack Overflow Badges	Stack Exchange Data Dump — Stack Overflow Badges	Badges earned by Stack Overflow users — badge name, class (gold/silver/bronze), tag-based vs activity-based, awarded timestamp. Joins to `stackoverflow-users` via `user_id`.	https://archive.org/details/stackexchange	Structured (XML)	CC-BY-SA-4.0	51,289,973	—	583.8 MB	487.6 MB
Stack Overflow PostLinks	Stack Exchange Data Dump — Stack Overflow PostLinks	Post-to-post link graph for Stack Overflow — duplicate-question and related-question relationships. Each row has both endpoint post IDs plus a link type code; joins to `stackoverflow-posts` on either side.	https://archive.org/details/stackexchange	Structured (XML)	CC-BY-SA-4.0	6,552,590	—	146.8 MB	107.1 MB
Stack Overflow Posts	Stack Exchange Data Dump — Stack Overflow Posts	Every Stack Overflow question and answer from 2008 to 2024 — title, body (HTML), tags, score, view count, accepted-answer ID, and authorship. The largest single table in the Stack Exchange data dump.	https://archive.org/details/stackexchange	Tabular (Parquet)	CC-BY-SA-4.0	58,329,355	—	23,896.5 MB	38,972.6 MB
Stack Overflow Tags	Stack Exchange Data Dump — Stack Overflow Tags	Tag metadata and usage counts for Stack Overflow — tag name, total uses, excerpt-post and wiki-post IDs. ~60k tags following a long-tail distribution.	https://archive.org/details/stackexchange	Structured (XML)	CC-BY-SA-4.0	65,675	—	1.4 MB	1.3 MB
Stack Overflow Users	Stack Exchange Data Dump — Stack Overflow Users	User profiles from Stack Overflow — display name, location, reputation, badge counts (gold/silver/bronze), account creation date, last access. Joins to `stackoverflow-posts` via `owner_user_id` and to `stackoverflow-badges` via `user_id`.	https://archive.org/details/stackexchange	Structured (XML)	CC-BY-SA-4.0	22,484,235	—	1,148.3 MB	1,251.5 MB
Statlog (German Credit Data)	UCI ML Repository — Statlog (German Credit Data)	This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data". For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.	https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data	Tabular (CSV)	CC-BY-4.0	1,000	—	0.0 MB	0.0 MB
Student Performance	UCI ML Repository — Student Performance	Predict student performance in secondary education (high school). This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1.	https://archive.ics.uci.edu/dataset/320/student+performance	Tabular (CSV)	CC-BY-4.0	649	—	0.0 MB	0.0 MB
Synthetic Text-to-SQL	gretelai/synthetic_text_to_sql — Gretel synthetic NL→SQL pairs	~105K synthetic natural-language → SQL pairs generated by Gretel with structured metadata: `sql_complexity`, `sql_task_type`, `domain`, `sql_explanation`, `sql_prompt`, `sql_context`. Useful baseline for text-to-SQL training and for exercising structured-string columns.	https://huggingface.co/datasets/gretelai/synthetic_text_to_sql	Tabular (Parquet)	Apache-2.0	105,851	1	21.9 MB	36.6 MB
Temperature change	Temperature change	Global Warming, Temperature Change, Climate Change. Data description The FAOSTAT Temperature Change domain disseminates statistics of mean surface temperature change by country, with annual updates. The current dissemination covers the period 1961–2023. Statistics are available for monthly, seasonal and annual mean temperature anomalies, i.e., temperature change with respect to a baseline climatology, corresponding to the period 1951–1980. — adapted from the dataset's Kaggle description (sevgisarac/temperature-change).	https://data.giss.nasa.gov/gistemp/	Tabular (CSV)	Attribution 3.0 IGO (CC BY 3.0 IGO)	147	—	0.0 MB	0.0 MB
Thyroid Disease	UCI ML Repository — Thyroid Disease	10 separate databases from Garavan Institute # From Garavan Institute # Documentation: as given by Ross Quinlan # 6 databases from the Garavan Institute in Sydney, Australia # Approximately the following for each database: 2800 training (data) instances and 972 test instances Plenty of missing data 29 or so attributes, either Boolean or continuously-valued # 2 additional databases, also from Ross Quinlan, are also here Hypothyroid.data and sick-euthyroid.data Quinlan believes that these databases have been corrupted Their format is highly similar to the other databases # 1 mo…	https://archive.ics.uci.edu/dataset/102/thyroid+disease	Tabular (CSV)	CC-BY-4.0	2,800	—	0.0 MB	0.1 MB
TinyStories	roneneldan/TinyStories (Eldan & Li, 2023)	~2.1M short synthetic children's stories (3–4 paragraphs each) generated by GPT-3.5/4 to study how small language models acquire coherent narrative ability. Two columns: `text` (the story) plus `validation`-flagged subset. Canonical small-LM training corpus.	https://huggingface.co/datasets/roneneldan/TinyStories	Tabular (Parquet)	CDLA-Sharing-1.0	2,141,709	3	633.5 MB	917.9 MB
Titanic Dataset	Titanic Dataset	Titanic Survival Prediction Dataset. The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. — adapted from the dataset's Kaggle description (yasserh/titanic-dataset).	https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/carData/TitanicSurvival.csv	Tabular (CSV)	CC0-1.0	891	—	0.0 MB	0.0 MB
TruthfulQA (multiple-choice)	truthfulqa/truthful_qa — multiple-choice configuration	817 questions designed to elicit imitative-falsehood answers — common misconceptions humans repeat. Multiple-choice config: `mc1_targets` (single-answer) and `mc2_targets` (probabilistic). The `generation` config is omitted (use the original repo for free-form prompts).	https://huggingface.co/datasets/truthfulqa/truthful_qa	Tabular (Parquet)	Apache-2.0	817	1	0.2 MB	0.3 MB
U.S. Airbnb Open Data	U.S. Airbnb Open Data	Airbnb listings and metrics of regions in the U.S. Since its inception in 2008, Airbnb has disrupted the traditional hospitality industry as more travellers decide to use Airbnb as their primary means of accommodation. Airbnb offers travellers a more unique and personalized way of accommodation and experience. Project I had compiled this dataset for a project of mine on 20th October 2020. — adapted from the dataset's Kaggle description (kritikseth/us-airbnb-open-data).	http://insideairbnb.com/get-the-data/	Tabular (CSV)	CC0-1.0	232,147	—	11.3 MB	13.5 MB
Uber Pickups NYC	Uber Pickups in New York City	Trip data for over 20 million Uber (and other for-hire vehicle) trips in NYC. Uber TLC FOIL Response This directory contains data on over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015. Trip-level data on 10 other for-hire vehicle (FHV) companies, as well as aggregated data for 329 FHV companies, is also included. All the files are as they were received on August 3, Sept. — adapted from the dataset's Kaggle description (fivethirtyeight/uber-pickups-in-new-york-city).	https://github.com/fivethirtyeight/uber-tlc-foil-response	Tabular (CSV)	CC0-1.0	564,516	—	2.2 MB	3.5 MB
UFC-Fight historical data from 1993 to 2021	UFC-Fight historical data from 1993 to 2021	Compiled UFC fight, fighter stats and information. UPDATE This dataset got a lot of love from the community and I saw many people asking for an updated version, so I have uploaded the latest scraped and processed data ( as of 21/03/2021). Now it's super easy for anyone to get the latest dataset (Just use a single command), so in case you need bleeding-edge data, or you want to see the code, you can look here. Hope this solves all problems! — adapted from the dataset's Kaggle description (rajeevw/ufcdata).	https://github.com/WarrierRajeev/UFC-Predictions	Tabular (CSV)	CC0-1.0	6,012	—	1.5 MB	2.9 MB
UK Price Paid	HM Land Registry Price Paid Data (1995–present)	Every residential property sale in England and Wales since 1995, published by HM Land Registry. Each row carries price, postcode, property type, new-build flag, lease/freehold, and local authority.	https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads	Tabular (CSV)	OGL-UK-3.0	31,192,682	—	1,024.7 MB	1,429.3 MB
UK Road Safety: Traffic Accidents and Vehicles	UK Road Safety: Traffic Accidents and Vehicles	Detailed dataset of road accidents and involved vehicles in the UK (2005-2017). The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research. The creation of this dataset was inspired by the one previously published by [Dave Fisher-Hickey][1]. — adapted from the dataset's Kaggle description (tsiaras/uk-road-safety-accidents-and-vehicles).	https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data	Tabular (CSV)	DbCL-1.0	9,015,100	—	379.3 MB	531.0 MB
UltraChat 200k	UltraChat 200k (HuggingFaceH4, 2024)	200k filtered + formatted multi-turn conversations from the UltraChat corpus, used by HuggingFaceH4 to instruction-tune Zephyr-7b-β. Each row carries a `messages` field of `list<struct<role, content>>` for the dialogue turns. Larger and more conversational than the no-robots / dolly-15k cohort; synthetic but stylistically diverse.	https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k	Tabular (Parquet)	MIT	515,311	1	1,055.0 MB	1,575.9 MB
UltraFeedback (binarized)	HuggingFaceH4/ultrafeedback_binarized — DPO-ready binarized preferences	61k DPO/SFT preference triples binarized from the UltraFeedback corpus by HuggingFaceH4. Each row carries `chosen` and `rejected` as `list<struct<role, content>>` plus per-side score doubles — a rich showcase for paired list-of-struct columns alongside the UltraChat 200k slug we already ship.	https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized	Tabular (Parquet)	MIT	187,405	—	438.7 MB	670.7 MB
US Accidents (2016 - 2023)	US Accidents (2016 - 2023)	A Countrywide Traffic Accident Dataset (2016 - 2023). This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. — adapted from the dataset's Kaggle description (sobhanmoosavi/us-accidents).	https://smoosavi.org/datasets/us_accidents	Tabular (Parquet)	CC-BY-NC-SA-4.0	7,728,394	—	541.0 MB	756.6 MB
Walmart	Walmart Dataset	Walmart Store Sales Prediction - Regression Problem. One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. — adapted from the dataset's Kaggle description (yasserh/walmart-dataset).	https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg	Tabular (CSV)	CC0-1.0	6,435	—	0.1 MB	0.1 MB
Waxal (Dagbani ASR, test split)	google/WaxalNLP — Dagbani ASR test split	Dagbani-language ASR test split from Google's Waxal multilingual African-language speech corpus. Audio binary blobs paired with Dagbani-text transcripts; 3 parquet shards from the canonical test set (train + validation + unlabeled splits exist upstream and total ~57 GB). WaxalNLP covers ~33 language-task pairs across Acholi (ach), Akan (aka), Amharic (amh), Dagbani (dag), Ewe (ewe), Igbo (ibo), Luganda (lug), Luo (luo), Swahili (swa), Yoruba (yor), and others — flip allow_patterns to `<lang>_<asr\|tts>/test-*.parquet` for any of them. (Note: despite the 'Waxal' name, the upstream does not include a Wolof config; pick a different language slug.)	https://huggingface.co/datasets/google/WaxalNLP	Tabular (Parquet)	CC-BY-SA-4.0	1,838	1	536.8 MB	544.8 MB
WebSight v0.1	HuggingFaceM4/WebSight v0.1 — synthetic HTML/screenshot pairs	~822K synthetic HTML pages with rendered screenshots, used to train the Idefics3 visual-document model. Each row pairs `html` (source code) with `image` (rendered PNG bytes). Smaller v0.1 release (71 shards); v0.2 exists (738 shards, ~10x larger).	https://huggingface.co/datasets/HuggingFaceM4/WebSight	Tabular (Parquet)	CC-BY-4.0	822,987	1	27,964.7 MB	33,160.0 MB
Wholesale customers	UCI ML Repository — Wholesale customers	The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories	https://archive.ics.uci.edu/dataset/292/wholesale+customers	Tabular (CSV)	CC-BY-4.0	440	—	0.0 MB	0.0 MB
Wikipedia (English)	wikimedia/wikipedia — English Wikipedia, 2023-11-01 dump	Cleaned-text English Wikipedia dump from 2023-11-01 (~6.4M articles). Each row carries `id`, `url`, `title`, `text` from Wikimedia's parquet auto-conversion. Distinct from `cohere-wikipedia-simple-embed` (which ships embeddings of Simple English) — this is the raw multilingual-friendly text corpus, gated here to English alone to bound the build size. Flip `hf_allow_patterns` to a different `20231101.<lang>` for other languages.	https://huggingface.co/datasets/wikimedia/wikipedia	Tabular (Parquet)	CC-BY-SA-3.0	6,407,814	7	7,753.8 MB	11,238.8 MB
Wikipedia Structured Contents	Wikipedia Structured Contents	Pre-parsed English and French Wikipedia Articles, Including Infoboxes. Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.). — adapted from the dataset's Kaggle description (wikimedia-foundation/wikipedia-structured-contents).	https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents	Custom	CC-BY-SA-4.0	10,112,058	—	34,032.7 MB	—
Wine	UCI ML Repository — Wine	Using chemical analysis to determine the origin of wines These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version.	https://archive.ics.uci.edu/dataset/109/wine	Tabular (CSV)	CC-BY-4.0	178	—	0.0 MB	0.0 MB
Wine Quality	UCI ML Repository — Wine Quality	Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/). The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).	https://archive.ics.uci.edu/dataset/186/wine+quality	Tabular (CSV)	CC-BY-4.0	6,497	—	0.1 MB	0.1 MB
World Bank WDI	World Development Indicators	World Bank's World Development Indicators — global country-level time-series spanning ~1500 indicators across economic, social, demographic, and environmental categories. The canonical cross-country comparison dataset for development economics.	https://datacatalog.worldbank.org/search/dataset/0037712	Tabular (CSV)	CC-BY-4.0	395,276	—	70.3 MB	112.3 MB
World Energy Consumption	World Energy Consumption	Consumption of energy by different countries. Data on Energy by Our World in Data Our complete Energy dataset is a collection of key metrics maintained by Our World in Data. It is updated regularly and includes data on energy consumption (primary energy, per capita, and growth rates), energy mix, electricity mix and other relevant metrics. The complete Our World in Data Energy dataset 🗂️ Download our complete Energy dataset : CSV \| XLSX \| JSON The CSV and XLSX files follow a format of 1 row per location and year. — adapted from the dataset's Kaggle description (pralabhpoudel/world-energy-consumption).	https://github.com/owid	Tabular (CSV)	Attribution 4.0 International (CC BY 4.0)	23,377	—	3.6 MB	6.7 MB
YouTube-Commons (sample)	PleIAs/YouTube-Commons — single-shard transcript sample	Single-shard sample (cctube_0.parquet, ~385MB) of PleIAs's YouTube-Commons corpus: ~2M transcripts of YouTube videos uploaders explicitly marked CC-BY. Schema: video metadata + multilingual transcript text. Flip allow patterns to `cctube_*.parquet` for the full ~426-shard set (~165GB).	https://huggingface.co/datasets/PleIAs/YouTube-Commons	Tabular (Parquet)	CC-BY-4.0	49,967	1	253.6 MB	409.9 MB
Zoo Animal Classification	Zoo Animal Classification	Use Machine Learning Methods to Correctly Classify Animals Based Upon Attributes	https://archive.ics.uci.edu/ml/datasets/Zoo	Tabular (CSV)	DbCL-1.0	101	—	0.0 MB	0.0 MB

⚠ Scrape advisories

These datasets aggregate or reference content whose underlying licenses have not been individually cleared. The aggregator's declared license (the License column above) governs only the metadata it ships, not the content it points at. Read each advisory before redistributing or building on top of one of these slugs.

C4 (en, validation) (c4-en-validation) — C4 (Colossal Clean Crawled Corpus) is a heavily-filtered scrape of Common Crawl. Allen AI's CC-BY-4.0 license covers the harvest layer; the underlying web text remains subject to per-publisher copyright. Treat as research convenience, not a license-cleared corpus.

CodeParrot Clean (validation) (codeparrot-clean-valid) — CodeParrot is a public scrape of MIT/BSD/Apache-licensed Python repositories from GitHub. The dataset's redistribution covers only the harvest layer; per-repository licenses still apply to each content row, and downstream redistribution requires honouring those licenses individually.

FineMath (4+ quality subset) (finemath-4plus) — Math-themed slice of the Common-Crawl-derived fineweb pipeline. Same harvest-only ODC-By-1.0 license — per-document copyright not cleared.

FinePDFs (English test sample) (finepdfs-en-test) — PDF-extracted parallel of fineweb. Same ODC-By-1.0 harvest license; per-document underlying copyright is not cleared. Research-pretraining convenience only.

Fineweb (sample, 10BT) (fineweb-sample-10bt) — Fineweb is a 15TB scrape of Common Crawl filtered for English text quality. Released under ODC-By 1.0, but redistribution covers only the harvest layer — per-page web content remains subject to individual copyright. Treat as research convenience for LLM pretraining research, not a license-cleared corpus.

Fineweb-2 (Swedish sample) (fineweb-2-swedish) — Multilingual extension of the Common-Crawl-derived fineweb pipeline. Released under ODC-By-1.0 by HuggingFace, but redistribution covers only the harvest layer — per-page web content remains subject to individual copyright. Treat as research convenience for multilingual LM pretraining.

LAION-400M (metadata) (laion-400m) — LAION-400M is a public-web scrape: rows pair URLs to web-hosted images with their alt-text captions and CLIP similarity scores. LAION's redistribution license (CC-BY-4.0) covers only this metadata table — it does NOT clear the underlying images, captions, or alt-text, which are subject to per-item copyright and the takedown model LAION operates. Treat the dataset as a research convenience, not a license-cleared corpus. For any production use, dereference URLs only after you've cleared rights with the original publishers, and consult LAION's safety advisories around the corpus's known content issues.

SlimPajama-6B (slimpajama-6b) — SlimPajama-6B is a deduplicated 6B-token sample of SlimPajama-627B, itself derived from RedPajama-1T. The underlying corpus pulls from Common Crawl, GitHub, Wikipedia, books, and ArXiv — per-source licenses are not individually cleared. Treat as research convenience, not a license-cleared corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠ Scrape advisories

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

⚠ Scrape advisories