Skip to content

Latest commit

Β 

History

History
281 lines (270 loc) Β· 156 KB

File metadata and controls

281 lines (270 loc) Β· 156 KB
Dataset Short Name Dataset Full Name Dataset Description Dataset Source (URL) Data Kind License Row Count Row Groups - Parquet File Size - Parquet File Size - Vortex
120 years of Olympic history: athletes and results 120 years of Olympic history: athletes and results basic bio data on athletes and medal results from Athens 1896 to Rio 2016. This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. β€” adapted from the dataset's Kaggle description (heesoo37/120-years-of-olympic-history-athletes-and-results). https://github.com/rgriff23/Olympic_history Tabular (CSV) CC0-1.0 271,116 β€” 4.4 MB 5.6 MB
⚠ C4 (en, validation) AllenAI/C4 β€” Colossal Clean Crawled Corpus, English validation split C4 (Raffel et al., JMLR 2020) β€” a heavily-filtered scrape of Common Crawl used to pretrain T5. This entry pulls only the 8-shard English validation split (~365k documents), enough for type coverage and as a smoke-test for the Common-Crawl scrape playbook. Flip allow_patterns to en/c4-train.*.json.gz to mirror the 327 GB English training set. https://huggingface.co/datasets/allenai/c4 Structured (JSON) ODC-By-1.0 364,608 β€” 339.7 MB 435.3 MB
⚠ CodeParrot Clean (validation) codeparrot/codeparrot-clean β€” validation split 61k Python source files scraped from MIT/BSD/Apache-licensed GitHub repos by the CodeParrot project. Validation split (142 MB raw .json.gz). Showcases code-corpus shape: string content, numeric quality metrics (line_mean, alpha_frac), boolean autogenerated flag, and per-row license attribution. https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid Structured + Blobs Apache-2.0 61,373 β€” 149.7 MB 343.5 MB
⚠ FineMath (4+ quality subset) HuggingFaceTB/finemath β€” qualityβ‰₯4 math web pages Math-focused subset of the fineweb pipeline, filtered to documents with an automated quality score β‰₯ 4 (the higher of the two quality tiers). 64 parquet shards. Schema mirrors fineweb (text + harvest metadata). https://huggingface.co/datasets/HuggingFaceTB/finemath Tabular (Parquet) ODC-By-1.0 6,699,493 7 12,647.7 MB 21,724.8 MB
⚠ FinePDFs (English test sample) HuggingFaceFW/finepdfs β€” English test split (sample) Sample slice of HuggingFace's PDF-derived corpus (3T tokens total, 1733 languages). Schema mirrors fineweb (text + harvest metadata) but the source is OCR/extracted PDFs rather than HTML. Limited here to the English test split (~1 shard) for tractable build size β€” flip allow patterns to eng_Latn/train/*.parquet for the full English corpus (579 shards, multi-hour fetch). https://huggingface.co/datasets/HuggingFaceFW/finepdfs Tabular (Parquet) ODC-By-1.0 373 1 7.0 MB 12.7 MB
⚠ Fineweb (sample, 10BT) HuggingFaceFW/fineweb β€” 10B-token reproducibility sample 10B-token reproducibility sample of HuggingFace's Fineweb (a 15TB Common-Crawl-filtered English-text corpus released for LLM pretraining research). 15 parquet shards of text + deduplication metadata (~32 GB raw). Flip allow_patterns to sample/100BT/*.parquet (300 GB) or sample/350BT/*.parquet (1 TB) for larger reproducibility samples. https://huggingface.co/datasets/HuggingFaceFW/fineweb Tabular (Parquet) ODC-By-1.0 14,868,862 β€” 20,364.5 MB 26,202.2 MB
⚠ Fineweb-2 (Swedish sample) HuggingFaceFW/fineweb-2 β€” Swedish Latin-script subset (sample) Swedish-language subset of HuggingFace's multilingual fineweb-2 (the 1000+-language extension of the original English fineweb). Each row carries text plus the same dedup/quality metadata fields as fineweb. Picked Swedish for a moderately-sized representative non-English split; flip hf_allow_patterns to <lang>_<script>/*.parquet for any of the other 1000+ language/script pairs. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 Tabular (Parquet) ODC-By-1.0 3,626,000 4 4,583.5 MB 5,807.7 MB
⚠ LAION-400M (metadata) LAION-400M β€” image-text pairs, metadata + CLIP only 400M image-text pairs scraped from Common Crawl, joined with CLIP-ViT-B/32 similarity scores. This entry pulls only the metadata parquets (URL, caption, height/width, similarity) β€” the images themselves are not redistributed by LAION. Gated on Hugging Face: requires accepting LAION's terms once via the dataset page before the API will serve downloads. https://laion.ai/blog/laion-400-open-dataset/ Tabular (Parquet) CC-BY-4.0 2,820,459 β€” 256.4 MB 318.3 MB
⚠ SlimPajama-6B DKYoon/SlimPajama-6B β€” 6B-token deduplicated sample 6B-token deduplicated subsample of Cerebras's SlimPajama-627B (itself a cleaned subset of RedPajama-1T). 50 parquet shards of text + meta.redpajama_set_name covering Common Crawl, GitHub, Wikipedia, books, ArXiv. Used as a small reproducible pretraining sample by the LLM research community. https://huggingface.co/datasets/DKYoon/SlimPajama-6B Tabular (Parquet) Apache-2.0 5,507,693 β€” 9,441.5 MB 13,695.7 MB
Abalone UCI ML Repository β€” Abalone Predict the age of abalone from physical measurements Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. https://archive.ics.uci.edu/dataset/1/abalone Tabular (CSV) CC-BY-4.0 4,177 β€” 0.1 MB 0.1 MB
Adult UCI ML Repository β€” Adult Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to determine whether a person's income is over $50,000 a year. https://archive.ics.uci.edu/dataset/2/adult Tabular (CSV) CC-BY-4.0 48,842 β€” 0.4 MB 0.4 MB
AI2 ARC allenai/ai2_arc β€” AI2 Reasoning Challenge (Easy + Challenge) 7.7K grade-school-level science multiple-choice questions, split into Easy (5.2K) and Challenge (2.6K) subsets. Each row carries question, choices (list<string>), and answerKey. Both subsets concatenated with a split column distinguishing the source. https://huggingface.co/datasets/allenai/ai2_arc Tabular (Parquet) CC-BY-SA-4.0 7,787 1 0.8 MB 1.1 MB
AI4I 2020 Predictive Maintenance Dataset UCI ML Repository β€” AI4I 2020 Predictive Maintenance Dataset The AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real predictive maintenance data encountered in industry. Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of our knowledge. https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset Tabular (CSV) CC-BY-4.0 10,000 β€” 0.1 MB 0.1 MB
Air Quality UCI ML Repository β€” Air Quality Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. https://archive.ics.uci.edu/dataset/360/air+quality Tabular (CSV) CC-BY-4.0 9,357 β€” 0.2 MB 0.2 MB
Airbnb Open Data Airbnb Open Data New York Airbnb Open Data. New York City Airbnb Data Cleaning Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. β€” adapted from the dataset's Kaggle description (arianazmoudeh/airbnbopendata). http://insideairbnb.com/explore/ Tabular (CSV) ODbL-1.0 102,599 β€” 4.9 MB 5.6 MB
Airbnb Prices in European Cities Airbnb Prices in European Cities Determinants of Price by Room Type, Location, Cleanliness Rating, and More. Each listing is evaluated for various attributes such as room types, cleanliness and satisfaction ratings, bedrooms, distance from the city centre, and more to capture an in-depth understanding of Airbnb prices on both weekdays and weekends. Using spatial econometric methods, we analyse and identify the determinants of Airbnb prices across these cities. We hope that this data set offers insight into how global markets are affected by social dynamics and geographical factors which in turn determine pricing strategies for optimal profitability! β€” adapted from the dataset's Kaggle description (thedevastator/airbnb-prices-in-european-cities). https://zenodo.org/record/4446043#.Y9Y9ENJBwUE Tabular (CSV) CC0-1.0 51,707 β€” 3.5 MB 3.0 MB
AMPds β€” Whole-House Electricity AMPds v2 β€” Whole-House Electricity (Makonin et al., Sci. Data 2016) Whole-house electricity consumption time-series from AMPds v2 (the Almanac of Minutely Power dataset, version 2). 1-minute resolution measurements of the residence's main electrical service over April 2012–April 2014, paired with rich electrical sub-metrics (voltage, current, frequency, power factor, real / reactive / apparent power, plus running totals). One row per minute (~1.05M rows). Schema: unix_ts, V, I, f, DPF, APF, P, Pt, Q, Qt, S, St. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FIE0S4 Tabular (CSV) CC-BY-4.0 1,051,200 2 18.6 MB 20.3 MB
Anthropic Economic Index Anthropic/EconomicIndex β€” Claude.ai usage panel (release 2026-03-24) Aggregated Claude.ai usage metrics from the fourth Anthropic Economic Index report (data window 2026-02-05 – 2026-02-12). Each row carries one metric value for a (geography, facet, variable, cluster) tuple. Schema: geo_id (country / country-state / global ISO codes), geography, date_start, date_end, platform_and_product, facet (collaboration / request / onet_task / etc.), level, variable, cluster_name, value. Bound here to the consumer-product file from the latest release (release_2026_03_24/data/aei_raw_claude_ai_*.csv); the sibling aei_raw_1p_api_*.csv covers API usage and earlier releases live under release_2025_*/data/ and release_2026_01_15/data/. https://huggingface.co/datasets/Anthropic/EconomicIndex Tabular (CSV) MIT 425,257 1 3.6 MB 5.8 MB
Anthropic HH-RLHF (helpful-base) Anthropic Helpful & Harmless RLHF β€” helpful-base subset Anthropic's RLHF preference dataset, scoped to the helpful-base subset (~161k pairs of chosen / rejected assistant responses to the same prompt). Each pair is a complete conversation transcript. First RLHF-preference-data slug in the catalog. Other subsets (harmless-base, helpful-online, helpful-rejection-sampled, red-team-attempts) can be added as sibling slugs. https://huggingface.co/datasets/Anthropic/hh-rlhf Structured (JSON) MIT 46,189 β€” 25.8 MB 34.7 MB
Anthropic Interviewer Anthropic/AnthropicInterviewer β€” qualitative AI-interview transcripts Anthropic's published transcripts from AI-conducted qualitative research interviews, across three populations: creatives, scientists, and workforce participants. Each row carries the interview transcript plus participant demographics and study metadata. https://huggingface.co/datasets/Anthropic/AnthropicInterviewer Tabular (Parquet) MIT 1,250 1 3.0 MB 5.4 MB
Auto MPG UCI ML Repository β€” Auto MPG Revised from CMU StatLib library, data concerns city-cycle fuel consumption This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original". "The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993) https://archive.ics.uci.edu/dataset/9/auto+mpg Tabular (CSV) CC-BY-4.0 398 β€” 0.0 MB 0.0 MB
Automobile UCI ML Repository β€” Automobile From 1985 Ward's Automotive Yearbook This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". https://archive.ics.uci.edu/dataset/10/automobile Tabular (CSV) CC-BY-4.0 205 β€” 0.0 MB 0.0 MB
Aya Collection (aya_dataset config) CohereLabs/aya_collection β€” Aya-curated subset Smaller aya_dataset config of the broader Aya Collection (which totals ~513M instances across 114 languages, mostly templated). This subset is the human-curated Aya Dataset in the Aya Collection's unified schema. Flip allow patterns to other configs (e.g. templated_xnli/) for the templated variants. https://huggingface.co/datasets/CohereLabs/aya_collection Tabular (Parquet) Apache-2.0 202,364 1 99.1 MB 160.3 MB
Aya Dataset CohereLabs/aya_dataset β€” human-curated multilingual instructions 204K human-curated multilingual instruction/response pairs in 65 languages, contributed by 3K participants in the Cohere For AI Aya open-science project. Each row carries inputs, targets, language, language_code, and annotation_type. https://huggingface.co/datasets/CohereLabs/aya_dataset Tabular (Parquet) Apache-2.0 204,112 1 99.3 MB 162.2 MB
Bank Account Fraud Dataset Suite (NeurIPS 2022) Bank Account Fraud Dataset Suite (NeurIPS 2022) Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation. The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind! Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of β€œmonth”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income). β€” adapted from the dataset's Kaggle description (sgpjesus/bank-account-fraud-dataset-neurips-2022). https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf Tabular (CSV) CC-BY-NC-SA-4.0 1,000,000 β€” 61.1 MB 56.5 MB
Bank Marketing UCI ML Repository β€” Bank Marketing The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. https://archive.ics.uci.edu/dataset/222/bank+marketing Tabular (CSV) CC-BY-4.0 45,211 β€” 0.3 MB 0.4 MB
Behavioral Risk Factor Surveillance System Behavioral Risk Factor Surveillance System Public health surveys of 400k people from 2011-2015. The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. β€” adapted from the dataset's Kaggle description (cdc/behavioral-risk-factor-surveillance-system). https://www.cdc.gov/brfss/ Custom CC0-1.0 445,132 β€” 26.8 MB 54.2 MB
Beijing Multi-Site Air Quality UCI ML Repository β€” Beijing Multi-Site Air Quality (Liang et al., 2017) Hourly air-quality + meteorological readings from 12 monitoring stations across Beijing, March 2013 – February 2017 (~420K rows). Each station ships as a separate CSV in the upstream nested zip; the build concatenates them with the existing station column intact. Schema (18 cols): No, year, month, day, hour, PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM, station. Standard substitute for the KDD'15 U-Air dataset (which is no longer publicly hosted). https://archive.ics.uci.edu/dataset/501/beijing+multi+site+air+quality+data Tabular (CSV) CC-BY-4.0 420,768 1 5.4 MB 7.3 MB
BeIR / MS MARCO BeIR/msmarco β€” text-only passage corpus + queries MS MARCO repackaged for the BEIR retrieval benchmark: 8.84M passages from the Bing search corpus paired with 510k user queries. Both as large_string parquet β€” solid showcase for string-heavy datasets and FSST encoding in Vortex. Concat'd into one slug with a split column distinguishing corpus vs queries. https://huggingface.co/datasets/BeIR/msmarco Tabular (Parquet) MIT 9,351,785 β€” 1,095.3 MB 1,681.8 MB
BI-Arade Public BI Benchmark β€” Arade Public BI Benchmark workload Arade β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Arade Tabular (CSV) MIT 9,888,775 β€” 312.5 MB 136.7 MB
BI-Bimbo Public BI Benchmark β€” Bimbo Public BI Benchmark workload Bimbo β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Bimbo Tabular (CSV) MIT 74,180,464 β€” 368.6 MB 436.5 MB
BI-CityMaxCapita Public BI Benchmark β€” CityMaxCapita Public BI Benchmark workload CityMaxCapita β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CityMaxCapita Tabular (CSV) MIT 912,657 β€” 102.0 MB 136.6 MB
BI-CMSprovider Public BI Benchmark β€” CMSprovider Public BI Benchmark workload CMSprovider β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CMSprovider Tabular (CSV) MIT 18,575,754 β€” 798.3 MB 804.8 MB
BI-CommonGovernment Public BI Benchmark β€” CommonGovernment Public BI Benchmark workload CommonGovernment β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/CommonGovernment Tabular (CSV) MIT 141,123,827 β€” 6,358.3 MB 9,153.8 MB
BI-Corporations Public BI Benchmark β€” Corporations Public BI Benchmark workload Corporations β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Corporations Tabular (CSV) MIT 741,723 β€” 53.6 MB 67.9 MB
BI-Eixo Public BI Benchmark β€” Eixo Public BI Benchmark workload Eixo β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Eixo Tabular (CSV) MIT 7,559,227 β€” 463.1 MB 616.6 MB
BI-Euro2016 Public BI Benchmark β€” Euro2016 Public BI Benchmark workload Euro2016 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Euro2016 Tabular (CSV) MIT 2,052,497 β€” 127.7 MB 156.9 MB
BI-Food Public BI Benchmark β€” Food Public BI Benchmark workload Food β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Food Tabular (CSV) MIT 5,216,593 β€” 36.4 MB 40.4 MB
BI-Generico Public BI Benchmark β€” Generico Public BI Benchmark workload Generico β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Generico Tabular (CSV) MIT 114,124,607 β€” 2,341.3 MB 3,619.4 MB
BI-HashTags Public BI Benchmark β€” HashTags Public BI Benchmark workload HashTags β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/HashTags Tabular (CSV) MIT 511,511 β€” 138.7 MB 186.8 MB
BI-Hatred Public BI Benchmark β€” Hatred Public BI Benchmark workload Hatred β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Hatred Tabular (CSV) MIT 873,166 β€” 100.6 MB 133.6 MB
BI-IGlocations1 Public BI Benchmark β€” IGlocations1 Public BI Benchmark workload IGlocations1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IGlocations1 Tabular (CSV) MIT 81,611 β€” 1.8 MB 2.3 MB
BI-IGlocations2 Public BI Benchmark β€” IGlocations2 Public BI Benchmark workload IGlocations2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IGlocations2 Tabular (CSV) MIT 4,341,308 β€” 515.6 MB 720.2 MB
BI-IUBLibrary Public BI Benchmark β€” IUBLibrary Public BI Benchmark workload IUBLibrary β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/IUBLibrary Tabular (CSV) MIT 1,795 β€” 0.2 MB 0.2 MB
BI-Medicare1 Public BI Benchmark β€” Medicare1 Public BI Benchmark workload Medicare1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare1 Tabular (CSV) MIT 17,290,144 β€” 939.9 MB 800.9 MB
BI-Medicare2 Public BI Benchmark β€” Medicare2 Public BI Benchmark workload Medicare2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare2 Tabular (CSV) MIT 18,306,546 β€” 853.2 MB 939.2 MB
BI-Medicare3 Public BI Benchmark β€” Medicare3 Public BI Benchmark workload Medicare3 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Medicare3 Tabular (CSV) MIT 9,287,877 β€” 452.9 MB 481.6 MB
BI-MedPayment1 Public BI Benchmark β€” MedPayment1 Public BI Benchmark workload MedPayment1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MedPayment1 Tabular (CSV) MIT 9,153,273 β€” 419.5 MB 472.5 MB
BI-MedPayment2 Public BI Benchmark β€” MedPayment2 Public BI Benchmark workload MedPayment2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MedPayment2 Tabular (CSV) MIT 9,153,273 β€” 488.0 MB 524.0 MB
BI-MLB Public BI Benchmark β€” MLB Public BI Benchmark workload MLB β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MLB Tabular (CSV) MIT 32,472,563 β€” 1,160.3 MB 2,018.4 MB
BI-Motos Public BI Benchmark β€” Motos Public BI Benchmark workload Motos β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Motos Tabular (CSV) MIT 28,364,361 β€” 581.7 MB 894.2 MB
BI-MulheresMil Public BI Benchmark β€” MulheresMil Public BI Benchmark workload MulheresMil β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/MulheresMil Tabular (CSV) MIT 7,561,432 β€” 464.9 MB 622.8 MB
BI-NYC Public BI Benchmark β€” NYC Public BI Benchmark workload NYC β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/NYC Tabular (CSV) MIT 19,242,976 β€” 856.5 MB 733.2 MB
BI-PanCreactomy1 Public BI Benchmark β€” PanCreactomy1 Public BI Benchmark workload PanCreactomy1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/PanCreactomy1 Tabular (CSV) MIT 9,153,273 β€” 423.0 MB 475.2 MB
BI-PanCreactomy2 Public BI Benchmark β€” PanCreactomy2 Public BI Benchmark workload PanCreactomy2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/PanCreactomy2 Tabular (CSV) MIT 18,306,546 β€” 845.9 MB 948.3 MB
BI-Physicians Public BI Benchmark β€” Physicians Public BI Benchmark workload Physicians β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Physicians Tabular (CSV) MIT 9,153,273 β€” 419.5 MB 473.4 MB
BI-Provider Public BI Benchmark β€” Provider Public BI Benchmark workload Provider β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Provider Tabular (CSV) MIT 73,226,184 β€” 3,412.1 MB 3,752.2 MB
BI-RealEstate1 Public BI Benchmark β€” RealEstate1 Public BI Benchmark workload RealEstate1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/RealEstate1 Tabular (CSV) MIT 39,062,718 β€” 2,367.9 MB 2,418.3 MB
BI-RealEstate2 Public BI Benchmark β€” RealEstate2 Public BI Benchmark workload RealEstate2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/RealEstate2 Tabular (CSV) MIT 66,415,881 β€” 5,305.3 MB 6,127.7 MB
BI-Redfin1 Public BI Benchmark β€” Redfin1 Public BI Benchmark workload Redfin1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin1 Tabular (CSV) MIT 12,120,220 β€” 1,619.4 MB 1,642.1 MB
BI-Redfin2 Public BI Benchmark β€” Redfin2 Public BI Benchmark workload Redfin2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin2 Tabular (CSV) MIT 9,090,165 β€” 1,214.9 MB 1,227.8 MB
BI-Redfin3 Public BI Benchmark β€” Redfin3 Public BI Benchmark workload Redfin3 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin3 Tabular (CSV) MIT 6,534,558 β€” 875.0 MB 909.6 MB
BI-Redfin4 Public BI Benchmark β€” Redfin4 Public BI Benchmark workload Redfin4 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Redfin4 Tabular (CSV) MIT 3,267,279 β€” 457.8 MB 477.1 MB
BI-Rentabilidad Public BI Benchmark β€” Rentabilidad Public BI Benchmark workload Rentabilidad β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Rentabilidad Tabular (CSV) MIT 3,595,905 β€” 759.9 MB 952.7 MB
BI-Romance Public BI Benchmark β€” Romance Public BI Benchmark workload Romance β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Romance Tabular (CSV) MIT 3,173,176 β€” 325.9 MB 463.9 MB
BI-SalariesFrance Public BI Benchmark β€” SalariesFrance Public BI Benchmark workload SalariesFrance β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/SalariesFrance Tabular (CSV) MIT 16,223,877 β€” 2,493.9 MB 2,832.0 MB
BI-TableroSistemaPenal Public BI Benchmark β€” TableroSistemaPenal Public BI Benchmark workload TableroSistemaPenal β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TableroSistemaPenal Tabular (CSV) MIT 25,274,916 β€” 272.2 MB 386.0 MB
BI-Taxpayer Public BI Benchmark β€” Taxpayer Public BI Benchmark workload Taxpayer β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Taxpayer Tabular (CSV) MIT 91,532,730 β€” 4,264.8 MB 4,685.6 MB
BI-Telco Public BI Benchmark β€” Telco Public BI Benchmark workload Telco β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Telco Tabular (CSV) MIT 2,913,060 β€” 1,051.0 MB 951.5 MB
BI-TrainsUK1 Public BI Benchmark β€” TrainsUK1 Public BI Benchmark workload TrainsUK1 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TrainsUK1 Tabular (CSV) MIT 12,909,724 β€” 443.7 MB 510.5 MB
BI-TrainsUK2 Public BI Benchmark β€” TrainsUK2 Public BI Benchmark workload TrainsUK2 β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/TrainsUK2 Tabular (CSV) MIT 31,123,554 β€” 1,263.8 MB 1,596.0 MB
BI-Uberlandia Public BI Benchmark β€” Uberlandia Public BI Benchmark workload Uberlandia β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Uberlandia Tabular (CSV) MIT 7,559,227 β€” 464.9 MB 627.1 MB
BI-USCensus Public BI Benchmark β€” USCensus Public BI Benchmark workload USCensus β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/USCensus Tabular (CSV) MIT 9,398,385 β€” 2,421.9 MB 3,037.9 MB
BI-Wins Public BI Benchmark β€” Wins Public BI Benchmark workload Wins β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/Wins Tabular (CSV) MIT 2,115,449 β€” 734.3 MB 820.4 MB
BI-YaleLanguages Public BI Benchmark β€” YaleLanguages Public BI Benchmark workload YaleLanguages β€” pipe-delimited CSV partitions drawn from a real-world BI dashboard. Part of CWI's 47-workload corpus assembled to stress-test columnar query engines on quirky production data: inconsistent encodings, mixed quoting, sparse columns, real-world cardinalities. Each workload shares a schema across its partitions and ships as raw CSV (no parquet upstream); raincloud merges them via public_bi_merge. https://github.com/cwida/public_bi_benchmark/tree/master/benchmark/YaleLanguages Tabular (CSV) MIT 5,762,082 β€” 94.7 MB 123.0 MB
Bike Sharing UCI ML Repository β€” Bike Sharing This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset Tabular (CSV) CC-BY-4.0 17,379 β€” 0.2 MB 0.2 MB
Breast Cancer UCI ML Repository β€” Breast Cancer This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. (See also lymphography and primary-tumor.) This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal. https://archive.ics.uci.edu/dataset/14/breast+cancer Tabular (CSV) CC-BY-4.0 286 β€” 0.0 MB 0.0 MB
Breast Cancer Wisconsin (Diagnostic) UCI ML Repository β€” Breast Cancer Wisconsin (Diagnostic) Diagnostic Wisconsin Breast Cancer Database. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/ Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic Tabular (CSV) CC-BY-4.0 569 β€” 0.1 MB 0.1 MB
Breast Cancer Wisconsin (Original) UCI ML Repository β€” Breast Cancer Wisconsin (Original) Original Wisconsin Breast Cancer Database Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself: Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991) Group 8: 86 instances (November 1991) -----------------------------------------… https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original Tabular (CSV) CC-BY-4.0 699 β€” 0.0 MB 0.0 MB
CA Housing California Housing Prices Median house prices for California districts derived from the 1990 census. This is the dataset used in the second chapter of AurΓ©lien GΓ©ron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome. The data contains information from the 1990 California census. β€” adapted from the dataset's Kaggle description (camnugent/california-housing-prices). http://lib.stat.cmu.edu/datasets/houses.zip Custom CC0-1.0 20,640 β€” 0.3 MB 0.4 MB
Car Evaluation UCI ML Repository β€” Car Evaluation Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods. Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure: CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . https://archive.ics.uci.edu/dataset/19/car+evaluation Tabular (CSV) CC-BY-4.0 1,728 β€” 0.0 MB 0.0 MB
Cardiovascular Diseases Risk Prediction Dataset Cardiovascular Diseases Risk Prediction Dataset The 2021 BRFSS Dataset from CDC. CVDs Risk Prediction Using Personal Lifestyle Factors - Check my notebook here! πŸ˜„ - Access the web application I created here! πŸ”— BRFSS Dataset The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. β€” adapted from the dataset's Kaggle description (alphiree/cardiovascular-diseases-risk-prediction-dataset). https://www.cdc.gov/brfss/ Custom CC0-1.0 418,268 β€” 32.3 MB 58.0 MB
CDC Diabetes Health Indicators UCI ML Repository β€” CDC Diabetes Health Indicators The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. Dataset link: https://www.cdc.gov/brfss/annual_data/annual_2014.html https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators Tabular (CSV) CC-BY-4.0 253,680 β€” 2.2 MB 1.4 MB
Census Income UCI ML Repository β€” Census Income Predict whether income exceeds $50K/yr based on census data. Also known as Adult dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to determine whether a person makes over 50K a year. https://archive.ics.uci.edu/dataset/20/census+income Tabular (CSV) CC-BY-4.0 48,842 β€” 0.4 MB 0.4 MB
Chess (Lichess) Lichess Standard Rated Games (2013-01 monthly dump) 20,000+ Lichess Games, including moves, victor, rating, opening details and more. General Info This is a set of just over 20,000 games collected from a selection of users on the site Lichess.org, and how to collect more. I will also upload more games in the future as I collect them. I collected this data using the [Lichess API][2], which enables collection of any given users game history. β€” adapted from the dataset's Kaggle description (datasnaek/chess). https://database.lichess.org/ Custom CC0-1.0 121,332 β€” 22.8 MB 23.9 MB
Chronic Kidney Disease UCI ML Repository β€” Chronic Kidney Disease This dataset can be used to predict the chronic kidney disease and it can be collected from the hospital nearly 2 months of period. We use the following representation to collect the dataset age - age bp - blood pressure sg - specific gravity al - albumin su - sugar rbc - red blood cells pc - pus cell pcc - pus cell clumps ba - bacteria bgr - blood glucose random bu - blood urea sc - serum creatinine sod - sodium pot - potassium hemo - hemoglobin pcv - packed cell volume wc - white blood cell count rc - red blood cell count htn - hypertension dm - diabetes mellitus cad - coronary artery disease… https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease Tabular (CSV) CC-BY-4.0 400 β€” 0.0 MB 0.0 MB
ClickBench Hits ClickBench Hits (Yandex Metrica log) 100M-row Yandex Metrica web-analytics event log used by the ClickBench OLAP benchmark suite. 105 columns covering URL, user agent, geo, click counts, and session metadata β€” heterogeneous string + numeric mix that exercises columnar query engines on wide, sparse rows. https://github.com/ClickHouse/ClickBench/blob/main/LICENSE Tabular (Parquet) Apache-2.0 99,997,497 β€” 9,497.6 MB 14,562.4 MB
CNN/DailyMail abisee/cnn_dailymail β€” news summarization (3.0.0) 300K English news articles from CNN and the Daily Mail (2007–2015) paired with multi-sentence reference summaries. Each row carries article, highlights, and id. Standard summarization eval; uses the 3.0.0 (non-anonymized) version. Original-author repo (Abi See). https://huggingface.co/datasets/abisee/cnn_dailymail Tabular (Parquet) Apache-2.0 311,971 1 543.3 MB 713.8 MB
CodeContests deepmind/code_contests (Li et al., 2022) 13K competitive-programming problems from Codeforces, AtCoder, et al., released with AlphaCode. Each row carries name, description, cf_* Codeforces metadata, public_tests, private_tests, and reference solutions in multiple languages. Used for code-generation evaluation. https://huggingface.co/datasets/deepmind/code_contests Tabular (Parquet) CC-BY-4.0 13,610 1 4,108.3 MB β€”
Cohere Wikipedia Simple (multilingual-v3 embeddings) Cohere/wikipedia-2023-11-embed-multilingual-v3 β€” Simple English subset (1024d) Simple English Wikipedia (646k passage-chunked rows) paired with Cohere's embed-multilingual-v3 1024-dimensional embeddings. The simple subset of the larger Cohere/wikipedia-2023-11-embed-multilingual-v3 repo, scoped via hf_allow_patterns. Showcases fixed_size_list<float, 1024> via the cast in hf_concat_splits. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3 Tabular (Parquet) Apache-2.0 646,424 β€” 1,263.2 MB 2,293.7 MB
COIG BAAI/COIG β€” Chinese Open Instruction Generalist BAAI's Chinese instruction-tuning corpus assembled from translation, exam, leetcode, human-value alignment, and counterfactual-correction subsets. Each row carries instruction, input, output, plus a subset-source tag. The default config concatenates the translatable and non-translatable splits. https://huggingface.co/datasets/BAAI/COIG Tabular (Parquet) Apache-2.0 178,246 1 55.5 MB 101.7 MB
Concrete Compressive Strength UCI ML Repository β€” Concrete Compressive Strength Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. Number of instances 1030 Number of Attributes 9 Attribute breakdown 8 quantitative input variables, and 1 quantitative output variable Missing Attribute Values None https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength Tabular (CSV) CC-BY-4.0 1,030 β€” 0.0 MB 0.0 MB
CoronaHack -Chest X-Ray-Dataset CoronaHack -Chest X-Ray-Dataset Classify the X Ray image which is having Corona. Corona - COVID19 virus affects the respiratory system of healthy individual & Chest X -Ray is one of the important imaging methods to identify the corona virus. With the Chest X - Ray dataset, Develop a Machine Learning Model to classify the X Rays of Healthy vs Pneumonia (Corona) affected patients & this model powers the AI application to test the Corona Virus in Faster Phase. Postdoctoral Fellow, Mila, University of Montreal for the dataset below for corona dataset & 80% dataset collected from different sources. β€” adapted from the dataset's Kaggle description (praveengovi/coronahack-chest-xraydataset). https://github.com/ieee8023/covid-chestxray-dataset Tabular (CSV) Attribution 4.0 International (CC BY 4.0) 5,910 β€” 0.1 MB 0.1 MB
Cosmopedia (Stanford subset) HuggingFaceTB/cosmopedia β€” Stanford-style synthetic textbooks subset Synthetic textbook-style content generated by Mixtral-8x7B-Instruct in the style of Stanford coursework (one of the eight cosmopedia subsets). 13 shards, ~5GB. Open-weight-model provenance β€” no closed-API ToS issue. Flip allow patterns to auto_math_text/, khanacademy/, openstax/, stories/, or web_samples_v1/ for the other subsets. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia Tabular (Parquet) Apache-2.0 1,020,024 1 2,022.9 MB 3,116.9 MB
Countries of the World CIA World Factbook (JSON mirror β†’ VARIANT parquet) Country names linked to region, population, area size, GDP, mortality and more. World fact sheet, fun to link with other datasets. Information on population, region, area size, infant mortality and more. [Source:][1] All these data sets are made up of data from the US government. β€” adapted from the dataset's Kaggle description (fernandol/countries-of-the-world). https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html Custom CC0-1.0 262 β€” 1.6 MB 4.7 MB
COVID-19 data from John Hopkins University COVID-19 data from John Hopkins University Updated daily at 6am UTC in both raw and convenient form. This is a daily updating version of COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). The data updates every day at 6am UTC, which updates just after the raw JHU data typically updates. I'm making it available in both a raw form (files with the prefix RAW) and convenient form (files prefixed with CONVENIENT). β€” adapted from the dataset's Kaggle description (antgoldbloom/covid19-data-from-john-hopkins-university). https://github.com/CSSEGISandData/COVID-19 Tabular (CSV) Attribution 4.0 International (CC BY 4.0) 289 β€” 1.5 MB 2.3 MB
COVID-19 World Vaccination Progress COVID-19 World Vaccination Progress Daily and Total Vaccination for COVID-19 in the World from Our World in Data. Data is collected daily from Our World in Data GitHub repository for covid-19, merged and uploaded. Country level vaccination data is gathered and assembled in one single file. Then, this data file is merged with locations data file to include vaccination sources information. β€” adapted from the dataset's Kaggle description (gpreda/covid-world-vaccination-progress). https://github.com/owid/covid-19-data Tabular (CSV) CC0-1.0 196,246 β€” 3.7 MB 6.8 MB
Credit Approval UCI ML Repository β€” Credit Approval This data concerns credit card applications; good mix of attributes This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values. https://archive.ics.uci.edu/dataset/27/credit+approval Tabular (CSV) CC-BY-4.0 690 β€” 0.0 MB 0.0 MB
Crimes in Boston Crimes in Boston Times, locations, and descriptions of crimes. Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. Records begin in June 14, 2015 and continue to September 3, 2018. β€” adapted from the dataset's Kaggle description (AnalyzeBoston/crimes-in-boston). https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system Tabular (CSV) CC0-1.0 319,073 β€” 5.9 MB 20.0 MB
Databricks Dolly 15k databricks-dolly-15k (Conover et al., 2023) 15k human-written instruction/response pairs across 7 task categories (closed QA, classification, summarization, etc.), generated by Databricks employees specifically for instruction-tuning open LLMs. Distinguished by being entirely human-authored (no model output recycled as data), making it CC-BY-SA-3.0 β€” commercially-clear unlike most synthetic instruction corpora. https://huggingface.co/datasets/databricks/databricks-dolly-15k Structured (JSON) CC-BY-SA-3.0 15,011 β€” 5.1 MB 6.3 MB
dbpedia + Embeddings DBpedia Entities 1M + OpenAI text-embedding-3-large (1536-dim) 1M DBpedia entity abstracts paired with 1536-dim OpenAI text-embedding-3-large embeddings (Qdrant's release). Each row carries the entity's title, abstract, and the dense vector. Standard reference for vector-search benchmarks. https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M Tabular (Parquet) CC-BY-SA-4.0 1,000,000 β€” 6,898.3 MB 10,616.6 MB
Default of Credit Card Clients UCI ML Repository β€” Default of Credit Card Clients This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients Tabular (CSV) CC-BY-4.0 30,000 β€” 1.2 MB 1.0 MB
Diabetes UCI ML Repository β€” Diabetes This diabetes dataset is from AIM '94 Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps. Diabetes files consist of four fields per record. https://archive.ics.uci.edu/dataset/34/diabetes Tabular (CSV) CC-BY-4.0 29,264 β€” 0.1 MB 0.1 MB
Diabetes 130-US Hospitals for Years 1999-2008 UCI ML Repository β€” Diabetes 130-US Hospitals for Years 1999-2008 The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days. The goal is to determine the early readmission of the patient within 30 days of discharge. The problem is important for the following reasons. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria. (1) It is an inpatient encounter (a hospital admission). (2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis. (3) The length of stay was at least 1 day and at most 14 days. (4) Laboratory tests were performed during the encounter. (5) Medicati… https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008 Tabular (CSV) CC-BY-4.0 101,766 β€” 2.0 MB 2.2 MB
Diabetes Health Diabetes Health Indicators Dataset 253,680 survey responses from cleaned BRFSS 2015 + balanced dataset. Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. β€” adapted from the dataset's Kaggle description (alexteboul/diabetes-health-indicators-dataset). https://www.cdc.gov/brfss/annual_data/annual_2015.html Custom CC0-1.0 441,456 β€” 33.1 MB 57.6 MB
Disease Symptoms Disease Symptom Prediction helps to create a disease prediction or healthcare system. A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv Content There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. β€” adapted from the dataset's Kaggle description (itachi9604/disease-symptom-description-dataset). https://github.com/itachi9604/healthcare-chatbot Tabular (CSV) CC-BY-SA-4.0 4,920 β€” 0.0 MB 0.1 MB
Docmatix (zero-shot subset) HuggingFaceM4/Docmatix β€” zero-shot evaluation subset Small zero-shot evaluation subset of Docmatix, the synthetic Doc-VQA training corpus released with Idefics3 (~1M images total, 9.5M Q&A pairs in the full set). Each row pairs page images with question/answer tuples. https://huggingface.co/datasets/HuggingFaceM4/Docmatix Tabular (Parquet) MIT 1,900 1 604.5 MB 615.4 MB
Dry Bean UCI ML Repository β€” Dry Bean Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains. Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. https://archive.ics.uci.edu/dataset/602/dry+bean+dataset Tabular (CSV) CC-BY-4.0 13,611 β€” 1.7 MB 0.8 MB
Electric Motor Temperature Electric Motor Temperature 185 hrs recordings from a permanent magnet synchronous motor (PMSM). UPDATE 26.04.2021 All data is deanonymized now. Moreover, 17 additional measurement profiles were added, expanding the dataset from 138 hours to 185 hours of records. The data set comprises several sensor data collected from a permanent magnet synchronous motor (PMSM) deployed on a test bench. β€” adapted from the dataset's Kaggle description (wkirgsn/electric-motor-temperature). https://github.com/upb-lea/deep-pmsm Tabular (CSV) CC-BY-SA-4.0 1,330,816 β€” 84.2 MB 102.5 MB
ElectricityLoadDiagrams20112014 UCI ML Repository β€” ElectricityLoadDiagrams20112014 This data set contains electricity consumption of 370 points/clients. Data set has no missing values. Values are in kW of each 15 min. To convert values in kWh values must be divided by 4. Each column represent one client. Some clients were created after 2011. In these cases consumption were considered zero. All time labels report to Portuguese hour. However all days present 96 measures (24*4). Every year in March time change day (which has only 23 hours) the values between 1:00 am and 2:00 am are zero for all points. Every year in October time change day (which has 25 hours) the values between 1:00 am and 2:00 am aggregate the consumption of two hours. https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014 Tabular (CSV) CC-BY-4.0 140,256 β€” 40.6 MB 52.1 MB
Emissions by Country Emissions by Country Quantifying Sources and Emission Levels. It contains information on total emissions as well as from coal, oil, gas, cement production and flaring, and other sources. The data also provides a breakdown of per capita CO2 emission per country - showing which countries are leading in pollution levels and identifying potential areas where reduction efforts should be concentrated. This dataset is essential for anyone who wants to get informed about their own environmental footprint or conduct research on international development trends More Datasets > For more datasets, click here. β€” adapted from the dataset's Kaggle description (thedevastator/global-fossil-co2-emissions-by-country-2002-2022). https://zenodo.org/record/7215364 Tabular (CSV) CC0-1.0 63,104 β€” 0.7 MB 1.1 MB
Emotions NLP Emotions Dataset for NLP Emotions dataset for NLP classification tasks. Few questions your emotion classification model can answer based on your customer review What is the sentiment of your customer comment? What is the mood of today's special food ? β€” adapted from the dataset's Kaggle description (praveengovi/emotions-dataset-for-nlp). https://www.aclweb.org/anthology/D18-1404/ Tabular (Parquet) CC-BY-SA-4.0 416,809 β€” 16.1 MB 19.3 MB
Energy Efficiency UCI ML Repository β€” Energy Efficiency This study looked into assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters. We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. https://archive.ics.uci.edu/dataset/242/energy+efficiency Tabular (CSV) CC-BY-4.0 768 β€” 0.0 MB 0.0 MB
Estimation of Obesity Levels Based On Eating Habits and Physical Condition UCI ML Repository β€” Estimation of Obesity Levels Based On Eating Habits and Physical Condition This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition Tabular (CSV) CC-BY-4.0 2,111 β€” 0.1 MB 0.1 MB
Exoplanet Hunting in Deep Space Exoplanet Hunting in Deep Space Kepler labelled time series data. The Search for New Earths ------------------------- GitHub The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems. β€” adapted from the dataset's Kaggle description (keplersmachines/kepler-labelled-time-series-data). https://github.com/winterdelta/keplersmachines Tabular (CSV) CC0-1.0 5,087 β€” 102.2 MB 116.2 MB
F1 Championship Formula 1 World Championship (1950 - 2024) F1 race data from 1950 to 2024. Formula 1 (a.k.a. F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the FΓ©dΓ©ration Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. β€” adapted from the dataset's Kaggle description (rohanrao/formula-1-world-championship-1950-2020). https://ergast.com/mrd/ Tabular (CSV) CC0-1.0 26,759 β€” 0.4 MB 0.7 MB
FineTranslations (Swedish sample) HuggingFaceFW/finetranslations β€” Swedish parallel-text sample Single 2 GB shard of HuggingFace's 1+T-token translation corpus β€” Swedish-anchored parallel-text. Schema centers on paired source/target text plus translation-quality metadata. Bound here to the first shard (swe_Latn/train/0000.parquet at refs/convert/parquet); the full Swedish split is 182 shards Γ— 2 GB = ~365 GB. Flip allow_patterns to swe_Latn/train/*.parquet for the full corpus, or to <lang>_<script>/train/0000.parquet for any of the other ~600 language anchors at the same one-shard size. https://huggingface.co/datasets/HuggingFaceFW/finetranslations Tabular (Parquet) ODC-By-1.0 321,000 1 1,450.3 MB 1,853.2 MB
FitBit Tracker FitBit Fitness Tracker Data Pattern recognition with tracker data: : Improve Your Overall Health. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). β€” adapted from the dataset's Kaggle description (arashnic/fitbit). https://zenodo.org/record/53894#.YMoUpnVKiP9 Tabular (CSV) CC0-1.0 1,397 β€” 0.0 MB 0.1 MB
Football Results International football results from 1872 to 2026 An up-to-date dataset of over 49,000 international football results. Well, what happened was that I was looking for a semi-definite easy-to-read list of international football matches and couldn't find anything decent. So I took it upon myself to collect it for my own use. I might as well share it. β€” adapted from the dataset's Kaggle description (martj42/international-football-results-from-1872-to-2017). https://github.com/martj42/international_results Tabular (CSV) CC0-1.0 49,328 β€” 0.4 MB 0.5 MB
Forest Fires UCI ML Repository β€” Forest Fires This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at: http://www.dsi.uminho.pt/~pcortez/forestfires). In [Cortez and Morais, 2007], the output 'area' was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. https://archive.ics.uci.edu/dataset/162/forest+fires Tabular (CSV) CC-BY-4.0 517 β€” 0.0 MB 0.0 MB
FRAMES google/frames-benchmark β€” Factuality, Retrieval, And reasoning MEasurement Set 824 multi-hop fact questions designed to require both retrieval (across multiple Wikipedia articles) and reasoning. Each row carries the Prompt, Answer, and wiki_links (the relevant source URLs). Small but structurally rich for retrieval-augmented eval. https://huggingface.co/datasets/google/frames-benchmark Tabular (Parquet) Apache-2.0 824 1 0.2 MB 0.2 MB
German Traffic Signs GTSRB - German Traffic Sign Recognition Benchmark Multi-class, single-image classification challenge. We cordially invite researchers from relevant fields to participate: The competition is designed to allow for participation without special domain knowledge. Our benchmark has the following properties: - Single-image, multi-class classification problem - More than 40 classes - More than 50,000 images in total - Large, lifelike database Acknowledgements [INI Benchmark Website][1] [1]: http://benchmark.ini.rub.de/ β€” adapted from the dataset's Kaggle description (meowmeowmeowmeowmeow/gtsrb-german-traffic-sign). http://benchmark.ini.rub.de/ Tabular (CSV) CC0-1.0 12,630 β€” 0.1 MB 0.1 MB
GHCN-Daily NOAA Global Historical Climatology Network (Daily) 3.17B daily weather observations from NOAA's Global Historical Climatology Network β€” surface-station readings since 1763. One row per (station, day, element) with min/max temperature, precipitation, snowfall, etc. The reference dataset for century-scale climate time-series analysis. https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily Custom US-Government-PD 3,178,406,394 β€” 5,975.8 MB 10,492.7 MB
Glass Classification Glass Classification Can you correctly identify glass type? https://archive.ics.uci.edu/ Tabular (CSV) DbCL-1.0 214 β€” 0.0 MB 0.0 MB
GloVe 6B 100d GloVe 6B Global Vectors (100-dimensional) 400,000 English word embeddings at 100 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The middle of the three GloVe-6B slugs; the dimension most commonly cited in classic NLP papers. https://nlp.stanford.edu/projects/glove/ Structured (Embeddings) PDDL-1.0 400,000 β€” 143.9 MB 128.6 MB
GloVe 6B 200d GloVe 6B Global Vectors (200-dimensional) 400,000 English word embeddings at 200 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The largest of the three GloVe-6B slugs; higher-fidelity vectors at 4Γ— the storage. https://nlp.stanford.edu/projects/glove/ Structured (Embeddings) PDDL-1.0 400,000 β€” 285.3 MB 255.1 MB
GloVe 6B 50d GloVe 6B Global Vectors (50-dimensional) 400,000 English word embeddings at 50 dimensions, trained on 6B tokens from Wikipedia 2014 + Gigaword 5 (Pennington et al., EMNLP 2014). The smallest of the three GloVe-6B slugs. https://nlp.stanford.edu/projects/glove/ Structured (Embeddings) PDDL-1.0 400,000 β€” 73.3 MB 68.3 MB
goodbooks-10k goodbooks-10k Ten thousand books, one million ratings. Also books marked to read, and tags. This version of the dataset is obsolete. It contains duplicate ratings (same userid,bookid), as reported by Philipp Spachtholz in his illustrious notebook. The current version has duplicates removed, and more ratings (six million), sorted by time. β€” adapted from the dataset's Kaggle description (zygmunt/goodbooks-10k). https://github.com/zygmuntz Tabular (CSV) CC-BY-SA-4.0 5,976,479 β€” 18.7 MB 23.2 MB
Google Cluster Trace 2011 β€” machine_events Google Cluster Trace v2 β€” machine_events table (2011) Machine-add/remove/update events from Google's 29-day production-cluster trace (May 2011, ~12.5K-machine cluster). One row per event: timestamp (microseconds since trace start), machine ID, event type (ADD / REMOVE / UPDATE), platform ID (string-hashed), and CPU+memory capacity. Six columns with no header in the source file β€” column names autogenerate as f0..f5. Schema documented at the upstream schema.csv. Subset of the broader cluster trace (job_events, task_events, task_usage, machine_attributes, task_constraints β€” all present in the same GCS bucket). machine_events is a single ~347 KB compressed file; the task_events / task_usage tables are 500 parts each and ~50 GB total. https://github.com/google/cluster-data Tabular (CSV) CC-BY-3.0 37,780 1 0.3 MB 0.4 MB
GSM8K Grade School Math 8K (Cobbe et al., 2021) 8.5k grade-school math word problems with step-by-step natural-language reasoning in the answer (Cobbe et al., 2021). Each problem requires 2–8 elementary arithmetic steps. The standard arithmetic-reasoning eval for chain-of-thought prompting; this is the main config (a sibling socratic config exists upstream but is omitted here). https://huggingface.co/datasets/openai/gsm8k Tabular (Parquet) MIT 8,792 β€” 1.8 MB 2.5 MB
Hacker News Hacker News posts + comments archive 28.7M Hacker News posts and comments spanning the site's full history (2007 onward). Single flat parquet from Google's bigquery-public-data export; each row carries id, type (story/comment/poll), author, time, parent, title, text, and url. Standard dataset for Hacker News thread analysis. https://news.ycombinator.com/ Tabular (Parquet) Public 41,813,385 β€” 6,761.8 MB 8,489.9 MB
Heart Disease UCI ML Repository β€” Heart Disease 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). https://archive.ics.uci.edu/dataset/45/heart+disease Tabular (CSV) CC-BY-4.0 303 β€” 0.0 MB 0.0 MB
Heart Disease Health Indicators Dataset Heart Disease Health Indicators Dataset 253,680 survey responses from cleaned BRFSS 2015 - binary classification. Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year β€” making it the leading cause of death. The buildup of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease. β€” adapted from the dataset's Kaggle description (alexteboul/heart-disease-health-indicators-dataset). https://www.cdc.gov/brfss/annual_data/annual_2015.html Custom CC0-1.0 441,456 β€” 33.1 MB 57.6 MB
Heart Disease Indicators Indicators of Heart Disease (2022 UPDATE) 2022 annual CDC survey data of 400k+ adults related to their health status. According to the CDC, heart disease is a leading cause of death for people of most races in the U.S. (African Americans, American Indians and Alaska Natives, and whites). half of all Americans (47%) have at least 1 of 3 major risk factors for heart disease: high blood pressure, high cholesterol, and smoking. β€” adapted from the dataset's Kaggle description (kamilpytlak/personal-key-indicators-of-heart-disease). https://www.cdc.gov/brfss/annual_data/annual_2022.html Custom CC0-1.0 445,132 β€” 26.8 MB 54.2 MB
Heart Failure Clinical Records UCI ML Repository β€” Heart Failure Clinical Records This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features. A detailed description of the dataset can be found in the Dataset section of the following paper: Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone". BMC Medical Informatics and Decision Making 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5 https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records Tabular (CSV) CC-BY-4.0 299 β€” 0.0 MB 0.0 MB
HellaSwag HellaSwag β€” Commonsense NLI (Zellers et al., ACL 2019) 70k commonsense-NLI multiple-choice questions: pick the plausible ending to a video-caption or how-to context. Adversarial filtering via models-of-the-day was used to keep the wrong endings hard (Zellers et al., ACL 2019). Standard benchmark for commonsense reasoning in LLMs. https://huggingface.co/datasets/Rowan/hellaswag Tabular (Parquet) MIT 59,950 β€” 23.2 MB 30.2 MB
HelpSteer2 nvidia/HelpSteer2 β€” open reward-model training data ~10K human-rated prompt/response pairs scored on five attributes: helpfulness, correctness, coherence, complexity, verbosity (each 0–4). NVIDIA's open-source reward-model training set. Each row carries prompt, response, plus the five score columns. https://huggingface.co/datasets/nvidia/HelpSteer2 Tabular (Parquet) CC-BY-4.0 21,362 1 12.6 MB 20.7 MB
Hotel Booking Hotel Booking Demand From the paper: hotel booking demand datasets. Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? β€” adapted from the dataset's Kaggle description (jessemostipak/hotel-booking-demand). https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md Tabular (CSV) Attribution 4.0 International (CC BY 4.0) 119,390 β€” 0.9 MB 1.5 MB
HotpotQA (fullwiki) hotpotqa/hotpot_qa β€” multi-hop QA, fullwiki configuration 113K multi-hop QA over Wikipedia. Fullwiki config: each question must be answered by reasoning across multiple Wikipedia paragraphs. Each row carries question, answer, supporting_facts (list of title/sentence-id pairs), and context (list of relevant paragraphs). https://huggingface.co/datasets/hotpotqa/hotpot_qa Tabular (Parquet) CC-BY-SA-4.0 105,257 1 263.0 MB 342.9 MB
Housing Prices Dataset Housing Prices Dataset Housing Prices Prediction - Regression Problem. A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model? β€” adapted from the dataset's Kaggle description (yasserh/housing-prices-dataset). https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg Tabular (CSV) CC0-1.0 545 β€” 0.0 MB 0.0 MB
Human Activity Recognition Using Smartphones UCI ML Repository β€” Human Activity Recognition Using Smartphones Human Activity Recognition database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones Tabular (CSV) CC-BY-4.0 10,299 β€” 26.4 MB 19.7 MB
HumanEval openai/openai_humaneval (Chen et al., 2021) Canonical 164-problem Python coding benchmark. Each row carries task_id, prompt (function signature + docstring), canonical_solution, test, and entry_point. The standard 'pass@k' eval target. https://huggingface.co/datasets/openai/openai_humaneval Tabular (Parquet) MIT 164 1 0.1 MB 0.1 MB
Individual Household Electric Power Consumption UCI ML Repository β€” Individual Household Electric Power Consumption Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available. This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months). Notes: 1.(global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3. 2.The dataset contains some missing values in the measurements (nearly 1,25% of the rows). https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption Tabular (CSV) CC-BY-4.0 2,075,259 β€” 9.0 MB 22.3 MB
Iowa Liquor Sales Iowa Liquor Sales 12 million alcoholic beverage sales in the Midwest. The Iowa Department of Commerce requires that every store that sells alcohol in bottled form for off-the-premises consumption must hold a class "E" liquor license (an arrangement typical of most of the state alcohol regulatory bodies). All alcoholic sales made by stores registered thusly with the Iowa Department of Commerce are logged in the Commerce department system, which is in turn published as open data by the State of Iowa. This dataset contains information on the name, kind, price, quantity, and location of sale of sales of individual containers or packages of containers of alcoholic beverages. β€” adapted from the dataset's Kaggle description (residentmario/iowa-liquor-sales). https://gist.github.com/dannguyen/18ed71d3451d147af414 Tabular (CSV) CC0-1.0 12,591,077 β€” 281.4 MB 350.1 MB
Iris UCI ML Repository β€” Iris A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods. This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain. This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick@espeedaz.net ). https://archive.ics.uci.edu/dataset/53/iris Tabular (CSV) CC-BY-4.0 150 β€” 0.0 MB 0.0 MB
JSONBench (Bluesky 100m) ClickHouse JSONBench β€” Bluesky firehose 100 M records 100 M Bluesky firehose events (likes / follows / posts / reposts / ...) stored as a single VARIANT column; benchmark dataset for semi-structured workloads. https://github.com/ClickHouse/JSONBench Custom Apache-2.0 100,000,000 β€” 11,411.6 MB β€”
Kepler Exoplanet Search Results Kepler Exoplanet Search Results 10000 exoplanet candidates examined by the Kepler Space Observatory. The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission. β€” adapted from the dataset's Kaggle description (nasa/kepler-exoplanet-search-results). https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html Tabular (CSV) CC0-1.0 9,564 β€” 3.2 MB 3.2 MB
LibriSpeech (test-clean) openslr/librispeech_asr β€” clean test split (sample) LibriSpeech ASR clean test split: ~2.6K read-aloud utterances from audiobooks with verbatim transcripts. Each row carries audio (struct<bytes, path>), text, speaker_id, chapter_id. Sample of the full ~1000-hour corpus; flip allow patterns to all/train.clean.100/*.parquet (or 360/500) for the bigger splits. https://huggingface.co/datasets/openslr/librispeech_asr Tabular (Parquet) CC-BY-4.0 2,620 1 330.0 MB 350.5 MB
Loan Default Dataset Loan Default Dataset Loan Default Classification Problem. Banks earn a major revenue from lending loans. But it is often associated with risk. The borrower's may default on the loan. β€” adapted from the dataset's Kaggle description (yasserh/loan-default-dataset). https://raw.githubusercontent.com/Masterx-AI/Project_Loan_Default_Risk_Expectancy_/main/loan.jpg Tabular (CSV) CC0-1.0 148,670 β€” 2.9 MB 2.9 MB
Lung Cancer Lung Cancer Does Smoking cause Lung Cancer. https://archive.ics.uci.edu/ Tabular (CSV) CC0-1.0 32 β€” 0.0 MB 0.1 MB
MAGIC Gamma Telescope UCI ML Repository β€” MAGIC Gamma Telescope Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope Tabular (CSV) CC-BY-4.0 19,020 β€” 1.0 MB 0.5 MB
Marketing Analytics Marketing Analytics Practice Exploratory and Statistical Analysis with Marketing Data. This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations. The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance Acknowledgement I do not own this dataset. β€” adapted from the dataset's Kaggle description (jackdaoud/marketing-data). https://github.com/nailson/ifood-data-business-analyst-test Tabular (CSV) CC0-1.0 2,240 β€” 0.1 MB 0.1 MB
MBPP google-research-datasets/mbpp β€” Mostly Basic Python Problems 974 short Python coding problems with reference solutions and three test cases each. Each row carries text (problem statement), code (reference solution), test_list, and test_setup_code. Smaller complement to HumanEval at the entry-level coding-eval tier. https://huggingface.co/datasets/google-research-datasets/mbpp Tabular (Parquet) CC-BY-4.0 974 1 0.1 MB 0.2 MB
Medical Cost Medical Cost Personal Datasets Health Insurance Premium charges based on Gender, BMI and other characteristics. This Dataset is something I found online when I wanted to practice regression models. It is an openly available online dataset at multiple places. Though I do not know the exact origin and collection methodology of the data, I would recommend this dataset to everybody who is just beginning their journey in Data science. β€” adapted from the dataset's Kaggle description (simranjain17/insurance). https://github.com/stedy/Machine-Learning-with-R-datasets Tabular (CSV) DbCL-1.0 1,338 β€” 0.0 MB 0.0 MB
MedMCQA openlifescienceai/medmcqa β€” Indian medical-entrance exam MCQ 194K multiple-choice questions from Indian medical-entrance exams (AIIMS, NEET-PG). Each row carries question, four opa..opd answer options, cop (correct option index), subject_name, topic_name, and exp (explanation). https://huggingface.co/datasets/openlifescienceai/medmcqa Tabular (Parquet) Apache-2.0 193,155 1 53.7 MB 70.8 MB
mlcourse.ai mlcourse.ai Datasets and notebooks of the open Machine Learning course mlcourse.ai. mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). Having both a Ph.D. degree in applied math and a Kaggle Competitions Master tier, Yury aimed at designing an ML course with a perfect balance between theory and practice. β€” adapted from the dataset's Kaggle description (kashnitsky/mlcourse). https://github.com/Yorko/mlcourse.ai}}, Tabular (CSV) CC-BY-NC-SA-4.0 32,561 β€” 0.3 MB 0.3 MB
MMLU Massive Multitask Language Understanding (Hendrycks et al., 2021) 57-subject multiple-choice eval covering STEM, humanities, social sciences, and professional topics (Hendrycks et al., 2021). 14k test questions with 4 answer choices each. Loaded from the all config (subjects merged into a single parquet with a subject column for grouping); the per-subject configs and the auxiliary_train split are not pulled in. The most-cited general-knowledge LLM benchmark. https://huggingface.co/datasets/cais/mmlu Tabular (Parquet) MIT 115,700 β€” 27.9 MB 85.9 MB
MMLU-Pro TIGER-Lab/MMLU-Pro β€” improved 12k-question multitask eval 12k harder multiple-choice questions across 14 categories, with 10 answer options per question (vs MMLU's 4) and chain-of-thought reasoning included. Successor to cais/mmlu we ship as Tier A; small (4 MB) but high-value for LLM evaluation tooling. https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro Tabular (Parquet) MIT 12,102 β€” 2.6 MB 4.5 MB
MMMLU (multilingual MMLU) openai/MMMLU β€” MMLU professionally translated into 14 languages MMLU's 14k test questions translated into 14 languages by professional human translators (released by OpenAI). 14 CSV shards keyed by <LANG>-<COUNTRY> filename; concatenated into one parquet with the language tag preserved as a split column. First multilingual-eval slug in the catalog. https://huggingface.co/datasets/openai/MMMLU Tabular (CSV) MIT 196,588 β€” 36.3 MB 61.6 MB
MMMU MMMU/MMMU β€” Massive Multi-discipline Multimodal Understanding 11.5K multimodal college-level questions across 30 disciplines (art, business, science, health, humanities, tech). Each row carries question text, up to 7 reference images (binary blobs), 4 answer choices, and a discipline label. Comprehensive cross-modal expert-AGI eval. https://huggingface.co/datasets/MMMU/MMMU Tabular (Parquet) Apache-2.0 11,550 1 3,465.0 MB 3,505.0 MB
MNIST ylecun/mnist β€” handwritten digit classification 70K 28x28 grayscale handwritten digit images (60k train + 10k test) with integer 0–9 labels. The canonical small image-classification benchmark. Image column is binary-blob PNG bytes. https://huggingface.co/datasets/ylecun/mnist Tabular (Parquet) MIT 70,000 1 15.1 MB 19.5 MB
Movie Industry Movie Industry Movies dataset for recommendation system. Welcome to the Movie Recommendation Dataset! This dataset is curated for building recommendation systems in the fascinating world of movies. Whether you're a data scientist, machine learning enthusiast, or a movie buff, this dataset provides a rich collection of information about various movies, offering endless possibilities for analysis and recommendation system development. β€” adapted from the dataset's Kaggle description (abdallahwagih/movies). https://github.com/Juanets/movie-stats Tabular (CSV) CC0-1.0 7,668 β€” 0.4 MB 0.5 MB
Mushroom UCI ML Repository β€” Mushroom From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy. https://archive.ics.uci.edu/dataset/73/mushroom Tabular (CSV) CC-BY-4.0 8,124 β€” 0.0 MB 0.1 MB
NBA Database NBA Database NBA rookies classification. This is publicly available data that has been scraped from NBA statistics. The data is from between 1990 and 2016. Each row describes the performance of a basketball player during their first ('rookie') year. β€” adapted from the dataset's Kaggle description (tombutton/basketball). https://github.com/wyattowalsh/nba-db Tabular (CSV) CC-BY-SA-4.0 1,308 β€” 0.0 MB 0.1 MB
New York City Airbnb Open Data New York City Airbnb Open Data Airbnb listings and metrics in NYC, NY, USA (2019). Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019. This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions. β€” adapted from the dataset's Kaggle description (dgomonov/new-york-city-airbnb-open-data). http://data.insideairbnb.com/united-states/ny/new-york-city/ Tabular (CSV) CC0-1.0 48,895 β€” 2.1 MB 2.2 MB
News Headlines Sarcasm News Headlines Dataset For Sarcasm Detection High quality dataset for the task of Sarcasm and Fake News Detection. Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! Follow me on LinkedIn for commentary on latest AI developments: https://www.linkedin.com/in/misrarishabh/ Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets. β€” adapted from the dataset's Kaggle description (rmisra/news-headlines-dataset-for-sarcasm-detection). https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection Tabular (Parquet) Attribution 4.0 International (CC BY 4.0) 26,709 β€” 0.8 MB 0.9 MB
NIPS Papers NIPS Papers Titles, authors, abstracts, and extracted text for all NIPS papers (1987-2017). Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. It covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning. This dataset includes the title, authors, abstracts, and extracted text for all NIPS papers to date (ranging from the first 1987 conference to the current 2016 conference). β€” adapted from the dataset's Kaggle description (benhamner/nips-papers). https://github.com/benhamner/nips-papers Tabular (CSV) ODbL-1.0 9,680 β€” 119.4 MB 188.0 MB
No Robots HuggingFaceH4/no_robots β€” 10k human-written instruction-tuning examples 10k high-quality human-written prompt/response pairs from HuggingFaceH4. Each row carries messages: list<struct<role, content>> plus a category label across 10 task types. Smaller but cleaner counterpart to OpenOrca / OpenAssistant in the instruction-tuning corner of the catalog. https://huggingface.co/datasets/HuggingFaceH4/no_robots Tabular (Parquet) CC-BY-NC-4.0 20,000 β€” 13.8 MB 18.3 MB
NYC 311 NYC 311 Service Requests (2020–present) 20.9M non-emergency service requests filed with NYC 311 from 2020 to present. Covers complaints (noise, sanitation, illegal parking, etc.), inquiries, and service requests β€” each row carries borough, agency, category, complaint type, location, and resolution metadata. https://www.nyc.gov/home/terms-of-use.page Tabular (CSV) NYC Open Data (public) 21,007,848 β€” 1,480.6 MB 2,006.2 MB
NYC Parking Tickets NYC Parking Tickets 42.3M Rows of Parking Ticket Data, Aug 2013-June 2017. The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year!). This data is made publicly available to aid in ticket resolution and to guide policymakers. There are four files, covering Aug 2013-June 2017. β€” adapted from the dataset's Kaggle description (new-york-city/nyc-parking-tickets). https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2024/pvqr-7yc4 Tabular (CSV) CC0-1.0 11,353,336 β€” 227.1 MB 375.8 MB
NYC Property Sales NYC Property Sales A year's worth of properties sold on the NYC real estate market. This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period. This dataset contains the location, address, type, sale price, and sale date of building units sold. A reference on the trickier fields: BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). β€” adapted from the dataset's Kaggle description (new-york-city/nyc-property-sales). https://www.nyc.gov/site/finance/property/property-rolling-sales-data.page Custom CC0-1.0 81,567 β€” 1.3 MB 2.3 MB
NYC TLC FHV 2025 NYC TLC FHV Trip Data 2025 Pre-app for-hire-vehicle trips (livery, black-car, luxury) reported by base companies to the TLC for 2025. Smaller and patchier than the high-volume Uber/Lyft data; useful as a contrast to the FHVHV record. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Tabular (Parquet) NYC TLC Terms of Use 25,047,544 β€” 253.1 MB 197.2 MB
NYC TLC Green 2025 NYC TLC Green Trip Data 2025 NYC's outer-borough Boro Taxis (street-hail livery cabs introduced in 2013) for calendar year 2025. Pickups are restricted to areas outside Manhattan's central business district plus the airports β€” complements the medallion-fleet yellow data with a different geographic footprint. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Tabular (Parquet) NYC TLC Terms of Use 591,375 β€” 12.5 MB 10.4 MB
NYC TLC HVFHV 2025 NYC TLC HVFHV Trip Data 2025 High-volume for-hire-vehicle trips β€” the post-2019 Uber, Lyft, Via, and Juno records the TLC began collecting after the rideshare-cap legislation. Far larger than the medallion or non-HV FHV streams; the dominant share of NYC's ride-hail data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Tabular (Parquet) NYC TLC Terms of Use 243,589,684 β€” 5,606.6 MB 5,273.4 MB
NYC TLC Yellow 2025 NYC TLC Yellow Trip Data 2025 Manhattan's iconic medallion taxis (the yellow cabs) β€” every metered trip recorded by the TLC for calendar year 2025. Rides are concentrated in Manhattan and the airports; pickups are exclusive to medallion holders. Long-running monthly time series; the go-to dataset for OLAP demos. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Tabular (Parquet) NYC TLC Terms of Use 48,722,602 β€” 857.0 MB 784.2 MB
NYPD Complaints NYPD Complaint Data Historic Historic NYPD complaint records from 2006 forward β€” every felony, misdemeanor, and violation reported to police. Each row has incident type, premises, suspect/victim demographics, location (precinct, borough, lat/lon), and dates. https://www.nyc.gov/home/terms-of-use.page Tabular (CSV) NYC Open Data (public) 10,071,507 β€” 320.3 MB 389.1 MB
Online Retail UCI ML Repository β€” Online Retail This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. https://archive.ics.uci.edu/dataset/352/online+retail Tabular (CSV) CC-BY-4.0 541,909 β€” 2.9 MB 3.3 MB
Online Retail II UCI ML Repository β€” Online Retail II A real online retail transaction data set of two years. This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers. https://archive.ics.uci.edu/dataset/502/online+retail+ii Tabular (CSV) CC-BY-4.0 1,067,371 β€” 5.9 MB 6.9 MB
Online Shoppers Purchasing Intention Dataset UCI ML Repository β€” Online Shoppers Purchasing Intention Dataset Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping. The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset Tabular (CSV) CC-BY-4.0 12,330 β€” 0.2 MB 0.3 MB
Open Food Facts Open Food Facts product database Crowd-sourced product facts for 4.4M packaged food products. Each row carries deeply nested nutrition, ingredient, allergen, and labelling metadata (origin, packaging, traffic-light scores). One of the larger heavily-nested JSON-shaped corpora in the catalog; the current build ships the canonical JSONL as a single raw_json: string column. https://world.openfoodfacts.org/data Custom ODbL-1.0 4,466,927 β€” 12,910.8 MB 36,431.3 MB
OpenAssistant Conversations (oasst1) OpenAssistant Conversations Release 1 (KΓΆpf et al., NeurIPS 2023) 85k crowd-authored assistant conversation messages organized as tree-structured threads, spanning 35 languages (KΓΆpf et al., NeurIPS 2023). Each row carries quality, toxicity, and emoji-feedback labels alongside the message text and parent pointer. First public RLHF-grade conversation corpus; powers OpenAssistant and many downstream fine-tunes. https://huggingface.co/datasets/OpenAssistant/oasst1 Tabular (Parquet) Apache-2.0 88,838 β€” 27.0 MB 43.2 MB
OpenLibrary Authors Internet Archive OpenLibrary β€” Authors Bibliographic records for book authors β€” name variants, birth/death dates, Wikidata cross-references, biographical notes. Joins to openlibrary-works via author keys. Sourced from OpenLibrary's monthly data dumps. https://openlibrary.org/developers/dumps Custom CC0-1.0 15,177,329 β€” 809.1 MB 2,061.5 MB
OpenLibrary Editions Internet Archive OpenLibrary β€” Editions Bibliographic records for individual book editions (ISBN, publisher, language, page count, physical format). Each edition ties back to a works row via the work key. Sourced from OpenLibrary's monthly data dumps. https://openlibrary.org/developers/dumps Custom CC0-1.0 55,962,700 β€” 12,931.6 MB 28,298.0 MB
OpenLibrary Works Internet Archive OpenLibrary β€” Works Bibliographic records for literary works (the abstract concept of a book β€” title, author, subjects). Each work has many editions; shipped as a separate slug. Sourced from OpenLibrary's monthly data dumps. https://openlibrary.org/developers/dumps Custom CC0-1.0 40,981,783 β€” 4,245.1 MB 8,826.4 MB
OpenOrca OpenOrca (Open-Orca, 2023) β€” GPT-4 + GPT-3.5 augmented FLAN ~4.2M instruction-response pairs generated via the Orca self-augmentation method, drawn from FLAN-Collection prompts answered by GPT-4 (1M) and GPT-3.5 (3.2M). Concatenated into one parquet with a split column distinguishing the two model sources. https://huggingface.co/datasets/Open-Orca/OpenOrca Tabular (Parquet) MIT 4,233,923 β€” 2,713.5 MB 3,610.5 MB
OpenPowerlifting OpenPowerlifting meet results 3.9M competition-lift records from powerlifting meets worldwide, maintained by openpowerlifting.org. One row per lift attempt with lifter, federation, weight class, equipment, and the four scores (squat / bench / deadlift / total). https://www.openpowerlifting.org/data Tabular (CSV) CC0-1.0 3,916,281 β€” 104.8 MB 156.5 MB
Optical Recognition of Handwritten Digits UCI ML Repository β€” Optical Recognition of Handwritten Digits Two versions of this database available; see folder We used preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions. For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G. T. Candela, D. L. https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits Tabular (CSV) CC-BY-4.0 5,620 β€” 0.2 MB 0.2 MB
OSM Germany Nodes OpenStreetMap Germany β€” Nodes OSM nodes for Germany from the Geofabrik extract β€” point features (addresses, POIs, traffic signals, etc.) with their geographic coordinates and tag bag. Emitted as GeoParquet 1.1 with WKB geometry. https://www.openstreetmap.org/copyright Geo (OSM PBF) ODbL-1.0 432,906,290 β€” 15,335.9 MB 21,061.1 MB
OSM Germany Relations OpenStreetMap Germany β€” Relations OSM relations (composite features) for Germany β€” multi-polygon land covers, route memberships, administrative boundaries. The richest of the three OSM-Germany slugs in tag complexity. Emitted as GeoParquet 1.1 with WKB geometry. https://www.openstreetmap.org/copyright Geo (OSM PBF) ODbL-1.0 889,712 β€” 91.2 MB 148.5 MB
OSM Germany Ways OpenStreetMap Germany β€” Ways OSM ways (linear features) for Germany from the Geofabrik extract β€” roads, paths, rivers, building outlines, railway lines. Each row carries a tag bag plus the ordered node references. Emitted as GeoParquet 1.1 with WKB LineString / Polygon geometry. https://www.openstreetmap.org/copyright Geo (OSM PBF) ODbL-1.0 70,097,667 701 9,514.0 MB 14,789.9 MB
OSMI Mental Health in Tech 2016 OSMI 2016 Mental Health in Tech Survey Data on prevalence and attitudes towards mental health among tech workers. OSMI Mental Health in Tech Survey 2016 Currently over 1400 responses, the ongoing 2016 survey aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. How Will This Data Be Used? We are interested in gauging how mental health is viewed within the tech/IT workplace, and the prevalence of certain mental health disorders within the tech industry. β€” adapted from the dataset's Kaggle description (osmi/mental-health-in-tech-2016). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 1,433 β€” 0.1 MB 0.3 MB
OSMI Mental Health in Tech 2017 OSMI 2017 Mental Health in Tech Survey Data on prevalence and attitudes towards mental health among tech workers. OSMI Mental Health in Tech Survey 2017 The 2017 survey aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. How Will This Data Be Used? We are interested in gauging how mental health is viewed within the tech/IT workplace, and the prevalence of certain mental health disorders within the tech industry. β€” adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2017). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 756 β€” 0.2 MB 0.4 MB
OSMI Mental Health in Tech 2018 OSMI 2018 Mental Health in Tech Survey Data on prevalence and attitudes towards mental health among tech workers. β€” adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2018). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 417 β€” 0.2 MB 0.3 MB
OSMI Mental Health in Tech 2019 OSMI 2019 Mental Health in Tech Survey β€” adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2019). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 352 β€” 0.1 MB 0.2 MB
OSMI Mental Health in Tech 2020 OSMI 2020 Mental Health in Tech Survey Results from the yearly survey. β€” adapted from the dataset's Kaggle description (osmihelp/osmi-2020-mental-health-in-tech-survey-results). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 180 β€” 0.1 MB 0.2 MB
OSMI Mental Health in Tech 2021 OSMI 2021 Mental Health in Tech Survey Results from the yearly survey. β€” adapted from the dataset's Kaggle description (osmihelp/osmh-2021-mental-health-in-tech-survey-results). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 131 β€” 0.1 MB 0.2 MB
OSMI Mental Health in Tech 2022 OSMI 2022 Mental Health in Tech Survey Results from the yearly survey. β€” adapted from the dataset's Kaggle description (osmihelp/osmh-mental-health-in-tech-survey-2022). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 164 β€” 0.1 MB 0.2 MB
OSMI Mental Health in Tech 2023 OSMI 2023 Mental Health in Tech Survey Results from the yearly survey. β€” adapted from the dataset's Kaggle description (osmihelp/osmi-mental-health-in-tech-survey-2023). https://osmhhelp.org/research.html Tabular (CSV) CC-BY-SA-4.0 6 β€” 0.1 MB 0.1 MB
P3 (eval-subset) bigscience/P3 β€” T0 held-out eval subset (7 prompt templates) Curated 7-config subset of bigscience/P3 (Sanh et al., ICLR 2022) β€” the held-out evaluation tasks used to benchmark T0 / T0pp, each represented by one canonical prompt template. Configs concatenated with a split column distinguishing train / validation / test where available; the source <config> itself is preserved via hf_concat_splits's per-shard tagging. Templates picked: super_glue_rte_GPT_3_style, super_glue_cb_GPT_3_style, super_glue_copa_C1_or_C2_premise_so_because_, super_glue_wic_GPT_3_prompt, super_glue_boolq_GPT_3_Style, winogrande_winogrande_debiased_Replace, hellaswag_Appropriate_continuation_Yes_or_No. Flip the allow-pattern list to any of the 658 configs upstream (see the parquet API for the full inventory) for a different prompt slice. https://huggingface.co/datasets/bigscience/P3 Tabular (Parquet) Apache-2.0 102,963 1 27.4 MB 39.5 MB
Palmer Penguins Palmer Archipelago (Antarctica) penguin data Drop in replacement for Iris Dataset. Please refer to the official Github page for details and license information. The details below have also been taken from there. Artwork: @allisonhorst Palmer Archipelago (Antarctica) penguin data Data were collected and made available by Dr. β€” adapted from the dataset's Kaggle description (parulpandey/palmer-archipelago-antarctica-penguin-data). https://archive.ics.uci.edu/ Tabular (CSV) CC0-1.0 344 β€” 0.0 MB 0.0 MB
Parkinsons UCI ML Repository β€” Parkinsons Oxford Parkinson's Disease Detection Dataset This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD. The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. https://archive.ics.uci.edu/dataset/174/parkinsons Tabular (CSV) CC-BY-4.0 195 β€” 0.0 MB 0.1 MB
People's Speech (clean validation) MLCommons/peoples_speech β€” clean validation split Validation slice of MLCommons People's Speech, a 30K-hour CC/PD-licensed English supervised conversational ASR corpus. Each row carries audio (struct<bytes, path>) plus alignment metadata. The full corpus is multi-TB; this validation split is a tractable sample. Flip allow patterns to clean/train/*.parquet for the full ~28K-hour clean training set. https://huggingface.co/datasets/MLCommons/peoples_speech Tabular (Parquet) CC-BY-2.0 18,622 1 2,199.5 MB 2,345.0 MB
Phishing Websites UCI ML Repository β€” Phishing Websites This dataset collected mainly from: PhishTank archive, MillerSmiles archive, GoogleÒ€ℒs searching operators. One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. https://archive.ics.uci.edu/dataset/327/phishing+websites Tabular (CSV) CC-BY-4.0 11,055 β€” 0.0 MB 0.1 MB
PleIAs SYNTH PleIAs/SYNTH β€” open generalist synthetic reasoning corpus PleIAs's open synthetic dataset for training small reasoning models, generated from open-weight model outputs (no closed-API ToS issue). 7 train shards. Schema centers on prompt/completion pairs with task-type metadata. https://huggingface.co/datasets/PleIAs/SYNTH Tabular (Parquet) CDLA-Permissive-2.0 777,943 1 1,565.3 MB 2,153.1 MB
Predict Droughts using Weather & Soil Data Predict Droughts using Weather & Soil Data Predicting continental US drought levels using meteorological & soil data. To make using previous drought scores for prediction easier (e.g. by interpolating), I merged them into one file and set the drought scores to NaN were not available. The US drought monitor is a measure of drought across the US manually created by experts using a wide range of data. β€” adapted from the dataset's Kaggle description (cdminix/us-drought-meteorological-data). https://github.com/Epistoteles/predicting-drought Tabular (CSV) CC0-1.0 19,300,680 β€” 513.8 MB 529.5 MB
Predict Students' Dropout and Academic Success UCI ML Repository β€” Predict Students' Dropout and Academic Success A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies. The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. The data is used to build classification models to predict students' dropout and academic sucess. The problem is formulated as a three category classification task, in which there is a strong imbalance towards one of the classes. https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success Tabular (CSV) CC-BY-4.0 4,424 β€” 0.1 MB 0.1 MB
PubMedQA (labeled) qiaojin/PubMedQA β€” 1k expert-labeled biomedical QA 1k expert-annotated biomedical research questions paired with PubMed abstracts and yes/no/maybe answers. Each row carries question, context.contexts (list<string>), long_answer, and final_decision. The pqa_artificial (61k) and pqa_unlabeled (211k) splits exist upstream β€” point allow_patterns at them for the larger set. https://huggingface.co/datasets/qiaojin/PubMedQA Tabular (Parquet) MIT 1,000 1 0.7 MB 1.0 MB
Real Estate Valuation UCI ML Repository β€” Real Estate Valuation The real estate valuation is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The Ò€œreal estate valuationҀ� is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples). https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set Tabular (CSV) CC-BY-4.0 414 β€” 0.0 MB 0.0 MB
San Francisco Building Permits San Francisco Building Permits 5 years and 200k building permits. Background A building permit is an official approval document issued by a governmental agency that allows you or your contractor to proceed with a construction or remodeling project on one's property. For more details go to https://www.thespruce.com/what-is-a-building-permit-1398344. Each city or county has its own office related to buildings, that can do multiple functions like issuing permits, inspecting buildings to enforce safety measures, modifying rules to accommodate needs of the growing population etc. β€” adapted from the dataset's Kaggle description (aparnashastry/building-permit-applications-data). https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9/data Tabular (CSV) DbCL-1.0 198,900 β€” 14.6 MB 19.4 MB
Seeds UCI ML Repository β€” Seeds Measurements of geometrical properties of kernels belonging to three different varieties of wheat. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. https://archive.ics.uci.edu/dataset/236/seeds Tabular (CSV) CC-BY-4.0 210 β€” 0.0 MB 0.0 MB
Seoul Bike Sharing Demand UCI ML Repository β€” Seoul Bike Sharing Demand The dataset contains count of public bicycles rented per hour in the Seoul Bike Sharing System, with corresponding weather data and holiday information Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand Tabular (CSV) CC-BY-4.0 8,760 β€” 0.1 MB 0.1 MB
SF Salaries SF Salaries Explore San Francisco city employee salary data. One way to understand how a city government works is by looking at who it employs and how its employees are compensated. This data contains the names, job title, and compensation for San Francisco city employees on an annual basis from 2011 to 2014. Exploration Ideas To help get you started, here are some data exploration ideas: - How have salaries changed over time between different groups of people? β€” adapted from the dataset's Kaggle description (kaggle/sf-salaries). https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd Tabular (CSV) CC0-1.0 1,096,102 β€” 58.9 MB 51.3 MB
SMS Spam Collection UCI ML Repository β€” SMS Spam Collection The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research. This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. https://archive.ics.uci.edu/dataset/228/sms+spam+collection Tabular (CSV) CC-BY-4.0 5,574 β€” 0.2 MB 0.3 MB
Spambase UCI ML Repository β€” Spambase Classifying Email as Spam or Non-Spam The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... The classification task for this dataset is to determine whether a given email is spam or not. Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. https://archive.ics.uci.edu/dataset/94/spambase Tabular (CSV) CC-BY-4.0 4,601 β€” 0.2 MB 0.4 MB
SQuAD v2 Stanford Question Answering Dataset v2.0 Stanford Question Answering Dataset v2 β€” 130k crowdsourced questions about Wikipedia paragraphs, with 50k of them deliberately unanswerable from the given context. Each row pairs a question, the passage it was asked about, and the canonical answer span(s). The v2 release added the unanswerable subset specifically to test whether models know when to abstain β€” a longstanding QA-eval weakness. https://huggingface.co/datasets/rajpurkar/squad_v2 Tabular (Parquet) CC-BY-SA-4.0 142,192 β€” 11.1 MB 16.5 MB
Stack Overflow 2018 Developer Survey Stack Overflow 2018 Developer Survey Individual responses on the 2018 Developer Survey fielded by Stack Overflow. Each year, we at Stack Overflow ask the developer community about everything from their favorite technologies to their job preferences. This year marks the eighth year we’ve published our Annual Developer Survey resultsβ€”with the largest number of respondents yet. Over 100,000 developers took the 30-minute survey in January 2018. β€” adapted from the dataset's Kaggle description (stackoverflow/stack-overflow-2018-developer-survey). https://insights.stackoverflow.com/survey/2018 Tabular (CSV) DbCL-1.0 98,855 β€” 6.9 MB 9.5 MB
Stack Overflow Badges Stack Exchange Data Dump β€” Stack Overflow Badges Badges earned by Stack Overflow users β€” badge name, class (gold/silver/bronze), tag-based vs activity-based, awarded timestamp. Joins to stackoverflow-users via user_id. https://archive.org/details/stackexchange Structured (XML) CC-BY-SA-4.0 51,289,973 β€” 583.8 MB 487.6 MB
Stack Overflow PostLinks Stack Exchange Data Dump β€” Stack Overflow PostLinks Post-to-post link graph for Stack Overflow β€” duplicate-question and related-question relationships. Each row has both endpoint post IDs plus a link type code; joins to stackoverflow-posts on either side. https://archive.org/details/stackexchange Structured (XML) CC-BY-SA-4.0 6,552,590 β€” 146.8 MB 107.1 MB
Stack Overflow Posts Stack Exchange Data Dump β€” Stack Overflow Posts Every Stack Overflow question and answer from 2008 to 2024 β€” title, body (HTML), tags, score, view count, accepted-answer ID, and authorship. The largest single table in the Stack Exchange data dump. https://archive.org/details/stackexchange Tabular (Parquet) CC-BY-SA-4.0 58,329,355 β€” 23,896.5 MB 38,972.6 MB
Stack Overflow Tags Stack Exchange Data Dump β€” Stack Overflow Tags Tag metadata and usage counts for Stack Overflow β€” tag name, total uses, excerpt-post and wiki-post IDs. ~60k tags following a long-tail distribution. https://archive.org/details/stackexchange Structured (XML) CC-BY-SA-4.0 65,675 β€” 1.4 MB 1.3 MB
Stack Overflow Users Stack Exchange Data Dump β€” Stack Overflow Users User profiles from Stack Overflow β€” display name, location, reputation, badge counts (gold/silver/bronze), account creation date, last access. Joins to stackoverflow-posts via owner_user_id and to stackoverflow-badges via user_id. https://archive.org/details/stackexchange Structured (XML) CC-BY-SA-4.0 22,484,235 β€” 1,148.3 MB 1,251.5 MB
Statlog (German Credit Data) UCI ML Repository β€” Statlog (German Credit Data) This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data". For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data Tabular (CSV) CC-BY-4.0 1,000 β€” 0.0 MB 0.0 MB
Student Performance UCI ML Repository β€” Student Performance Predict student performance in secondary education (high school). This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. https://archive.ics.uci.edu/dataset/320/student+performance Tabular (CSV) CC-BY-4.0 649 β€” 0.0 MB 0.0 MB
Synthetic Text-to-SQL gretelai/synthetic_text_to_sql — Gretel synthetic NL→SQL pairs ~105K synthetic natural-language → SQL pairs generated by Gretel with structured metadata: sql_complexity, sql_task_type, domain, sql_explanation, sql_prompt, sql_context. Useful baseline for text-to-SQL training and for exercising structured-string columns. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql Tabular (Parquet) Apache-2.0 105,851 1 21.9 MB 36.6 MB
Temperature change Temperature change Global Warming, Temperature Change, Climate Change. Data description The FAOSTAT Temperature Change domain disseminates statistics of mean surface temperature change by country, with annual updates. The current dissemination covers the period 1961–2023. Statistics are available for monthly, seasonal and annual mean temperature anomalies, i.e., temperature change with respect to a baseline climatology, corresponding to the period 1951–1980. β€” adapted from the dataset's Kaggle description (sevgisarac/temperature-change). https://data.giss.nasa.gov/gistemp/ Tabular (CSV) Attribution 3.0 IGO (CC BY 3.0 IGO) 147 β€” 0.0 MB 0.0 MB
Thyroid Disease UCI ML Repository β€” Thyroid Disease 10 separate databases from Garavan Institute # From Garavan Institute # Documentation: as given by Ross Quinlan # 6 databases from the Garavan Institute in Sydney, Australia # Approximately the following for each database: ** 2800 training (data) instances and 972 test instances ** Plenty of missing data ** 29 or so attributes, either Boolean or continuously-valued # 2 additional databases, also from Ross Quinlan, are also here ** Hypothyroid.data and sick-euthyroid.data ** Quinlan believes that these databases have been corrupted ** Their format is highly similar to the other databases # 1 mo… https://archive.ics.uci.edu/dataset/102/thyroid+disease Tabular (CSV) CC-BY-4.0 2,800 β€” 0.0 MB 0.1 MB
TinyStories roneneldan/TinyStories (Eldan & Li, 2023) ~2.1M short synthetic children's stories (3–4 paragraphs each) generated by GPT-3.5/4 to study how small language models acquire coherent narrative ability. Two columns: text (the story) plus validation-flagged subset. Canonical small-LM training corpus. https://huggingface.co/datasets/roneneldan/TinyStories Tabular (Parquet) CDLA-Sharing-1.0 2,141,709 3 633.5 MB 917.9 MB
Titanic Dataset Titanic Dataset Titanic Survival Prediction Dataset. The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered β€œunsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. β€” adapted from the dataset's Kaggle description (yasserh/titanic-dataset). https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/carData/TitanicSurvival.csv Tabular (CSV) CC0-1.0 891 β€” 0.0 MB 0.0 MB
TruthfulQA (multiple-choice) truthfulqa/truthful_qa β€” multiple-choice configuration 817 questions designed to elicit imitative-falsehood answers β€” common misconceptions humans repeat. Multiple-choice config: mc1_targets (single-answer) and mc2_targets (probabilistic). The generation config is omitted (use the original repo for free-form prompts). https://huggingface.co/datasets/truthfulqa/truthful_qa Tabular (Parquet) Apache-2.0 817 1 0.2 MB 0.3 MB
U.S. Airbnb Open Data U.S. Airbnb Open Data Airbnb listings and metrics of regions in the U.S. Since its inception in 2008, Airbnb has disrupted the traditional hospitality industry as more travellers decide to use Airbnb as their primary means of accommodation. Airbnb offers travellers a more unique and personalized way of accommodation and experience. Project I had compiled this dataset for a project of mine on 20th October 2020. β€” adapted from the dataset's Kaggle description (kritikseth/us-airbnb-open-data). http://insideairbnb.com/get-the-data/ Tabular (CSV) CC0-1.0 232,147 β€” 11.3 MB 13.5 MB
Uber Pickups NYC Uber Pickups in New York City Trip data for over 20 million Uber (and other for-hire vehicle) trips in NYC. Uber TLC FOIL Response This directory contains data on over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015. Trip-level data on 10 other for-hire vehicle (FHV) companies, as well as aggregated data for 329 FHV companies, is also included. All the files are as they were received on August 3, Sept. β€” adapted from the dataset's Kaggle description (fivethirtyeight/uber-pickups-in-new-york-city). https://github.com/fivethirtyeight/uber-tlc-foil-response Tabular (CSV) CC0-1.0 564,516 β€” 2.2 MB 3.5 MB
UFC-Fight historical data from 1993 to 2021 UFC-Fight historical data from 1993 to 2021 Compiled UFC fight, fighter stats and information. UPDATE This dataset got a lot of love from the community and I saw many people asking for an updated version, so I have uploaded the latest scraped and processed data ( as of 21/03/2021). Now it's super easy for anyone to get the latest dataset (Just use a single command), so in case you need bleeding-edge data, or you want to see the code, you can look here. Hope this solves all problems! β€” adapted from the dataset's Kaggle description (rajeevw/ufcdata). https://github.com/WarrierRajeev/UFC-Predictions Tabular (CSV) CC0-1.0 6,012 β€” 1.5 MB 2.9 MB
UK Price Paid HM Land Registry Price Paid Data (1995–present) Every residential property sale in England and Wales since 1995, published by HM Land Registry. Each row carries price, postcode, property type, new-build flag, lease/freehold, and local authority. https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads Tabular (CSV) OGL-UK-3.0 31,192,682 β€” 1,024.7 MB 1,429.3 MB
UK Road Safety: Traffic Accidents and Vehicles UK Road Safety: Traffic Accidents and Vehicles Detailed dataset of road accidents and involved vehicles in the UK (2005-2017). The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research. The creation of this dataset was inspired by the one previously published by [Dave Fisher-Hickey][1]. β€” adapted from the dataset's Kaggle description (tsiaras/uk-road-safety-accidents-and-vehicles). https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data Tabular (CSV) DbCL-1.0 9,015,100 β€” 379.3 MB 531.0 MB
UltraChat 200k UltraChat 200k (HuggingFaceH4, 2024) 200k filtered + formatted multi-turn conversations from the UltraChat corpus, used by HuggingFaceH4 to instruction-tune Zephyr-7b-Ξ². Each row carries a messages field of list<struct<role, content>> for the dialogue turns. Larger and more conversational than the no-robots / dolly-15k cohort; synthetic but stylistically diverse. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k Tabular (Parquet) MIT 515,311 1 1,055.0 MB 1,575.9 MB
UltraFeedback (binarized) HuggingFaceH4/ultrafeedback_binarized β€” DPO-ready binarized preferences 61k DPO/SFT preference triples binarized from the UltraFeedback corpus by HuggingFaceH4. Each row carries chosen and rejected as list<struct<role, content>> plus per-side score doubles β€” a rich showcase for paired list-of-struct columns alongside the UltraChat 200k slug we already ship. https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized Tabular (Parquet) MIT 187,405 β€” 438.7 MB 670.7 MB
US Accidents (2016 - 2023) US Accidents (2016 - 2023) A Countrywide Traffic Accident Dataset (2016 - 2023). This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. β€” adapted from the dataset's Kaggle description (sobhanmoosavi/us-accidents). https://smoosavi.org/datasets/us_accidents Tabular (Parquet) CC-BY-NC-SA-4.0 7,728,394 β€” 541.0 MB 756.6 MB
Walmart Walmart Dataset Walmart Store Sales Prediction - Regression Problem. One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. β€” adapted from the dataset's Kaggle description (yasserh/walmart-dataset). https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg Tabular (CSV) CC0-1.0 6,435 β€” 0.1 MB 0.1 MB
Waxal (Dagbani ASR, test split) google/WaxalNLP β€” Dagbani ASR test split Dagbani-language ASR test split from Google's Waxal multilingual African-language speech corpus. Audio binary blobs paired with Dagbani-text transcripts; 3 parquet shards from the canonical test set (train + validation + unlabeled splits exist upstream and total ~57 GB). WaxalNLP covers ~33 language-task pairs across Acholi (ach), Akan (aka), Amharic (amh), Dagbani (dag), Ewe (ewe), Igbo (ibo), Luganda (lug), Luo (luo), Swahili (swa), Yoruba (yor), and others β€” flip allow_patterns to <lang>_<asr|tts>/test-*.parquet for any of them. (Note: despite the 'Waxal' name, the upstream does not include a Wolof config; pick a different language slug.) https://huggingface.co/datasets/google/WaxalNLP Tabular (Parquet) CC-BY-SA-4.0 1,838 1 536.8 MB 544.8 MB
WebSight v0.1 HuggingFaceM4/WebSight v0.1 β€” synthetic HTML/screenshot pairs ~822K synthetic HTML pages with rendered screenshots, used to train the Idefics3 visual-document model. Each row pairs html (source code) with image (rendered PNG bytes). Smaller v0.1 release (71 shards); v0.2 exists (738 shards, ~10x larger). https://huggingface.co/datasets/HuggingFaceM4/WebSight Tabular (Parquet) CC-BY-4.0 822,987 1 27,964.7 MB 33,160.0 MB
Wholesale customers UCI ML Repository β€” Wholesale customers The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories https://archive.ics.uci.edu/dataset/292/wholesale+customers Tabular (CSV) CC-BY-4.0 440 β€” 0.0 MB 0.0 MB
Wikipedia (English) wikimedia/wikipedia β€” English Wikipedia, 2023-11-01 dump Cleaned-text English Wikipedia dump from 2023-11-01 (~6.4M articles). Each row carries id, url, title, text from Wikimedia's parquet auto-conversion. Distinct from cohere-wikipedia-simple-embed (which ships embeddings of Simple English) β€” this is the raw multilingual-friendly text corpus, gated here to English alone to bound the build size. Flip hf_allow_patterns to a different 20231101.<lang> for other languages. https://huggingface.co/datasets/wikimedia/wikipedia Tabular (Parquet) CC-BY-SA-3.0 6,407,814 7 7,753.8 MB 11,238.8 MB
Wikipedia Structured Contents Wikipedia Structured Contents Pre-parsed English and French Wikipedia Articles, Including Infoboxes. Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.). β€” adapted from the dataset's Kaggle description (wikimedia-foundation/wikipedia-structured-contents). https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents Custom CC-BY-SA-4.0 10,112,058 β€” 34,032.7 MB β€”
Wine UCI ML Repository β€” Wine Using chemical analysis to determine the origin of wines These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. https://archive.ics.uci.edu/dataset/109/wine Tabular (CSV) CC-BY-4.0 178 β€” 0.0 MB 0.0 MB
Wine Quality UCI ML Repository β€” Wine Quality Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/). The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). https://archive.ics.uci.edu/dataset/186/wine+quality Tabular (CSV) CC-BY-4.0 6,497 β€” 0.1 MB 0.1 MB
World Bank WDI World Development Indicators World Bank's World Development Indicators β€” global country-level time-series spanning ~1500 indicators across economic, social, demographic, and environmental categories. The canonical cross-country comparison dataset for development economics. https://datacatalog.worldbank.org/search/dataset/0037712 Tabular (CSV) CC-BY-4.0 395,276 β€” 70.3 MB 112.3 MB
World Energy Consumption World Energy Consumption Consumption of energy by different countries. Data on Energy by Our World in Data Our complete Energy dataset is a collection of key metrics maintained by Our World in Data. It is updated regularly and includes data on energy consumption (primary energy, per capita, and growth rates), energy mix, electricity mix and other relevant metrics. The complete Our World in Data Energy dataset πŸ—‚οΈ Download our complete Energy dataset : CSV | XLSX | JSON The CSV and XLSX files follow a format of 1 row per location and year. β€” adapted from the dataset's Kaggle description (pralabhpoudel/world-energy-consumption). https://github.com/owid Tabular (CSV) Attribution 4.0 International (CC BY 4.0) 23,377 β€” 3.6 MB 6.7 MB
YouTube-Commons (sample) PleIAs/YouTube-Commons β€” single-shard transcript sample Single-shard sample (cctube_0.parquet, ~385MB) of PleIAs's YouTube-Commons corpus: ~2M transcripts of YouTube videos uploaders explicitly marked CC-BY. Schema: video metadata + multilingual transcript text. Flip allow patterns to cctube_*.parquet for the full ~426-shard set (~165GB). https://huggingface.co/datasets/PleIAs/YouTube-Commons Tabular (Parquet) CC-BY-4.0 49,967 1 253.6 MB 409.9 MB
Zoo Animal Classification Zoo Animal Classification Use Machine Learning Methods to Correctly Classify Animals Based Upon Attributes https://archive.ics.uci.edu/ml/datasets/Zoo Tabular (CSV) DbCL-1.0 101 β€” 0.0 MB 0.0 MB

⚠ Scrape advisories

These datasets aggregate or reference content whose underlying licenses have not been individually cleared. The aggregator's declared license (the License column above) governs only the metadata it ships, not the content it points at. Read each advisory before redistributing or building on top of one of these slugs.

C4 (en, validation) (c4-en-validation) β€” C4 (Colossal Clean Crawled Corpus) is a heavily-filtered scrape of Common Crawl. Allen AI's CC-BY-4.0 license covers the harvest layer; the underlying web text remains subject to per-publisher copyright. Treat as research convenience, not a license-cleared corpus.

CodeParrot Clean (validation) (codeparrot-clean-valid) β€” CodeParrot is a public scrape of MIT/BSD/Apache-licensed Python repositories from GitHub. The dataset's redistribution covers only the harvest layer; per-repository licenses still apply to each content row, and downstream redistribution requires honouring those licenses individually.

FineMath (4+ quality subset) (finemath-4plus) β€” Math-themed slice of the Common-Crawl-derived fineweb pipeline. Same harvest-only ODC-By-1.0 license β€” per-document copyright not cleared.

FinePDFs (English test sample) (finepdfs-en-test) β€” PDF-extracted parallel of fineweb. Same ODC-By-1.0 harvest license; per-document underlying copyright is not cleared. Research-pretraining convenience only.

Fineweb (sample, 10BT) (fineweb-sample-10bt) β€” Fineweb is a 15TB scrape of Common Crawl filtered for English text quality. Released under ODC-By 1.0, but redistribution covers only the harvest layer β€” per-page web content remains subject to individual copyright. Treat as research convenience for LLM pretraining research, not a license-cleared corpus.

Fineweb-2 (Swedish sample) (fineweb-2-swedish) β€” Multilingual extension of the Common-Crawl-derived fineweb pipeline. Released under ODC-By-1.0 by HuggingFace, but redistribution covers only the harvest layer β€” per-page web content remains subject to individual copyright. Treat as research convenience for multilingual LM pretraining.

LAION-400M (metadata) (laion-400m) β€” LAION-400M is a public-web scrape: rows pair URLs to web-hosted images with their alt-text captions and CLIP similarity scores. LAION's redistribution license (CC-BY-4.0) covers only this metadata table β€” it does NOT clear the underlying images, captions, or alt-text, which are subject to per-item copyright and the takedown model LAION operates. Treat the dataset as a research convenience, not a license-cleared corpus. For any production use, dereference URLs only after you've cleared rights with the original publishers, and consult LAION's safety advisories around the corpus's known content issues.

SlimPajama-6B (slimpajama-6b) β€” SlimPajama-6B is a deduplicated 6B-token sample of SlimPajama-627B, itself derived from RedPajama-1T. The underlying corpus pulls from Common Crawl, GitHub, Wikipedia, books, and ArXiv β€” per-source licenses are not individually cleared. Treat as research convenience, not a license-cleared corpus.