This repository aggregates datasets that can be used to develop conversational AI techniques. In this repository, we cover the research tasks of open-domain conversation, conversational recommendation and conversational search.
| Dataset | #dialogues | collection | year | download |
|---|---|---|---|---|
| QuAC | 13,569 | Crowdsourcing | 2018 | Download |
| MANtIS | 80,324 | Stack Exchange | 2019 | Download |
| CoQA | 8,399 | Crowdsourcing | 2019 | Download |
| ShARC | 948 | Crowdsourcing | 2018 | Download |
| MSDialog | 2,199 | Microsoft Community | 2018 | Download |
| Dataset | #dialogues | Corpus Size | collection | year | download |
|---|---|---|---|---|---|
| CAsT-19,20,21,22 | 30 - 50 | 38,426,252 | Crowdsourcing | 2019 | Download |
| OR-QuAC | 5,644 | 11,377,951 | Update QuAC for self-containment | 2020 | Download |
| Dataset | #dialogues | #utternaces | domain | collection | language | year | download |
|---|---|---|---|---|---|---|---|
| ReDial | 10,006 | 182,150 | Movie | Amazon Mechanical Turk (AMT) | ENG | 2018 | Download |
| OpenDialKG | 12,320 | 71,873 | Movies & Books | KG-walk Crowdsourcing | ENG | 2019 | Download |
| INSPIRED | 1,001 | 35,811 | Movie | Social-encouraged crowdsourcing (AMT) | ENG | 2020 | Download |
| TG-ReDial | 10,000 | 129,392 | Movie | Topic-driven generation, crowdsourcing | CHN | 2020 | Download |
| DuRecDial2.0 | 16,482 | 255,346 | Movie, music, star, food, restaurant, weather | translation from DuRecDial (crowdsourced) | ENG, CHN | 2021 | Download |
| INSPIRED2 | 1,001 | 35,811 | Movie | clean & augment INSPIRED | ENG | 2022 | Download |
| U-NEED | 7,698 | 53,712 | e-commerce | pre-sale dialogues from Taobao | CHN | 2023 | Download |
| PEARL | 57,277 | 548,061 | Movie | review-based syntheic dialogues | ENG | 2024 | Download |
| Dataset | #dialogues | #utternaces | #domain | collection | language | year | download |
|---|---|---|---|---|---|---|---|
| MultiWoZ | 8,438 | 113,556 | 7 | Wizard-of-Oz | EN | 2018 | Download |
| SGD | 16,142 | 329,964 | 16 | outline simulation then crowdsourced paraphrasing | EN | 2020 | Download |
| Dataset | Paper | Link |
|---|---|---|
| MG-ShopDial | MG-ShopDial: A Multi-Goal Conversational Dataset for e-Commerce | link |
| Dataset | Paper | Link |
|---|---|---|
| DialogStudio | DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI | link |