File tree Expand file tree Collapse file tree 4 files changed +28
-16
lines changed
tensorflow_datasets/datasets/multi_news Expand file tree Collapse file tree 4 files changed +28
-16
lines changed Original file line number Diff line number Diff line change 1- Multi-News, consists of news articles and human-written summaries
2- of these articles from the site newser.com.
3- Each summary is professionally written by editors and
4- includes links to the original articles cited.
5-
6- There are two features:
7- - document: text of news articles seperated by special token "|||||".
8- - summary: news summary.
1+ # Multi-News Dataset
2+
3+ Multi-News consists of news articles and human-written summaries of these
4+ articles from the news site ` newser.com ` . Each summary is professionally written
5+ by editors and includes links to the original articles cited.
6+
7+ This is the first large-scale dataset for multi-document summarization on news
8+ articles.
9+
10+ Each record has two features:
11+
12+ * ` document ` : Texts of news articles, separated by special token "|||||".
13+ * ` summary ` : Summary of the news.
Original file line number Diff line number Diff line change 1+ content.data-type.text # Contains text data.
2+ content.subject.news # Relates to news.
3+ content.language.en # Contains text in language English / en.
4+ ml.task.abstractive-text-summarization # Relates to Abstractive Text Summarization, a machine learning task.
5+ ml.task.natural-language-understanding # Relates to Natural Language Understanding, a machine learning task.
6+ ml.task.text-summarization # Relates to Text Summarization, a machine learning task.
Original file line number Diff line number Diff line change 1- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/test.src.cleaned 133 d04c4581d52321a30c246d2caa72853ee7f28c6b7a3985ee436f54c4bc264315 test.src.cleaned
2- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/test.tgt 132 afba4aa26d95bb557c0eaa0cb8f7495af2104f1e43f4b5f9ef429b8752477abd test.tgt
3- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/train.src.cleaned 134 75f87b786ff1982bf1bd5803c6a7377d1834b81956ac680a6955789ba047cc0b train.src.cleaned
4- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/train.tgt 133 9f1e9b290a6aae1aa67bd5b361c934ee9db32486e5cd97d83184c097ef8b27e5 train.tgt
5- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/val.src.cleaned 133 8df3ef6bd1882094de8120fa635c3abf758e10427f81f306aaa4786df7b57861 val.src.cleaned
6- https ://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/val.tgt 132 9c0377a443ea92b17449f7df17f1cdfa7c7ebbfe3a45f2f8cd7b3e0ffb47b1df val.tgt
1+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/test.src.cleaned 68999509 138d3ac2dc899cbcd2e3745aaa94d1c1db55fb7058d9df4ba3ef2dac05a3a186 test.src.cleaned
2+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/test.tgt 7309099 fa97cf91a62ae82a0af6da88f2ddf8e06eb4e3b90f7971d8e0c516436518fae3 test.tgt
3+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/train.src.cleaned 547512283 627781c8ce55d528fcdacd495db45583a915e2d24b7983b0a5a6693ede933bb1 train.src.cleaned
4+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/train.tgt 58793912 e9e82b8f413b0f1ed4eb7c883f93bb744f829c218c1608b6ba7615d687d07121 train.tgt
5+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/val.src.cleaned 66875522 f0a43902da366eea2b882e39ddd4c0975ad44aba6b61095a2ea90362e9e2bb65 val.src.cleaned
6+ https ://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/val.tgt 7295302 bb08a078e0cb2b8ca9cc0fe3bfbe9d4098dee706bd00eb97449155e41b880157 val.tgt
Original file line number Diff line number Diff line change 1919import tensorflow_datasets .public_api as tfds
2020
2121_URL_PATH = (
22- "https://huggingface.co/datasets/alexfabbri/multi_news/raw /main/data/"
22+ "https://huggingface.co/datasets/alexfabbri/multi_news/resolve /main/data/"
2323)
2424_LICENSE = "For non-commercial research and educational purposes only"
2525
3131class Builder (tfds .core .GeneratorBasedBuilder ):
3232 """DatasetBuilder for multi_news dataset."""
3333
34- VERSION = tfds .core .Version ("2.0 .0" )
34+ VERSION = tfds .core .Version ("2.1 .0" )
3535 RELEASE_NOTES = {
3636 "1.0.0" : "Initial release." ,
3737 "2.0.0" : "Update the dataset with valid URLs." ,
38+ "2.1.0" : "Update the dataset with cleaned URLs." ,
3839 }
3940
4041 def _info (self ) -> tfds .core .DatasetInfo :
You can’t perform that action at this time.
0 commit comments