- ELITR-minuting-corpus (small)
- Additional data generated by ChatGPT from transcripts?
-
HuggingFace: https://huggingface.co/datasets?task_categories=task_categories:summarization
-
CNN-Daily Mail
- on HuggingFace
- CNN articles and summaries
-
AMI Meeting Corpus
- on HuggingFace
- product meetings transcripts and summaries
-
ICSI Meeting Corpus
- not on HuggingFace
- academic meetings transcripts and summaries
- code to process: https://github.com/xcfcode/meeting_summarization_dataset
-
Spotify Podcast Dataset
- not on HuggingFace
-
SAMSum Corpus
- on HuggingFace
- messenger-like conversations with summaries
-
DialogSum
- on HuggingFace
- dialogues and summaries
-
XSum Dataset
- on HuggingFace
- news articles and summaries
-
MediaSum
- on HuggingFace
- media interview transcripts and summaries
-
OAGK/OAGKX
- on Lindat
- scientific articles with abstracts
-
QMSum
- not on HuggingFace
- meeting transcripts, query-based summarizations of various topics discussed
-
Info:
- AMI and ICSI best but rather small
- Original
- Taken real time, some content of the meeting might be missing
- Generated
- Taken later by independent annotator not present in the meeting
- Both can be used, both have some problems
- They are aligned manually
- Can also be used for training, extract smaller pieces of transcripts and minutes that are properly aligned
- We can talk to Marie Hledikova, email or Wednesday at 10:00
-
https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
-
Relevant:
-
DialogLM (model for long dialogue understanding) https://github.com/microsoft/DialogLM
- Many work done, but not documented properly
- Some documentation: https://elitr.eu/deliverables/
- https://nlp-yang.github.io/Dialogue_Summarization.pdf
- https://www.catalyzex.com/s/Meeting%20Summarization
- https://nero-docs.stanford.edu/jupyter-slurm.html
- https://stackoverflow.com/questions/35545402/how-to-run-an-ipynb-jupyter-notebook-from-terminal
- Try some HuggingFace summarization pretrained models and finetune it on the minuting dataset
- Download and analyze the various datasets
- Preprocess the Europarl dataset