RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

📰 News

[Great News] 🎉🎉🎉 Our paper has been accpected by WWW'25 Resource Track

This is the official repo for paper: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Intorduction

The dataset is publicly avaliable at zenodo and HuggingFace:

https://zenodo.org/records/11406538

https://huggingface.co/datasets/zzha6204/RU-AI-origin

The noise augmented dataset is publicly avaliable at Huggingface:

https://huggingface.co/datasets/zzha6204/RU-AI-noise

Dataset Comparison

Dataset	Modality	Content	Real/Human	Machine Generated Content	Task
M4	Text	General	10,019,311	122,481	Multi-lingual AI Text Detection
DeepfakeTextDetect	Text	General	447,674	447,674	Generalised AI Text Detection
ArguGPT	Text	Essay	4,115	4,038	Language Learner-AI Text Detection
HC3	Text	Question Answers	80,805	44,425	AI Answer Detection
CNNSpot	Image	General	362,000	362,000	AI Image Detection
DE-FAKE	Image	General	20,000	191,946	AI Image Detection
GenImage	Image	General	1,331,167	1,350,000	AI Image Detection
WaveFake	Voice	General	13,600	104,885	Fake Voice Detection
Sprocket-VC	Voice	General	3,132	3,456	Fake Voice Detection
FakeAVCeleb	Video-Voice	Face	500	19,500	DeepFake Detection
ForgeryNet	Video-Image	Face	1,438,201	1,457,861	DeepFake Detection
DFDC	Video-Image-Voice	Face	23,654	104,500	DeepFake Detection
DGM4	Text-Image	General	77,426	152,574	Media Manipulation Detection
Ours	Text-Image-Voice	General	245,895	1,229,475	AI Text Image Voice Detection

Requirement

The dataset requires at least 500GB of disk space to be fully downloaded.

The model inference requires a Nvidia GPU with at least 16GB of vRAM to run. We recommend to have NVIDIA RTX 3090, 24GB or anything above to run this project.

We highly recommend to have this package installed within a virtual environment such as conda or venv.

Environmental requirement:

Python >= 3.8
Pytorch >= 1.13.1
CUDA Version >= 11.6

Clone the project:

git clone https://github.com/ZhihaoZhang97/RU-AI.git

Create the virtual environment via conda and Python 3.8:

conda create -n ruai python=3.8

Activate the environment:

conda activate ruai

Move into the project path:

cd RU-AI

Install the dependencies:

pip3 install -r requirements.txt

Data Sample

We provide a quick tutorial on how to download and inspect the dataset on the data-example.ipynb notebook.

You can also directly run the follwoing code to download smaple data sourced for flickr8k:

python ./download_flickr.py

You can also download all the data by running the following code.

Please note the whole dataset is over 157GB in compression and could take up to 500GB after decompression.

It will take a while for downloading, the actual speed depends on your internet.

python ./download_all.py

You can also go to ./data to manually check the data after downloading.

Here is the directory tree after downloading all the data:

├── audio
│   ├── coco
│   │   ├── efficientspeech
│   │   ├── real
│   │   ├── styletts2
│   │   ├── vits
│   │   ├── xtts2
│   │   └── yourtts
│   ├── flickr8k
│   │   ├── efficientspeech
│   │   ├── real
│   │   ├── styletts2
│   │   ├── vits
│   │   ├── xtts2
│   │   └── yourtts
│   └── place
│       ├── efficientspeech
│       ├── real
│       ├── styletts2
│       ├── vits
│       ├── xtts2
│       └── yourtts
├── image
│   ├── coco
│   │   ├── real
│   │   ├── stable-diffusion-images-absolutereality-remove-black
│   │   ├── stable-diffusion-images-epicrealism-remove-black
│   │   ├── stable-diffusion-images-v1-5
│   │   ├── stable-diffusion-images-v6-0-remove-black
│   │   └── stable-diffusion-images-xl-v3-0-remove-black
│   ├── flickr8k
│   │   ├── real
│   │   ├── stable-diffusion-images-absolutereality
│   │   ├── stable-diffusion-images-epicrealism
│   │   ├── stable-diffusion-images-v1-5
│   │   ├── stable-diffusion-images-v6-0
│   │   └── stable-diffusion-images-xl-v3-0
│   └── place
│       ├── real
│       ├── stable-diffusion-images-absolutereality-remove-black
│       ├── stable-diffusion-images-epicrealism-remove-black
│       ├── stable-diffusion-images-v1-5
│       ├── stable-diffusion-images-v6-0-remove-black
│       └── stable-diffusion-images-xl-v3-0-remove-black
└── text
    ├── coco
    ├── flickr8k
    └── place

Object Categories

Model Inference

Before model inference, replace image_data_paths, audio_data_paths, text_data in the infer_imagebind_model.py and infer_languagebind_model.py files with real data / data paths

imagebind based model

python infer_imagebind_model.py

languagebind based model

python infer_languagebind_model.py

Reference

We are appreciated the open-source community for the datasets and the models.

Microsoft COCO: Common Objects in Context

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Learning Deep Features for Scene Recognition using Places Database

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Unsupervised Learning of Spoken Language with Visual Context

Learning Word-Like Units from Joint Audio-Visual Analysis

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind: One Embedding Space To Bind Them All

Citation

If found our dataset or research useful, please cite:

@misc{huang2024ruai,
      title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection}, 
      author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
      year={2024},
      eprint={2406.04906},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

📰 News

Intorduction

Dataset Comparison

Requirement

Data Sample

Object Categories

Model Inference

Reference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
bpe		bpe
data		data
imagebind		imagebind
languagebind		languagebind
tokenizer_cache_dir		tokenizer_cache_dir
LICENSE		LICENSE
README.md		README.md
data-example.ipynb		data-example.ipynb
data-flow.png		data-flow.png
download_all.py		download_all.py
download_flickr.py		download_flickr.py
image-category.png		image-category.png
infer_imagebind_model.py		infer_imagebind_model.py
infer_languagebind_model.py		infer_languagebind_model.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

📰 News

Intorduction

Dataset Comparison

Requirement

Data Sample

Object Categories

Model Inference

Reference

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages