TuPy Data Engineering

This repository contains the processes used to create the Portuguese hate speech dataset (TuPy), an annotated corpus designed to facilitate the development of advanced hate speech detection models using machine learning (ML) and natural language processing (NLP) techniques. TuPI is formed by combining datasets annotated by Fortuna et. al. (2019), Leite et. al. (2021), Vargas et. al. (2020) in addition to 10 thousand unpublished annotated documents collected in 2023.

This repository is organized as follows:

root.
    ├── datasets 
    ├── figures
    ├── notebooks
    ├── models
    ├── notebooks
    ├── src
    ├── LICENSE
    └── README.md

Quick start

Run the following command

bash INIT.sh

Or install Miniconda 3 than type the following command order:

conda create -n tupi-env python=3.10
conda activate tupi-env
pip install poetry
poetry install
poetry run python -m nltk.downloader stopwords

Acknowledge

The TuPi project is the result of the development of Felipe Oliveira's thesis and the work of several collaborators. This project is financed by the Federal University of Rio de Janeiro (UFRJ) and the Alberto Luiz Coimbra Institute for Postgraduate Studies and Research in Engineering (COPPE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TuPy Data Engineering

Quick start

Acknowledge

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TuPy Data Engineering

Quick start

Acknowledge