Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 2.16 KB

File metadata and controls

44 lines (34 loc) · 2.16 KB

GitHub pull requests GitHub issues GitHub last commit (branch) GitHub license Python 3.10+

TuPy Data Engineering

This repository contains the processes used to create the Portuguese hate speech dataset (TuPy), an annotated corpus designed to facilitate the development of advanced hate speech detection models using machine learning (ML) and natural language processing (NLP) techniques. TuPI is formed by combining datasets annotated by Fortuna et. al. (2019), Leite et. al. (2021), Vargas et. al. (2020) in addition to 10 thousand unpublished annotated documents collected in 2023.

This repository is organized as follows:

root.
    ├── datasets 
    ├── figures
    ├── notebooks
    ├── models
    ├── notebooks
    ├── src
    ├── LICENSE
    └── README.md

Quick start

Run the following command

bash INIT.sh

Or install Miniconda 3 than type the following command order:

conda create -n tupi-env python=3.10
conda activate tupi-env
pip install poetry
poetry install
poetry run python -m nltk.downloader stopwords

Acknowledge

The TuPi project is the result of the development of Felipe Oliveira's thesis and the work of several collaborators. This project is financed by the Federal University of Rio de Janeiro (UFRJ) and the Alberto Luiz Coimbra Institute for Postgraduate Studies and Research in Engineering (COPPE).