Paper: Dataset Cartography for Question Answering
Original paper for Dataset Cartography Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020.
Original repo: Dataset Cartography.
This expansion to the Dataset Cartography framework adds support for computing confidence and variability on the SQUAD dataset.
This project is based on Hugging Face Transformers.
Install requirements:
pip install -r requirements.txtA trainer is provided to train a model in SQUAD and produce Dataset dynamics data and and a json version of training data that transforms ids to integers, for better torch tensor compatibility.
To train a model, use:
python -m cartography.classification.run_squad \
--do_train \
--output_dir ./trained_modelBy default, google/electra-small-discriminator model is trained on this task.
To train another model: use:
python -m cartography.classification.run_squad \
--do_train \
--model $(MODEL)
--output_dir ./trained_modelOther arguments are available. Check Hugging Face for more information and tutorials.
This example plots data maps on the SQUAD. The trained_model directory should contain
a directory named training_dymanics, which contains logits and gold label data for each
training example, for each epoch. This information is used to calculate confidence and variability
data.
python -m cartography.selection.train_dy_filtering \
--plot \
--task_name SQUAD \
--model_dir ./trained_model \
--model model-name \
--plots_dir ./plotsTo filter hard examples, use:
python -m cartography.selection.train_dy_filtering \
--filter \
--task_name SQUAD \
--model_dir ./trained_model \
--metric confidence \
--data_dir .trained_model/glue_data \
--filtering_output_dir ./filtered_train_dataTo filter ambiguous examples, use:
python -m cartography.selection.train_dy_filtering \
--filter \
--task_name SQUAD \
--model_dir ./trained_model \
--metric variability \
--data_dir .trained_model/glue_data \
--filtering_output_dir ./filtered_train_dataIf using, cite:
@misc{fernandez:22,
title={Dataset Artifacts and Cartography in Question Answering},
author={Daniel Fernandez},
url={https://github.com/dferndz/cartography/raw/main/fernandez-squad-cartography.pdf},
year={2022}
}