Skip to content

dferndz/cartography

 
 

Repository files navigation

Dataset Cartography

SQUAD Support

Paper: Dataset Cartography for Question Answering

Original paper for Dataset Cartography Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020.

Original repo: Dataset Cartography.

This expansion to the Dataset Cartography framework adds support for computing confidence and variability on the SQUAD dataset.

Requirements

This project is based on Hugging Face Transformers.

Install requirements:

pip install -r requirements.txt

Training a model

A trainer is provided to train a model in SQUAD and produce Dataset dynamics data and and a json version of training data that transforms ids to integers, for better torch tensor compatibility.

To train a model, use:

python -m cartography.classification.run_squad \
    --do_train \
    --output_dir ./trained_model

By default, google/electra-small-discriminator model is trained on this task.

To train another model: use:

python -m cartography.classification.run_squad \
    --do_train \
    --model $(MODEL)
    --output_dir ./trained_model

Other arguments are available. Check Hugging Face for more information and tutorials.

Plotting Data Maps

This example plots data maps on the SQUAD. The trained_model directory should contain a directory named training_dymanics, which contains logits and gold label data for each training example, for each epoch. This information is used to calculate confidence and variability data.

python -m cartography.selection.train_dy_filtering \
    --plot \
    --task_name SQUAD \
    --model_dir ./trained_model \
    --model model-name \
    --plots_dir ./plots

Filtering Training Data

To filter hard examples, use:

python -m cartography.selection.train_dy_filtering \
    --filter \
    --task_name SQUAD \
    --model_dir ./trained_model \
    --metric confidence \
    --data_dir .trained_model/glue_data \
    --filtering_output_dir ./filtered_train_data

To filter ambiguous examples, use:

python -m cartography.selection.train_dy_filtering \
    --filter \
    --task_name SQUAD \
    --model_dir ./trained_model \
    --metric variability \
    --data_dir .trained_model/glue_data \
    --filtering_output_dir ./filtered_train_data

Citation

If using, cite:

@misc{fernandez:22,
    title={Dataset Artifacts and Cartography in Question Answering},
    author={Daniel Fernandez},
    url={https://github.com/dferndz/cartography/raw/main/fernandez-squad-cartography.pdf},
    year={2022}
}

About

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 50.9%
  • Jupyter Notebook 48.1%
  • Jsonnet 1.0%