GitHub - jaydeepborkar/Assisted-Memorization

This repository contains code to reproduce results from our paper:

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, and Christopher A. Choquette-Choo
In Findings of the Association for Computational Linguistics (ACL) 2025
https://arxiv.org/abs/2502.15680

Software

Required for training and inference: transformers, datasets, accelerate, torch, and tqdm. We use transformers v4.44.0, datasets v2.14.7, accelerate v0.29.2, torch v2.2.2, and python v3.9.12.

Step 1: Training

Note: Considering the sensitive nature of our datasets, we will make them available on a request basis. If you would like to access our datasets for training, please send an email to borkar.j@northeastern.edu.

Continuous Training Setup:

To train a model and checkpoint it every 10% of training, run:

python training.py continue_train

This will save the checkpoints in the models directory and data seen during training in data directory. Next, you should also run python process_data_files.py <folder_path> and python process_checkpoints.py <folder_path> where <folder_path> is the path to your directory containing data files and checkpoints. This will re-name the data files and checkpoints to more readable structure containing epoch name and training interval.

Retraining Setup:

To train all of our ten models for this setup, run:

python training.py retrain

This will save all the models in models directory.

Step 2: Generating Samples

First, you will need to download a slice of Common Crawl that we will use to prompt our models. You can do this using ./commoncrawl.sh. This will download a WET file crawl.wet from the December 2024 crawl for you. You can also use the crawl of your choice.

To generate samples, run:

python extract.py --wet_file crawl.wet --batch_size 100 --num_samples 25000 --max_length 256

You can adjust batch_size according to your compute. This will save your generations in a txt file.

Step 3: Evaluating for Memorization

To evaluate memorization in the Continuous Training setup and generate a plot, run:

python taxonomy.py

This will also save memorized examples with taxonomy labels in the form of CSV files.

For the Retraining Setup, run:

python pii_add.py

Citation

@misc{borkar2025privacyrippleeffectsadding,
      title={Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training}, 
      author={Jaydeep Borkar and Matthew Jagielski and Katherine Lee and Niloofar Mireshghallah and David A. Smith and Christopher A. Choquette-Choo},
      year={2025},
      eprint={2502.15680},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.15680}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software

Step 1: Training

Continuous Training Setup:

Retraining Setup:

Step 2: Generating Samples

Step 3: Evaluating for Memorization

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md
commoncrawl.sh		commoncrawl.sh
extract.py		extract.py
pii_add.py		pii_add.py
process_checkpoints.py		process_checkpoints.py
process_data_files.py		process_data_files.py
taxonomy.py		taxonomy.py
training.py		training.py

License

jaydeepborkar/Assisted-Memorization

Folders and files

Latest commit

History

Repository files navigation

Software

Step 1: Training

Continuous Training Setup:

Retraining Setup:

Step 2: Generating Samples

Step 3: Evaluating for Memorization

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages