Skip to content

Latest commit

 

History

History
153 lines (83 loc) · 3.62 KB

File metadata and controls

153 lines (83 loc) · 3.62 KB

Quantifying Language Confusion

This is the official repository for our paper Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis.

Data and Results

Please refer to zenodo for datasets, language graphs, and results:

DATA include the following datasets:

i) Raw Language Graphs and

ii) The calculated Language Similarities from the Language Graphs,

iii) MTEI: the files from the experimental results of multilingual inversion attacks, and calculated language confusion entropy from the data;

iv) LCB: the files from the language confusion benchmark and calculated language confusion entropy from the data

Results include aggregated results for further analysis:

i) inversion_language_confusion: results from MTEI

ii) prompting_language_confusion: results from LCB

Installation

  1. Download the repository to local:

  2. Create a new conda environment

conda create -n envlc python==3.12

conda activate envlc

  1. Install pytorch and packages from requirements

pip3 install torch torchvision torchaudio

pip install -r requirements.txt

  1. Specifics
  • Tokenize Japanese, after installing fugashi[unidic]:

python -m unidic download

Language Confusion Analysis

src/analysis_language_confusion