Evaluating Multilingual Text Encoders for Unsupervised CLIR

Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš. Evaluating Multilingual Text Encoders for Unsupervised CLIR. arXiv preprint arXiv:2101.08370

Installation instructions

We recommend installing the dependencies with Anaconda as follows:

conda create --name clir python=3.7
conda activate clir
conda install faiss-gpu cudatoolkit=10.0 -c pytorch

pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

(Optional) Install LASER and set the LASER environment variable, then set the LASER_EMB variable in config.py. You need to manually adjust the sequence length in LASER_HOME/source/embed.py.

Resources

We run our experiments on CLEF for document-level retrieval and samples of Europarl for sentence-level retrieval. Download and extract the files into PROJECT_ROOT/data/corpus.
Download the following embeddings and place them into PROJECT_ROOT/data/embedding_spaces:

Model	XLM	mBERT
ISO	L1	L0
AOC	L12, L15	L9

We further make the following cross-lingual word embedding spaces available:

Type	CLWE Space
Supervised	CCA, proc, procB, RCSLS
Unsupervised	VecMap, Muse, ICP

Optional: Download model checkpoints here and extract into PROJECT_ROOT/data/checkpoints.
Note: Reference files for significance tests (PROJECT_ROOT/data/ttest-references) are not available at the moment.

Example usage

We provide the following two scripts for running multilingual encoder experiments and cross-lingual word embedding experiments respectively:

python run_CLWE_exp.py     # Run baseline/CLWE/AOC/ISO experiments
  --dataset                # One of: europarl,clef
  --emb_spaces             # Zero or more: cca,proc,procb,rcsls,icp,muse,vecmap,xlm_aoc,mbert_aoc,xlm_iso,mbert_iso
  --retrieval_models       # One or more: IDF-SUM,TbT-QT
  --baselines              # Zero or more: unigram,gtranslate
  --lang_pairs             # One or more: enfi,enit,ende,enru,defi,deit,deru,fiit,firu

python run_ENC_exp.py     # Run SEMB or similarity-specialized sentence encoder experiments
  --processes             # Number of processes (for parallel idf calculations), e.g. 10
  --gpu                   # Cuda device
  --name                  # Abitrary experiment name, results stored in PROJECT_HOME/results/{name}/
  --dataset               # One of: europarl, clef
  --encoder               # One of: mbert,xlm,laser,labse,muse,distil_mbert,distil_xlmr,distil_muse
  --lang_pairs            # One or more: enfi,enit,enru,ende,defi,deit,deru,fiit,firu
  --maxlen                # 1 <= max_sequence_length =< 512

Citing

If you use this repository please consider citing our paper:

@inproceedings{litschko2021encoderclir,
 author = {Litschko, Robert and Vuli{\'c}, Ivan, and Ponzetto, Simone Paolo, and Glava{\v{s}}, Goran},
 booktitle = {Proceedings of ECIR},
 title = {Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval},
 year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_CLWE.sh		download_CLWE.sh
download_ENC.sh		download_ENC.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Multilingual Text Encoders for Unsupervised CLIR

Installation instructions

Resources

Example usage

Citing

About

Uh oh!

Releases

Packages

Languages

License

rlitschk/EncoderCLIR

Folders and files

Latest commit

History

Repository files navigation

Evaluating Multilingual Text Encoders for Unsupervised CLIR

Installation instructions

Resources

Example usage

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages