Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš. Evaluating Multilingual Text Encoders for Unsupervised CLIR. arXiv preprint arXiv:2101.08370
We recommend installing the dependencies with Anaconda as follows:
conda create --name clir python=3.7
conda activate clir
conda install faiss-gpu cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
(Optional) Install LASER and set the LASER environment variable, then set the LASER_EMB variable in config.py. You need to manually adjust the sequence length in LASER_HOME/source/embed.py.
- We run our experiments on CLEF for document-level retrieval and samples of Europarl for sentence-level retrieval. Download and extract the files into
PROJECT_ROOT/data/corpus. - Download the following embeddings and place them into
PROJECT_ROOT/data/embedding_spaces:
| Model | XLM | mBERT |
|---|---|---|
| ISO | L1 | L0 |
| AOC | L12, L15 | L9 |
- We further make the following cross-lingual word embedding spaces available:
| Type | CLWE Space |
|---|---|
| Supervised | CCA, proc, procB, RCSLS |
| Unsupervised | VecMap, Muse, ICP |
- Optional: Download model checkpoints here and extract into
PROJECT_ROOT/data/checkpoints. - Note: Reference files for significance tests (
PROJECT_ROOT/data/ttest-references) are not available at the moment.
We provide the following two scripts for running multilingual encoder experiments and cross-lingual word embedding experiments respectively:
python run_CLWE_exp.py # Run baseline/CLWE/AOC/ISO experiments
--dataset # One of: europarl,clef
--emb_spaces # Zero or more: cca,proc,procb,rcsls,icp,muse,vecmap,xlm_aoc,mbert_aoc,xlm_iso,mbert_iso
--retrieval_models # One or more: IDF-SUM,TbT-QT
--baselines # Zero or more: unigram,gtranslate
--lang_pairs # One or more: enfi,enit,ende,enru,defi,deit,deru,fiit,firu
python run_ENC_exp.py # Run SEMB or similarity-specialized sentence encoder experiments
--processes # Number of processes (for parallel idf calculations), e.g. 10
--gpu # Cuda device
--name # Abitrary experiment name, results stored in PROJECT_HOME/results/{name}/
--dataset # One of: europarl, clef
--encoder # One of: mbert,xlm,laser,labse,muse,distil_mbert,distil_xlmr,distil_muse
--lang_pairs # One or more: enfi,enit,enru,ende,defi,deit,deru,fiit,firu
--maxlen # 1 <= max_sequence_length =< 512If you use this repository please consider citing our paper:
@inproceedings{litschko2021encoderclir,
author = {Litschko, Robert and Vuli{\'c}, Ivan, and Ponzetto, Simone Paolo, and Glava{\v{s}}, Goran},
booktitle = {Proceedings of ECIR},
title = {Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval},
year = {2021}
}