Skip to content

rlitschk/EncoderCLIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Multilingual Text Encoders for Unsupervised CLIR

Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš. Evaluating Multilingual Text Encoders for Unsupervised CLIR. arXiv preprint arXiv:2101.08370

Installation instructions

We recommend installing the dependencies with Anaconda as follows:

conda create --name clir python=3.7
conda activate clir
conda install faiss-gpu cudatoolkit=10.0 -c pytorch

pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

(Optional) Install LASER and set the LASER environment variable, then set the LASER_EMB variable in config.py. You need to manually adjust the sequence length in LASER_HOME/source/embed.py.

Resources

  • We run our experiments on CLEF for document-level retrieval and samples of Europarl for sentence-level retrieval. Download and extract the files into PROJECT_ROOT/data/corpus.
  • Download the following embeddings and place them into PROJECT_ROOT/data/embedding_spaces:
Model XLM mBERT
ISO L1 L0
AOC L12, L15 L9
  • We further make the following cross-lingual word embedding spaces available:
Type CLWE Space
Supervised CCA, proc, procB, RCSLS
Unsupervised VecMap, Muse, ICP
  • Optional: Download model checkpoints here and extract into PROJECT_ROOT/data/checkpoints.
  • Note: Reference files for significance tests (PROJECT_ROOT/data/ttest-references) are not available at the moment.

Example usage

We provide the following two scripts for running multilingual encoder experiments and cross-lingual word embedding experiments respectively:

python run_CLWE_exp.py     # Run baseline/CLWE/AOC/ISO experiments
  --dataset                # One of: europarl,clef
  --emb_spaces             # Zero or more: cca,proc,procb,rcsls,icp,muse,vecmap,xlm_aoc,mbert_aoc,xlm_iso,mbert_iso
  --retrieval_models       # One or more: IDF-SUM,TbT-QT
  --baselines              # Zero or more: unigram,gtranslate
  --lang_pairs             # One or more: enfi,enit,ende,enru,defi,deit,deru,fiit,firu

python run_ENC_exp.py     # Run SEMB or similarity-specialized sentence encoder experiments
  --processes             # Number of processes (for parallel idf calculations), e.g. 10
  --gpu                   # Cuda device
  --name                  # Abitrary experiment name, results stored in PROJECT_HOME/results/{name}/
  --dataset               # One of: europarl, clef
  --encoder               # One of: mbert,xlm,laser,labse,muse,distil_mbert,distil_xlmr,distil_muse
  --lang_pairs            # One or more: enfi,enit,enru,ende,defi,deit,deru,fiit,firu
  --maxlen                # 1 <= max_sequence_length =< 512

Citing

If you use this repository please consider citing our paper:

@inproceedings{litschko2021encoderclir,
 author = {Litschko, Robert and Vuli{\'c}, Ivan, and Ponzetto, Simone Paolo, and Glava{\v{s}}, Goran},
 booktitle = {Proceedings of ECIR},
 title = {Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval},
 year = {2021}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published