These experiments can be expensive because of all the inferences we have to do. We design our experiments to be effective yet cheap.
We first collect all the language model responses: for all languages and all models, we perform inference on the language model to generate the responses. We store the response and classify them into answer choices (because we use classification tasks, or you could classify them into correct/incorrect with other accuracy/reward metrics). Now, we can treat LSKExtractor and any of the other baselines as language selection methods, where we use those methods to select the language of the query, and we already have stored the language model response, so we can retrieve that from our stored responses. Below is the rough pipeline of how to run our code.
To start, process all the data using the .ipynb notebooks in data/.
You can use our script translate_gpt.py to use a GPT model to translate queries into different languages.
Now, you can run inference with run_inference.py. If you want to run inference without reasoning, use run_inference_nr.py.
It's crucial to run parse_generations_to_classify.py so that the LLM responses can be mapped to an answer choice in our classification tasks.
Finally, you can run the EVALUATION_... scripts to run and evaluate LSKExtractor and any of the baselines.
