Software for the paper: "Is it enough to ask questions? Dialogue Evaluation through Question Answer and Generation"
This is the official code release for Is enough to ask questions? Dialogue Evaluation through Question Answer and Generation, published In Proceedings of the 2nd ACM Workshop in AI-powered Question & Answering Systems (AIQAM ’25), October 27–28, 2025, Dublin, Ireland. ACM, New York, NY, USA.
There is a minimum framework requirement for this codebase: - Python 3.10,
CUDA 12.1, NVIDIA Driver v 560.35.03, CuDNN 8.9.2
Make sure to install the following libraries by running pip install -r requirements.txt.
You will also need to download the english pipeline model from Spacy by running the following python -m spacy download en_core_web_sm.
Alternatively, we also provide an anaconda environment file (environment.yml) that you can use to bootstrap the preparation of the working environment. You can use it by running make prep or conda env install -y environment.yml from the project's root. This installation method requires an Anaconda installation.
Our experimental design studies whether model-based evaluation methods offer advantages over word-overlap metrics in assessing topic consistency. We utilized three text corpora, each providing ground-truth consistency labels:
-
Wizard of Wikipedia (WOW) (Dinan et al., 2018): A dialogue dataset where bot responses are grounded in Wikipedia sentences relevant to the conversation. We used a subset with 600 sentence pairs manually labelled as consistent (1) or inconsistent (0) (Honovich et al., 2021).
-
Topical Chat (Gopalakrishnan et al., 2023; Mehri and Eskenazi, 2020): Contains human annotations for system and human-generated responses across 60 dialogue contexts, with 204 sentence pairs labelled with "Uses knowledge". This "Uses knowledge" label is assigned based on the following criterion: Given the fact that the response is conditioned on, how well does the response uses that fact?''.
-
Persona Chat (Zhang et al., 2018; Mehri and Eskenazi, 2020): Includes human annotations for a corpus of persona-conditioned conversations across 10907 dialogues, with 240 sentence pairs labelled as "Uses knowledge." Knowledge labels in Persona Chat were obtained using the same instructions as those in Topical Chat.
-
Dialogue NLI (Welleck et al., 2018): Consists of pairs including either a personality description sentence or an utterance from the dialogue history (the premise) and a subsequent dialogue utterance (hypothesis). The pairs are classified as entailing (consistent) or contradicting (inconsistent), totalling 8,368 pairs. Neutral labels are not included.
All datasets contain labels that allow us to discriminate between consistent and inconsistent samples (pairs of sentences).
Assuming all datasets are available, the evaluation pipeline can be launched using the following:
python scripts/scores.py \
--infile <file_path> \
--config_path <config_path> \
--exp_name <experiment_name> \
--verboseCheck scripts/run_tc.sh for examples on how to run the pipeline.
The evaluation pipeline loops over the pairs of sentences in --infile <file_path> path and evaluates them.
- The remaining options are the following:
<config_path>: Path to the config file.<experiment_name>: Name of the experiment. Name of the folder where to log results.
To modify the experiment parameters, update the following variables in the experiment configuration file (config/config.json):
{
"evaluation": {
"qg_model": {
// Question maximum length
"max_length": 512,
// Number of generated questions
"n_return_seq": 3,
},
"qa_model": {
// Missing answer threshold
"no_answer_threshold": 0.5
},
}
}@inproceedings{10.1145/3746274.3760399,
author = {Vilaça, Luís and Viana, Paula},
title = {Is it Enough to Ask Questions? Dialogue Evaluation through Question Answering and Generation},
year = {2025},
isbn = {9798400720567},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746274.3760399},
doi = {10.1145/3746274.3760399},
booktitle = {Proceedings of the 2nd ACM Workshop in AI-Powered Question \& Answering Systems},
pages = {12–18},
numpages = {7},
location = {Ireland},
series = {AIQAM '25}
}
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Any feedback is appreciated. If you observed any issues, please contact us. All project-related issues and feature requests should be submitted through our GitHub Issues page.
