Software for the paper: "Is it enough to ask questions? Dialogue Evaluation through Question Answer and Generation"

This is the official code release for Is enough to ask questions? Dialogue Evaluation through Question Answer and Generation, published In Proceedings of the 2nd ACM Workshop in AI-powered Question & Answering Systems (AIQAM ’25), October 27–28, 2025, Dublin, Ireland. ACM, New York, NY, USA.

Installation

Framework

There is a minimum framework requirement for this codebase: - Python 3.10, CUDA 12.1, NVIDIA Driver v 560.35.03, CuDNN 8.9.2

Libraries

Make sure to install the following libraries by running pip install -r requirements.txt.

You will also need to download the english pipeline model from Spacy by running the following python -m spacy download en_core_web_sm.

Alternatively, we also provide an anaconda environment file (environment.yml) that you can use to bootstrap the preparation of the working environment. You can use it by running make prep or conda env install -y environment.yml from the project's root. This installation method requires an Anaconda installation.

Data

Our experimental design studies whether model-based evaluation methods offer advantages over word-overlap metrics in assessing topic consistency. We utilized three text corpora, each providing ground-truth consistency labels:

Wizard of Wikipedia (WOW) (Dinan et al., 2018): A dialogue dataset where bot responses are grounded in Wikipedia sentences relevant to the conversation. We used a subset with 600 sentence pairs manually labelled as consistent (1) or inconsistent (0) (Honovich et al., 2021).
Topical Chat (Gopalakrishnan et al., 2023; Mehri and Eskenazi, 2020): Contains human annotations for system and human-generated responses across 60 dialogue contexts, with 204 sentence pairs labelled with "Uses knowledge". This "Uses knowledge" label is assigned based on the following criterion: Given the fact that the response is conditioned on, how well does the response uses that fact?''.
Persona Chat (Zhang et al., 2018; Mehri and Eskenazi, 2020): Includes human annotations for a corpus of persona-conditioned conversations across 10907 dialogues, with 240 sentence pairs labelled as "Uses knowledge." Knowledge labels in Persona Chat were obtained using the same instructions as those in Topical Chat.
Dialogue NLI (Welleck et al., 2018): Consists of pairs including either a personality description sentence or an utterance from the dialogue history (the premise) and a subsequent dialogue utterance (hypothesis). The pairs are classified as entailing (consistent) or contradicting (inconsistent), totalling 8,368 pairs. Neutral labels are not included.

All datasets contain labels that allow us to discriminate between consistent and inconsistent samples (pairs of sentences).

Run evaluation pipeline

Assuming all datasets are available, the evaluation pipeline can be launched using the following:

python scripts/scores.py \
    --infile <file_path> \
    --config_path <config_path> \
    --exp_name <experiment_name> \
    --verbose

Check scripts/run_tc.sh for examples on how to run the pipeline.

The evaluation pipeline loops over the pairs of sentences in --infile <file_path> path and evaluates them.

The remaining options are the following:
- <config_path>: Path to the config file.
- <experiment_name>: Name of the experiment. Name of the folder where to log results.

To modify the experiment parameters, update the following variables in the experiment configuration file (config/config.json):

{
    "evaluation": {
        "qg_model": {
            // Question maximum length
            "max_length": 512,
            // Number of generated questions
            "n_return_seq": 3,
        },
        "qa_model": {
            // Missing answer threshold
            "no_answer_threshold": 0.5
        },
    }
}

Citations

@inproceedings{10.1145/3746274.3760399,
    author = {Vilaça, Luís and Viana, Paula},
    title = {Is it Enough to Ask Questions? Dialogue Evaluation through Question Answering and Generation},
    year = {2025},
    isbn = {9798400720567},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746274.3760399},
    doi = {10.1145/3746274.3760399},
    booktitle = {Proceedings of the 2nd ACM Workshop in AI-Powered Question \& Answering Systems},
    pages = {12–18},
    numpages = {7},
    location = {Ireland},
    series = {AIQAM '25}
}

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Correspondence and Maintenance

Any feedback is appreciated. If you observed any issues, please contact us. All project-related issues and feature requests should be submitted through our GitHub Issues page.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
data		data
runs		runs
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software for the paper: "Is it enough to ask questions? Dialogue Evaluation through Question Answer and Generation"

Installation

Framework

Libraries

Data

Run evaluation pipeline

Citations

Correspondence and Maintenance

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Software for the paper: "Is it enough to ask questions? Dialogue Evaluation through Question Answer and Generation"

Installation

Framework

Libraries

Data

Run evaluation pipeline

Citations

Correspondence and Maintenance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages