This repository contains the source code used for the experiments presented in the paper "Neural Prioritisation for Web Crawling" by Francesca Pezzuti, Sean MacAvaney and Nicola Tonellotto, published at ICTIR2025 - PDF.
You can install the requirements using pip:
pip install -r requirements.txt
Web collection:
- ClueWeb22-B (eng):
Query sets:
- MSM-WS (MS MARCO Web Search): Link to the dataset
- RQ (Researchy Questions): Link to the dataset
To preprocess the msm-ws query set:
- make sure that queries are stored under "/data/queries/msmarco-ws/msmarco-ws-queries.tsv"
- make sure that qrels are stored under "./../data/qrels/msmarco-ws/cleaned-msmarco-ws-qrels.tsv"
Then, run:
python preproc_querysets.pyTo preprocess ClueWeb22-B run:
python preproc_cw22b.pyTo crawl with BFS, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type bfsTo crawl with DFS, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type dfsTo crawl with QFirst, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name first --frontier_type qualityTo crawl with QOracle, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type oracle-qualityTo crawl with QMin, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name min --frontier_type quality --updates_enabled 1To index and rank documents with BM25 the set of documents crawled up to time t=limit by a crawler whose experimental name is 'crawler' use:
python index.py --periodic 1 --limit 2_500_000 --exp_name crawler --benchmark msmarco-wsBy default, this python script automatically retrieves the top k scoring documents for queries for the specified query set and writes the results in the runs directory; however, this option can be disabled by launching the script with --evaluate False.
To re-rank documents with MonoElectra:
CUDA_VISIBLE_DEVICES=0 python rerank.py --exp_name crawler --subexp_name limit_2500000 --benchmark msmarco-wsTo evaluate crawling and retrieval effectiveness, use the Jupyter Notebook plot-metrics.ipynb.