Neural Prioritisation for Web Crawling

This repository contains the source code used for the experiments presented in the paper "Neural Prioritisation for Web Crawling" by Francesca Pezzuti, Sean MacAvaney and Nicola Tonellotto, published at ICTIR2025 - PDF.

Usage

Installation

You can install the requirements using pip:

pip install -r requirements.txt

Supported datasets

Web collection:

ClueWeb22-B (eng):

Query sets:

MSM-WS (MS MARCO Web Search): Link to the dataset
RQ (Researchy Questions): Link to the dataset

Pre-processing

To preprocess the msm-ws query set:

make sure that queries are stored under "/data/queries/msmarco-ws/msmarco-ws-queries.tsv"
make sure that qrels are stored under "./../data/qrels/msmarco-ws/cleaned-msmarco-ws-qrels.tsv"

Then, run:

python preproc_querysets.py

To preprocess ClueWeb22-B run:

python preproc_cw22b.py

Crawling

BreadthFirstSearch (BFS)

To crawl with BFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type bfs

DepthFirstSearch (DFS)

To crawl with DFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type dfs

QFirst

To crawl with QFirst, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name first --frontier_type quality

QOracle

To crawl with QOracle, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type oracle-quality

QMin

To crawl with QMin, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name min --frontier_type quality --updates_enabled 1

Ranking with BM25

To index and rank documents with BM25 the set of documents crawled up to time t=limit by a crawler whose experimental name is 'crawler' use:

python index.py --periodic 1 --limit 2_500_000 --exp_name crawler --benchmark msmarco-ws

By default, this python script automatically retrieves the top k scoring documents for queries for the specified query set and writes the results in the runs directory; however, this option can be disabled by launching the script with --evaluate False.

Re-ranking with MonoElectra

To re-rank documents with MonoElectra:

CUDA_VISIBLE_DEVICES=0 python rerank.py --exp_name crawler --subexp_name limit_2500000 --benchmark msmarco-ws

Evaluation

To evaluate crawling and retrieval effectiveness, use the Jupyter Notebook plot-metrics.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_qual.txt		requirements_qual.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Prioritisation for Web Crawling

Usage

Installation

Supported datasets

Pre-processing

Crawling

BreadthFirstSearch (BFS)

DepthFirstSearch (DFS)

QFirst

QOracle

QMin

Ranking with BM25

Re-ranking with MonoElectra

Evaluation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Prioritisation for Web Crawling

Usage

Installation

Supported datasets

Pre-processing

Crawling

BreadthFirstSearch (BFS)

DepthFirstSearch (DFS)

QFirst

QOracle

QMin

Ranking with BM25

Re-ranking with MonoElectra

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages