Execute a query on the arxiv and SemanticScholar database, get papers, and manually quantify whether they are reproducible.
- Python: V-3.13.0
- Requirements: see
requirements.txt
python main.py
downloads all PDFs that correspond to the query configuration, puts them into the paper_pdf directory, and conducts a text-based search in the PDF for keywords that indicate a reproduction package (see configuration params in main.py).
Alongside a BibTex file references.bib is created, which is based on a journal reference if one is provided, and otherwise uses the arXiv DOI.
references.csv contains a list of the bib_keys, together with the arXiv/semantic_scholar and some further meta data.
The current content of references{.csv, .bib} were obtained with the following query for arxiv:
(cat:quant-ph) AND (((ti:"quantum computing" OR ti:"quantum_computer") AND (ti:"algorithm" OR ti:"software") AND (ti:"NISQ")) OR ((abs:"quantum computing" OR abs:"quantum_computer") AND (abs:"algorithm" OR abs:"software") AND (abs:"NISQ"))) AND submittedDate:[20210101 TO 20260327] ANDNOT (ti:"survey" OR ti:"review" OR abs:"survey" OR abs:"review")
and the following for semantic_scholar the following parameters for the API call were used:
"query": '(("quantum computing" | "quantum_computer") + ("algorithm" | "software") + ("NISQ")) + (- "survey" + -"review"),
"publicationDateOrYear": '2021-01-01:2026-03-27',
"fieldsOfStudy": 'Physics,Mathematics,Computer Science,Engineering',
"publicationTypes": 'JournalArticle,CaseReport,Conference,Study',
When running the text filter for the reproduction keywords (specified in main.py), we get:
218 out of 424 potentially experimental papers have some indication for a reproduction package.
127 out of 249 potentially experimental and peer-reviewed papers have some indication for a reproduction package.
66 out of 127 potentially experimental and peer-reviewed papers that were found on both arXiv and SemanticScholar have some indication for a reproduction package.
42 out of 66 manually scanned papers do actually contain some kind of reproduction package.
31 out of 42 do actually contain code of which 28 have some kind of documentation and 20 have some kind of environment specification.
for the queries above.
The result data and meta data are stored in the result_data.csv and result_meta.csv files.
(Output of python main.py -h)
usage: ArXiv & SemanticScholar Scraper [-h] [-rn REF_NAME] [-dd DOWNLOAD_DIR] [-r RESULTS_DIR] [-mr MAX_RESULTS] [-s {arxiv,semantic_scholar,both,none}] [-nd] [-c] [-l] [-rd RESULT_DATA_FILE] [-res RESULT_META_FILE] [-os] [-rs RESULT_MANUAL_FILE]
[-or] [-rr RESULT_REPRO_FILE]
Fetches bib entries and PDFs from arXiv and SemanticScholar.
options:
-h, --help show this help message and exit
-rn, --ref_name REF_NAME
Bib and corresponding CSV file name (without file ending) defaults to 'references', which ends up as 'references.bib' and 'references.csv'.
-dd, --download_dir DOWNLOAD_DIR
Directory to which to store PDFs, defaults to 'paper_pdfs'.
-r, --results_dir RESULTS_DIR
Directory to which to store result CSV/BIB, defaults to 'results'.
-mr, --max_results MAX_RESULTS
Maximum results per query. Default is unlimited (only applies to ArXiv)
-s, --source {arxiv,semantic_scholar,both,none}
Source from which to download.
-nd, --no_download If provided, no PDFs are downloaded.
-c, --cleanup_pdfs If provided, PDFs are deleted that are not in bib data.
-l, --load_existing_results
Whether to load result data from existing result files.
-rd, --result_data_file RESULT_DATA_FILE
File to which result data get stored. Defaults to results_data.csv.
-res, --result_meta_file RESULT_META_FILE
File to which meta and summary data of the results get stored. Defaults to results_meta.csv.
-os, --overwrite_manual_results
Whether to overwrite already existing manually scan results for reproduction packages
-rs, --result_manual_file RESULT_MANUAL_FILE
File to which results for the manual scan are stored.Defaults to results_manual.csv.
-or, --overwrite_repro_results
Whether to overwrite already existing repository scan results for reproduction packages
-rr, --result_repro_file RESULT_REPRO_FILE
File to which results for the specific repository repro scan are stored. Defaults to results_repro.csv.
- The query above is executed, either on
arxiv, onsemantic_scholar, or onboth(See the-sargument above. This process can also be skipped (noneprovided to-s), when the query has already been executed and the PDFs have been downloaded, and the bib file (references.bib) and bib meta file (references.csv) have been created. - Duplicates in the bib data are removed based on a matching title and authors. Entries, which have a non-arxiv DOI are prioritised here. With the
-cflag, duplicated PDFs can be deleted as well. - Having all redundant bib entries removed, a text-based search is conducted on the PDFs. Where the papers get classified as
"experimental"if certain keywords occur in the text (seeEXPERIMENTAL_PAPER_KEYWORDSinmain.py), as"review"papers if certain keywords occur in the title or in the first 1500 characters of the text (seeREVIEW_PAPER_KEYWORDSinmain), or as"unknown"if none match. - Additionally, the text-based search looks for keywords that indicate a reproduction package, such as certain platforms (see
PLATFORMSinmain.py), or keywords (seeCODE_KEYWORDSinmain.py). - Finally, two CSV files are created:
results_meta.csvwith the columns:bib_key: key of the bib entry inreferences.bib.id: arXiv or semantic scholar IDsource:"arxiv","semantic_scholar", or"both"(If the paper was found on both platforms).downloaded_pdf: If there exists a PDF that was downloadable (some publications are not open access), actually this is alwaysTrue, as we currently do not store the closed source papers.has_doi: If the paper has a non-arxiv DOI, which indicates a peer-reviewed paper.paper_class:"experimental","review", or"unknown"(see point 3).has_repro_hint: If there was a hint for a reproduction package found (see previous point).
results_data.csvwith the columns:bib_key,id,source,has_doi: same as forresults_meta.csv. Note however that theresults_data.csvcan contain multiple rows for the same paper, as it contains details on each individual hit that matches the reproduction keyword criterias.repro_hint_source:"platform", or"keyword", dependent on the matching criterion (Falseif none was found).repro_hint: the exact platform or keyword that matched (Falseif none was found).repro_hint_line: line of the hint in the PDF file (Falseif none was found).repro_hint_text: The full line of text that contains the hint (Falseif none was found).
- Based on all papers that are classified as
"experimental", and the flagshas_doiandhas_repro_hint, an interactive shell is opened up to conduct a manual scan of these results for acutal reproduction studies. As there could be many false-positives, for instance as some other github repository that is no reproduction package is mentioned in the references. Another possibility is that the"experimental"class was incorrectly assigned and the paper is in fact conceptual. The scan has the following options and consequences:[y] yes: there is a reproduction package. If the reproduction link contains a link then this is automatically saved in the results as well.[n] no: there is no reproduction package.[u] unclear: not apparent from the hint only, a PDF gets opened up. This can also be useful for looking up correct link. The prompt will be repeated after that.[q] quit: Stop the scan. Intermediate results are stored.[l] yes and manually enter [l]ink: if there is a reproduction package, and the link in the reproduction text hint is only partially available, or not at all, but known.[i] ignore paper: continues with the next paper[c] re-[c]lassify paper: If the paper is not"experimental", but something else, for instance"conceptual", or a"thesis". Results for the scan are stored inresults_manual.csv. Previous scanned results are loaded by default. To start the scan from scratch, use the-osflag.
- An additional CLI-Application is then launched to specify whether a given code repository has the following. Available options are:
[y] yes / [n] no / [p] partially / [u] unknown / [na] [n]ot [a]pplicable.- code available
- an environment
- documentation
- a version update
- a Docker image
- hardware specification (experimental_params)
- executable
Just want to continue the manual scan with already some exisiting results available? Just run:
python main.py -s none -l