Skip to content

lfd/qce26_repro_analysis

Repository files navigation

Reproducibility Assessment

Execute a query on the arxiv and SemanticScholar database, get papers, and manually quantify whether they are reproducible.

Python

  • Python: V-3.13.0
  • Requirements: see requirements.txt

Scrape

python main.py

downloads all PDFs that correspond to the query configuration, puts them into the paper_pdf directory, and conducts a text-based search in the PDF for keywords that indicate a reproduction package (see configuration params in main.py).

Alongside a BibTex file references.bib is created, which is based on a journal reference if one is provided, and otherwise uses the arXiv DOI.

references.csv contains a list of the bib_keys, together with the arXiv/semantic_scholar and some further meta data.

The current content of references{.csv, .bib} were obtained with the following query for arxiv:

(cat:quant-ph) AND (((ti:"quantum computing" OR ti:"quantum_computer") AND (ti:"algorithm" OR ti:"software") AND (ti:"NISQ")) OR ((abs:"quantum computing" OR abs:"quantum_computer") AND (abs:"algorithm" OR abs:"software") AND (abs:"NISQ"))) AND submittedDate:[20210101 TO 20260327] ANDNOT (ti:"survey" OR ti:"review" OR abs:"survey" OR abs:"review")

and the following for semantic_scholar the following parameters for the API call were used:

"query": '(("quantum computing" | "quantum_computer") + ("algorithm" | "software") + ("NISQ")) + (- "survey" + -"review"),
"publicationDateOrYear": '2021-01-01:2026-03-27',
"fieldsOfStudy": 'Physics,Mathematics,Computer Science,Engineering',
"publicationTypes": 'JournalArticle,CaseReport,Conference,Study',

When running the text filter for the reproduction keywords (specified in main.py), we get:

218 out of 424 potentially experimental papers have some indication for a reproduction package.

127 out of 249 potentially experimental and peer-reviewed papers have some indication for a reproduction package.

66 out of 127 potentially experimental and peer-reviewed papers that were found on both arXiv and SemanticScholar have some indication for a reproduction package.

42 out of 66 manually scanned papers do actually contain some kind of reproduction package.

31 out of 42 do actually contain code of which 28 have some kind of documentation and 20 have some kind of environment specification.

for the queries above. The result data and meta data are stored in the result_data.csv and result_meta.csv files.

Arguments

(Output of python main.py -h)

usage: ArXiv & SemanticScholar Scraper [-h] [-rn REF_NAME] [-dd DOWNLOAD_DIR] [-r RESULTS_DIR] [-mr MAX_RESULTS] [-s {arxiv,semantic_scholar,both,none}] [-nd] [-c] [-l] [-rd RESULT_DATA_FILE] [-res RESULT_META_FILE] [-os] [-rs RESULT_MANUAL_FILE]
                                       [-or] [-rr RESULT_REPRO_FILE]

Fetches bib entries and PDFs from arXiv and SemanticScholar.

options:
  -h, --help            show this help message and exit
  -rn, --ref_name REF_NAME
                        Bib and corresponding CSV file name (without file ending) defaults to 'references', which ends up as 'references.bib' and 'references.csv'.
  -dd, --download_dir DOWNLOAD_DIR
                        Directory to which to store PDFs, defaults to 'paper_pdfs'.
  -r, --results_dir RESULTS_DIR
                        Directory to which to store result CSV/BIB, defaults to 'results'.
  -mr, --max_results MAX_RESULTS
                        Maximum results per query. Default is unlimited (only applies to ArXiv)
  -s, --source {arxiv,semantic_scholar,both,none}
                        Source from which to download.
  -nd, --no_download    If provided, no PDFs are downloaded.
  -c, --cleanup_pdfs    If provided, PDFs are deleted that are not in bib data.
  -l, --load_existing_results
                        Whether to load result data from existing result files.
  -rd, --result_data_file RESULT_DATA_FILE
                        File to which result data get stored. Defaults to results_data.csv.
  -res, --result_meta_file RESULT_META_FILE
                        File to which meta and summary data of the results get stored. Defaults to results_meta.csv.
  -os, --overwrite_manual_results
                        Whether to overwrite already existing manually scan results for reproduction packages
  -rs, --result_manual_file RESULT_MANUAL_FILE
                        File to which results for the manual scan are stored.Defaults to results_manual.csv.
  -or, --overwrite_repro_results
                        Whether to overwrite already existing repository scan results for reproduction packages
  -rr, --result_repro_file RESULT_REPRO_FILE
                        File to which results for the specific repository repro scan are stored. Defaults to results_repro.csv.

What happens when calling main.py?

  1. The query above is executed, either on arxiv, on semantic_scholar, or on both (See the -s argument above. This process can also be skipped (none provided to -s), when the query has already been executed and the PDFs have been downloaded, and the bib file (references.bib) and bib meta file (references.csv) have been created.
  2. Duplicates in the bib data are removed based on a matching title and authors. Entries, which have a non-arxiv DOI are prioritised here. With the -c flag, duplicated PDFs can be deleted as well.
  3. Having all redundant bib entries removed, a text-based search is conducted on the PDFs. Where the papers get classified as "experimental" if certain keywords occur in the text (see EXPERIMENTAL_PAPER_KEYWORDS in main.py), as "review" papers if certain keywords occur in the title or in the first 1500 characters of the text (see REVIEW_PAPER_KEYWORDS in main), or as "unknown" if none match.
  4. Additionally, the text-based search looks for keywords that indicate a reproduction package, such as certain platforms (see PLATFORMS in main.py), or keywords (see CODE_KEYWORDS in main.py).
  5. Finally, two CSV files are created:
    1. results_meta.csv with the columns:
      • bib_key: key of the bib entry in references.bib.
      • id: arXiv or semantic scholar ID
      • source: "arxiv", "semantic_scholar", or "both" (If the paper was found on both platforms).
      • downloaded_pdf: If there exists a PDF that was downloadable (some publications are not open access), actually this is always True, as we currently do not store the closed source papers.
      • has_doi: If the paper has a non-arxiv DOI, which indicates a peer-reviewed paper.
      • paper_class: "experimental", "review", or "unknown" (see point 3).
      • has_repro_hint: If there was a hint for a reproduction package found (see previous point).
    2. results_data.csv with the columns:
      • bib_key, id, source, has_doi: same as for results_meta.csv. Note however that the results_data.csv can contain multiple rows for the same paper, as it contains details on each individual hit that matches the reproduction keyword criterias.
      • repro_hint_source: "platform", or "keyword", dependent on the matching criterion (False if none was found).
      • repro_hint: the exact platform or keyword that matched (False if none was found).
      • repro_hint_line: line of the hint in the PDF file (False if none was found).
      • repro_hint_text: The full line of text that contains the hint (False if none was found).
  6. Based on all papers that are classified as "experimental", and the flags has_doi and has_repro_hint, an interactive shell is opened up to conduct a manual scan of these results for acutal reproduction studies. As there could be many false-positives, for instance as some other github repository that is no reproduction package is mentioned in the references. Another possibility is that the "experimental" class was incorrectly assigned and the paper is in fact conceptual. The scan has the following options and consequences:
    • [y] yes: there is a reproduction package. If the reproduction link contains a link then this is automatically saved in the results as well.
    • [n] no: there is no reproduction package.
    • [u] unclear: not apparent from the hint only, a PDF gets opened up. This can also be useful for looking up correct link. The prompt will be repeated after that.
    • [q] quit: Stop the scan. Intermediate results are stored.
    • [l] yes and manually enter [l]ink: if there is a reproduction package, and the link in the reproduction text hint is only partially available, or not at all, but known.
    • [i] ignore paper: continues with the next paper
    • [c] re-[c]lassify paper: If the paper is not "experimental", but something else, for instance "conceptual", or a "thesis". Results for the scan are stored in results_manual.csv. Previous scanned results are loaded by default. To start the scan from scratch, use the -os flag.
  7. An additional CLI-Application is then launched to specify whether a given code repository has the following. Available options are: [y] yes / [n] no / [p] partially / [u] unknown / [na] [n]ot [a]pplicable.
    • code available
    • an environment
    • documentation
    • a version update
    • a Docker image
    • hardware specification (experimental_params)
    • executable

Just want to continue the manual scan with already some exisiting results available? Just run:

python main.py -s none -l

About

Pipeline for quantifying reproducibility artifacts in papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages