biorxiv-retriever is a resilient wrapper to the Biorxiv API. It consists of two main classes: BiorxivDataGenerator and BiorxivRetriever. The former uses resilient HTTP requests to generate a dataset with the available preprints in Biorxiv. BiorxivRetriever is an API wrapper that allows for API calls to any of the services supported by the Biorxiv API.
Clone the repository and setup a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt From the directory root you can get CLI help on how to call the commands using:
# To use BiorxivRetriever
python -m src.cli.search.search --help
# To use DatasetGenerator
python -m src.cli.create_data.create_data --helpUsing the details service of the Biorxiv API to find all papers between first of May 2022 and the current date.
python -m src.cli.search.search details biorxiv \
--start_date=2022-05-01Same as in the previous example with data from Medrxiv.
python -m src.cli.search.search details medrxiv \
--start_date=2022-05-01Search for details of article publishers. In this case, the publisher with a prefix
doi 10.15252
python -m src.cli.search.search publisher biorxiv \
--prefix=10.15252 \
--start_date=2021-05-01Show the summary of content statistics in Biorxiv
python -m src.cli.search.search sum biorxiv \
--interval=mGet all the available metadata in biorxiv since 4th May 2022 <(-_-)> may the force be with you.
python -m src.cli.create_data.create_data biorxiv \
--start_date=2022-05-04 \
--email=your.email@company.acmeSame as above for Medrxiv.
python -m src.cli.create_data.create_data medrxiv \
--start_date=2022-05-04 \
--email=your.email@company.acmeRetrieve the entire metadata available since April 2022 and also the source XML text.
python -m src.cli.create_data.create_data biorxiv \
--start_date=2022-05-04 \
--email=your.email@company.acme \
--xml=TrueThe functionalities of biorxiv-retriever can be used as normal python modules in case it is necessary. The last line above can be called from a python script using:
from src.dataset_generator import BiorxivDataGenerator
data = BiorxivDataGenerator(start_date='2022-05-04',
email='your.email@company.acme',
xml=True)
data()If you are interested on downloading the metadata only and want to download the source xml
files on a later stage, we provide the BiorxivDataGenerator.dl_source_xml method.
It accepts the path to the json file with the metadata generated and it downloads the source
files. This is useful if you want to obtain the metadata first and the
source text on a later step.