This repository was archived by the owner on Jul 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
43 lines (31 loc) · 1.48 KB
/
README.Rmd
File metadata and controls
43 lines (31 loc) · 1.48 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
output: github_document
---
# scrapeR
## Overview
scrapeR is a framework inspired by the `scrapy` python framework. It is meant to be used for batching multiple crawlers and reusing item pipelines between them.
The basic components of a scrapeR project are `queues, steps, pipelines, and runners`.
**Pipelines** are an organizational tool of a combination of different scraping and processing steps that are applied to the original *queue* of data. Each pipeline begins with a list input and will run each step in the pipeline. Pipelines can steps are similar to classic functional programming functions but have the advantage of delayed execution, reusability, parallelization, and documentation.
The **steps** that comprise *pipelines* can either be `parsers or transformers`. **Parsers** are *steps* that are applied to each item of the *queue* in parallel. **Transformers** are *steps* that are applied to entire *queue* and are important for summarizing and consolidating the results.
## Installation
```{r, eval= FALSE}
# Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("quartzsoftwarellc/scrapeR")
```
# Usage
```{r}
library(rvest)
library(scrapeR)
start_url <- "https://en.wikipedia.org/wiki/Cat"
spider("cats") %>%
add_queue(start_url) %>%
add_parser(~ {
read_html(.x) %>%
html_nodes(".mw-headline") %>%
html_text()
}
) %>%
run() %>%
paste(sep = "")
```