Skip to content

Commit 33454b9

Browse files
committed
Some updates to default.json, readme explaining the updates, and gitignoring any config file besides default for now
1 parent c89a88c commit 33454b9

3 files changed

Lines changed: 116 additions & 4 deletions

File tree

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,8 @@ go.work.sum
2626
.idea
2727

2828
# Environment variables
29-
.env
29+
.env
30+
31+
# Ignore all configs except default.json
32+
configs/*
33+
!configs/default.json

README.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,92 @@ scrapey-cli/
108108

109109
---
110110

111+
## 🔧 Configuration Options
112+
113+
Scrapey CLI is configured using a JSON file that defines how websites are crawled and scraped. Below is a detailed breakdown of the available configuration options.
114+
115+
### 🌍 URL Configuration
116+
117+
```json
118+
"url": {
119+
"base": "https://example.com",
120+
"routes": [
121+
"/route1",
122+
"/route2",
123+
"*"
124+
],
125+
"includeBase": false
126+
}
127+
```
128+
129+
- **base**: The primary domain to scrape.
130+
- **routes**: List of specific paths to scrape. Supports `*` as a wildcard for full site crawling.
131+
- **includeBase**: Whether to include the base URL in the scrape.
132+
133+
### 🔍 Parsing Rules
134+
135+
```json
136+
"parseRules": {
137+
"title": "title",
138+
"metaDescription": "meta[name='description']",
139+
"articleContent": "article",
140+
"author": ".author-name",
141+
"datePublished": "meta[property='article:published_time']"
142+
}
143+
```
144+
145+
- **title**: Extracts the page title.
146+
- **metaDescription**: Extracts the meta description.
147+
- **articleContent**: Defines the main article section.
148+
- **author**: Selector for extracting author names.
149+
- **datePublished**: Extracts the publication date from meta properties.
150+
151+
### 💾 Storage Options
152+
153+
```json
154+
"storage": {
155+
"outputFormats": ["json", "csv", "xml"],
156+
"savePath": "output/",
157+
"fileName": "scraped_data"
158+
}
159+
```
160+
161+
- **outputFormats**: List of formats in which data will be stored.
162+
- **savePath**: Directory where scraped content is saved.
163+
- **fileName**: Base name for output files.
164+
165+
### ⚡ Scraping Behavior
166+
167+
```json
168+
"scrapingOptions": {
169+
"maxDepth": 2,
170+
"rateLimit": 1.5,
171+
"retryAttempts": 3,
172+
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
173+
}
174+
```
175+
176+
- **maxDepth**: Defines how deep the scraper should follow links.
177+
- **rateLimit**: Time delay (in seconds) between requests to avoid rate-limiting.
178+
- **retryAttempts**: Number of retries for failed requests.
179+
- **userAgent**: Custom user-agent string to mimic a browser.
180+
181+
### 🛠 Data Formatting
182+
183+
```json
184+
"dataFormatting": {
185+
"cleanWhitespace": true,
186+
"removeHTML": true
187+
}
188+
```
189+
190+
- **cleanWhitespace**: Removes unnecessary whitespace in extracted content.
191+
- **removeHTML**: Strips HTML tags from extracted content for cleaner output.
192+
193+
This configuration file allows fine-tuning of scraping behavior, data extraction, and storage formats for ultimate flexibility in web scraping.
194+
195+
---
196+
111197
## 🛠 Usage
112198

113199
- **Basic Execution:**

configs/default.json

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,29 @@
11
{
2-
"url": "https://MyExample.com",
2+
"url": {
3+
"base": "https://example.com",
4+
"routes": ["/route1", "/route2", "*"],
5+
"includeBase": false
6+
},
37
"parseRules": {
4-
"title": "My Example App",
5-
"metaDescription": "meta[name='Example App Loading Config']"
8+
"title": "title",
9+
"metaDescription": "meta[name='description']",
10+
"articleContent": "article",
11+
"author": ".author-name",
12+
"datePublished": "meta[property='article:published_time']"
13+
},
14+
"storage": {
15+
"outputFormats": ["json", "csv", "xml"],
16+
"savePath": "output/",
17+
"fileName": "scraped_data"
18+
},
19+
"scrapingOptions": {
20+
"maxDepth": 2,
21+
"rateLimit": 1.5,
22+
"retryAttempts": 3,
23+
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
24+
},
25+
"dataFormatting": {
26+
"cleanWhitespace": true,
27+
"removeHTML": true
628
}
729
}

0 commit comments

Comments
 (0)