ScrapeX Web Scraper

A robust web scraping utility built with Rust that uses headless Chrome/Chromium to extract content from any website.

Features

🌐 Cross-platform compatibility - Works on Windows and Linux with platform-specific optimizations
🔍 Automatic browser detection - Finds installed Chrome/Chromium browser instances
📄 Content extraction - Captures main article content from web pages
📸 Screenshot capability - Takes screenshots of pages for verification
💾 Automatic file saving - Stores content and screenshots with timestamps
🛠️ Error handling - Comprehensive error handling for browser operations

Prerequisites

Rust (latest stable version)
Chrome or Chromium browser installed on your system

Installation

Clone this repository:

git clone https://github.com/RGGH/scrapeX
cd scrapeX

Build the project:
```
cargo build --release
```

Usage

Run the application with:

cargo run -- --url https://www.motorsport.com

wait 10-20 secs it will proceed by itself

The application will:

Automatically detect your Chrome/Chromium browser
Launch a browser instance with appropriate settings for your OS
Navigate to the chosen website
Extract the page title and main content
Save the content to a text file in the output directory
Take a screenshot and save it to the output directory

Output

All outputs are saved in the output directory with timestamp-based filenames:

Content: output/_content_[timestamp].txt
Screenshots: output/_screenshot_[timestamp].png

Technical Details

This project utilizes:

chromiumoxide - For browser automation
tokio - For asynchronous runtime
Custom Chrome/Chromium detection logic

Configuration

The browser window is visible by default for debugging purposes. To make it headless, modify the BrowserConfig::builder() line in main.rs by removing the .with_head() call.

Troubleshooting

Browser not found: Ensure Chrome or Chromium is installed and accessible in the standard installation locations
Permission errors on Linux: The application uses --no-sandbox flag on Linux to avoid permission issues in certain environments
Slow performance: Increase the navigation timeout in the code if needed

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
output		output
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeX Web Scraper

Features

Prerequisites

Installation

Usage

Output

Technical Details

Configuration

Troubleshooting

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapeX Web Scraper

Features

Prerequisites

Installation

Usage

Output

Technical Details

Configuration

Troubleshooting

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages