A robust web scraping utility built with Rust that uses headless Chrome/Chromium to extract content from any website.
- 🌐 Cross-platform compatibility - Works on Windows and Linux with platform-specific optimizations
- 🔍 Automatic browser detection - Finds installed Chrome/Chromium browser instances
- 📄 Content extraction - Captures main article content from web pages
- 📸 Screenshot capability - Takes screenshots of pages for verification
- 💾 Automatic file saving - Stores content and screenshots with timestamps
- 🛠️ Error handling - Comprehensive error handling for browser operations
- Rust (latest stable version)
- Chrome or Chromium browser installed on your system
-
Clone this repository:
git clone https://github.com/RGGH/scrapeX cd scrapeX -
Build the project:
cargo build --release
Run the application with:
cargo run -- --url https://www.motorsport.com
wait 10-20 secs it will proceed by itself
The application will:
- Automatically detect your Chrome/Chromium browser
- Launch a browser instance with appropriate settings for your OS
- Navigate to the chosen website
- Extract the page title and main content
- Save the content to a text file in the
outputdirectory - Take a screenshot and save it to the
outputdirectory
All outputs are saved in the output directory with timestamp-based filenames:
- Content:
output/_content_[timestamp].txt - Screenshots:
output/_screenshot_[timestamp].png
This project utilizes:
chromiumoxide- For browser automationtokio- For asynchronous runtime- Custom Chrome/Chromium detection logic
The browser window is visible by default for debugging purposes. To make it headless, modify the BrowserConfig::builder() line in main.rs by removing the .with_head() call.
- Browser not found: Ensure Chrome or Chromium is installed and accessible in the standard installation locations
- Permission errors on Linux: The application uses
--no-sandboxflag on Linux to avoid permission issues in certain environments - Slow performance: Increase the navigation timeout in the code if needed