This is a Python script called "scraper.py" that performs web scraping and text processing tasks using various libraries. The script extracts text from a list of URLs, cleans the text, and outputs the cleaned text to a file.
To use this script, follow the instructions below:
Make sure you have the following libraries installed:
- pandas
- spacy
- goose3
- textblob
You can install these libraries using pip:
pip install pandas spacy goose3 textblob- Clone the repository or download the "scraper.py" file to your local machine.
- Create a file named "URL.txt" and add the list of URLs you want to scrape, each URL on a separate line.
- Open a terminal or command prompt and navigate to the directory where the "scraper.py" file is located.
- Run the script using the following command:
python scraper.py- The script will extract the text from each URL, clean it, and print the cleaned text to the console.
- The cleaned text will also be saved in a file named "Output.txt" in the same directory.
- Make sure you have a stable internet connection to access the URLs.
- The script uses the "en_core_web_sm" model from spaCy for text processing. If you don't have it downloaded, the script will download it automatically.
- Feel free to modify the code to suit your specific requirements or add more functionalities.
This project is licensed under the MIT License.