This project provides an example of how to run a Scrapy-based spider on multiple Scrapy-Splash instances using Nginx as a load balancer. The spider and its services are managed by Docker and Docker Compose.
The spider scrapes data from the well-known Quotes to Scrape website, extracting some basic information about quotes.
There are 3 Splash instances defined in the docker-compose.yml file. However, you can easily scale up or down the number of instances by modifying the docker-compose.yml and nginx.conf files accordingly.
Note: there's a more flexible way to scale up/down the number of Splash instances by using Docker Swarm's deploy feature, since it's internally provides a load balancing mechanism between the replicas. However, for simplicity, this implementation involves manual scaling and uses Nginx as a load balancer for the specified Splash instances.
- Docker
- Docker Compose plugin
- Python 3.9+
Once the environment is set up, you can run the spider and its services using Docker Compose:
docker compose up --abort-on-container-exit As a result, the spider will start scraping data from the target website, distributing requests across the available Splash instances using Nginx as a load balancer.
Note: --abort-on-container-exit flag is used to stop all services when the spider finishes its job.
This project is licensed under the MIT License - see the LICENSE file for details.
