Skip to content

serhiiur/Scrapy-Splash-With-Nginx-Load-Balancer

Repository files navigation

Introduction

This project provides an example of how to run a Scrapy-based spider on multiple Scrapy-Splash instances using Nginx as a load balancer. The spider and its services are managed by Docker and Docker Compose.

Prerequisites

The spider scrapes data from the well-known Quotes to Scrape website, extracting some basic information about quotes.

How it works

Architecture Diagram

There are 3 Splash instances defined in the docker-compose.yml file. However, you can easily scale up or down the number of instances by modifying the docker-compose.yml and nginx.conf files accordingly.

Note: there's a more flexible way to scale up/down the number of Splash instances by using Docker Swarm's deploy feature, since it's internally provides a load balancing mechanism between the replicas. However, for simplicity, this implementation involves manual scaling and uses Nginx as a load balancer for the specified Splash instances.

System Requirements

  • Docker
  • Docker Compose plugin
  • Python 3.9+

Usage

Once the environment is set up, you can run the spider and its services using Docker Compose:

  docker compose up --abort-on-container-exit 

As a result, the spider will start scraping data from the target website, distributing requests across the available Splash instances using Nginx as a load balancer.

Note: --abort-on-container-exit flag is used to stop all services when the spider finishes its job.

References

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Scrapy based spider using multiple Splash instances served by Nginx

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors