A Spring Boot-based web scraper that extracts job listings from various job portals and stores them in a MySQL database. This is a sub-project of a larger Job Portal application, designed to continuously scrape and populate job data that the main job portal can fetch and display to users.
The Job Portal Scraper is a Spring Boot application that automates the collection of job postings from job portals (currently focusing on Infopark). It scrapes job data, processes it, and stores it in a MySQL database where the main Job Portal application can access it for displaying job listings to end users.
- Automated Job Scraping: Automatically scrapes job listings from Infopark job portal
- Database Storage: Stores scraped job data in MySQL database using Spring Data JPA
- Spring Boot Integration: Built with Spring Boot for easy deployment and configuration
- Job Portal Integration: Designed as a backend service for the main Job Portal application
- Structured Data Model: Clean job data model with JPA entities for database persistence
- Continuous Operation: Runs as a CommandLineRunner on application startup
- Spring Boot Application: Fully configured Spring Boot 4.0 application
- Infopark Scraper: Active scraper for Infopark job portal (https://infopark.in/companies/job-search)
- Database Integration: MySQL database connection with Spring Data JPA
- Job Entity Model: Complete job data model with fields:
- Job title, company name, requirements, experience, location
- Posted date, deadline, apply link
- JPA Repository: JobRepo interface for database operations
- Auto-execution: Scraper runs automatically on application startup using CommandLineRunner
This scraper serves as a data collection service for the main Job Portal application:
- Scraper runs and collects job data from supported portals
- Data is stored in the shared MySQL database (
jobportaldatabase) - Main Job Portal application fetches this data to display to users
- Both applications connect to the same database for seamless data sharing
- Language: Java 21
- Framework: Spring Boot 4.0.0
- Database: MySQL 8.0+
- ORM: Spring Data JPA with Hibernate
- HTML Parsing: Jsoup 1.21.2
- Build Tool: Maven
- Utilities: Lombok for boilerplate code reduction
Scrapper-java/
├── src/
│ ├── main/
│ │ ├── java/com/scraper/jobportal/jobportal_scraper/
│ │ │ ├── JobportalScraperApplication.java # Main Spring Boot application
│ │ │ ├── model/
│ │ │ │ └── Jobs.java # JPA entity for job data
│ │ │ ├── repository/
│ │ │ │ └── JobRepo.java # JPA repository interface
│ │ │ └── scraper/
│ │ │ └── InfoparkScraper.java # Infopark job portal scraper
│ │ └── resources/
│ │ └── application.properties # Database and app configuration
│ └── test/
│ └── java/
│ └── JobportalScraperApplicationTests.java
├── pom.xml # Maven dependencies and build config
└── README.md
- Java 21 or higher
- Maven 3.6+
- MySQL 8.0+
- Git
-
Clone the repository
git clone https://github.com/Akhil-vk18/Scrapper-java.git cd Scrapper-java -
Set up MySQL Database
CREATE DATABASE jobportal;
-
Configure Database Connection
Update
src/main/resources/application.propertieswith your MySQL credentials:spring.application.name=jobportal-scraper spring.datasource.url=jdbc:mysql://127.0.0.1:3306/jobportal spring.datasource.username=root spring.datasource.password=your_password spring.jpa.hibernate.ddl-auto=update spring.jpa.show-sql=true spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.MySQLDialect
-
Build the project
mvn clean install
-
Run the application
mvn spring-boot:run
The scraper will automatically run on startup and populate the database with job data.
The Jobs JPA entity stores scraped job information with the following fields:
id(Integer): Auto-generated primary keytitle(String): Job titlecompanyname(String): Company namerequirements(String): Job requirementsexperience(String): Required experiencelocation(String): Job locationpostedDate(LocalDate): Date when job was posteddeadline(LocalDate): Application deadlineapplylink(String): URL to apply for the job
This data is stored in the jobs table in the MySQL database and can be accessed by the main Job Portal application.
- Application Starts: The Spring Boot application starts and initializes the
InfoparkScrapercomponent - Scraper Executes: The scraper connects to Infopark's job search page using Jsoup
- Data Extraction: Job details are extracted from the HTML table structure
- Database Storage: Each job is saved to the MySQL database via Spring Data JPA
- Job Portal Access: The main Job Portal application queries this database to display jobs to users
Run tests using Maven:
mvn test- Add more job portal scrapers (Naukri, LinkedIn, Indeed, etc.)
- Implement scheduling for periodic scraping (Spring @Scheduled)
- Add duplicate job detection
- Implement error notifications
- Add REST API endpoints for manual scraping triggers
- Enhance data validation and sanitization
- Add support for job categories and tags
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Ensure you have permission to scrape target websites
- Respect the
robots.txtfile and website Terms of Service - Implement appropriate delays between requests
- Use proper User-Agent headers
- Consider the website's server load and implement responsible scraping practices
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or suggestions, please open an issue on the GitHub repository.
Akhil VK
- GitHub: @Akhil-vk18
Last Updated: December 16, 2024
Status: 🟢 Active - Core functionality implemented
This scraper is a sub-project of the larger Job Portal application ecosystem. For the main Job Portal repository, please visit the parent project.