This is an automated bot that scrapes IT job postings from multiple sources, processes them through an advanced filtering and scoring system, and notifies the most relevant results to a Telegram channel.
The main goal is to filter out the "noise" from job portals and present only the offers that fit a specific search profile, which is easily configurable (by default, it's set for entry-level roles). This project also serves as a data backend for a separate web application.
- Multi-Source Scraping: Fetches jobs from Get on Board, Educación IT, Empleos IT, and JobSpy (which in turn scrapes LinkedIn, Indeed, and others).
- Firestore Database: Uses Google Firestore to store processed jobs, preventing duplicate notifications and tracking data over time.
- Advanced Filtering:
- Automatically discards jobs from non-IT related fields (e.g., HR, Marketing, Finance).
- Excludes roles that do not match the desired seniority level (e.g., Senior, Lead, Manager).
- Scoring System:
- Each job is run through an algorithm that assigns a relevance score from 0 to 100.
- The system analyzes the job title and description for keywords related to technologies, roles, and seniority.
- Applies bonuses and penalties based on configurable rules.
- Tag Generation: Extracts and assigns the most important keywords to each job (e.g.,
react,python,aws,backend) for easy identification. - Telegram Notifications: Sends jobs that exceed a minimum score threshold to a designated Telegram channel.
- Web Frontend Integration: Triggers a cache revalidation on a separate web application to keep its data up-to-date.
- Automated Execution: Designed to be run automatically via GitHub Actions on a schedule.
- Scrape: The bot runs and fetches the latest job postings from all enabled sources.
- Deduplication: It checks Firestore to see if the jobs have already been processed.
- Pre-filtering: Applies a first layer of filters to discard jobs based on area and undesired seniority.
- Scoring & Final Filtering: Assigns a score to the remaining jobs. Only those that surpass the
MIN_SCOREare accepted. - Notification: Accepted jobs are sent to the Telegram channel.
- Storage: All processed jobs (both accepted and rejected) are saved to Firestore for future reference and deduplication.
- Cache Revalidation: A request is sent to the web frontend to revalidate its cache, ensuring the new job data is reflected.
- Python 3.11 or higher
- A Google Cloud project with Firestore enabled
- A Telegram Bot Token and Channel ID
-
Clone the repository:
git clone https://github.com/your-username/job-alert-bot.git cd job-alert-bot -
Install the dependencies:
pip install -r requirements.txt
The project is configured through environment variables. You can create a .env file in the project root to manage them locally.
BOT_TOKEN(Required): Your Telegram bot token.TELEGRAM_CHANNEL_ID(Required): The ID of the Telegram channel where notifications will be sent.
JOB_SOURCES: A comma-separated list of sources to use. If not set, all available sources will be used.- Available Sources:
getonboard,educacionit,jobspy,empleosit - Example:
JOB_SOURCES=getonboard,jobspy
- Available Sources:
GOOGLE_APPLICATION_CREDENTIALS: The absolute path to your Google Cloud service account JSON key file. This is required for local development. When deployed (e.g., on GitHub Actions), you might use a different authentication method (like Workload Identity Federation or a base64-encoded secret).
BASE_URL: The base URL of the web application to trigger cache revalidation.REVALIDATION_SECRET: The secret token required by the frontend's revalidation endpoint.
To run the bot manually:
-
Set up your environment variables:
- Create a
.envfile with the necessary variables (see Configuration section). - Or export them in your shell:
export BOT_TOKEN="YOUR_TELEGRAM_BOT_TOKEN" export TELEGRAM_CHANNEL_ID="YOUR_TELEGRAM_CHANNEL_ID" export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json" export JOB_SOURCES="getonboard,jobspy"
- Create a
-
Run the main script:
python main.py
The bot is also designed to be run automatically. The .github/workflows/scraper.yml file contains a GitHub Actions workflow to run the script on a schedule.
- Language: Python 3.11
- Scraping:
requests,beautifulsoup4,python-jobspy - Database: Google Firestore
- Notifications:
python-telegram-bot - Orchestration: GitHub Actions
- Data Handling:
pandas,dateparser