JobHunter

Automated job scraping and cold email outreach system. Scrapes job listings from multiple boards, extracts recruiter contact information, generates personalized cold emails, and sends them with your resume attached -- all on autopilot.

Built on python-jobspy for multi-board job scraping.

Features

Multi-board scraping -- Indeed, LinkedIn, Glassdoor, and more via python-jobspy
Two-phase pipeline -- scrape and persist first, then process and email, with per-row status tracking for crash resilience
Personalized emails -- LLM-generated cold emails via OpenRouter with automatic fallback to template-based generation
Resume attachment -- sends your PDF resume with every outreach email
Google Sheets storage -- full audit trail with separate worksheets for scraped jobs, sent emails, and run statistics
Smart deduplication -- exact match and company cooldown prevent duplicate outreach
DNS email validation -- filters non-existent mailboxes using emval (Rust-based DNS/MX lookup) before sending, reducing mailer daemon bounces
Title and email filtering -- skip irrelevant job titles and generic email addresses automatically
Dry run mode -- test the full pipeline without sending any emails
Configurable scheduling -- built-in APScheduler or external cron/GitHub Actions
Weekend skip -- automatically skips execution on weekends
Summary reports -- email yourself a run summary after each execution

How It Works

JobHunter runs a two-phase pipeline:

Phase 1: Scrape + Save
  Search terms x Locations x Boards --> python-jobspy
  --> Batch-write all results to "Scraped Jobs" worksheet (status: Pending)

Phase 2: Process + Email
  For each scraped job:
    --> Extract valid recipient emails
    --> Check deduplication rules
    --> Generate personalized email (LLM or fallback template)
    --> Send with resume attached
    --> Update row status (Yes / Skipped / Failed / DryRun)

Report:
  --> Write run statistics to "Run Stats" worksheet
  --> Email summary report to yourself

Every job gets a status update in the spreadsheet regardless of outcome, giving you a complete audit trail of what was sent, what was skipped, and why.

Quick Start (Zero to First Run)

This section walks you through everything from a fresh clone to your first successful dry run. Follow these steps in order.

Step 1: Clone and Install

You need Python 3.10+ and uv (a fast Python package manager).

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/arinbalyan/JobHunter.git
cd JobHunter

# Install all dependencies
uv sync

After uv sync, you should see a .venv directory in the project root. All commands use uv run to automatically pick up this virtual environment.

Step 2: Create Your Applicant Profile

JobHunter uses an applicant profile to generate personalized cold emails. This is a plain text or markdown file that describes who you are, what you have done, and what you are looking for. The LLM reads this file and uses it as context when writing emails on your behalf.

Create the file at contexts/profile.md:

mkdir -p contexts

Then create contexts/profile.md with your information. Here is an example structure:

# Your Name

## Summary
Recent CS graduate / 3 years of experience in backend engineering / looking
for roles in machine learning, data science, or software engineering.

## Skills
- Languages: Python, JavaScript, SQL, Go
- ML/AI: PyTorch, scikit-learn, pandas, NLP, computer vision
- Web: FastAPI, React, Node.js, PostgreSQL
- Cloud: AWS (EC2, S3, Lambda), Docker, Kubernetes

## Experience
- Software Engineer at CompanyX (2022-2024): Built data pipelines processing
  10M+ records daily. Deployed ML models for recommendation engine.
- Intern at StartupY (Summer 2021): Developed REST API for internal tools.

## Education
- B.Tech in Computer Science, University Name, 2024
- Relevant coursework: Machine Learning, Data Structures, Distributed Systems

## Projects
- Built a job search automation tool using Python and LLMs
- Open-source contributor to [project name]
- Personal portfolio: yoursite.com

## What I'm Looking For
Entry-level to mid-level roles in ML engineering, data science, or full-stack
development. Open to onsite (India) and remote positions.

Write it in first person or third person, either works. The LLM will adapt. The more specific you are about your skills and experience, the better the generated emails will be.

This file is gitignored and never committed. For GitHub Actions deployment, you will base64-encode it as a secret (covered in the Deployment section).

Step 3: Set Up Google Sheets

JobHunter stores everything in Google Sheets -- scraped jobs, sent emails, and run statistics. Here is how to set it up:

Go to the Google Cloud Console
Create a new project (or select an existing one)
In the left sidebar, go to APIs & Services > Library
Search for and enable both:
- Google Sheets API
- Google Drive API
Go to APIs & Services > Credentials
Click Create Credentials > Service Account
Give it a name (e.g., "jobhunter-bot") and click through to finish
Click on the new service account, go to the Keys tab
Click Add Key > Create New Key > JSON and download it

Base64-encode the downloaded JSON file:

# Linux/macOS/WSL
base64 -w 0 your-credentials-file.json
# Copy the entire output (it will be a long string)

Go to Google Sheets and create a new spreadsheet. Name it whatever you want (e.g., "JobHunter")
Share the spreadsheet with the service account email. You can find this email in the JSON file under client_email (it looks like jobhunter-bot@your-project.iam.gserviceaccount.com). Give it Editor access.

JobHunter will automatically create the required worksheets (Scraped Jobs, Sent Emails, Run Stats) on the first run.

Step 4: Configure Environment

Copy the example file and fill in your values:

cp .env.example .env

Open .env in your editor. Here is what to fill in, grouped by priority:

Must set (will not work without these):

Variable	What to Put
`GMAIL_EMAIL`	Your Gmail address
`GMAIL_APP_PASSWORD`	A Gmail App Password (generate here) -- this is NOT your Gmail password
`CONTACT_NAME`	Your full name
`GOOGLE_CREDENTIALS_JSON`	The base64 string from Step 3
`GOOGLE_SHEET_NAME`	The name of the spreadsheet from Step 3

Should set (for better email quality):

Variable	What to Put
`OPENROUTER_API_KEY`	Get a free key at openrouter.ai/keys -- without this, emails use basic templates instead of LLM
`CONTACT_EMAIL`	Your contact email for email signatures
`CONTACT_PHONE`	Your phone number
`CONTACT_PORTFOLIO`	Your portfolio URL
`CONTACT_GITHUB`	Your GitHub username
`RESUME_DRIVE_LINK`	Google Drive link to your resume

Optional (defaults work fine):

Variable	Default	What It Does
`CONTEXT_FILE_PATH`	`contexts/profile.md`	Path to your applicant profile from Step 2
`RESUME_FILE_PATH`	`ArinBalyan.pdf`	Path to your resume PDF (see Step 5)
`REPORT_EMAIL`	Same as `GMAIL_EMAIL`	Where to send run summary reports
`STORAGE_BACKEND`	`sheets`	Use `csv` for local testing without Google Sheets
`DRY_RUN`	`false`	Set to `true` to test without sending emails
`ONSITE_RESULTS_WANTED`	`1000`	Max jobs per search term per board (200-500 recommended)

Customize your job search:

Variable	Example	Description
`ONSITE_SEARCH_TERMS`	`machine learning,data scientist,AI engineer`	Comma-separated search queries
`ONSITE_LOCATIONS`	`Delhi,Mumbai,Bangalore`	Comma-separated locations
`ONSITE_JOB_BOARDS`	`indeed,glassdoor,linkedin`	Which boards to scrape
`REJECT_TITLES`	`senior,lead,manager,director`	Skip jobs with these words in the title

See .env.example for the complete list with descriptions.

Step 5: Add Your Resume

Place your resume PDF in the project root with its real filename:

cp /path/to/YourResume.pdf ArinBalyan.pdf

The file is tracked in git (.gitignore has an explicit !ArinBalyan.pdf exception), so it is committed with the repo and available in GitHub Actions without any decryption step. It gets attached to every outreach email.

If you use a different filename, update RESUME_FILE_PATH in .env and add a matching exception to .gitignore:

# .env
RESUME_FILE_PATH=YourResume.pdf

# .gitignore
!YourResume.pdf

To update your resume in the future, replace the file and push:

cp /path/to/updated-resume.pdf ArinBalyan.pdf
git add ArinBalyan.pdf && git commit -m "chore: update resume" && git push

Step 6: Test With a Dry Run

Before sending real emails, run a dry run to verify everything works:

DRY_RUN=true uv run python -m jobspy_v2 onsite

What to expect:

The scraper will connect to job boards and download listings (this takes a few minutes depending on RESULTS_WANTED)
Results are written to the "Scraped Jobs" worksheet in your Google Sheet
For each job with a valid email, it will generate an email but NOT send it
Rows are updated with status "DryRun" in the spreadsheet
A summary report is emailed to your REPORT_EMAIL
You will see detailed logs in the terminal showing each step

Check your Google Sheet after the run. You should see:

"Scraped Jobs" tab with all jobs and their status
"Sent Emails" tab with the would-have-been-sent emails
"Run Stats" tab with a summary row

If something fails, check the logs for error messages. Common issues:

Error	Cause	Fix
SMTP authentication failed	Wrong app password	Regenerate at myaccount.google.com/apppasswords
Google Sheets API error	Credentials issue	Make sure you shared the sheet with the service account email
No jobs with emails found	Normal for small searches	Increase `RESULTS_WANTED` or add more search terms
LLM request failed	API key issue	Check `OPENROUTER_API_KEY` -- fallback templates will be used automatically

Step 7: Go Live

Once your dry run looks good:

Set DRY_RUN=false in .env (or remove the line -- false is the default)
Run:
```
uv run python -m jobspy_v2 onsite
```
Emails will be sent for real, with a 30-second pause between each one
Monitor the terminal and your Google Sheet for results

Usage

Run Modes

JobHunter supports two modes with separate search configurations:

# Scrape onsite/hybrid jobs (uses ONSITE_* settings in .env)
uv run python -m jobspy_v2 onsite

# Scrape remote jobs (uses REMOTE_* settings in .env)
uv run python -m jobspy_v2 remote

Each mode uses its own search terms, locations, boards, and limits defined in your .env file.

Dry Run

Test the full pipeline without sending any emails:

DRY_RUN=true uv run python -m jobspy_v2 onsite

Or set DRY_RUN=true in your .env file. Dry runs still write to Google Sheets (with status "DryRun") and track deduplication, so you can verify everything works before going live.

Scheduled Runs

Use the built-in scheduler to run automatically:

uv run python -m jobspy_v2 schedule

This starts a long-running process that triggers at your configured times. Keep it running in a terminal, tmux session, or screen session.

Configure the schedule in your .env:

SCHEDULER_ONSITE_CRON=30 2 * * *
SCHEDULER_REMOTE_CRON=0 13 * * *

These are standard cron expressions (minute hour day month weekday). The defaults above correspond to 8:00 AM IST (onsite) and 8:00 AM US Eastern (remote).

Deployment

GitHub Actions (Full Setup)

GitHub Actions lets you run JobHunter automatically without keeping your computer on. The repository includes three workflows:

Workflow	File	What It Does
`JobHunter Tests`	`tests.yml`	Runs the test suite on every push/PR to `main`
`JobHunter Onsite`	`onsite.yml`	Scrapes onsite jobs daily at 8:00 AM IST
`JobHunter Remote`	`remote.yml`	Scrapes remote jobs daily at 8:00 AM US Eastern

Step-by-Step Setup

1. Push the repo to GitHub (if not already done):

git remote add origin https://github.com/YOUR_USERNAME/JobHunter.git
git push -u origin main

2. Generate base64 strings for your private files.

Your resume PDF is committed directly to the repo (see Step 5), so only the applicant profile and Google credentials need to be base64-encoded as secrets.

# Applicant profile
base64 -w 0 contexts/profile.md
# Copy the ENTIRE output

# Google credentials JSON (if not already base64-encoded)
base64 -w 0 your-credentials-file.json
# Copy the ENTIRE output

On macOS, use base64 -i instead of base64 -w 0.

3. Add repository secrets in GitHub.

Go to your repo on GitHub, then Settings > Secrets and variables > Actions > New repository secret. Add each of the following:

Credentials:

Secret Name	Value
`GMAIL_EMAIL`	Your Gmail address
`GMAIL_APP_PASSWORD`	Your Gmail App Password
`OPENROUTER_API_KEY`	Your OpenRouter API key
`GOOGLE_CREDENTIALS_JSON`	Base64 of your service account JSON (from step 2)

Files (base64-encoded):

Secret Name	Value
`CONTEXT_BASE64`	Base64 of your `contexts/profile.md` (from step 2)

Identity:

Secret Name	Value
`CONTACT_NAME`	Your full name
`CONTACT_EMAIL`	Your contact email
`CONTACT_PHONE`	Your phone number
`CONTACT_PORTFOLIO`	Your portfolio URL
`CONTACT_GITHUB`	Your GitHub username
`CONTACT_LINKEDIN`	Your LinkedIn URL
`CONTACT_CODOLIO`	Your Codolio username (or leave empty)
`RESUME_DRIVE_LINK`	Google Drive link to your resume

Storage:

Secret Name	Value
`STORAGE_BACKEND`	`sheets`
`GOOGLE_SHEET_NAME`	Name of your Google Sheets spreadsheet
`REPORT_EMAIL`	Email address for run reports

LLM:

Secret Name	Value
`LLM_BASE_URL`	`https://openrouter.ai/api/v1`
`LLM_MODEL`	Your model name (e.g., `google/gemini-2.0-flash-exp:free`)
`EMAIL_GENERATOR_MODE`	`llm`

Email Settings:

Secret Name	Value
`APPLICATION_SENDER_NAME`	Your name as it appears in the From field
`FALLBACK_EMAIL_SUBJECT`	Fallback subject line if LLM fails
`ONSITE_FALLBACK_EMAIL_BODY`	Fallback email body for onsite workflow if LLM fails
`REMOTE_FALLBACK_EMAIL_BODY`	Fallback email body for remote workflow if LLM fails
`REJECT_TITLES`	Comma-separated title filter patterns
`EMAIL_FILTER_PATTERNS`	Comma-separated email filter patterns
`MIN_EMAIL_WORDS`	`120`
`MAX_EMAIL_WORDS`	`300`
`EMAIL_INTERVAL_SECONDS`	`30`

Onsite Job Search Config:

Secret Name	Example Value
`ONSITE_SEARCH_TERMS`	`machine learning,data scientist`
`ONSITE_LOCATIONS`	`Delhi,Mumbai,Bangalore`
`ONSITE_JOB_TYPE`	`fulltime`
`ONSITE_JOB_BOARDS`	`indeed,glassdoor,linkedin`
`ONSITE_COUNTRY_INDEED`	`India`
`ONSITE_RESULTS_WANTED`	`300`
`ONSITE_MAX_EMAILS_PER_DAY`	`50`

Remote Job Search Config:

Secret Name	Example Value
`REMOTE_SEARCH_TERMS`	`machine learning,AI engineer`
`REMOTE_LOCATIONS`	`USA,UK,Canada`
`REMOTE_IS_REMOTE`	`true`
`REMOTE_JOB_TYPE`	`fulltime`
`REMOTE_JOB_BOARDS`	`indeed,glassdoor,linkedin`
`REMOTE_COUNTRIES_INDEED`	`USA,UK`
`REMOTE_RESULTS_WANTED`	`300`
`REMOTE_MAX_EMAILS_PER_DAY`	`50`

Remote pairing note:

When REMOTE_IS_REMOTE=true, scraping uses location=Remote and iterates countries.

When REMOTE_IS_REMOTE=false, REMOTE_LOCATIONS and REMOTE_COUNTRIES_INDEED are paired 1:1 (zip mapping), not as a Cartesian product.

4. Test it.

Go to Actions in your GitHub repo. Click on JobHunter Onsite (or Remote), then click Run workflow. Check the dry_run box, then click Run workflow. Monitor the logs to make sure everything works.

5. Enable automatic scheduling.

The workflows are already configured with cron schedules. Once your test run passes, the scraper will run automatically every day. You can edit the cron times in .github/workflows/onsite.yml and remote.yml.

How File Secrets Work

The workflows handle private files as follows:

Resume (PDF — committed directly):
- Your resume PDF (ArinBalyan.pdf) is committed to the repo. The .gitignore has an explicit !ArinBalyan.pdf exception so git tracks it.
- The workflow reads it directly from the checkout — no decryption step.
- To update your resume, replace the file locally and push:
```
cp /path/to/new-resume.pdf ArinBalyan.pdf
git add ArinBalyan.pdf && git commit -m "chore: update resume" && git push
```
- If you rename the file, update RESUME_FILE_PATH in both workflow files and add a matching !YourFilename.pdf line to .gitignore.
Applicant Context (Text File):
- Encode your contexts/profile.md to base64:
```
base64 -w0 contexts/profile.md
```
- Store the output as a GitHub Secret: CONTEXT_BASE64.
- The workflow decodes it to disk at runtime.

The context file exists only for the duration of the workflow run.

Local Cron

On Linux/macOS, use crontab for lightweight scheduling:

crontab -e

Add entries like:

0 9 * * 1-5  cd /path/to/JobHunter && uv run python -m jobspy_v2 onsite >> /var/log/jobhunter.log 2>&1
0 10 * * 1-5 cd /path/to/JobHunter && uv run python -m jobspy_v2 remote >> /var/log/jobhunter.log 2>&1

Architecture

Project Structure

JobHunter/
  src/jobspy_v2/
    __main__.py          # CLI entry point
    config/
      settings.py        # Pydantic settings (all from ENV)
      defaults.py        # Footer templates, reject lists
    core/
      scraper.py         # Multi-board job scraping via python-jobspy
      email_gen.py       # LLM + fallback email generation
      email_sender.py    # SMTP email sending with resume attachment
      dedup.py           # Deduplication with domain/company cooldowns
      reporter.py        # Run statistics and summary reports
    storage/
      base.py            # Storage protocol and column schemas
      sheets_backend.py  # Google Sheets implementation
      csv_backend.py     # Local CSV fallback
    workflows/
      base.py            # Two-phase pipeline orchestration
      onsite.py          # Onsite/hybrid mode
      remote.py          # Remote mode
    scheduler.py         # APScheduler-based cron scheduling
  contexts/
    profile.md           # Your applicant profile (gitignored)
  tests/
    unit/                # 200 unit tests
  .github/workflows/     # CI and deployment workflows
  .env.example           # Configuration template
  pyproject.toml         # Project metadata and dependencies

Two-Phase Pipeline

The pipeline is split into two phases for reliability:

Phase 1 -- Scrape and Save: All search term and location combinations are dispatched to python-jobspy. Results are collected into a DataFrame and batch-written to the "Scraped Jobs" worksheet with every row set to email_sent=Pending. This ensures that even if the process crashes during email sending, all scraped data is preserved.

Phase 2 -- Process and Email: Each scraped job is processed sequentially. Valid recipient emails are extracted, deduplication rules are checked, a personalized email is generated, and the email is sent with your resume. After each job, the corresponding row in "Scraped Jobs" is updated with the outcome (Yes, Skipped, Failed, or DryRun) along with the skip reason or recipient address.

A 30-second interval is enforced between sends to avoid rate limiting.

Storage Backends

Backend	When to use	Configuration
Google Sheets	Production use, full audit trail	Set `GOOGLE_CREDENTIALS_JSON` and `GOOGLE_SHEET_NAME`
CSV	Local testing, no Google account	Set `STORAGE_BACKEND=csv`

Both backends implement the same protocol with three data stores:

Scraped Jobs (22 columns) -- every job found during scraping
Sent Emails (12 columns) -- every email successfully sent
Run Stats (15 columns) -- summary statistics per run

Job Filtering

Title Filters

Jobs with titles matching any pattern in REJECT_TITLES are automatically skipped. Default patterns filter out senior, lead, manager, director, and other non-target roles. Customize in your .env:

REJECT_TITLES=senior,lead,manager,director,principal,staff,vp,chief

Email Filters

Generic email addresses (info@, hr@, support@, noreply@, etc.) are filtered out by default. Only emails that look like personal recruiter addresses are kept. Customize with EMAIL_FILTER_PATTERNS in your .env.

Deduplication

Two levels of deduplication prevent repeat outreach:

Rule	Description
Exact email match	Never send to the same email address twice (permanent)
Company cooldown	Never email two different people at the same company on the same day

Configuration Reference

All configuration is done through environment variables. See .env.example for the complete list with descriptions and default values. Key categories:

Category	Variables
Email	`GMAIL_EMAIL`, `GMAIL_APP_PASSWORD`, `REPORT_EMAIL`
Identity	`CONTACT_NAME`, `CONTACT_PHONE`, `CONTACT_PORTFOLIO`, `CONTACT_GITHUB`, `CONTACT_CODOLIO`
Resume	`RESUME_FILE_PATH`, `RESUME_DRIVE_LINK`
Profile	`CONTEXT_FILE_PATH` (default: `contexts/profile.md`)
LLM	`OPENROUTER_API_KEY`, `LLM_MODEL`, `LLM_BASE_URL`
Storage	`STORAGE_BACKEND`, `GOOGLE_CREDENTIALS_JSON`, `GOOGLE_SHEET_NAME`
Onsite Search	`ONSITE_SEARCH_TERMS`, `ONSITE_LOCATIONS`, `ONSITE_JOB_BOARDS`, `ONSITE_RESULTS_WANTED`
Remote Search	`REMOTE_SEARCH_TERMS`, `REMOTE_LOCATIONS`, `REMOTE_JOB_BOARDS`, `REMOTE_RESULTS_WANTED`
Filtering	`REJECT_TITLES`, `EMAIL_FILTER_PATTERNS`
Limits	`ONSITE_MAX_EMAILS_PER_DAY`, `REMOTE_MAX_EMAILS_PER_DAY`, `EMAIL_INTERVAL_SECONDS`, `MIN_EMAIL_WORDS`, `MAX_EMAIL_WORDS`
Schedule	`SCHEDULER_ENABLED`, `SCHEDULER_ONSITE_CRON`, `SCHEDULER_REMOTE_CRON`
Debug	`DRY_RUN`

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=jobspy_v2

# Run a specific test file
uv run pytest tests/unit/test_workflows.py

The test suite contains 200 unit tests covering configuration, storage backends, core logic, and workflow orchestration.

Code Style

The project uses Ruff for linting and formatting:

# Check for issues
uv run ruff check src/ tests/

# Auto-fix issues
uv run ruff check --fix src/ tests/

Line length is set to 88 characters. Target Python version is 3.10.

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for guidelines on how to get started, submit issues, and open pull requests.

Credits

python-jobspy by SpeedyApply -- the multi-board job scraping engine that powers JobHunter's data collection
OpenRouter -- LLM API gateway used for email generation
gspread -- Google Sheets API client for Python

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
scripts		scripts
src/jobspy_v2		src/jobspy_v2
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ArinBalyan.pdf		ArinBalyan.pdf
ArinBalyan.pdf.gpg		ArinBalyan.pdf.gpg
ArinBalyan_Resume.pdf		ArinBalyan_Resume.pdf
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
blog.md		blog.md
blog2.md		blog2.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

JobHunter

Table of Contents

Features

How It Works

Quick Start (Zero to First Run)

Step 1: Clone and Install

Step 2: Create Your Applicant Profile

Step 3: Set Up Google Sheets

Step 4: Configure Environment

Step 5: Add Your Resume

Step 6: Test With a Dry Run

Step 7: Go Live

Usage

Run Modes

Dry Run

Scheduled Runs

Deployment

GitHub Actions (Full Setup)

Step-by-Step Setup

How File Secrets Work

Local Cron

Architecture

Project Structure

Two-Phase Pipeline

Storage Backends

Job Filtering

Title Filters

Email Filters

Deduplication

Configuration Reference

Development

Running Tests

Code Style

Contributing

Credits

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages