GitHub - Catafal/lead-gen-factory: Find your ICP :)

██╗ ██████╗ ███████╗ ██║ ██╔════╝ ██╔════╝ ██║ ██║ ███╗█████╗ ██║ ██║ ██║██╔══╝ ███████╗╚██████╔╝██║ ╚══════╝ ╚═════╝ ╚═╝

Lead Gen Factory

Free-text ICP → scored B2B leads CSV — fully automated, zero manual research.

A headless CLI pipeline for B2B lead research. Give it an Ideal Customer Profile in plain English, and it searches the internet, crawls relevant pages, extracts structured contact data via LLM, scores ICP fit, deduplicates, and outputs a clean CSV ready for outreach — with optional Google Sheets CRM sync and email enrichment.

Key Features:

🔍 Intelligent Search — LLM-generated queries, Tavily-powered, parallel execution
🧠 LLM Extraction — Structured lead data pulled from raw web pages via OpenRouter
📊 ICP Scoring — Multi-dimensional 0–10 scoring with weighted rubrics and reasoning
👤 Contact Discovery — 3-tier champion search: role-based → HR-targeted → enrichment
📧 Email Enrichment — Apollo.io (v1 name-match or v2 role-search) + Hunter.io fallback
🔁 Smart Dedup — Fuzzy company name matching across runs (85% threshold)
📋 CRM Sync — Google Sheets append with pre-run dedup (skips companies already in sheet)
⚙️ Full Config CLI — lgf config set for any setting; lgf config to inspect all

See CHANGELOG.md for version history and release notes.

Quick Start

Installation

# From GitHub (recommended — always latest)
pipx install git+https://github.com/Catafal/lead-gen-factory.git

# Or from source (for development)
git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
pip install -e .
playwright install chromium

Note: pipx installs lgf globally without touching your system Python. Install it with pip install pipx if you don't have it.

First-Time Setup

lgf init

The wizard collects your API keys and writes them to ~/.lgf/.env. You'll need:

Key	Source	Required
`TAVILY_API_KEY`	tavily.com — free tier: 1000/month	✅
`OPENROUTER_API_KEY`	openrouter.ai	✅
`FIRECRAWL_API_KEY`	firecrawl.dev — crawl fallback	optional
`APOLLO_API_KEY`	apollo.io — email enrichment	optional
`HUNTER_API_KEY`	hunter.io — email enrichment fallback	optional
`GOOGLE_SHEET_ID`	Sheet URL between `/d/` and `/edit`	optional

Verify your setup:

lgf doctor

Write Your ICP

Create a markdown file describing your Ideal Customer Profile. See icp_examples/skillia_spain.md for a full example.

Key sections your ICP should include:

Section	Description
Target Roles	Decision maker job titles (e.g. "HR Director", "Head of People")
Company Size	Employee range (e.g. "50 to 500 employees")
Industries	Verticals / sectors to target
Geography	Countries, regions, or cities
Buying Signals	Observable behaviors that signal they need you
Exclude	Terms or company types that disqualify a lead

Run the Pipeline

# Full run from an ICP file
lgf research --icp icp_examples/skillia_spain.md

# Inline ICP (no file needed)
lgf research --icp-text "HR Directors at 50-200 person tech companies in Spain"

# Use a saved profile
lgf research --profile skillia-spain

# Dry run — generate queries only, no search or crawl
lgf research --icp icp_examples/skillia_spain.md --dry-run

# Narrow an existing ICP without editing the file
lgf research --profile skillia-spain --focus "SaaS companies only, no agencies"

Output is saved to leads_YYYYMMDD.csv by default (or specify --output path/to/file.csv).

Commands

Setup & Diagnostics

lgf init        # Guided first-time setup wizard — writes ~/.lgf/.env
lgf doctor      # Health check: API keys + live connectivity probes

Research Pipeline

lgf research --icp PATH               # Full pipeline from ICP file
lgf research --icp-text TEXT          # Inline ICP description
lgf research --profile NAME           # Use a saved ICP profile
lgf research --focus TEXT             # Add a narrowing constraint (stacks on ICP)
lgf research --dry-run                # Generate queries only — no API calls beyond ICP parse
lgf research --min-score INT          # Override minimum ICP fit score (default: from ICP)
lgf research --output PATH            # Custom CSV output path
lgf research --verbose                # Show debug logs

ICP Profiles

lgf profile add NAME --icp PATH       # Save an ICP file as a reusable profile
lgf profile list                      # List all saved profiles
lgf profile remove NAME               # Delete a profile

Profiles are stored as markdown files in ~/.lgf/profiles/.

Configuration

lgf config                            # Show all effective settings (API keys masked)
lgf config set KEY VALUE              # Update any setting in ~/.lgf/.env

Commonly changed settings:

lgf config set DEFAULT_LLM_MODEL anthropic/claude-3-5-sonnet
lgf config set FALLBACK_LLM_MODEL google/gemini-2.0-flash-001
lgf config set MAX_SEARCH_RESULTS 15
lgf config set APOLLO_ENRICHER_VERSION v2
lgf config set MULTI_THREAD_SCORE_THRESHOLD 7.0

ICP Validation

lgf validate-icp --icp PATH           # Parse and display ICP without running the pipeline
lgf validate-icp --icp-text TEXT      # Same for inline text

Pipeline Architecture

ICP (file / inline / profile)
    ↓
🧠  ICP Parsing        — free-text → structured Pydantic model
    ↓
🔍  Query Generation   — LLM → 8-12 targeted search queries
    ↓
🌐  Web Search         — Tavily, async parallel
    ↓
🧹  Snippet Pre-filter — LLM heuristic drops irrelevant results
    ↓
📄  Web Crawl          — crawl4ai → Scrapling → Firecrawl (3-tier fallback)
    ↓
🏢  Lead Extraction    — LLM → structured CompanyLead + contacts
    ↓
📊  ICP Scoring        — multi-dimensional weighted rubrics (0-10)
    ↓
🔁  Deduplication      — fuzzy company name match (85% threshold, cross-run)
    ↓
👤  Contact Discovery  — role search → HR search → enrichment (3-tier)
    ↓
📧  Email Enrichment   — Apollo v1/v2 + Hunter.io fallback (optional)
    ↓
📁  CSV Output + CRM   — filtered, sorted, Google Sheets sync (optional)

The pipeline runs Passes 1, 2, and 3 in sequence. Passes 2 and 3 use asyncio.gather with a semaphore for parallel execution per company.

Output Schema

Each lead row in the CSV:

Field	Description	Notes
`Business`	Company name
`First`	Decision maker first name
`Last`	Decision maker last name
`Email`	Champion email	Filled by Apollo/Hunter if configured
`LinkedIn`	Contact LinkedIn URL
`Website`	Company website URL
`Phone`	Contact phone	Filled by Apollo if configured
`Date`	Date discovered (ISO 8601)
`Place of Work`	Company + city/country
`ICP Fit Score`	0–10 match score
`ICP Fit Reason`	LLM reasoning for the score

Configuration Reference

All settings live in ~/.lgf/.env (written by lgf init, edited by lgf config set).

Required

TAVILY_API_KEY=tvly-your-key
OPENROUTER_API_KEY=sk-or-your-key

LLM

DEFAULT_LLM_MODEL=google/gemini-2.0-flash-001    # Primary model for all LLM calls
FALLBACK_LLM_MODEL=anthropic/claude-3-5-haiku    # Used if primary fails
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

Pipeline Tuning

MAX_SEARCH_RESULTS=10           # Tavily results per query
MAX_CONCURRENT_CRAWLS=5         # Parallel crawl4ai sessions
MAX_CONTACT_SEARCHES=3          # Tavily results per company in contact search
MAX_CONCURRENT_EXTRACTIONS=5    # Parallel LLM calls
MULTI_THREAD_SCORE_THRESHOLD=6.0  # Min rubric score to trigger multi-contact search
SEEN_COMPANIES_PATH=~/.lgf/data/seen_companies.json  # Cross-run dedup file

Email Enrichment (optional)

APOLLO_API_KEY=your-master-key         # Apollo.io master key (Settings > Integrations)
APOLLO_ENRICHER_VERSION=v1             # v1=match by name (fast), v2=search by role (smarter)
HUNTER_API_KEY=your-hunter-key         # Hunter.io fallback
FIRECRAWL_API_KEY=your-firecrawl-key   # 3rd-tier crawl fallback

CRM Sync (optional)

GOOGLE_SHEET_ID=1abc...xyz             # From sheet URL: /d/<ID>/edit
GOOGLE_SHEET_TAB=Sheet1                # Tab name at the bottom
GOOGLE_CREDENTIALS_PATH=client_secret.json  # OAuth 2.0 Desktop App JSON

Download credentials from Google Cloud Console → APIs & Services → Credentials → OAuth 2.0 Client IDs.

Project Structure

lead-gen-factory/
├── src/
│   ├── main.py              # CLI assembler — typer app with all commands
│   ├── config.py            # All settings — single source of truth (never os.environ)
│   ├── commands/            # CLI sub-commands
│   │   ├── init.py          # lgf init wizard
│   │   ├── doctor.py        # lgf doctor health check
│   │   ├── profile.py       # lgf profile add/list/remove
│   │   └── config.py        # lgf config show/set
│   ├── agents/              # LLM-powered pipeline stages (no cross-calls)
│   │   ├── icp_parser.py
│   │   ├── query_generator.py
│   │   ├── snippet_filter.py
│   │   ├── lead_extractor.py
│   │   ├── icp_scorer.py
│   │   └── contact_finder.py
│   ├── pipeline/            # Orchestration layer
│   │   ├── runner.py        # Full pipeline orchestrator (3 passes)
│   │   ├── searcher.py      # Tavily search (async)
│   │   ├── crawler.py       # crawl4ai + Scrapling + Firecrawl fallback
│   │   ├── deduplicator.py  # Fuzzy within-run dedup
│   │   ├── seen_tracker.py  # Cross-run dedup (seen_companies.json)
│   │   └── crm_sync.py      # Google Sheets append
│   ├── schemas/             # Pydantic models
│   │   ├── icp.py
│   │   └── lead.py
│   └── utils/
│       └── output.py        # CSV writer + Rich display helpers
├── tests/                   # Test suite mirroring src/
├── icp_examples/            # Example ICP files
├── docs/                    # Architecture docs (Diataxis)
└── pyproject.toml           # Package definition + lgf entrypoint

Layer rule: Agents never call other agents. The pipeline/runner.py orchestrates all inter-agent communication. See BACKEND_STRUCTURE.md for details.

Tech Stack

Component	Tool
Language	Python 3.12
CLI	Typer
Web Search	Tavily
Web Crawl	crawl4ai → Scrapling → Firecrawl
LLM Provider	OpenRouter via OpenAI SDK
Default Model	`google/gemini-2.0-flash-001`
Email Enrichment	Apollo.io + Hunter.io
CRM Sync	gspread → Google Sheets
Validation	Pydantic v2
Deduplication	thefuzz (Levenshtein)
Terminal UI	Rich
Async	asyncio + semaphore-controlled concurrency

Troubleshooting

Run diagnostics first:

lgf doctor

Problem	Solution
`lgf: command not found`	`pip install -e .` or `pipx install .` from the repo root
Missing API keys	`lgf init` to reconfigure, or `lgf config set KEY VALUE`
`playwright install chromium`	Required for crawl4ai — run once after install
0 leads found	Broaden ICP geography/roles; use `--dry-run` first to check queries
Tavily rate limit	Free tier: 1000/month — use `--dry-run` to test; avoid re-running the same ICP
Google Sheets auth error	Credentials file missing or wrong path — check `lgf doctor`
LLM extraction issues	`lgf research --verbose` to see prompts and raw responses
CRM pre-filter skipped	Warning is shown; pipeline runs normally — post-run dedup still prevents duplicates

Development

Setup

git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
python -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install chromium

Workflow

python -m pytest tests/         # Run all tests
python -m pytest tests/unit/    # Unit tests only
lgf doctor                      # Verify API keys and connectivity
lgf research --icp icp_examples/skillia_spain.md --dry-run  # Test query generation

Documentation

Document	Description
`docs/pipeline/`	Full architecture docs (Diataxis)
`docs/pipeline/reference/cli-commands.md`	Complete CLI reference
`docs/pipeline/reference/env-vars.md`	All environment variables
`docs/pipeline/explanation/`	Architecture rationale
`docs/pipeline/adr/`	Architecture decision records
`icp_examples/`	Example ICP files

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

See CONTRIBUTING.md for code style, test requirements, and commit format.

Legal

License

Licensed under CC BY-NC 4.0 — free to use and modify, not for commercial purposes. See LICENSE.

Security

Report vulnerabilities via SECURITY.md.

Built by Jordi Catafal for Skillia

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.claude/skills/lgf		.claude/skills/lgf
.github		.github
docs		docs
icp_examples		icp_examples
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
APP_FLOW.md		APP_FLOW.md
BACKEND_STRUCTURE.md		BACKEND_STRUCTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
PRD.md		PRD.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
TECH_STACK.md		TECH_STACK.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Installation

First-Time Setup

Write Your ICP

Run the Pipeline

Commands

Setup & Diagnostics

Research Pipeline

ICP Profiles

Configuration

ICP Validation

Pipeline Architecture

Output Schema

Configuration Reference

Required

LLM

Pipeline Tuning

Email Enrichment (optional)

CRM Sync (optional)

Project Structure

Tech Stack

Troubleshooting

Development

Setup

Workflow

Documentation

Contributing

Legal

License

Security

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages