βββ βββββββ ββββββββ βββ ββββββββ ββββββββ βββ βββ ββββββββββ βββ βββ βββββββββ ββββββββββββββββββββ ββββββββ βββββββ βββLead Gen Factory
Free-text ICP β scored B2B leads CSV β fully automated, zero manual research.
A headless CLI pipeline for B2B lead research. Give it an Ideal Customer Profile in plain English, and it searches the internet, crawls relevant pages, extracts structured contact data via LLM, scores ICP fit, deduplicates, and outputs a clean CSV ready for outreach β with optional Google Sheets CRM sync and email enrichment.
Key Features:
- π Intelligent Search β LLM-generated queries, Tavily-powered, parallel execution
- π§ LLM Extraction β Structured lead data pulled from raw web pages via OpenRouter
- π ICP Scoring β Multi-dimensional 0β10 scoring with weighted rubrics and reasoning
- π€ Contact Discovery β 3-tier champion search: role-based β HR-targeted β enrichment
- π§ Email Enrichment β Apollo.io (v1 name-match or v2 role-search) + Hunter.io fallback
- π Smart Dedup β Fuzzy company name matching across runs (85% threshold)
- π CRM Sync β Google Sheets append with pre-run dedup (skips companies already in sheet)
- βοΈ Full Config CLI β
lgf config setfor any setting;lgf configto inspect all
See CHANGELOG.md for version history and release notes.
# From GitHub (recommended β always latest)
pipx install git+https://github.com/Catafal/lead-gen-factory.git
# Or from source (for development)
git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
pip install -e .
playwright install chromiumNote:
pipxinstallslgfglobally without touching your system Python. Install it withpip install pipxif you don't have it.
lgf initThe wizard collects your API keys and writes them to ~/.lgf/.env. You'll need:
| Key | Source | Required |
|---|---|---|
TAVILY_API_KEY |
tavily.com β free tier: 1000/month | β |
OPENROUTER_API_KEY |
openrouter.ai | β |
FIRECRAWL_API_KEY |
firecrawl.dev β crawl fallback | optional |
APOLLO_API_KEY |
apollo.io β email enrichment | optional |
HUNTER_API_KEY |
hunter.io β email enrichment fallback | optional |
GOOGLE_SHEET_ID |
Sheet URL between /d/ and /edit |
optional |
Verify your setup:
lgf doctorCreate a markdown file describing your Ideal Customer Profile. See icp_examples/skillia_spain.md for a full example.
Key sections your ICP should include:
| Section | Description |
|---|---|
| Target Roles | Decision maker job titles (e.g. "HR Director", "Head of People") |
| Company Size | Employee range (e.g. "50 to 500 employees") |
| Industries | Verticals / sectors to target |
| Geography | Countries, regions, or cities |
| Buying Signals | Observable behaviors that signal they need you |
| Exclude | Terms or company types that disqualify a lead |
# Full run from an ICP file
lgf research --icp icp_examples/skillia_spain.md
# Inline ICP (no file needed)
lgf research --icp-text "HR Directors at 50-200 person tech companies in Spain"
# Use a saved profile
lgf research --profile skillia-spain
# Dry run β generate queries only, no search or crawl
lgf research --icp icp_examples/skillia_spain.md --dry-run
# Narrow an existing ICP without editing the file
lgf research --profile skillia-spain --focus "SaaS companies only, no agencies"Output is saved to leads_YYYYMMDD.csv by default (or specify --output path/to/file.csv).
lgf init # Guided first-time setup wizard β writes ~/.lgf/.env
lgf doctor # Health check: API keys + live connectivity probeslgf research --icp PATH # Full pipeline from ICP file
lgf research --icp-text TEXT # Inline ICP description
lgf research --profile NAME # Use a saved ICP profile
lgf research --focus TEXT # Add a narrowing constraint (stacks on ICP)
lgf research --dry-run # Generate queries only β no API calls beyond ICP parse
lgf research --min-score INT # Override minimum ICP fit score (default: from ICP)
lgf research --output PATH # Custom CSV output path
lgf research --verbose # Show debug logslgf profile add NAME --icp PATH # Save an ICP file as a reusable profile
lgf profile list # List all saved profiles
lgf profile remove NAME # Delete a profileProfiles are stored as markdown files in ~/.lgf/profiles/.
lgf config # Show all effective settings (API keys masked)
lgf config set KEY VALUE # Update any setting in ~/.lgf/.envCommonly changed settings:
lgf config set DEFAULT_LLM_MODEL anthropic/claude-3-5-sonnet
lgf config set FALLBACK_LLM_MODEL google/gemini-2.0-flash-001
lgf config set MAX_SEARCH_RESULTS 15
lgf config set APOLLO_ENRICHER_VERSION v2
lgf config set MULTI_THREAD_SCORE_THRESHOLD 7.0lgf validate-icp --icp PATH # Parse and display ICP without running the pipeline
lgf validate-icp --icp-text TEXT # Same for inline textICP (file / inline / profile)
β
π§ ICP Parsing β free-text β structured Pydantic model
β
π Query Generation β LLM β 8-12 targeted search queries
β
π Web Search β Tavily, async parallel
β
π§Ή Snippet Pre-filter β LLM heuristic drops irrelevant results
β
π Web Crawl β crawl4ai β Scrapling β Firecrawl (3-tier fallback)
β
π’ Lead Extraction β LLM β structured CompanyLead + contacts
β
π ICP Scoring β multi-dimensional weighted rubrics (0-10)
β
π Deduplication β fuzzy company name match (85% threshold, cross-run)
β
π€ Contact Discovery β role search β HR search β enrichment (3-tier)
β
π§ Email Enrichment β Apollo v1/v2 + Hunter.io fallback (optional)
β
π CSV Output + CRM β filtered, sorted, Google Sheets sync (optional)
The pipeline runs Passes 1, 2, and 3 in sequence. Passes 2 and 3 use asyncio.gather with a semaphore for parallel execution per company.
Each lead row in the CSV:
| Field | Description | Notes |
|---|---|---|
Business |
Company name | |
First |
Decision maker first name | |
Last |
Decision maker last name | |
Email |
Champion email | Filled by Apollo/Hunter if configured |
LinkedIn |
Contact LinkedIn URL | |
Website |
Company website URL | |
Phone |
Contact phone | Filled by Apollo if configured |
Date |
Date discovered (ISO 8601) | |
Place of Work |
Company + city/country | |
ICP Fit Score |
0β10 match score | |
ICP Fit Reason |
LLM reasoning for the score |
All settings live in ~/.lgf/.env (written by lgf init, edited by lgf config set).
TAVILY_API_KEY=tvly-your-key
OPENROUTER_API_KEY=sk-or-your-keyDEFAULT_LLM_MODEL=google/gemini-2.0-flash-001 # Primary model for all LLM calls
FALLBACK_LLM_MODEL=anthropic/claude-3-5-haiku # Used if primary fails
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1MAX_SEARCH_RESULTS=10 # Tavily results per query
MAX_CONCURRENT_CRAWLS=5 # Parallel crawl4ai sessions
MAX_CONTACT_SEARCHES=3 # Tavily results per company in contact search
MAX_CONCURRENT_EXTRACTIONS=5 # Parallel LLM calls
MULTI_THREAD_SCORE_THRESHOLD=6.0 # Min rubric score to trigger multi-contact search
SEEN_COMPANIES_PATH=~/.lgf/data/seen_companies.json # Cross-run dedup fileAPOLLO_API_KEY=your-master-key # Apollo.io master key (Settings > Integrations)
APOLLO_ENRICHER_VERSION=v1 # v1=match by name (fast), v2=search by role (smarter)
HUNTER_API_KEY=your-hunter-key # Hunter.io fallback
FIRECRAWL_API_KEY=your-firecrawl-key # 3rd-tier crawl fallbackGOOGLE_SHEET_ID=1abc...xyz # From sheet URL: /d/<ID>/edit
GOOGLE_SHEET_TAB=Sheet1 # Tab name at the bottom
GOOGLE_CREDENTIALS_PATH=client_secret.json # OAuth 2.0 Desktop App JSONDownload credentials from Google Cloud Console β APIs & Services β Credentials β OAuth 2.0 Client IDs.
lead-gen-factory/
βββ src/
β βββ main.py # CLI assembler β typer app with all commands
β βββ config.py # All settings β single source of truth (never os.environ)
β βββ commands/ # CLI sub-commands
β β βββ init.py # lgf init wizard
β β βββ doctor.py # lgf doctor health check
β β βββ profile.py # lgf profile add/list/remove
β β βββ config.py # lgf config show/set
β βββ agents/ # LLM-powered pipeline stages (no cross-calls)
β β βββ icp_parser.py
β β βββ query_generator.py
β β βββ snippet_filter.py
β β βββ lead_extractor.py
β β βββ icp_scorer.py
β β βββ contact_finder.py
β βββ pipeline/ # Orchestration layer
β β βββ runner.py # Full pipeline orchestrator (3 passes)
β β βββ searcher.py # Tavily search (async)
β β βββ crawler.py # crawl4ai + Scrapling + Firecrawl fallback
β β βββ deduplicator.py # Fuzzy within-run dedup
β β βββ seen_tracker.py # Cross-run dedup (seen_companies.json)
β β βββ crm_sync.py # Google Sheets append
β βββ schemas/ # Pydantic models
β β βββ icp.py
β β βββ lead.py
β βββ utils/
β βββ output.py # CSV writer + Rich display helpers
βββ tests/ # Test suite mirroring src/
βββ icp_examples/ # Example ICP files
βββ docs/ # Architecture docs (Diataxis)
βββ pyproject.toml # Package definition + lgf entrypoint
Layer rule: Agents never call other agents. The pipeline/runner.py orchestrates all inter-agent communication. See BACKEND_STRUCTURE.md for details.
| Component | Tool |
|---|---|
| Language | Python 3.12 |
| CLI | Typer |
| Web Search | Tavily |
| Web Crawl | crawl4ai β Scrapling β Firecrawl |
| LLM Provider | OpenRouter via OpenAI SDK |
| Default Model | google/gemini-2.0-flash-001 |
| Email Enrichment | Apollo.io + Hunter.io |
| CRM Sync | gspread β Google Sheets |
| Validation | Pydantic v2 |
| Deduplication | thefuzz (Levenshtein) |
| Terminal UI | Rich |
| Async | asyncio + semaphore-controlled concurrency |
Run diagnostics first:
lgf doctor| Problem | Solution |
|---|---|
lgf: command not found |
pip install -e . or pipx install . from the repo root |
| Missing API keys | lgf init to reconfigure, or lgf config set KEY VALUE |
playwright install chromium |
Required for crawl4ai β run once after install |
| 0 leads found | Broaden ICP geography/roles; use --dry-run first to check queries |
| Tavily rate limit | Free tier: 1000/month β use --dry-run to test; avoid re-running the same ICP |
| Google Sheets auth error | Credentials file missing or wrong path β check lgf doctor |
| LLM extraction issues | lgf research --verbose to see prompts and raw responses |
| CRM pre-filter skipped | Warning is shown; pipeline runs normally β post-run dedup still prevents duplicates |
git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
python -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install chromiumpython -m pytest tests/ # Run all tests
python -m pytest tests/unit/ # Unit tests only
lgf doctor # Verify API keys and connectivity
lgf research --icp icp_examples/skillia_spain.md --dry-run # Test query generation| Document | Description |
|---|---|
docs/pipeline/ |
Full architecture docs (Diataxis) |
docs/pipeline/reference/cli-commands.md |
Complete CLI reference |
docs/pipeline/reference/env-vars.md |
All environment variables |
docs/pipeline/explanation/ |
Architecture rationale |
docs/pipeline/adr/ |
Architecture decision records |
icp_examples/ |
Example ICP files |
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
See CONTRIBUTING.md for code style, test requirements, and commit format.
Licensed under CC BY-NC 4.0 β free to use and modify, not for commercial purposes. See LICENSE.
Report vulnerabilities via SECURITY.md.
Built by Jordi Catafal for Skillia