Skip to content

Catafal/lead-gen-factory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
β–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β•β•β• β–ˆβ–ˆβ•”β•β•β•β•β•
β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘
β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•

Lead Gen Factory

CI Python Version License LLM Search

Free-text ICP β†’ scored B2B leads CSV β€” fully automated, zero manual research.

A headless CLI pipeline for B2B lead research. Give it an Ideal Customer Profile in plain English, and it searches the internet, crawls relevant pages, extracts structured contact data via LLM, scores ICP fit, deduplicates, and outputs a clean CSV ready for outreach β€” with optional Google Sheets CRM sync and email enrichment.

Key Features:

  • πŸ” Intelligent Search β€” LLM-generated queries, Tavily-powered, parallel execution
  • 🧠 LLM Extraction β€” Structured lead data pulled from raw web pages via OpenRouter
  • πŸ“Š ICP Scoring β€” Multi-dimensional 0–10 scoring with weighted rubrics and reasoning
  • πŸ‘€ Contact Discovery β€” 3-tier champion search: role-based β†’ HR-targeted β†’ enrichment
  • πŸ“§ Email Enrichment β€” Apollo.io (v1 name-match or v2 role-search) + Hunter.io fallback
  • πŸ” Smart Dedup β€” Fuzzy company name matching across runs (85% threshold)
  • πŸ“‹ CRM Sync β€” Google Sheets append with pre-run dedup (skips companies already in sheet)
  • βš™οΈ Full Config CLI β€” lgf config set for any setting; lgf config to inspect all

See CHANGELOG.md for version history and release notes.


Quick Start

Installation

# From GitHub (recommended β€” always latest)
pipx install git+https://github.com/Catafal/lead-gen-factory.git

# Or from source (for development)
git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
pip install -e .
playwright install chromium

Note: pipx installs lgf globally without touching your system Python. Install it with pip install pipx if you don't have it.

First-Time Setup

lgf init

The wizard collects your API keys and writes them to ~/.lgf/.env. You'll need:

Key Source Required
TAVILY_API_KEY tavily.com β€” free tier: 1000/month βœ…
OPENROUTER_API_KEY openrouter.ai βœ…
FIRECRAWL_API_KEY firecrawl.dev β€” crawl fallback optional
APOLLO_API_KEY apollo.io β€” email enrichment optional
HUNTER_API_KEY hunter.io β€” email enrichment fallback optional
GOOGLE_SHEET_ID Sheet URL between /d/ and /edit optional

Verify your setup:

lgf doctor

Write Your ICP

Create a markdown file describing your Ideal Customer Profile. See icp_examples/skillia_spain.md for a full example.

Key sections your ICP should include:

Section Description
Target Roles Decision maker job titles (e.g. "HR Director", "Head of People")
Company Size Employee range (e.g. "50 to 500 employees")
Industries Verticals / sectors to target
Geography Countries, regions, or cities
Buying Signals Observable behaviors that signal they need you
Exclude Terms or company types that disqualify a lead

Run the Pipeline

# Full run from an ICP file
lgf research --icp icp_examples/skillia_spain.md

# Inline ICP (no file needed)
lgf research --icp-text "HR Directors at 50-200 person tech companies in Spain"

# Use a saved profile
lgf research --profile skillia-spain

# Dry run β€” generate queries only, no search or crawl
lgf research --icp icp_examples/skillia_spain.md --dry-run

# Narrow an existing ICP without editing the file
lgf research --profile skillia-spain --focus "SaaS companies only, no agencies"

Output is saved to leads_YYYYMMDD.csv by default (or specify --output path/to/file.csv).


Commands

Setup & Diagnostics

lgf init        # Guided first-time setup wizard β€” writes ~/.lgf/.env
lgf doctor      # Health check: API keys + live connectivity probes

Research Pipeline

lgf research --icp PATH               # Full pipeline from ICP file
lgf research --icp-text TEXT          # Inline ICP description
lgf research --profile NAME           # Use a saved ICP profile
lgf research --focus TEXT             # Add a narrowing constraint (stacks on ICP)
lgf research --dry-run                # Generate queries only β€” no API calls beyond ICP parse
lgf research --min-score INT          # Override minimum ICP fit score (default: from ICP)
lgf research --output PATH            # Custom CSV output path
lgf research --verbose                # Show debug logs

ICP Profiles

lgf profile add NAME --icp PATH       # Save an ICP file as a reusable profile
lgf profile list                      # List all saved profiles
lgf profile remove NAME               # Delete a profile

Profiles are stored as markdown files in ~/.lgf/profiles/.

Configuration

lgf config                            # Show all effective settings (API keys masked)
lgf config set KEY VALUE              # Update any setting in ~/.lgf/.env

Commonly changed settings:

lgf config set DEFAULT_LLM_MODEL anthropic/claude-3-5-sonnet
lgf config set FALLBACK_LLM_MODEL google/gemini-2.0-flash-001
lgf config set MAX_SEARCH_RESULTS 15
lgf config set APOLLO_ENRICHER_VERSION v2
lgf config set MULTI_THREAD_SCORE_THRESHOLD 7.0

ICP Validation

lgf validate-icp --icp PATH           # Parse and display ICP without running the pipeline
lgf validate-icp --icp-text TEXT      # Same for inline text

Pipeline Architecture

ICP (file / inline / profile)
    ↓
🧠  ICP Parsing        β€” free-text β†’ structured Pydantic model
    ↓
πŸ”  Query Generation   β€” LLM β†’ 8-12 targeted search queries
    ↓
🌐  Web Search         β€” Tavily, async parallel
    ↓
🧹  Snippet Pre-filter β€” LLM heuristic drops irrelevant results
    ↓
πŸ“„  Web Crawl          β€” crawl4ai β†’ Scrapling β†’ Firecrawl (3-tier fallback)
    ↓
🏒  Lead Extraction    β€” LLM β†’ structured CompanyLead + contacts
    ↓
πŸ“Š  ICP Scoring        β€” multi-dimensional weighted rubrics (0-10)
    ↓
πŸ”  Deduplication      β€” fuzzy company name match (85% threshold, cross-run)
    ↓
πŸ‘€  Contact Discovery  β€” role search β†’ HR search β†’ enrichment (3-tier)
    ↓
πŸ“§  Email Enrichment   β€” Apollo v1/v2 + Hunter.io fallback (optional)
    ↓
πŸ“  CSV Output + CRM   β€” filtered, sorted, Google Sheets sync (optional)

The pipeline runs Passes 1, 2, and 3 in sequence. Passes 2 and 3 use asyncio.gather with a semaphore for parallel execution per company.


Output Schema

Each lead row in the CSV:

Field Description Notes
Business Company name
First Decision maker first name
Last Decision maker last name
Email Champion email Filled by Apollo/Hunter if configured
LinkedIn Contact LinkedIn URL
Website Company website URL
Phone Contact phone Filled by Apollo if configured
Date Date discovered (ISO 8601)
Place of Work Company + city/country
ICP Fit Score 0–10 match score
ICP Fit Reason LLM reasoning for the score

Configuration Reference

All settings live in ~/.lgf/.env (written by lgf init, edited by lgf config set).

Required

TAVILY_API_KEY=tvly-your-key
OPENROUTER_API_KEY=sk-or-your-key

LLM

DEFAULT_LLM_MODEL=google/gemini-2.0-flash-001    # Primary model for all LLM calls
FALLBACK_LLM_MODEL=anthropic/claude-3-5-haiku    # Used if primary fails
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

Pipeline Tuning

MAX_SEARCH_RESULTS=10           # Tavily results per query
MAX_CONCURRENT_CRAWLS=5         # Parallel crawl4ai sessions
MAX_CONTACT_SEARCHES=3          # Tavily results per company in contact search
MAX_CONCURRENT_EXTRACTIONS=5    # Parallel LLM calls
MULTI_THREAD_SCORE_THRESHOLD=6.0  # Min rubric score to trigger multi-contact search
SEEN_COMPANIES_PATH=~/.lgf/data/seen_companies.json  # Cross-run dedup file

Email Enrichment (optional)

APOLLO_API_KEY=your-master-key         # Apollo.io master key (Settings > Integrations)
APOLLO_ENRICHER_VERSION=v1             # v1=match by name (fast), v2=search by role (smarter)
HUNTER_API_KEY=your-hunter-key         # Hunter.io fallback
FIRECRAWL_API_KEY=your-firecrawl-key   # 3rd-tier crawl fallback

CRM Sync (optional)

GOOGLE_SHEET_ID=1abc...xyz             # From sheet URL: /d/<ID>/edit
GOOGLE_SHEET_TAB=Sheet1                # Tab name at the bottom
GOOGLE_CREDENTIALS_PATH=client_secret.json  # OAuth 2.0 Desktop App JSON

Download credentials from Google Cloud Console β†’ APIs & Services β†’ Credentials β†’ OAuth 2.0 Client IDs.


Project Structure

lead-gen-factory/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py              # CLI assembler β€” typer app with all commands
β”‚   β”œβ”€β”€ config.py            # All settings β€” single source of truth (never os.environ)
β”‚   β”œβ”€β”€ commands/            # CLI sub-commands
β”‚   β”‚   β”œβ”€β”€ init.py          # lgf init wizard
β”‚   β”‚   β”œβ”€β”€ doctor.py        # lgf doctor health check
β”‚   β”‚   β”œβ”€β”€ profile.py       # lgf profile add/list/remove
β”‚   β”‚   └── config.py        # lgf config show/set
β”‚   β”œβ”€β”€ agents/              # LLM-powered pipeline stages (no cross-calls)
β”‚   β”‚   β”œβ”€β”€ icp_parser.py
β”‚   β”‚   β”œβ”€β”€ query_generator.py
β”‚   β”‚   β”œβ”€β”€ snippet_filter.py
β”‚   β”‚   β”œβ”€β”€ lead_extractor.py
β”‚   β”‚   β”œβ”€β”€ icp_scorer.py
β”‚   β”‚   └── contact_finder.py
β”‚   β”œβ”€β”€ pipeline/            # Orchestration layer
β”‚   β”‚   β”œβ”€β”€ runner.py        # Full pipeline orchestrator (3 passes)
β”‚   β”‚   β”œβ”€β”€ searcher.py      # Tavily search (async)
β”‚   β”‚   β”œβ”€β”€ crawler.py       # crawl4ai + Scrapling + Firecrawl fallback
β”‚   β”‚   β”œβ”€β”€ deduplicator.py  # Fuzzy within-run dedup
β”‚   β”‚   β”œβ”€β”€ seen_tracker.py  # Cross-run dedup (seen_companies.json)
β”‚   β”‚   └── crm_sync.py      # Google Sheets append
β”‚   β”œβ”€β”€ schemas/             # Pydantic models
β”‚   β”‚   β”œβ”€β”€ icp.py
β”‚   β”‚   └── lead.py
β”‚   └── utils/
β”‚       └── output.py        # CSV writer + Rich display helpers
β”œβ”€β”€ tests/                   # Test suite mirroring src/
β”œβ”€β”€ icp_examples/            # Example ICP files
β”œβ”€β”€ docs/                    # Architecture docs (Diataxis)
└── pyproject.toml           # Package definition + lgf entrypoint

Layer rule: Agents never call other agents. The pipeline/runner.py orchestrates all inter-agent communication. See BACKEND_STRUCTURE.md for details.


Tech Stack

Component Tool
Language Python 3.12
CLI Typer
Web Search Tavily
Web Crawl crawl4ai β†’ Scrapling β†’ Firecrawl
LLM Provider OpenRouter via OpenAI SDK
Default Model google/gemini-2.0-flash-001
Email Enrichment Apollo.io + Hunter.io
CRM Sync gspread β†’ Google Sheets
Validation Pydantic v2
Deduplication thefuzz (Levenshtein)
Terminal UI Rich
Async asyncio + semaphore-controlled concurrency

Troubleshooting

Run diagnostics first:

lgf doctor
Problem Solution
lgf: command not found pip install -e . or pipx install . from the repo root
Missing API keys lgf init to reconfigure, or lgf config set KEY VALUE
playwright install chromium Required for crawl4ai β€” run once after install
0 leads found Broaden ICP geography/roles; use --dry-run first to check queries
Tavily rate limit Free tier: 1000/month β€” use --dry-run to test; avoid re-running the same ICP
Google Sheets auth error Credentials file missing or wrong path β€” check lgf doctor
LLM extraction issues lgf research --verbose to see prompts and raw responses
CRM pre-filter skipped Warning is shown; pipeline runs normally β€” post-run dedup still prevents duplicates

Development

Setup

git clone https://github.com/Catafal/lead-gen-factory.git
cd lead-gen-factory
python -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install chromium

Workflow

python -m pytest tests/         # Run all tests
python -m pytest tests/unit/    # Unit tests only
lgf doctor                      # Verify API keys and connectivity
lgf research --icp icp_examples/skillia_spain.md --dry-run  # Test query generation

Documentation

Document Description
docs/pipeline/ Full architecture docs (Diataxis)
docs/pipeline/reference/cli-commands.md Complete CLI reference
docs/pipeline/reference/env-vars.md All environment variables
docs/pipeline/explanation/ Architecture rationale
docs/pipeline/adr/ Architecture decision records
icp_examples/ Example ICP files

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

See CONTRIBUTING.md for code style, test requirements, and commit format.


Legal

License

Licensed under CC BY-NC 4.0 β€” free to use and modify, not for commercial purposes. See LICENSE.

Security

Report vulnerabilities via SECURITY.md.


Built by Jordi Catafal for Skillia

About

Find your ICP :)

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors