Lead Scoring Model Builder

A machine learning tool that compares traditional rules-based lead scoring with predictive models to demonstrate the ROI of data-driven marketing operations.

The Problem

Most B2B companies use traditional lead scoring - arbitrary point values assigned to demographic and behavioral attributes:

Enterprise company = +25 points
Email click = +5 points
MQL threshold = 50 points

The issues:

Arbitrary rules - Point values based on intuition, not data
Poor accuracy - 35-40% of "qualified" leads never convert
Wasted sales time - Reps spend hours on leads that won't close
Missed opportunities - Good leads slip through with low scores

Why Machine Learning Works Better

ML models analyze historical conversion patterns to predict which leads actually buy:

Metric	Traditional Scoring	ML Model	Improvement
Accuracy	35.3%	79.0%	+43.7 pts
False Positives	963 bad leads	77 bad leads	-886 (-92%)
Sales Time Wasted	1,926 hours	154 hours	1,772 hours saved
Cost Savings	-	$132,900/year	-

What This Project Does

1. Generates Realistic Lead Data

Creates 5,000 synthetic B2B leads with:

Firmographics: Company size, industry, source
Engagement: Email opens, content downloads, webinar attendance
Conversion outcome: Did they become a customer?

2. Applies Traditional Scoring

Implements a typical rules-based scoring model:

score = (company_size_points + 
         industry_points + 
         email_opens * 2 + 
         demo_request * 30)

3. Trains ML Models

Logistic Regression - Simple, interpretable baseline
Random Forest - More sophisticated pattern detection

4. Compares Performance

Side-by-side analysis of:

Accuracy, precision, recall
ROC curves
Business impact ($ savings)

5. Identifies What Actually Predicts Conversion

Shows which features matter most (often surprising):

Top Predictive Features:

page_views (engagement depth)
email_opens (interest level)
email_clicks (intent)
company_size (fit)
days_since_first_touch (timing)

Sample Output

Model Performance (`model_performance_metrics.csv`)

model                   | accuracy | precision | recall | roc_auc
Traditional Scoring     | 0.353    | 0.289     | 0.950  | 0.572
Logistic Regression     | 0.771    | 0.679     | 0.783  | 0.852
Random Forest          | 0.790    | 0.710     | 0.788  | 0.871

Business Impact (`business_impact_comparison.csv`)

metric                                | traditional  | ml_model    | improvement
False Positives (Bad Leads to Sales) | 963          | 77          | 886 fewer (-92.0%)
Wasted Sales Hours                    | 1,926        | 154         | 1,772 hours saved
Wasted Sales Cost                     | $144,450     | $11,550     | $132,900 saved
Total Financial Impact                | -$3,430,650  | -$113,550   | $3,317,100 improvement

Scored Leads (`scored_leads.csv`)

Every lead gets both scores for comparison:

lead_id | company_size | industry  | traditional_score | ml_probability | converted
L000123 | Enterprise   | Tech      | 78               | 0.85          | 1
L000456 | SMB          | Retail    | 52               | 0.12          | 0

Real-World Use Cases

Use Case 1: Improve MQL Quality

Scenario: Sales complains 60% of MQLs are junk.

Action:

Run this analysis on your historical data
Compare traditional vs ML accuracy
Present ROI to leadership ($130K+ savings)
Deploy ML scoring in HubSpot/Marketo

Outcome: Reduce false-positive MQLs by 90%, save 1,700+ sales hours/year.

Use Case 2: Optimize Marketing Budget

Scenario: CMO asks "which lead sources actually convert?"

Action:

Train ML model on historical data
Check feature importance chart
See which sources predict conversion
Reallocate budget away from low-value sources

Outcome: Data-driven budget allocation, not guesswork.

Use Case 3: Sales & Marketing Alignment

Scenario: Sales won't follow up on leads because "marketing's scores are wrong."

Action:

Show ML model accuracy (79%) vs traditional (35%)
Demonstrate 92% reduction in bad leads
Calculate sales time savings (1,700+ hours)
Get buy-in for implementation

Outcome: Sales actually trusts the scoring model.

Installation & Setup

Prerequisites

Python 3.8+
pip

Install Dependencies

pip install -r requirements.txt

Usage

Run Full Analysis

cd scripts
python lead_scoring_builder.py

This will:

Generate 5,000 synthetic leads
Apply traditional scoring
Train ML models (Logistic Regression + Random Forest)
Evaluate and compare performance
Calculate business impact
Export results and visualizations

Expected Runtime

~30 seconds on a standard laptop

Output Files

/data/
  └── leads.csv                          # Raw lead data

/output/
  ├── scored_leads.csv                   # All leads with both scores
  ├── model_performance_metrics.csv      # Accuracy, precision, recall
  ├── business_impact_comparison.csv     # $ savings analysis
  ├── roc_curve_comparison.png           # Visual performance comparison
  ├── feature_importance.png             # What actually predicts conversion
  └── score_distributions.png            # Score distributions by outcome

/models/
  ├── random_forest_model.pkl            # Trained model (reusable)
  └── logistic_regression_model.pkl      # Trained model (reusable)

Visualizations

1. ROC Curve Comparison

Shows ML model dramatically outperforms traditional scoring (AUC: 0.871 vs 0.572)

2. Feature Importance

Reveals which attributes actually predict conversion (often surprising - e.g., page views matter more than company size)

3. Score Distributions

Shows traditional scoring poorly separates converters from non-converters

Adapting for Your Data

Connect to HubSpot

from hubspot import HubSpot

api_client = HubSpot(access_token='your_token')

# Fetch contacts with lifecycle data
contacts = api_client.crm.contacts.basic_api.get_page(
    properties=['lifecyclestage', 'company', 'hs_analytics_source']
)

# Convert to DataFrame
leads_df = pd.DataFrame([
    {
        'company_size': contact.properties.get('company_size'),
        'industry': contact.properties.get('industry'),
        'email_opens': contact.properties.get('hs_email_opens'),
        'converted': 1 if contact.properties.get('lifecyclestage') == 'customer' else 0
    }
    for contact in contacts.results
])

# Train model on YOUR data
X, y, features = prepare_features(leads_df)
model = RandomForestClassifier()
model.fit(X, y)

Connect to Salesforce

from simple_salesforce import Salesforce

sf = Salesforce(username='user', password='pass', security_token='token')

# Query leads
leads = sf.query_all("""
    SELECT Id, Company, Industry, Email_Opens__c, IsConverted
    FROM Lead
    WHERE CreatedDate >= LAST_YEAR
""")

# Train model on Salesforce data

Key Insights from Demo Run

Based on 5,000 synthetic leads:

Model Performance

Traditional Accuracy: 35.3% (barely better than random)
ML Accuracy: 79.0% (2.2x better)
ROC AUC: 0.871 (excellent discrimination)

Business Impact

886 fewer false positives (bad leads filtered out)
1,772 sales hours saved per year
$132,900 cost savings in wasted sales time
$3.3M total impact from better lead qualification

What Actually Predicts Conversion

Engagement depth (page views) - not just one visit
Email engagement - opens AND clicks matter
Company fit - size and industry combined
Timing - days since first touch
Intent signals - pricing page views, demo requests

Technical Stack

Tool	Purpose
Python	Core analysis
Pandas	Data manipulation
scikit-learn	Machine learning models
Matplotlib	Visualizations
Joblib	Model persistence

Next Steps

Deploy to Production:
- Integrate with HubSpot/Marketo API
- Schedule weekly retraining
- Set up Slack alerts for model performance
Advanced Features:
- Multi-class scoring (MQL, SQL, Opportunity tiers)
- Real-time scoring via API endpoint
- A/B test ML vs traditional in production
Expand Scope:
- Predict deal size (regression)
- Predict time to close
- Churn prediction for customers

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
business_impact_comparison.csv		business_impact_comparison.csv
feature_importance.png		feature_importance.png
lead_scoring_builder.py		lead_scoring_builder.py
logistic_regression_model.pkl		logistic_regression_model.pkl
model_performance_metrics.csv		model_performance_metrics.csv
random_forest_model.pkl		random_forest_model.pkl
requirements.txt		requirements.txt
roc_curve_comparison.png		roc_curve_comparison.png
score_distributions.png		score_distributions.png
scored_leads.csv		scored_leads.csv

Folders and files

Latest commit

History

Repository files navigation

Lead Scoring Model Builder

The Problem

Why Machine Learning Works Better

What This Project Does

1. Generates Realistic Lead Data

2. Applies Traditional Scoring

3. Trains ML Models

4. Compares Performance

5. Identifies What Actually Predicts Conversion

Sample Output

Model Performance (model_performance_metrics.csv)

Business Impact (business_impact_comparison.csv)

Scored Leads (scored_leads.csv)

Real-World Use Cases

Use Case 1: Improve MQL Quality

Use Case 2: Optimize Marketing Budget

Use Case 3: Sales & Marketing Alignment

Installation & Setup

Prerequisites

Install Dependencies

Usage

Run Full Analysis

Expected Runtime

Output Files

Visualizations

1. ROC Curve Comparison

2. Feature Importance

3. Score Distributions

Adapting for Your Data

Connect to HubSpot

Connect to Salesforce

Key Insights from Demo Run

Model Performance

Business Impact

What Actually Predicts Conversion

Technical Stack

Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model Performance (`model_performance_metrics.csv`)

Business Impact (`business_impact_comparison.csv`)

Scored Leads (`scored_leads.csv`)

Packages