Skip to content

BYU-PCCL/scam-call-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

115 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scam Call Identification

(In development) A comprehensive machine learning system for detecting scam phone calls in real time using LLM-derived behavioral features combined with natural language processing techniques. Classifies calls as legitimate or fraudulent, using their transcripts as the raw input.

Project Overview

This research project, developed by the Perception, Control and Cognition Lab (BYU-PCCL), aims to automatically identify scam phone calls by analyzing conversation patterns, behavioral cues, and linguistic features in call transcripts. The system uses advanced NLP techniques and machine learning models to detect common scam tactics and patterns.

Key Features

  • Multi-source Data Processing: Handles transcripts from YouTube scam calls, legitimate call datasets, and real-world call recordings
  • LLM-powered Feature Extraction: Uses ChatGPT and Gemini models to extract behavioral and linguistic features
  • Behavioral Analysis: Identifies pressure tactics, urgency patterns, information requests, and authority impersonation
  • Automated Transcription: Converts audio files to text using state-of-the-art transcription models
  • Docker Support: Containerized environment for reproducible results
  • Rate Limiting: Built-in rate limiting for API calls to LLM services

πŸ“Š Dataset Sources

The project integrates multiple datasets:

  • YouTube Scam Calls: 243+ transcripts from scam baiting videos and real scam calls
  • Candor Dataset: Legitimate phone call recordings for comparison
  • Switchboard Dataset: Standard conversational telephone speech corpus
  • Thai Call Center Dataset: Additional call center conversation data
  • Internet Search Calls: Curated collection of scam call examples
  • (and more)

Project Structure

scam-call-identification/
β”œβ”€β”€ src/                              # Core source code
β”‚   β”œβ”€β”€ data_processing/              # Data ingestion, preprocessing
β”‚   β”œβ”€β”€ general_file_utils/           # File handling utils
β”‚   β”œβ”€β”€ llm_tools/                    # LLM integration and feature extraction
β”‚   β”œβ”€β”€ ml_scam_classification/       # Main ML classification components
β”‚   β”‚   β”œβ”€β”€ data/                     # Processed datasets
β”‚   β”‚   β”œβ”€β”€ models/                   # Model definitions and utils
β”‚   β”‚   β”œβ”€β”€ prompting/                # LLM prompts and feature definitions
β”‚   β”‚   β”œβ”€β”€ settings/                 # Configuration files
β”‚   β”‚   └── utils/                    # Utility functions
β”‚   └── rate_limits/                  # API rate limiting
β”œβ”€β”€ scripts/                          # Executable scripts
β”‚   β”œβ”€β”€ ETL/                          # Extract, Transform, Load operations
β”‚   β”œβ”€β”€ EDA/                          # Exploratory Data Analysis
β”‚   β”œβ”€β”€ feature_engineering/          # Feature extraction scripts
β”‚   └── generating-synthetic-calls/   # Synthetic data generation
β”œβ”€β”€ outputs/                          # Generated results and models
β”œβ”€β”€ Dockerfile                        # Container configuration
β”œβ”€β”€ requirements.txt                  # Python dependencies
└── build_image_docker_scams.sh       # Docker build script

Using the Repo

Prerequisites

  • Python 3.11+
  • Docker (optional, for containerized deployment)
  • API keys for OpenAI GPT and Google Gemini (for feature extraction)

Installation

  1. Clone the repository:

    git clone https://github.com/BYU-PCCL/scam-call-identification.git
    cd scam-call-identification
  2. Set up Python environment:

    python -m venv scams_env
    # On Windows:
    .\scams_env\Scripts\activate
    # On Linux/Mac:
    source scams_env/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

Docker Setup (Alternative)

Build and run using Docker:

chmod +x build_image_docker_scams.sh
./build_image_docker_scams.sh

API Configuration

  1. Create an api_keys folder in the project root
  2. Add your API keys for:
    • OpenAI GPT models
    • Google Gemini models
  3. Configure rate limits in src/rate_limits/ directory

Usage

Running Scripts

Important: Always run scripts as modules to ensure proper imports:

python -m scripts.feature_engineering.run_chatgpt_behavioral_analysis
python -m scripts.ETL.aggregate_all_transcripts
python -m scripts.EDA.compiled.inspect_compiled_transcripts

Feature Extraction

Extract behavioral features from call transcripts:

python -m scripts.feature_engineering.run_chatgpt_behavioral_analysis [prompt_path] [continuation_prompt_path]

Data Processing

Process raw audio files and generate transcripts:

python -m scripts.ETL.transcribe_audio
python -m scripts.ETL.transform_parquet_to_audio

Behavioral Analysis

The system analyzes calls across multiple behavioral dimensions:

  1. Pressure & Urgency: Detecting time pressure and fear tactics
  2. Information Elicitation: Identifying requests for sensitive data
  3. Authority Impersonation: Recognizing false authority claims
  4. Financial Request Patterns: Detecting payment solicitations
  5. Conversation Flow: Analyzing dialogue patterns
  6. Scam-Specific Signatures: Identifying known scam types

πŸ“ˆ Model Performance

The system uses a hierarchical classification approach:

  • Feature Extraction: LLM-based behavioral feature extraction
  • Classification: Traditional ML models trained on extracted features
  • Validation: Cross-validation on multiple datasets

πŸ”¬ Research Features

Behavioral Feature Categories

The system analyzes calls across 9+ behavioral categories:

  1. Pressure & Urgency Tactics
  2. Information Elicitation Patterns
  3. False Authority & Impersonation
  4. True Authority & Legitimacy
  5. Financial Request Patterns
  6. Conversation Flow & Meta-Communication
  7. Question Patterns & Information Seeking
  8. Response Patterns & Compliance
  9. Scam Signature Behaviors

Advanced Features

  • Multi-LLM Analysis: Compares results from different language models
  • Temporal Analysis: Tracks behavioral patterns over conversation duration
  • Synthetic Data Generation: Creates training data using demographic models
  • Rate-Limited Processing: Manages API calls efficiently

πŸ“ Configuration

Prompt Engineering

Prompts are versioned and stored in src/ml_scam_classification/prompting/:

  • features.json / features_v2.json: Behavioral feature definitions
  • prompt_conner_v*.txt: Main analysis prompts
  • prompt_*_contd.txt: Continuation prompts for long conversations

Settings

Configuration files in src/ml_scam_classification/settings/:

  • global_settings.py: Global configuration
  • supported_transcription_models.json: Available transcription models
  • Rate limiting configurations

πŸ§ͺ Testing and Validation

The project includes comprehensive testing utilities:

  • File validation: Ensures data integrity
  • JSON validation: Validates structured outputs
  • Path validation: Confirms file system operations
  • Model validation: Tests classification performance

πŸ“Š Output Formats

The system generates:

  • JSON Feature Files: Structured behavioral analysis results
  • CSV Reports: Aggregated classification results
  • Transcription Files: Processed audio-to-text conversions
  • Model Artifacts: Trained classification models

🀝 Contributing

This is a research project. For contributions:

  1. Follow the existing code structure
  2. Use the module-based import system
  3. Add appropriate error handling and validation
  4. Include comprehensive documentation
  5. Test with multiple datasets

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ›οΈ Institution

Brigham Young University - Perception, Control and Cognition Lab (BYU-PCCL)

πŸ“š Citation

If you use this work in your research, please cite:

@software{scam_call_identification,
  title={Scam Call Identification System},
  author={BYU Perception, Control and Cognition Lab},
  year={2025},
  url={https://github.com/BYU-PCCL/scam-call-identification}
}

Research Applications

Once fully developed, this system could be applied to:

  • Telecommunications Security: Real-time scam call detection
  • Consumer Protection: Educational tools for scam awareness
  • Law Enforcement: Analysis of fraud patterns
  • Academic Research: Study of deceptive communication patterns
  • Industry Applications: Call center quality assurance

For questions, issues, or research collaboration opportunities, please open an issue or contact the BYU-PCCL team.

About

Novel ML-powered real-time scam call identification data pipeline and model (in development) that analyzes phone call transcripts using LLM-derived and NLP-based features to indicate scam risk.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages