(In development) A comprehensive machine learning system for detecting scam phone calls in real time using LLM-derived behavioral features combined with natural language processing techniques. Classifies calls as legitimate or fraudulent, using their transcripts as the raw input.
This research project, developed by the Perception, Control and Cognition Lab (BYU-PCCL), aims to automatically identify scam phone calls by analyzing conversation patterns, behavioral cues, and linguistic features in call transcripts. The system uses advanced NLP techniques and machine learning models to detect common scam tactics and patterns.
- Multi-source Data Processing: Handles transcripts from YouTube scam calls, legitimate call datasets, and real-world call recordings
- LLM-powered Feature Extraction: Uses ChatGPT and Gemini models to extract behavioral and linguistic features
- Behavioral Analysis: Identifies pressure tactics, urgency patterns, information requests, and authority impersonation
- Automated Transcription: Converts audio files to text using state-of-the-art transcription models
- Docker Support: Containerized environment for reproducible results
- Rate Limiting: Built-in rate limiting for API calls to LLM services
The project integrates multiple datasets:
- YouTube Scam Calls: 243+ transcripts from scam baiting videos and real scam calls
- Candor Dataset: Legitimate phone call recordings for comparison
- Switchboard Dataset: Standard conversational telephone speech corpus
- Thai Call Center Dataset: Additional call center conversation data
- Internet Search Calls: Curated collection of scam call examples
- (and more)
scam-call-identification/
βββ src/ # Core source code
β βββ data_processing/ # Data ingestion, preprocessing
β βββ general_file_utils/ # File handling utils
β βββ llm_tools/ # LLM integration and feature extraction
β βββ ml_scam_classification/ # Main ML classification components
β β βββ data/ # Processed datasets
β β βββ models/ # Model definitions and utils
β β βββ prompting/ # LLM prompts and feature definitions
β β βββ settings/ # Configuration files
β β βββ utils/ # Utility functions
β βββ rate_limits/ # API rate limiting
βββ scripts/ # Executable scripts
β βββ ETL/ # Extract, Transform, Load operations
β βββ EDA/ # Exploratory Data Analysis
β βββ feature_engineering/ # Feature extraction scripts
β βββ generating-synthetic-calls/ # Synthetic data generation
βββ outputs/ # Generated results and models
βββ Dockerfile # Container configuration
βββ requirements.txt # Python dependencies
βββ build_image_docker_scams.sh # Docker build script
- Python 3.11+
- Docker (optional, for containerized deployment)
- API keys for OpenAI GPT and Google Gemini (for feature extraction)
-
Clone the repository:
git clone https://github.com/BYU-PCCL/scam-call-identification.git cd scam-call-identification -
Set up Python environment:
python -m venv scams_env # On Windows: .\scams_env\Scripts\activate # On Linux/Mac: source scams_env/bin/activate
-
Install dependencies:
pip install -r requirements.txt
Build and run using Docker:
chmod +x build_image_docker_scams.sh
./build_image_docker_scams.sh- Create an
api_keysfolder in the project root - Add your API keys for:
- OpenAI GPT models
- Google Gemini models
- Configure rate limits in
src/rate_limits/directory
Important: Always run scripts as modules to ensure proper imports:
python -m scripts.feature_engineering.run_chatgpt_behavioral_analysis
python -m scripts.ETL.aggregate_all_transcripts
python -m scripts.EDA.compiled.inspect_compiled_transcriptsExtract behavioral features from call transcripts:
python -m scripts.feature_engineering.run_chatgpt_behavioral_analysis [prompt_path] [continuation_prompt_path]Process raw audio files and generate transcripts:
python -m scripts.ETL.transcribe_audio
python -m scripts.ETL.transform_parquet_to_audioThe system analyzes calls across multiple behavioral dimensions:
- Pressure & Urgency: Detecting time pressure and fear tactics
- Information Elicitation: Identifying requests for sensitive data
- Authority Impersonation: Recognizing false authority claims
- Financial Request Patterns: Detecting payment solicitations
- Conversation Flow: Analyzing dialogue patterns
- Scam-Specific Signatures: Identifying known scam types
The system uses a hierarchical classification approach:
- Feature Extraction: LLM-based behavioral feature extraction
- Classification: Traditional ML models trained on extracted features
- Validation: Cross-validation on multiple datasets
The system analyzes calls across 9+ behavioral categories:
- Pressure & Urgency Tactics
- Information Elicitation Patterns
- False Authority & Impersonation
- True Authority & Legitimacy
- Financial Request Patterns
- Conversation Flow & Meta-Communication
- Question Patterns & Information Seeking
- Response Patterns & Compliance
- Scam Signature Behaviors
- Multi-LLM Analysis: Compares results from different language models
- Temporal Analysis: Tracks behavioral patterns over conversation duration
- Synthetic Data Generation: Creates training data using demographic models
- Rate-Limited Processing: Manages API calls efficiently
Prompts are versioned and stored in src/ml_scam_classification/prompting/:
features.json/features_v2.json: Behavioral feature definitionsprompt_conner_v*.txt: Main analysis promptsprompt_*_contd.txt: Continuation prompts for long conversations
Configuration files in src/ml_scam_classification/settings/:
global_settings.py: Global configurationsupported_transcription_models.json: Available transcription models- Rate limiting configurations
The project includes comprehensive testing utilities:
- File validation: Ensures data integrity
- JSON validation: Validates structured outputs
- Path validation: Confirms file system operations
- Model validation: Tests classification performance
The system generates:
- JSON Feature Files: Structured behavioral analysis results
- CSV Reports: Aggregated classification results
- Transcription Files: Processed audio-to-text conversions
- Model Artifacts: Trained classification models
This is a research project. For contributions:
- Follow the existing code structure
- Use the module-based import system
- Add appropriate error handling and validation
- Include comprehensive documentation
- Test with multiple datasets
This project is licensed under the MIT License - see the LICENSE file for details.
Brigham Young University - Perception, Control and Cognition Lab (BYU-PCCL)
If you use this work in your research, please cite:
@software{scam_call_identification,
title={Scam Call Identification System},
author={BYU Perception, Control and Cognition Lab},
year={2025},
url={https://github.com/BYU-PCCL/scam-call-identification}
}Once fully developed, this system could be applied to:
- Telecommunications Security: Real-time scam call detection
- Consumer Protection: Educational tools for scam awareness
- Law Enforcement: Analysis of fraud patterns
- Academic Research: Study of deceptive communication patterns
- Industry Applications: Call center quality assurance
For questions, issues, or research collaboration opportunities, please open an issue or contact the BYU-PCCL team.