A lightweight, deployment-ready machine learning system for detecting malicious URLs and domains at the IoT gateway edge. This system achieves 72.30% accuracy with only 3.5 MB memory footprint and 10.50 ms end-to-end latency, making it suitable for resource-constrained edge devices.
URL Input β Feature Extraction β Normalization β Random Forest β Decision
(0ms) (3.19ms) (0ms) (7.31ms) (10.50ms total)
See detailed architecture documentation for component specifications and deployment modes.
Traditional cloud-based URL detection systems introduce unacceptable latency and privacy concerns for IoT deployments. This project implements a complete Edge-AI solution that processes URLs locally at the gateway level, providing real-time threat detection without requiring cloud connectivity.
- Real-time malicious URL detection at IoT gateways
- Lightweight Random Forest model (1.8 MB model size, 3.5 MB memory usage)
- Fast inference (7.31 ms model prediction, 10.50 ms end-to-end)
- 31-dimensional feature engineering framework
- Containerized deployment with Docker support
- RESTful API for easy integration
- Comprehensive evaluation across 5 ML algorithms
- Production-ready with monitoring and logging
| Metric | Value |
|---|---|
| Accuracy | 72.30% |
| F1-Score | 0.7089 |
| Precision | 0.7156 |
| Recall | 0.7023 |
| Model Prediction Time | 7.31 ms |
| Feature Extraction Time | 3.19 ms |
| End-to-End Latency | 10.50 ms |
| Memory Footprint | 3.5 MB |
| Model Size | 1.8 MB |
| Throughput | 137 samples/sec |
The system consists of four main components:
-
Feature Extraction Module: Extracts 31 features from URLs including lexical patterns, DNS metadata, SSL certificate information, and domain registration data
-
ML Detection Engine: Random Forest classifier optimized for edge deployment with minimal resource requirements
-
RESTful API: FastAPI-based service exposing prediction endpoints for integration with network security infrastructure
-
Monitoring Stack: Prometheus metrics collection and logging for production deployment
- Python 3.8 or higher
- Docker and Docker Compose (for containerized deployment)
- 512 MB RAM minimum (recommended: 1 GB)
- Linux-based system (tested on Ubuntu 20.04+)
- Clone the repository:
git clone https://github.com/Huy-VNNIC/Edge-AI-URL-Detection.git
cd Edge-AI-URL-Detection- Create and activate virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Download datasets and build features:
python scripts/build_dataset.py
python scripts/extract_features.py- Train the model:
python scripts/train_model.pyfrom src.models.predictor import URLPredictor
# Initialize predictor
predictor = URLPredictor(model_path='models/rf_model.joblib')
# Predict single URL
url = "http://suspicious-domain.com/malware.exe"
result = predictor.predict(url)
print(f"Malicious probability: {result['probability']:.2f}")
print(f"Prediction: {result['prediction']}")
# Batch prediction
urls = ["http://example.com", "http://phishing-site.ru/login"]
results = predictor.predict_batch(urls)Start the API server:
python src/api/main.pyMake predictions via HTTP:
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"url": "http://suspicious-domain.com/malware.exe"}'Response:
{
"url": "http://suspicious-domain.com/malware.exe",
"prediction": "malicious",
"probability": 0.87,
"processing_time_ms": 10.2,
"features": {
"url_length": 45,
"entropy": 3.42,
"has_ip": false
}
}Deploy the complete system with Docker Compose:
docker-compose up -dThis starts:
- API service on port 8000
- Prometheus metrics on port 9090
Check service health:
curl http://localhost:8000/healthEdit config/config.yaml to customize system behavior:
model:
type: "random_forest"
path: "models/rf_model.joblib"
threshold: 0.5
api:
host: "0.0.0.0"
port: 8000
workers: 4
feature_extraction:
timeout: 5
dns_resolution: true
ssl_verification: true
logging:
level: "INFO"
file: "logs/detection.log"Edge-AI-URL-Detection/
βββ config/ # Configuration files
β βββ config.yaml # System configuration
β
βββ datasets/ # Dataset storage (not in git)
β βββ original/ # Raw datasets
β β βββ cic_trap4phish/ # CIC Trap4Phish 2025
β β βββ malicious_domain_features/ # 12-feature domain dataset
β β βββ malicious_phish_dataset/ # Malicious phish URLs
β β βββ base_json/ # Base JSON datasets
β β βββ CSV_benign.csv # Benign URL samples
β β βββ CSV_malware.csv # Malware URL samples
β β βββ CSV_phishing.csv # Phishing URL samples
β β βββ CSV_spam.csv # Spam URL samples
β βββ processed/ # Processed features and splits
β βββ features/ # Extracted feature sets
β βββ splits/ # Train/validation/test splits
β
βββ data/ # Working data directories
β βββ ablation/ # Ablation study results
β βββ hashed_domain_names/
β βββ processed/ # Processing outputs
β βββ regular_domain_names/
β βββ splits/ # Dataset splits
β
βββ deployment/ # Deployment configurations
β
βββ docker/ # Docker containerization
β βββ Dockerfile.api # API service container
β βββ Dockerfile.processor # Processing service container
β βββ prometheus.yml # Prometheus monitoring config
β
βββ docs/ # Documentation
β βββ architecture.md # System architecture details
β
βββ models/ # Trained ML models
β βββ rf_model.joblib # Random Forest model
β βββ rf_scaler.joblib # Feature scaler
β βββ rf_features.txt # Feature names
β βββ test/ # Test models
β
βββ notebooks/ # Jupyter notebooks for analysis
β
βββ paper/ # FJCAI 2026 Research Paper
β βββ latex/ # LaTeX source files
β βββ img/ # Paper figures
β
βββ reports/ # Evaluation reports and metrics
β βββ api_benchmark_results.json
β βββ evaluation_results.json
β βββ paper_metrics_complete.json
β βββ cv_evaluation/ # Cross-validation results
β
βββ scripts/ # Training and evaluation scripts
β βββ build_dataset.py # Dataset construction
β βββ extract_features.py # Feature extraction
β βββ train_model.py # Model training
β βββ evaluate_system.py # System evaluation
β βββ comprehensive_cv.py # Cross-validation
β βββ api_benchmark.py # API performance test
β βββ feature_importance_analysis.py
β βββ ablation_study.py # Ablation experiments
β
βββ src/ # Source code
β βββ api/ # REST API implementation (FastAPI)
β βββ data/ # Data processing pipelines
β βββ features/ # Feature extraction modules
β βββ models/ # ML model implementations
β βββ utils/ # Utility functions
β
βββ tests/ # Unit and integration tests
β
βββ docker-compose.yml # Docker Compose configuration
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore patterns
βββ LICENSE # MIT License
βββ README.md # This file
The system implements a 31-dimensional feature framework:
- URL length, path length, query length
- Character entropy
- Digit ratio, special character ratio
- Presence of IP address
- URL structure indicators
- Domain age
- Registration period
- TLD characteristics
- WHOIS information
- Domain reputation metrics
- DNS query response time
- NXDOMAIN ratio
- TTL values
- A record count
- Fast-flux indicators
- Certificate validity period
- Certificate age
- Issuer information
- Subject Alternative Names count
The system was evaluated with 5 different ML algorithms:
| Model | Accuracy | F1-Score | Prediction Time | Memory | Throughput |
|---|---|---|---|---|---|
| Random Forest | 72.30% | 0.7089 | 7.31 ms | 3.5 MB | 137 samples/s |
| Logistic Regression | 72.37% | 0.7065 | 2.45 ms | 0.8 MB | 408 samples/s |
| Neural Network | 72.03% | 0.7050 | 3.12 ms | 1.5 MB | 321 samples/s |
| SVM | 71.80% | 0.6875 | 8.95 ms | 1.2 MB | 112 samples/s |
| XGBoost | 63.57% | 0.4941 | 4.28 ms | 2.1 MB | 234 samples/s |
Random Forest was selected for deployment due to optimal balance between accuracy, resource efficiency, and interpretability.
Minimum:
- CPU: ARM Cortex-A53 or equivalent
- RAM: 512 MB
- Storage: 10 MB (model + dependencies)
Recommended:
- CPU: ARM Cortex-A72 or x86-64
- RAM: 1 GB
- Storage: 50 MB
- Raspberry Pi 4 (4GB RAM)
- NVIDIA Jetson Nano
- AWS EC2 t3.micro
- Ubuntu 20.04 LTS / 22.04 LTS
- Docker containers
- DNS Resolver Integration: Inline filtering at DNS level
- Firewall Mode: Block/alert based on detection scores
- SIEM Integration: Forward detections to Splunk/ELK
- Proxy Mode: HTTP/HTTPS traffic inspection
| Configuration | Accuracy | F1-Score | Feature Count | Extraction Time |
|---|---|---|---|---|
| Full (31 features) | 72.30% | 0.7089 | 31 | 3.19 ms |
| Without DNS | 71.95% | 0.7045 | 26 | 2.87 ms |
| Without SSL | 71.42% | 0.6982 | 28 | 2.94 ms |
| URL + Domain Only | 69.73% | 0.6782 | 19 | 2.12 ms |
| Lexical Only | 65.24% | 0.6142 | 12 | 1.48 ms |
Common false negative patterns:
- Sophisticated phishing with domain mimicry
- Legitimate-looking URLs with malicious payloads
- Fresh domains with no historical data
- Parameter-level obfuscation (Base64, hex encoding)
Common false positive patterns:
- Developer tools and CDN URLs
- API documentation with extensive parameters
- Regional TLDs (.ru, .cn) for legitimate sites
This work has been accepted for publication at FJCAI 2026 (Conference on Artificial Intelligence):
Title: Edge-AI Malicious Domain and URL Detection for IoT Gateway Security: A Lightweight Random Forest Approach
Authors: Tung Phan Luu, Huy Nguyen Nhat, Bao Tran Minh, Gia Nhu Nguyen
Paper ID: 157
Camera-ready paper available in paper/latex/main.pdf
Run unit tests:
pytest tests/Run integration tests:
python tests/test_api.py
python tests/test_pipeline.pyPerformance benchmarking:
python scripts/api_benchmark.pyAvailable at http://localhost:9090/metrics:
url_detection_requests_total: Total prediction requestsurl_detection_latency_seconds: Request latency histogramurl_detection_malicious_total: Malicious URL countmodel_inference_time_seconds: Model prediction time
- Application logs:
logs/detection.log - API access logs:
logs/api_access.log - Error logs:
logs/errors.log
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit your changes (
git commit -m 'Add YourFeature') - Push to the branch (
git push origin feature/YourFeature) - Open a Pull Request
Please ensure:
- Code follows PEP 8 style guidelines
- All tests pass
- New features include unit tests
- Documentation is updated
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@inproceedings{luu2026edge,
title={Edge-AI Malicious Domain and URL Detection for IoT Gateway Security: A Lightweight Random Forest Approach},
author={Luu, Tung Phan and Nguyen, Huy Nhat and Tran, Bao Minh and Nguyen, Gia Nhu},
booktitle={Proceedings of the Conference on Artificial Intelligence (FJCAI)},
year={2026}
}- URLhaus (abuse.ch) for malicious URL dataset
- Majestic Million for benign URL dataset
- FJCAI 2026 conference organizers and reviewers
- Author: Huy Nguyen Nhat
- Email: nguyennhathuy11@dtu.edu.vn
- GitHub: https://github.com/Huy-VNNIC
- Repository: https://github.com/Huy-VNNIC/Edge-AI-URL-Detection
Future enhancements planned:
- Real-time model updates via federated learning
- Adversarial robustness improvements
- Extended feature set for zero-day detection
- Support for additional IoT gateway platforms
- Web-based management interface
- Automated retraining pipeline
- URLNet: Character-level CNN for URL classification
- PhishDef: URL-based phishing detection
- BERT-URL: Transformer-based URL analysis
- Initial release
- Random Forest implementation with 31 features
- RESTful API with FastAPI
- Docker containerization
- FJCAI 2026 camera-ready paper
For issues, questions, or feature requests:
- Open an issue on GitHub
- Email the maintainers
- Check documentation in the
paper/directory
This system is designed as a defense-in-depth component and should not be relied upon as the sole security measure. Always deploy multiple layers of security controls in production environments.
The accuracy metrics reported (72.30%) reflect realistic performance under challenging cybersecurity conditions including label noise and edge deployment constraints. Performance may vary depending on the threat landscape and deployment environment. Regular model retraining is recommended.
