A comprehensive machine learning framework for detecting intrusions in automotive CAN bus networks using the Car Hacking: Attack & Defense Challenge 2020 dataset from IEEE Dataport. This project implements cutting-edge deep learning approaches including Graph Neural Networks, Transformers, and hybrid architectures for automotive cybersecurity.
This project addresses the critical challenge of securing modern vehicles against cyber attacks by developing advanced intrusion detection systems (IDS) for Controller Area Network (CAN) bus communications. The framework implements multiple state-of-the-art machine learning approaches:
-
๐ Graph Convolutional Networks (GCN) [WIP] for network topology-based anomaly detection
-
Hybrid ML Framework [WIP] combining sequence transformers, graph neural networks, and contrastive learning
-
๐ณ Traditional ML Models (Random Forest & XGBoost) for baseline comparison and ensemble methods
-
Interactive Streamlit Dashboard for real-time analysis and model evaluation
The project utilizes the Car Hacking: Attack & Defense Challenge 2020 Dataset which contains:
- Total Messages: 8,694,507 CAN bus messages
- Training Data: 3,672,151 messages
- Test Data: 3,752,046 messages
- Validation Data: 1,270,310 messages
The dataset includes four main attack categories:
| Attack Type | Count | Description |
|---|---|---|
| Flooding | 345,859 | High-frequency message injection attacks |
| Fuzzing | 216,571 | Random payload injection attacks |
| Spoofing | 200,338 | Message impersonation attacks |
| Replay | 110,474 | Previously captured message replay attacks |
| Normal | 7,821,265 | Legitimate vehicle communication |
โโโโ0_Preliminary/
โ โโโโ0_Training/ # Training Files
โ โ Pre_train_D_0.csv
โ โ Pre_train_D_1.csv
โ โ Pre_train_D_2.csv
โ โ Pre_train_S_0.csv
โ โ Pre_train_S_1.csv
โ โ Pre_train_S_2.csv
โ โ
โ โโโโ1_Submission/ # Test Files
โ Pre_submit_D.csv
โ Pre_submit_S.csv
โ
โโโโ1_Final/ # Validation Files
Fin_host_session_submit_S.csvEach CSV contains CAN bus messages with the following structure:
Timestamp: Unix timestamp of message transmissionArbitration_ID: CAN message identifier (hex format)DLC: Data Length Code (0-8 bytes)Data: Hexadecimal payload data (up to 16 hex characters)Class: Primary classification (Normal/Attack)SubClass: Detailed attack type (Normal/Flooding/Fuzzing/Spoofing/Replay)
.
โ .gitignore
โ .python-version
โ main.py # main file
โ pyproject.toml
โ README.md
โ uv.lock
โ
โโโโdata
โ โโโโ0_Preliminary
โ โ โโโโ0_Training
โ โ โ Pre_train_D_0.csv
โ โ โ Pre_train_D_1.csv
โ โ โ Pre_train_D_2.csv
โ โ โ Pre_train_S_0.csv
โ โ โ Pre_train_S_1.csv
โ โ โ Pre_train_S_2.csv
โ โ โ
โ โ โโโโ1_Submission
โ โ Pre_submit_D.csv
โ โ Pre_submit_S.csv
โ โ
โ โโโโ1_Final
โ Fin_host_session_submit_S.csv
โ
โโโโhelpers
โ data_viewer.py
โ schema_viewer.py
โ
โโโโout # EDA's
โ โโโโeda_out
โ โ eda_summary.json
โ โ sample_head.csv
โ โ
โ โโโโschema_debug
โ schema_report.json
โ
โโโโsrc
โ ensemble_trial.py
โ GCNN.py # Graph Convolutional Neural Network (WIP)
โ ML.py # ML implementations specifically Random Forest and XGBoost
โ
โโโโutils
can_ids_streamlit_app.py # An interactive dashboard for visualisation
- Python: 3.13+ (specified in
.python-version) - Package Manager: uv (recommended) or pip
- Memory: 8GB+ RAM recommended for full dataset processing
- Storage: 2GB+ free space for dataset and outputs
# Clone the repository
git clone https://github.com/Anmol-G-K/IEEE-EV-Hackathon.git
cd IEEE_EV
# Install dependencies using uv (recommended)
uv sync
# Or using pip
pip install -e .| Package | Version | Purpose |
|---|---|---|
| PyTorch | โฅ2.8.0 | Deep learning framework |
| torch-geometric | Latest | Graph neural networks |
| scikit-learn | โฅ1.7.2 | Traditional ML algorithms |
| XGBoost | โฅ3.0.5 | Gradient boosting |
| Polars | โฅ1.33.1 | Fast data processing |
| Streamlit | โฅ1.49.1 | Interactive dashboard |
| NetworkX | Latest | Graph analysis |
| Matplotlib/Seaborn | Latest | Visualization |
Start by analyzing the dataset structure and characteristics:
# Generate comprehensive EDA report
python helpers/data_viewer.py
# Validate dataset schema
python helpers/schema_viewer.pyThis generates detailed reports in out/eda_out/ including:
- Dataset statistics and distributions
- Missing data analysis
- Attack type distributions
- Message frequency patterns
- Arbitration ID statistics
Train a GCN for anomaly detection:
python src/GCNN.pyFeatures:
- Converts CAN messages to graph representations
- Learns node embeddings for Arbitration IDs
- Builds correlation-based adjacency matrices
- Generates anomaly scores for each message
- Creates visualizations: PCA plots, score distributions, graph structures
Outputs:
outputs/X.npy: Node feature matrixoutputs/edge_index.npy: Graph adjacency matrixoutputs/node_embeddings_cpu.npy: Learned embeddingsoutputs/node_anomaly_score_cpu.npy: Anomaly scores- Visualization plots (PCA, histograms, graph structures) Currently a Work in progress
Run baseline and ensemble models:
python src/ML.pyModels:
- Random Forest classifier with feature engineering
- XGBoost classifier with hyperparameter optimization
- Comprehensive feature extraction pipeline
- Cross-validation and performance metrics
Train the advanced hybrid model:
python src/ensemble_trial.pyArchitecture Components:
- Sequence Transformer: Captures temporal patterns in message sequences
- Graph Neural Network: Models network topology and message relationships
- Contrastive Learning: Learns robust message representations
- Fusion Classifier: Combines all modalities for final predictions
Features:
- Sliding window approach for sequence modeling
- Multi-modal feature fusion
- PyTorch AMP for efficient training
- Comprehensive evaluation metrics
Launch the Streamlit application:
streamlit run utils/visual.pyDashboard Features:
- Interactive data upload and preprocessing
- Real-time model training and evaluation
- Confusion matrix visualization
- Feature importance analysis
- Performance comparison charts
The GCN implementation treats CAN messages as nodes in a graph where:
-
Node Features:
- Arbitration ID embeddings
- Payload byte statistics (mean, frequency)
- Message timing characteristics
-
Edge Construction:
- Correlation-based adjacency matrix
- Top-k neighborhood selection
- Threshold-based edge pruning
-
Architecture:
- 2-layer Graph Convolutional Network
- Reconstruction loss for unsupervised learning
- Anomaly scoring through embedding distances
The hybrid approach combines multiple modalities:
-
Sequence Component:
- Transformer encoder for temporal patterns
- Multi-head attention mechanism
- Positional encoding for message sequences
-
Graph Component:
- GCN layers for network topology
- Global mean pooling for graph-level features
- Message relationship modeling
-
Contrastive Component:
- Self-supervised representation learning
- Message similarity modeling
- Robust feature extraction
-
Fusion Strategy:
- Multi-modal feature concatenation
- Dropout for regularization
- Binary classification head
Comprehensive feature extraction pipeline:
- Payload Features: Byte-level analysis, entropy calculation, statistical moments
- Timing Features: Inter-arrival times, frequency estimation, burst detection
- Network Features: Message frequency per ID, traffic patterns
- Statistical Features: Mean, standard deviation, correlations, distributions
- Create Model File: Add new implementation in
src/directory - Follow Patterns: Use existing data loading and preprocessing utilities
- Add Evaluation: Include comprehensive metrics and visualizations
- Update Documentation: Document new approaches and results
- Custom EDA: Modify
helpers/data_viewer.pyfor specialized analysis - Preprocessing: Update functions in model files or
MISC/preprocess.py - Feature Engineering: Add new feature extraction methods
- Validation: Use
helpers/schema_viewer.pyfor data quality checks
- Dashboard: Extend
utils/visual.pywith new Streamlit components - Plotting: Add model-specific visualization functions
- Real-time: Implement live monitoring capabilities
- Export: Add report generation and export functionality
- Graph Neural Networks for CAN Bus Intrusion Detection
- Transformer-based Sequence Modeling for Automotive Security
- Multi-modal Fusion for Vehicle Cybersecurity
- PyTorch Geometric - Graph neural networks
- Streamlit - Interactive dashboards
- Polars - Fast data processing
- scikit-learn - Machine learning algorithms
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Include unit tests for new features
- Update documentation for API changes
- Ensure backward compatibility
This project is developed for educational and research purposes only. Always ensure compliance with local regulations and ethical guidelines when working with automotive systems and cybersecurity research.
- Amrita Vishwa Vidyapeetham IEEE Student Branch on organising the hackathon.
- IEEE Dataport for providing the Car Hacking dataset
- PyTorch Community for excellent deep learning frameworks
- Automotive Security Research Community for ongoing contributions
- Open Source Contributors who make projects like this possible
| Name | GitHub | |
|---|---|---|
| Aryan jaljith | GitHub | |
| Mauli Rajguru | GitHub | |
| Anmol | GitHub |
๐ Securing the Future of Connected Vehicles ๐
Advanced Machine Learning for Automotive Cybersecurity