Skip to content

Anmol-G-K/IEEE-EV-Hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš— IEEE EV - Car Hacking Intrusion Detection System

Python PyTorch Streamlit Polars uv

A comprehensive machine learning framework for detecting intrusions in automotive CAN bus networks using the Car Hacking: Attack & Defense Challenge 2020 dataset from IEEE Dataport. This project implements cutting-edge deep learning approaches including Graph Neural Networks, Transformers, and hybrid architectures for automotive cybersecurity.

๐ŸŽฏ Overview

This project addresses the critical challenge of securing modern vehicles against cyber attacks by developing advanced intrusion detection systems (IDS) for Controller Area Network (CAN) bus communications. The framework implements multiple state-of-the-art machine learning approaches:

  • ๐Ÿ”— Graph Convolutional Networks (GCN) [WIP] for network topology-based anomaly detection

  • Hybrid ML Framework [WIP] combining sequence transformers, graph neural networks, and contrastive learning

  • ๐ŸŒณ Traditional ML Models (Random Forest & XGBoost) for baseline comparison and ensemble methods

  • Interactive Streamlit Dashboard for real-time analysis and model evaluation

๐Ÿ“Š Dataset

The project utilizes the Car Hacking: Attack & Defense Challenge 2020 Dataset which contains:

Dataset Statistics

  • Total Messages: 8,694,507 CAN bus messages
  • Training Data: 3,672,151 messages
  • Test Data: 3,752,046 messages
  • Validation Data: 1,270,310 messages

Attack Types

The dataset includes four main attack categories:

Attack Type Count Description
Flooding 345,859 High-frequency message injection attacks
Fuzzing 216,571 Random payload injection attacks
Spoofing 200,338 Message impersonation attacks
Replay 110,474 Previously captured message replay attacks
Normal 7,821,265 Legitimate vehicle communication

Data Structure

โ”œโ”€โ”€โ”€0_Preliminary/
โ”‚   โ”œโ”€โ”€โ”€0_Training/ # Training Files
โ”‚   โ”‚       Pre_train_D_0.csv
โ”‚   โ”‚       Pre_train_D_1.csv
โ”‚   โ”‚       Pre_train_D_2.csv
โ”‚   โ”‚       Pre_train_S_0.csv
โ”‚   โ”‚       Pre_train_S_1.csv
โ”‚   โ”‚       Pre_train_S_2.csv
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€1_Submission/ # Test Files
โ”‚           Pre_submit_D.csv
โ”‚           Pre_submit_S.csv
โ”‚
โ””โ”€โ”€โ”€1_Final/ # Validation Files
        Fin_host_session_submit_S.csv

CAN Message Format

Each CSV contains CAN bus messages with the following structure:

  • Timestamp: Unix timestamp of message transmission
  • Arbitration_ID: CAN message identifier (hex format)
  • DLC: Data Length Code (0-8 bytes)
  • Data: Hexadecimal payload data (up to 16 hex characters)
  • Class: Primary classification (Normal/Attack)
  • SubClass: Detailed attack type (Normal/Flooding/Fuzzing/Spoofing/Replay)

๐Ÿ—๏ธ Project Architecture

.
โ”‚   .gitignore
โ”‚   .python-version
โ”‚   main.py # main file
โ”‚   pyproject.toml
โ”‚   README.md
โ”‚   uv.lock
โ”‚   
โ”œโ”€โ”€โ”€data
โ”‚   โ”œโ”€โ”€โ”€0_Preliminary
โ”‚   โ”‚   โ”œโ”€โ”€โ”€0_Training
โ”‚   โ”‚   โ”‚       Pre_train_D_0.csv
โ”‚   โ”‚   โ”‚       Pre_train_D_1.csv
โ”‚   โ”‚   โ”‚       Pre_train_D_2.csv
โ”‚   โ”‚   โ”‚       Pre_train_S_0.csv
โ”‚   โ”‚   โ”‚       Pre_train_S_1.csv
โ”‚   โ”‚   โ”‚       Pre_train_S_2.csv
โ”‚   โ”‚   โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€1_Submission
โ”‚   โ”‚           Pre_submit_D.csv
โ”‚   โ”‚           Pre_submit_S.csv
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€1_Final
โ”‚           Fin_host_session_submit_S.csv
โ”‚
โ”œโ”€โ”€โ”€helpers
โ”‚       data_viewer.py
โ”‚       schema_viewer.py
โ”‚
โ”œโ”€โ”€โ”€out # EDA's
โ”‚   โ”œโ”€โ”€โ”€eda_out
โ”‚   โ”‚       eda_summary.json
โ”‚   โ”‚       sample_head.csv
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€schema_debug
โ”‚           schema_report.json
โ”‚
โ”œโ”€โ”€โ”€src
โ”‚       ensemble_trial.py
โ”‚       GCNN.py # Graph Convolutional Neural Network (WIP) 
โ”‚       ML.py # ML implementations specifically Random Forest and XGBoost
โ”‚
โ””โ”€โ”€โ”€utils
        can_ids_streamlit_app.py # An interactive dashboard for visualisation

๐Ÿš€ Quick Start

Prerequisites

  • Python: 3.13+ (specified in .python-version)
  • Package Manager: uv (recommended) or pip
  • Memory: 8GB+ RAM recommended for full dataset processing
  • Storage: 2GB+ free space for dataset and outputs

Installation

# Clone the repository
git clone https://github.com/Anmol-G-K/IEEE-EV-Hackathon.git
cd IEEE_EV

# Install dependencies using uv (recommended)
uv sync

# Or using pip
pip install -e .

Key Dependencies

Package Version Purpose
PyTorch โ‰ฅ2.8.0 Deep learning framework
torch-geometric Latest Graph neural networks
scikit-learn โ‰ฅ1.7.2 Traditional ML algorithms
XGBoost โ‰ฅ3.0.5 Gradient boosting
Polars โ‰ฅ1.33.1 Fast data processing
Streamlit โ‰ฅ1.49.1 Interactive dashboard
NetworkX Latest Graph analysis
Matplotlib/Seaborn Latest Visualization

๐ŸŽฎ Usage

1. Exploratory Data Analysis

Start by analyzing the dataset structure and characteristics:

# Generate comprehensive EDA report
python helpers/data_viewer.py

# Validate dataset schema
python helpers/schema_viewer.py

This generates detailed reports in out/eda_out/ including:

  • Dataset statistics and distributions
  • Missing data analysis
  • Attack type distributions
  • Message frequency patterns
  • Arbitration ID statistics

2. Graph Convolutional Network (WIP)

Train a GCN for anomaly detection:

python src/GCNN.py

Features:

  • Converts CAN messages to graph representations
  • Learns node embeddings for Arbitration IDs
  • Builds correlation-based adjacency matrices
  • Generates anomaly scores for each message
  • Creates visualizations: PCA plots, score distributions, graph structures

Outputs:

  • outputs/X.npy: Node feature matrix
  • outputs/edge_index.npy: Graph adjacency matrix
  • outputs/node_embeddings_cpu.npy: Learned embeddings
  • outputs/node_anomaly_score_cpu.npy: Anomaly scores
  • Visualization plots (PCA, histograms, graph structures) Currently a Work in progress

3. Traditional ML Pipeline

Run baseline and ensemble models:

python src/ML.py

Models:

  • Random Forest classifier with feature engineering
  • XGBoost classifier with hyperparameter optimization
  • Comprehensive feature extraction pipeline
  • Cross-validation and performance metrics

4. Hybrid ML Framework (WIP)

Train the advanced hybrid model:

python src/ensemble_trial.py

Architecture Components:

  • Sequence Transformer: Captures temporal patterns in message sequences
  • Graph Neural Network: Models network topology and message relationships
  • Contrastive Learning: Learns robust message representations
  • Fusion Classifier: Combines all modalities for final predictions

Features:

  • Sliding window approach for sequence modeling
  • Multi-modal feature fusion
  • PyTorch AMP for efficient training
  • Comprehensive evaluation metrics

5. Interactive Dashboard

Launch the Streamlit application:

streamlit run utils/visual.py

Dashboard Features:

  • Interactive data upload and preprocessing
  • Real-time model training and evaluation
  • Confusion matrix visualization
  • Feature importance analysis
  • Performance comparison charts

๐Ÿ”ฌ Technical Methodology

Graph Neural Network Approach

The GCN implementation treats CAN messages as nodes in a graph where:

  1. Node Features:

    • Arbitration ID embeddings
    • Payload byte statistics (mean, frequency)
    • Message timing characteristics
  2. Edge Construction:

    • Correlation-based adjacency matrix
    • Top-k neighborhood selection
    • Threshold-based edge pruning
  3. Architecture:

    • 2-layer Graph Convolutional Network
    • Reconstruction loss for unsupervised learning
    • Anomaly scoring through embedding distances

Hybrid Framework

The hybrid approach combines multiple modalities:

  1. Sequence Component:

    • Transformer encoder for temporal patterns
    • Multi-head attention mechanism
    • Positional encoding for message sequences
  2. Graph Component:

    • GCN layers for network topology
    • Global mean pooling for graph-level features
    • Message relationship modeling
  3. Contrastive Component:

    • Self-supervised representation learning
    • Message similarity modeling
    • Robust feature extraction
  4. Fusion Strategy:

    • Multi-modal feature concatenation
    • Dropout for regularization
    • Binary classification head

Feature Engineering

Comprehensive feature extraction pipeline:

  • Payload Features: Byte-level analysis, entropy calculation, statistical moments
  • Timing Features: Inter-arrival times, frequency estimation, burst detection
  • Network Features: Message frequency per ID, traffic patterns
  • Statistical Features: Mean, standard deviation, correlations, distributions

๐Ÿ› ๏ธ Development

Adding New Models

  1. Create Model File: Add new implementation in src/ directory
  2. Follow Patterns: Use existing data loading and preprocessing utilities
  3. Add Evaluation: Include comprehensive metrics and visualizations
  4. Update Documentation: Document new approaches and results

Extending Data Processing

  1. Custom EDA: Modify helpers/data_viewer.py for specialized analysis
  2. Preprocessing: Update functions in model files or MISC/preprocess.py
  3. Feature Engineering: Add new feature extraction methods
  4. Validation: Use helpers/schema_viewer.py for data quality checks

Enhancing Visualizations

  1. Dashboard: Extend utils/visual.py with new Streamlit components
  2. Plotting: Add model-specific visualization functions
  3. Real-time: Implement live monitoring capabilities
  4. Export: Add report generation and export functionality

๐Ÿ“š Research & References

Dataset

Key Papers

  • Graph Neural Networks for CAN Bus Intrusion Detection
  • Transformer-based Sequence Modeling for Automotive Security
  • Multi-modal Fusion for Vehicle Cybersecurity

Libraries & Frameworks

๐Ÿค Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add comprehensive docstrings
  • Include unit tests for new features
  • Update documentation for API changes
  • Ensure backward compatibility

โš ๏ธ Disclaimer

This project is developed for educational and research purposes only. Always ensure compliance with local regulations and ethical guidelines when working with automotive systems and cybersecurity research.

๐Ÿค๐Ÿ™Œ Acknowledgments

  • Amrita Vishwa Vidyapeetham IEEE Student Branch on organising the hackathon.
  • IEEE Dataport for providing the Car Hacking dataset
  • PyTorch Community for excellent deep learning frameworks
  • Automotive Security Research Community for ongoing contributions
  • Open Source Contributors who make projects like this possible

๐Ÿ‘ฅ Team Members

Name GitHub LinkedIn
Aryan jaljith GitHub LinkedIn
Mauli Rajguru GitHub LinkedIn
Anmol GitHub LinkedIn

๐Ÿ”’ Securing the Future of Connected Vehicles ๐Ÿ”’

Advanced Machine Learning for Automotive Cybersecurity

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages