Skip to content

An implementation and analysis of two distinct Graph Neural Network (GNN) projects: Node Classification on a real-world citation network and Graph Generation using state-of-the-art generative models on synthetic data.

Notifications You must be signed in to change notification settings

Ayan1349/Project_Graph_Learning

Repository files navigation

Graph Neural Network Applications: Node Classification & Graph Generation

This repository contains the implementation and analysis of two distinct Graph Neural Network (GNN) projects: Node Classification on a real-world citation network and Graph Generation using state-of-the-art generative models on synthetic data.


Project Structure

File Description
Task_1_Node_Classification_Report.pdf Comprehensive report on the Node Classification task (Objective, GCN/GAT Architectures, Results, Ablation Study).
Task_1_Node_Classification.ipynb Jupyter Notebook containing the full code for the Node Classification task.
Task_2_Graph_Generation_Report.pdf Comprehensive report on the Graph Generation task (Diffusion Model and GraphRNN Architectures, Comparison, and Statistical Analysis).
Task_2_Graph_Generation.ipynb Jupyter Notebook containing the full code for the Graph Generation task.
Presentation.pdf Slide deck summarizing the project findings (not provided here).

Task 1: Node Classification on OGBN-Arxiv

The objective of Task 1 was to build a GNN model for the Node Property Prediction task. This is a large-scale transductive node classification problem.

Dataset and Preprocessing

  • Dataset: OGBN-Arxiv (Open Graph Benchmark), a large-scale academic citation network.
  • Task: Classify each paper (node) into one of 40 subject categories.
  • Scale: The graph has 169,343 nodes and 1,166,243 non-zero edges.
  • Preprocessing: The graph's edge list was converted to a memory-efficient torch_sparse.SparseTensor and explicitly made symmetric to ensure proper message passing.

Models Implemented

  1. Baseline Model: Graph Convolutional Network (GCN)

    • Architecture: A deep 4-layer GCN with Batch Normalization (BN) layers between hidden layers for stabilization, and Dropout (p=0.4) for regularization.
    • Ablation Study: A 20-layer GCN without Batch Normalization was tested, resulting in a significant drop in accuracy (30.17%).
  2. Improved Model: Graph Attention Network (GAT)

    • Architecture: A shallower 2-layer GAT utilizing a weighted, differential attention mechanism. Multi-head attention (heads=2) with concatenation was used in hidden layers.

Takeaways

  • Both the deeper GCN and the shallower GAT achieved competitive test accuracy (approximately 71%).
  • The GAT demonstrated superior generalization (less divergence between train/validation/test curves) and faster initial convergence.
  • The ablation study proved that Batch Normalization is essential for deep GNNs to mitigate issues like over-smoothing and over-squashing.

Task 2: Graph Generation

The objective of Task 2 was to build a GNN-based generative model capable of creating synthetic graphs that reproduce specific statistical properties.

Dataset

  • Dataset: Synthetic Erdős-Rényi (ER) Graphs (G(N,p)) with parameters N=25 nodes and edge probability p=0.3.
  • Purpose: ER graphs were chosen because their target statistical properties (e.g., degree distribution, clustering coefficient) are known analytically, allowing for clear, quantitative evaluation.

Approach 1: Discrete Denoising Diffusion Model

  • Choice: Selected for its robust capacity to model complex, high-dimensional discrete data distributions (adjacency matrices).
  • Architecture: Utilizes a Graph Transformer Denoiser (non-autoregressive) that processes the entire noisy adjacency matrix at each reverse-diffusion step. This approach inherently captures the global context of the graph.
  • Key Result: The empirical degree distribution showed strong alignment with the theoretical Binomial distribution. The empirical average Clustering Coefficient (~0.299) was extremely close to the theoretical value (p=0.300).

Approach 2: GraphRNN Model

  • Choice: Implemented as an autoregressive benchmark.
  • Architecture: Uses two decoupled GRUCells (Node RNN and Edge RNN) to sequentially generate nodes and their edge sequences. This approach relies on sequential/local context.
  • Key Result: The empirical degree distribution was visibly shifted to the left compared to the theoretical curve, showing the model underestimated the frequency of higher-degree nodes. The empirical average Clustering Coefficient (~0.245) was lower than the theoretical value (p=0.300).

Conclusion on Performance

The Diffusion Model proved superior because its global context view at every step allowed it to learn the holistic statistical properties of the ER graphs more effectively. The GraphRNN's sequential and limited-context generation introduced a bias towards simpler structures, resulting in a misaligned degree distribution. The Diffusion Model is also inherently more parallelizable and efficient for large-scale generation than the O(N^2) sequential steps of GraphRNN.


How to run the code

The code for this project is contained within the two Jupyter Notebooks. To run the experiments and reproduce the results, follow these steps:

Prerequisites

  • Python 3.8+
  • A suitable environment manager (Anaconda/Miniconda recommended)

Setup and Installation

  1. Clone the repository:
    git clone <YOUR_REPO_URL>
    cd <YOUR_REPO_NAME>
  2. Install Core Dependencies: The project relies on the PyTorch and PyTorch Geometric (PyG) ecosystem. It is recommended to install PyTorch first, ensuring compatibility with your CUDA version (if using a GPU).
    # Install PyTorch (follow official instructions for your CUDA version)
    # pip install torch torchvision torchaudio 
    
    # Install PyTorch Geometric (PyG) and its dependencies
    # Consult PyG documentation for the most stable command for your environment
    pip install torch_geometric torch_sparse
    
    # Install other required libraries
    pip install numpy pandas matplotlib scikit-learn ogb
    Note: The ogb library is specifically needed for downloading and pre-processing the OGBN-Arxiv dataset used in Task 1.

Execution

  1. Launch a Jupyter environment (e.g., JupyterLab or Jupyter Notebook):
    jupyter notebook
  2. Open the following files and run the cells sequentially:
    • Task 1: Task_1_Node_Classification.ipynb to train and evaluate the GCN and GAT models on OGBN-Arxiv.
    • Task 2: Task_2_Graph_Generation.ipynb to train and sample from the Discrete Diffusion Model and the GraphRNN model on synthetic ER graphs.

Limitations and Future Work

  • Task 1 (Node Classification): The GAT model was limited to only 2 layers due to hardware constraints. Future work should involve implementing a deeper GAT (e.g., 4 layers with more attention heads) and conducting rigorous hyperparameter tuning.
  • Task 2 (Graph Generation): GraphRNN requires a larger training dataset (e.g., the 800-graph dataset used for the Diffusion Model) and a larger context window M for a fairer comparison. A potential improvement for both models is to introduce a loss term that penalizes disconnected components, encouraging a specific target connectivity.

About

An implementation and analysis of two distinct Graph Neural Network (GNN) projects: Node Classification on a real-world citation network and Graph Generation using state-of-the-art generative models on synthetic data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published