Graph Neural Network Applications: Node Classification & Graph Generation

This repository contains the implementation and analysis of two distinct Graph Neural Network (GNN) projects: Node Classification on a real-world citation network and Graph Generation using state-of-the-art generative models on synthetic data.

Project Structure

File	Description
`Task_1_Node_Classification_Report.pdf`	Comprehensive report on the Node Classification task (Objective, GCN/GAT Architectures, Results, Ablation Study).
`Task_1_Node_Classification.ipynb`	Jupyter Notebook containing the full code for the Node Classification task.
`Task_2_Graph_Generation_Report.pdf`	Comprehensive report on the Graph Generation task (Diffusion Model and GraphRNN Architectures, Comparison, and Statistical Analysis).
`Task_2_Graph_Generation.ipynb`	Jupyter Notebook containing the full code for the Graph Generation task.
`Presentation.pdf`	Slide deck summarizing the project findings (not provided here).

Task 1: Node Classification on OGBN-Arxiv

The objective of Task 1 was to build a GNN model for the Node Property Prediction task. This is a large-scale transductive node classification problem.

Dataset and Preprocessing

Dataset: OGBN-Arxiv (Open Graph Benchmark), a large-scale academic citation network.
Task: Classify each paper (node) into one of 40 subject categories.
Scale: The graph has 169,343 nodes and 1,166,243 non-zero edges.
Preprocessing: The graph's edge list was converted to a memory-efficient torch_sparse.SparseTensor and explicitly made symmetric to ensure proper message passing.

Models Implemented

Baseline Model: Graph Convolutional Network (GCN)
- Architecture: A deep 4-layer GCN with Batch Normalization (BN) layers between hidden layers for stabilization, and Dropout (p=0.4) for regularization.
- Ablation Study: A 20-layer GCN without Batch Normalization was tested, resulting in a significant drop in accuracy (30.17%).
Improved Model: Graph Attention Network (GAT)
- Architecture: A shallower 2-layer GAT utilizing a weighted, differential attention mechanism. Multi-head attention (heads=2) with concatenation was used in hidden layers.

Takeaways

Both the deeper GCN and the shallower GAT achieved competitive test accuracy (approximately 71%).
The GAT demonstrated superior generalization (less divergence between train/validation/test curves) and faster initial convergence.
The ablation study proved that Batch Normalization is essential for deep GNNs to mitigate issues like over-smoothing and over-squashing.

Task 2: Graph Generation

The objective of Task 2 was to build a GNN-based generative model capable of creating synthetic graphs that reproduce specific statistical properties.

Dataset

Dataset: Synthetic Erdős-Rényi (ER) Graphs (G(N,p)) with parameters N=25 nodes and edge probability p=0.3.
Purpose: ER graphs were chosen because their target statistical properties (e.g., degree distribution, clustering coefficient) are known analytically, allowing for clear, quantitative evaluation.

Approach 1: Discrete Denoising Diffusion Model

Choice: Selected for its robust capacity to model complex, high-dimensional discrete data distributions (adjacency matrices).
Architecture: Utilizes a Graph Transformer Denoiser (non-autoregressive) that processes the entire noisy adjacency matrix at each reverse-diffusion step. This approach inherently captures the global context of the graph.
Key Result: The empirical degree distribution showed strong alignment with the theoretical Binomial distribution. The empirical average Clustering Coefficient (~0.299) was extremely close to the theoretical value (p=0.300).

Approach 2: GraphRNN Model

Choice: Implemented as an autoregressive benchmark.
Architecture: Uses two decoupled GRUCells (Node RNN and Edge RNN) to sequentially generate nodes and their edge sequences. This approach relies on sequential/local context.
Key Result: The empirical degree distribution was visibly shifted to the left compared to the theoretical curve, showing the model underestimated the frequency of higher-degree nodes. The empirical average Clustering Coefficient (~0.245) was lower than the theoretical value (p=0.300).

Conclusion on Performance

The Diffusion Model proved superior because its global context view at every step allowed it to learn the holistic statistical properties of the ER graphs more effectively. The GraphRNN's sequential and limited-context generation introduced a bias towards simpler structures, resulting in a misaligned degree distribution. The Diffusion Model is also inherently more parallelizable and efficient for large-scale generation than the O(N^2) sequential steps of GraphRNN.

How to run the code

The code for this project is contained within the two Jupyter Notebooks. To run the experiments and reproduce the results, follow these steps:

Prerequisites

Python 3.8+
A suitable environment manager (Anaconda/Miniconda recommended)

Setup and Installation

Clone the repository:

git clone <YOUR_REPO_URL>
cd <YOUR_REPO_NAME>

Install Core Dependencies: The project relies on the PyTorch and PyTorch Geometric (PyG) ecosystem. It is recommended to install PyTorch first, ensuring compatibility with your CUDA version (if using a GPU).

# Install PyTorch (follow official instructions for your CUDA version)
# pip install torch torchvision torchaudio 

# Install PyTorch Geometric (PyG) and its dependencies
# Consult PyG documentation for the most stable command for your environment
pip install torch_geometric torch_sparse

# Install other required libraries
pip install numpy pandas matplotlib scikit-learn ogb

Note: The ogb library is specifically needed for downloading and pre-processing the OGBN-Arxiv dataset used in Task 1.

Execution

Launch a Jupyter environment (e.g., JupyterLab or Jupyter Notebook):
```
jupyter notebook
```
Open the following files and run the cells sequentially:
- Task 1: Task_1_Node_Classification.ipynb to train and evaluate the GCN and GAT models on OGBN-Arxiv.
- Task 2: Task_2_Graph_Generation.ipynb to train and sample from the Discrete Diffusion Model and the GraphRNN model on synthetic ER graphs.

Limitations and Future Work

Task 1 (Node Classification): The GAT model was limited to only 2 layers due to hardware constraints. Future work should involve implementing a deeper GAT (e.g., 4 layers with more attention heads) and conducting rigorous hyperparameter tuning.
Task 2 (Graph Generation): GraphRNN requires a larger training dataset (e.g., the 800-graph dataset used for the Diffusion Model) and a larger context window M for a fairer comparison. A potential improvement for both models is to introduce a loss term that penalizes disconnected components, encouraging a specific target connectivity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Graph Neural Network Applications: Node Classification & Graph Generation

Project Structure

Task 1: Node Classification on OGBN-Arxiv

Dataset and Preprocessing

Models Implemented

Takeaways

Task 2: Graph Generation

Dataset

Approach 1: Discrete Denoising Diffusion Model

Approach 2: GraphRNN Model

Conclusion on Performance

How to run the code

Prerequisites

Setup and Installation

Execution

Limitations and Future Work

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Presentation.pdf		Presentation.pdf
README.md		README.md
Task_1_Node_Classification.ipynb		Task_1_Node_Classification.ipynb
Task_1_Node_Classification_Report.pdf		Task_1_Node_Classification_Report.pdf
Task_2_Graph_Generation.ipynb		Task_2_Graph_Generation.ipynb
Task_2_Graph_Generation_Report.pdf		Task_2_Graph_Generation_Report.pdf

Ayan1349/Project_Graph_Learning

Folders and files

Latest commit

History

Repository files navigation

Graph Neural Network Applications: Node Classification & Graph Generation

Project Structure

Task 1: Node Classification on OGBN-Arxiv

Dataset and Preprocessing

Models Implemented

Takeaways

Task 2: Graph Generation

Dataset

Approach 1: Discrete Denoising Diffusion Model

Approach 2: GraphRNN Model

Conclusion on Performance

How to run the code

Prerequisites

Setup and Installation

Execution

Limitations and Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages