Welcome to the GNN HIV Challenge! This competition focuses on predicting the molecular properties of chemical compounds to identify potential inhibitors of HIV.
The task is to classify molecular graphs to predict anti-HIV activity.
- Input: A molecular graph structure (atoms as nodes, bonds as edges) and atomic-level features
- Output: A probability score indicating the likelihood that the molecule inhibits HIV replication
- Goal: Develop Graph Neural Network models (GCN, GAT, GIN) that generalize to unseen molecular structures
- 0: Non-Inhibitor (Inactive)
- 1: Inhibitor (Active)
- Non-Euclidean Data: Molecules are graph-structured data with varying sizes and complex topologies.
- Class Imbalance: The dataset is imbalanced (~25% positive, ~75% negative), so ROC-AUC is preferred over accuracy.
- Feature Sparsity: Models must learn meaningful molecular representations from limited atomic features.
- Generalization: Models must capture biochemical patterns without overfitting.
The dataset consists of molecular graphs derived from chemical compound databases.
- Total Graphs: 5,000
- Training: 4,000 graphs
- Test: 1,000 graphs
- Features: Node-level descriptors (atomic properties) and adjacency matrices (bonds)
- Format: Separate files for metadata, graph structure, and node features
| Column Name | Type | Description |
|---|---|---|
| graph_id | int | Unique identifier for the molecular graph (0β3999) |
| target | int | Ground truth label (0 = Inactive, 1 = Active) |
| Column Name | Type | Description |
|---|---|---|
| graph_id | int | Unique identifier for the molecular graph (4000β4999) |
A dictionary mapping graph_id to a NumPy array of node features.
- Shape:
(num_nodes, num_node_features) - Content: Atomic properties (e.g., atomic number, degree, hybridization)
A dictionary mapping graph_id to adjacency information.
- Key:
edge_listβ List of tuples[(node_u, node_v), ...]representing bonds
To load a single training sample:
import pandas as pd
import pickle
train_df = pd.read_csv('data/train.csv')
row = train_df.iloc[0]
gid = row['graph_id']
with open('data/node_features.pkl', 'rb') as f:
feats = pickle.load(f)
with open('data/graph_structures.pkl', 'rb') as f:
structs = pickle.load(f)
x = feats[gid]
edges = structs[gid]['edge_list']
y = row['target'] Primary Metric: ROC-AUC (Area Under the Receiver Operating Characteristic Curve)
- Range: 0.0 β 1.0
- 1.0: Perfect classifier
- 0.5: Random guessing
- < 0.5: Worse than random
ROC-AUC is threshold-independent and robust to class imbalance, making it ideal for screening tasks.
git clone https://github.com/faranbutt/GNN-HIV-Challenge-2.git
cd GNN-HIV-Challenge-2
pip install -r requirements.txt
Starter code is provided for the following architectures:
- RFC (RandomForest Classifier)
- GCN (Graph Neural Network)
- GIN (Graph Isomorphism Network)
- GAT (Graph Attention Network)
#default model
python starter_code/train.py
#GCN
python starter_code/train.py --model gcn --epochs 15
#GAT
python starter_code/train.py --model gat --epochs 15
#GIN
python starter_code/train.py --model gin --epochs 15This process will:
- Train on
data/train.csv - Generate predictions for
data/test.csv - Save the submission file to
submissions/pyg_gcn.csv
- Fork this repository
- Develop your model in a new branch or in your fork
- Generate a CSV file
submissions/<your_username>.csvwith the following columns:graph_id: Integer IDprobability: Float prediction (0.0 to 1.0)
- Commit the file to the
submissions/folder - Open a Pull Request to the main branch
- GitHub Actions will automatically evaluate your submission and comment on the PR with your score
| Rank | User | Submission File | ROC-AUC | Date |
|---|---|---|---|---|
| 1 | faranbutt | submissions/default.csv | 0.4747 | 2026-01-16 |
βββ .github
β βββ workflows
β βββ score_submission.yml
βββ data
β βββ graph_structures.pkl
β βββ node_features.pkl
β βββ test.csv
β βββ test_labels.csv # (Non Accesable)
β βββ train.csv
βββ scoring
β βββ generate_html_leaderboard.py
β βββ scoring_script.py
β βββ update_leaderboard.py
βββ starter_code
β βββ baseline.py
β βββ data_loader.py
β βββ gnn_models.py
β βββ train.py
βββ submissions
β βββ submission_samples.csv
βββ .gitignore
βββ README.md
βββ index.html
βββ leaderboard.csv
βββ leaderboard.html
βββ leaderboard.md
βββ pyproject.toml
βββ requirements.txt