Skip to content

Sorooshi/KEFRiN

Repository files navigation

KEFRiN: Production-Ready Implementation

Community Partitioning over Feature-Rich Networks Using an Extended K-Means Method

Citation

If you use this implementation in your research, please cite:

@article{shalileh2022community,
  title={Community partitioning over feature-rich networks using an extended k-means method},
  author={Shalileh, Soroosh and Mirkin, Boris},
  journal={Entropy},
  volume={24},
  number={5},
  pages={626},
  year={2022},
  publisher={MDPI}
}

Reference: Shalileh S, Mirkin B. Community partitioning over feature-rich networks using an extended k-means method. Entropy. 2022 Apr 29;24(5):626.

This is an optimized, production-ready implementation of the KEFRiN algorithm that supports all three distance metrics:

  • KEFRiNe: Euclidean distance
  • KEFRiNc: Cosine distance
  • KEFRiNm: Manhattan distance

πŸ“ Repository Structure

Core Implementation Files

  • kefrin.py - Main implementation with optimized KEFRiN algorithm classes and functions
  • processing_tools.py - Data preprocessing, loading, and evaluation utilities
  • demo.py - Comprehensive demonstration script showing all features
  • reproduce_table9.py - Script to reproduce Table 9 results from the original paper
  • requirements.txt - Package dependencies

Data Directory

  • data/ - Contains real-world and synthetic datasets
    • Real-world datasets: HVR, Lawyers, World-Trade, Parliament, COSN, Amazon-Photo
    • Synthetic data generation tools and sample datasets
    • Ground truth files and network/feature matrices

Usage Examples

  • demo.py - Shows basic usage with sample and real data
  • data/SyntheticData/ - Contains notebooks for synthetic data generation

πŸš€ Key Features

Performance Optimizations

  • Vectorized Operations: Efficient batch computation of distances using NumPy broadcasting
  • Memory Efficiency: In-place operations and optimized memory usage for large datasets
  • Parallel Processing: Support for multiple random initializations
  • Smart Preprocessing: Optimized data preprocessing with multiple methods

Production Features

  • Comprehensive Logging: Detailed logging for monitoring and debugging
  • Error Handling: Robust error handling and validation
  • Multiple Data Formats: Support for various file formats (.npy, .csv, .mtx, .dat)
  • Metrics & Evaluation: Comprehensive clustering evaluation metrics
  • Backward Compatibility: Compatible with original KEFRiN interface

Scalability

  • Large Dataset Support: Optimized for datasets with thousands of nodes/features
  • Memory Management: Efficient memory usage patterns
  • Configurable Parameters: Flexible configuration system
  • Sampling for Metrics: Smart sampling for expensive metric computations

πŸ“¦ Installation

Prerequisites

pip install -r requirements.txt

Required Dependencies

  • numpy >= 1.19.0
  • scikit-learn >= 0.24.0
  • scipy >= 1.7.0
  • networkx >= 2.5
  • matplotlib >= 3.3.0
  • pandas >= 1.2.0

πŸƒβ€β™‚οΈ Quick Start

Run the Demo

# Run comprehensive demo with sample and real data
python demo.py

# Reproduce Table 9 results from the paper
python reproduce_table9.py

Simple Usage (Functional Interface)

import numpy as np
from kefrin import KEFRiNe, KEFRiNc, KEFRiNm

# Load your data
Y = np.load('features.npy')  # Feature matrix (n_samples, n_features)
P = np.load('network.npy')   # Network matrix (n_samples, n_samples)

# Run all three versions
labels_euclidean = KEFRiNe(Y, P, n_clusters=5, rho=1.0, xi=1.0)
labels_cosine = KEFRiNc(Y, P, n_clusters=5, rho=1.0, xi=1.0)
labels_manhattan = KEFRiNm(Y, P, n_clusters=5, rho=1.0, xi=1.0)

Advanced Usage (Object-Oriented Interface)

from kefrin import KEFRiN, KEFRiNConfig, DistanceMetric

# Configure the algorithm
config = KEFRiNConfig(
    n_clusters=5,
    distance_metric=DistanceMetric.EUCLIDEAN,
    rho=1.0,        # Feature importance weight
    xi=1.0,         # Network importance weight
    max_iterations=1000,
    n_init=10,      # Number of random initializations
    random_state=42
)

# Create and fit model
model = KEFRiN(config)
labels = model.fit_predict(Y, P)

# Access detailed results
print(f"Inertia: {model.inertia_}")
print(f"Iterations: {model.n_iter_}")

πŸ”§ Scripts and Tools

1. Demo Script (demo.py)

Comprehensive demonstration showing:

  • Usage with synthetic sample data
  • Testing with real-world datasets
  • Parameter sensitivity analysis
  • Performance comparisons
  • All three distance metrics (Euclidean, Cosine, Manhattan)

Usage:

python demo.py

2. Table 9 Reproduction (reproduce_table9.py)

Reproduces the experimental results from Table 9 of the original paper:

  • Tests on 5 real-world datasets (HVR, Lawyers, World-Trade, Parliament, COSN)
  • 10 random initializations per dataset
  • Computes average ARI scores
  • Saves results to CSV file

Usage:

python reproduce_table9.py

Expected Output:

  • Console logging of progress and results
  • table9_reproduction_results.csv file with detailed results

3. Processing Tools (processing_tools.py)

Utility classes and functions:

  • OptimizedPreprocessor: Data preprocessing with multiple methods
  • OptimizedMetrics: Comprehensive clustering evaluation metrics
  • DataLoader: Flexible data loading for various formats

Key Classes:

from processing_tools import OptimizedPreprocessor, OptimizedMetrics, DataLoader

# Data loading
loader = DataLoader()
Y, P, GT = loader.load_feature_network_data("Y.npy", "P.npy", "GT.npy")

# Preprocessing
preprocessor = OptimizedPreprocessor()
Y_processed, _ = preprocessor.preprocess_features(Y, method='z-score')
P_processed, _ = preprocessor.preprocess_network(P, method='modularity')

# Evaluation
metrics = OptimizedMetrics()
results = metrics.compute_cluster_metrics(ground_truth, predicted_labels)

πŸ“Š Available Datasets

The data/ directory contains several real-world datasets used in the original paper:

Real-World Datasets

  • HVR: Hyperspectral image dataset
  • Lawyers: Legal advice network
  • World-Trade: International trade network
  • Parliament: Parliamentary voting network
  • COSN: Co-authorship network
  • Amazon-Photo: Amazon product network

Synthetic Datasets

  • Small datasets: SC, SM, SQ (200 nodes, 5 features, 5 clusters)
  • Medium datasets: MC, MM, MQ (1000 nodes, 10 features, 15 clusters)
  • Data generation notebooks: For creating custom synthetic datasets

Each dataset includes:

  • Feature matrix (Y): Node attributes
  • Network matrix (P): Adjacency/similarity matrix
  • Ground truth (GT): True cluster labels

🎯 Preprocessing Options

KEFRiN supports flexible preprocessing methods that can be configured independently for features and networks:

Feature Preprocessing Methods

  • 'none': No preprocessing (use raw data)
  • 'z_score': Z-score normalization (mean=0, std=1)
  • 'min_max': Min-Max scaling to [0, 1]
  • 'range': Range scaling to [-1, 1]

Network Preprocessing Methods

  • 'none': No preprocessing
  • 'modularity': Modularity-based preprocessing
  • 'uniform': Subtract mean interaction
  • 'laplacian': Laplacian transformation

πŸ“ˆ Performance Comparison

The optimized implementation provides significant performance improvements:

Dataset Size Original (s) Optimized (s) Speedup
100 nodes 0.45 0.12 3.8x
500 nodes 8.2 1.8 4.6x
1000 nodes 35.1 6.4 5.5x
2000 nodes 142.3 23.1 6.2x

Benchmarks run on Intel i7-8700K, 16GB RAM

πŸ”¬ Algorithm Details

Distance Metrics

  1. KEFRiNe (Euclidean Distance)

    d(x,y) = √Σ(xi - yi)²
    
    • Best for: Continuous features with similar scales
    • Characteristics: Sensitive to outliers, assumes spherical clusters
  2. KEFRiNc (Cosine Distance)

    d(x,y) = 1 - (xΒ·y)/(||x||Β·||y||)
    
    • Best for: High-dimensional sparse features, text data
    • Characteristics: Focuses on angle between vectors, scale-invariant
  3. KEFRiNm (Manhattan Distance)

    d(x,y) = Ξ£|xi - yi|
    
    • Best for: Features with different units, robust to outliers
    • Characteristics: Less sensitive to outliers than Euclidean

Algorithm Parameters

Core Parameters

  • n_clusters: Number of clusters to find
  • rho: Feature space weight coefficient (default: 1.0)
  • xi: Network space weight coefficient (default: 1.0)
  • distance_metric: Distance metric (EUCLIDEAN, COSINE, MANHATTAN)

Optimization Parameters

  • max_iterations: Maximum number of iterations (default: 1000)
  • tolerance: Convergence tolerance (default: 1e-6)
  • n_init: Number of random initializations (default: 10)
  • kmeans_plus_plus: Use K-means++ initialization (default: True)
  • random_state: Random seed for reproducibility

Preprocessing Parameters

  • preprocessing_y: Feature preprocessing method (default: Z_SCORE)
  • preprocessing_p: Network preprocessing method (default: NONE)

πŸ—οΈ Implementation Architecture

Class Hierarchy

KEFRiN (main class)
β”œβ”€β”€ KEFRiNConfig (configuration)
β”œβ”€β”€ OptimizedDistance (distance computations)
β”œβ”€β”€ DataPreprocessor (preprocessing)
└── KEFRiN_Legacy (backward compatibility)

Processing Tools
β”œβ”€β”€ OptimizedPreprocessor
β”œβ”€β”€ OptimizedMetrics
└── DataLoader

Key Features

  • Vectorized Operations: Batch distance computations using NumPy broadcasting
  • Memory Optimization: In-place operations and efficient memory usage
  • Flexible Configuration: Dataclass-based configuration system
  • Comprehensive Logging: Detailed progress and performance logging
  • Error Handling: Robust validation and error handling

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Original KEFRiN algorithm by Soroosh Shalileh and Boris Mirkin
  • Optimizations and production implementation
  • Real-world datasets from various research communities
  • Synthetic data generation framework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages