Skip to content

asadnoor/ScholarNodes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScholarNodes

Overview

ScholarNode is a researcher collaboration recommendation framework built using publication corpora alone, combining topic modeling techniques with social network analysis. Rather than relying on explicit relationships such as prior co-authorships or citation links, ScholarNode focuses exclusively on publication content, under the assumption that thematic similarity reflects latent research affinity.

By modeling researchers through the topics present in their scholarly outputs, the framework aims to identify potential collaboration opportunities, particularly interdisciplinary connections that are often missed by approaches based solely on observed collaboration or citation patterns.

To improve the interpretability of recommendations, ScholarNode integrates community detection techniques that allow researchers to explore their broader topical communities. This enables users to visualize where top recommendations emerge within the network and to interpret recommendations at both the individual and community levels through topic-based visualizations.

Conceptual Framework

ScholarNode follows a content-driven workflow to model and recommend potential research collaborations. The high-level pipeline consists of the following steps:

  1. Corpus Preparation: Publication records for researchers are collected and preprocessed to extract textual content (titles and abstracts).

  2. Topic Modeling: Global and researcher-specific topic models are generated using both probabilistic models (LDA) and embedding-based techniques (BERTopic) to capture thematic structures in the corpus.

  3. Researcher Similarity Computation: Researchers’ topic probability distributions are extracted from the learned topic models, and similarity scores are computed using the Jensen-Shannon metric.

  4. Researcher Network Construction and Pruning: A fully connected network is constructed from the similarity matrix and subsequently pruned to refine the network structure, reducing clutter and improving suitability for community detection.

  5. Community Detection: Network-based community detection is applied to group researchers into topical communities, providing interpretable structure for recommendations.

  6. Recommendation & Visualization: Top recommendations for each researcher are highlighted within their community context. Topic-level visualizations allow interpretation at both individual and community levels.

This framework emphasizes content-driven, interpretable collaboration recommendations, capturing potential interdisciplinary connections that are often missed by conventional network-based methods.

Repository Structure

The repository is organized as follows:

  • src/: Core Python implementations of ScholarNode. Subfolders contain modules for topic modeling, network construction, community detection, topic analysis, and visualization.

  • data/: Storage for preprocessed datasets and model outputs, including:

    • TEST/: An anonymized demonstration dataset with sample publication records, suitable for reproducing the pipeline. This folder also contains results relevant to the institution-specific cases. For example, detected communities are stored as GEXF files from the community detection algorithm (used for visualization in Gephi), located under TEST/networks/nh-louvain_community.gexf.
    • Institution-specific folders (MSU/, WSU/, CSU/): Contain full datasets and model results used in published research. These are not included in this public repository.
  • backend/: Optional backend code for web-based visualization of ScholarNode results. Not required to run the demonstration pipeline.

  • lib/: External libraries and dependencies used for topic modeling (e.g., Mallet). These are not included in the repository; users may need to install them separately.

This structure separates research logic, data, and optional visualization components, making it easy to run the demonstration pipeline with the TEST dataset while clearly showing the organization of the full research workflow.

Data & Reproducibility

The data/TEST/ folder contains an anonymized demonstration dataset designed to reproduce the ScholarNode workflow. It includes both raw sample data and precomputed results to allow users to explore the methodology without requiring full institutional datasets.

Folder structure

  • data/TEST/all_works.csv: Sample publication records with anonymized researcher IDs and publication metadata.
  • data/TEST/models/: Precomputed topic models for demonstration purposes:
    • LDA_<best_topic_number>/: LDA topic model outputs for the sampled data.
      • networks/: Community detection results on the LDA-based similarity network (GEXF files for visualization in Gephi).
    • scibert/: BERTopic embedding-based topic model outputs.
      • networks/: Community detection results on the BERT-based similarity network (GEXF files).

These precomputed models and network files allow users to reproduce the topic modeling, researcher similarity computation, network construction, and community detection steps without having to run computationally intensive model training.

Note: Full institutional datasets (MSU/, WSU/, CSU/) are not included in this repository. The TEST folder is sufficient to explore the methodology and reproduce the pipeline for demonstration purposes.

How to Run

The ScholarNode workflow can be reproduced easily using the included driver file.

1. Install dependencies

Install the required Python packages as listed in the ScholarNode Imported Package.pdf included in this repository.

Optional: Some components, like Mallet (for LDA) or Gephi (for network visualization), may require external installation if you want to retrain models or visualize networks.

2. Run the pipeline

# Run ScholarNode on the TEST dataset (default)
python src/driver.py

# By default, this runs on the anonymized TEST dataset.
# To run on a specific institution dataset, replace TEST with the institution identifier (e.g., MSU, WSU, CSU).

# All hyperparameters and decision settings (e.g., minimum publications per researcher,
# LDA model type, pretrained BERTopic embedder location) are configured in:
config/config.py

# If you do not want to install the Mallet Java library, the pipeline also supports a Gensim LDA model.
# You can select which model to use in config/config.py.

This setup allows you to reproduce the workflow end-to-end, explore precomputed models, and generate network/community visualizations using the TEST dataset or your own institution-specific data.

Visualization

ScholarNode provides multiple ways to visualize collaboration communities and topic-level information, both at the individual researcher and community level.

Community-level visualization

  • Detected communities from the network are stored in GEXF format and can be opened in Gephi.

    • Example files:
      • data/TEST/LDA_<best_topic_number>/networks/nh-louvain_community.gexf
      • data/TEST/BERT/networks/nh-louvain_community.gexf
  • Example network figure:

Community Network

Detected research collaboration network visualized with community structure. Node size represents node degree, and node color represents researchers from different departments, showing potential interdisciplinary collaborations.
  • Community wordclouds illustrate dominant topics within each detected community.

Community Wordcloud

Example wordcloud showing dominant topics in detected research communities.

Web application profile view

  • ScholarNode optionally provides a web-based interface to explore recommendations and topic profiles interactively.

Web App Profile View

Web application profile page showing the community network, top 10 recommended collaborators, and topic distribution.

References

The ScholarNode framework and this repository are based on the following works:

  1. Noor, Md Asaduzzaman, Sheppard, John, & Clark, Jason. (2023). Finding Potential Research Collaborations from Social Networks Derived from Topic Models. 10th IEEE International Conference on Behavioural and Social Computing (BESC), 1–7.

  2. Noor, Md Asaduzzaman, Clark, Jason A., & Sheppard, John W. (2024). ScholarNodes: Applying Content-based Filtering to Recommend Interdisciplinary Communities within Scholarly Social Networks. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2791–2795.

  3. Noor, Md Asaduzzaman, Sheppard, John, & Clark, Jason. (2024). Identifying Hierarchical Community Structures in Content-based Scholarly Social Networks. 23rd IEEE International Conference on Machine Learning and Applications (ICMLA), 1–8.

These three works form the foundation of the implementation included in this repository.

Additional extensions of the framework, such as handling publication imbalance through cloning-based approaches, have been explored in later work:

  1. Noor, Md Asaduzzaman, Sheppard, John, & Clark, Jason. (2025). Handling Publication Imbalance for Effective Community Detection in Scholarly Networks. Proceedings of the 2025 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). [Not included in this repository]

This shows the evolution of the ScholarNode framework and highlights ongoing research directions beyond what is included here.

About

Researcher collaboration recommendation framework using publication content, topic modeling, and social network analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors