Skip to content

Emysha99/Stackoverflow_Technical_Knowledge_Depth_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Stack Overflow Knowledge Depth Analysis

๐Ÿ”Ž Overview

This module is part of DevLink AI, our final year research project which was designed to analyze projectโ€“developer compatibility by aggregating multi-source signals (GitHub, Stack Overflow, social signals, and personality cues) to match developers with projects.

My contribution was the development of the Stack Overflow Knowledge Depth Analysis Module.

๐Ÿง  Problem Statement

Most existing systems evaluate developers based on activity volume or reputation. However, they often fail to measure how conceptually deep a developer's knowledge is. This project addresses that gap by:

  • Classifying Stack Overflow questions into Basic, Intermediate, or Advanced levels.
  • Scoring developers based on the difficulty and impact of their contributions using a log-scaled XP algorithm.

โ— Problem โ€“ Solution โ€“ Novelty

Problem
Traditional Stack Overflow metrics (reputation, counts, upvotes) emphasize activity, not the depth of technical knowledge demonstrated in content.

Solution
Automatically classify the difficulty level of posts (Basic / Intermediate / Advanced) and compute a knowledge depth score that jointly considers content complexity and contribution type (question vs. answer).

Novelty

  • Hybrid modeling: combine rule-based conceptual features (36 macro categories) with statistical text features (TF-IDF).
  • Log-scaled scoring: emphasizes quality over quantity, rewarding advanced, high-signal contributions.
  • Beyond reputation: captures true expertise rather than engagement-only proxies.

๐Ÿ› ๏ธ Approach

  1. Data Preprocessing

    • Handle missing/duplicate records.
    • Clean HTML/links/punctuation; tokenize; lemmatize with POS tagging.
  2. Feature Extraction

    • Conceptual (rule-based): 36 macro features spanning Basic, Intermediate, Advanced knowledge areas (e.g., Syntax, Data Structures, System Design).
    • Statistical: TF-IDF vectorization of post text.
  3. Classification

    • Train an SVM using TF-IDF + rule-based features.
    • Benchmark against a TF-IDFโ€“only baseline.
  4. Knowledge Depth Scoring

    • Log-scaled weighted score combining predicted difficulty and post type (Q/A) to produce a per-user knowledge depth metric.

๐Ÿ“Š Technologies Used

  • Python
  • Scikit-learn
  • Pandas / NumPy
  • OpenAI GPT

๐Ÿ“š Research Foundation

  • Inspired by Bloomโ€™s Taxonomy for CS education
  • Grounded in prior works on difficulty classification and developer modeling
  • Designed for explainability, transparency, and real-world applicability

About

This is my individual contribution for final year research project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published