Automated AI-Driven Health Monitor & Decision Engine for Open Source Ecosystems
Traditional OSS metrics like Star counts and Forks are lagging indicators. They do not reflect current operational reality, maintenance burden, or developer satisfaction of a project.
OSS Sentinel addresses this gap by analyzing the "heartbeat" of a repository: its Issues. By leveraging NLP to classify sentiment and urgency, we move beyond vanity metrics to actionable insights about project stability and technical debt.
The pipeline follows a rigorous data engineering flow:
Ingestion Layer: Connects to GitHub Search API to fetch raw issue data based on temporal and repository targets.
Processing Layer: Uses Pandas to clean, normalize, and flatten nested JSON structures into a structured schema.
Enrichment Layer: Employs OpenAI's GPT-4o-mini to perform deep semantic classification on every issue: Sentiment (Positive / Neutral / Negative), Category (Bug / Feature / Documentation / Other), Urgency (High / Medium / Low).
Analytics Layer: Computes a proprietary Pain Index (Sentiment × Urgency) and generates diagnostic heatmaps.
As a Proof of Concept, OSS Sentinel analyzed the health of three major Business Intelligence tools (Apache Superset, Grafana, and Metabase) over the last 6 months.
Window: 180 Days | Sample: 100 issues/repo
| Repository | Pain Index | Sentiment Distribution | High Urgency Rate |
|---|---|---|---|
| Grafana | -1.03 |
Balanced (51% Neg / 12% Pos) | 25% |
| Metabase | -1.54 |
Mixed (67% Neg / 7% Pos) | 41% |
| Apache Superset | -2.21 |
Critical (87% Neg) | 53% |
Pain Index Formula: (-1 to +1) × (Low:1 / Med:2 / High:3). Lower is "worse".
Exhibits the lowest Pain Score. While issues exist, they tend to be of medium urgency. The higher positive sentiment ratio indicates a healthier community response to issues.
The data reveals a demanding technical debt load. The overwhelming negative sentiment (87%) coupled with the highest Urgency rate suggests the project is in a constant state of triage. Adoption requires a strong internal engineering team.
Sits between the two. High urgency bugs are prevalent, but the community is slightly more positive than Apache, indicating a resilient but strained support ecosystem.
- Python 3.9+
- GitHub Personal Access Token (Classic) with
public_reposcope - OpenAI API Key
Clone the repository:
git clone https://github.com/cesaremcasa/oss-sentinel.git
cd oss-sentinelCreate and activate virtual environment:
python3.9 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtSet your environment variables:
export GITHUB_TOKEN="your_github_token"
export OPENAI_API_KEY="your_openai_key"Or create a .env file in the root directory:
GITHUB_TOKEN=your_github_token
OPENAI_API_KEY=your_openai_key
Fetch raw issue data from GitHub repositories:
python src/ingestion.pyThis step queries the GitHub Search API and saves raw JSON data to data/raw/.
Clean and normalize the raw data into a structured format:
python src/processing.pyProcessed data will be saved to data/processed/.
Perform semantic classification using OpenAI GPT-4o-mini:
python src/enrichment.pyEach issue will be classified by Sentiment, Category, and Urgency. Results are saved to data/enriched/.
Generate Pain Index calculations and diagnostic heatmaps:
python src/analyze.pyResults and plots will be saved in assets/plots/ and data/analysis/.
To run all steps sequentially:
python main.py.
├── src/
│ ├── ingestion.py # GitHub API data fetching
│ ├── processing.py # Data cleaning & normalization
│ ├── enrichment.py # AI-powered classification
│ └── analyze.py # Pain Index calculation & visualization
├── data/
│ ├── raw/ # Raw GitHub API responses
│ ├── processed/ # Cleaned & structured data
│ ├── enriched/ # AI-classified data
│ └── analysis/ # Final metrics & reports
├── assets/
│ └── plots/ # Generated visualizations
├── main.py # Full pipeline orchestrator
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md
The Pain Index is calculated as:
Pain Index = Sentiment_Score × Urgency_Weight
Where:
- Sentiment Score: Positive (+1), Neutral (0), Negative (-1)
- Urgency Weight: Low (1), Medium (2), High (3)
This metric provides a quantitative measure of project health, where lower (more negative) values indicate higher technical debt and community frustration.
GitHub API has rate limits. The system includes exponential backoff and retry logic to handle rate limiting gracefully. For unauthenticated requests: 60 requests/hour. For authenticated requests: 5,000 requests/hour.
The system uses OpenAI's GPT-4o-mini for classification due to its optimal cost/performance ratio for structured extraction tasks. Each issue is processed individually with a structured prompt to ensure consistent classification.
MIT License
Copyright (c) 2025 Cesar Augusto
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contributions are welcome! Please open an issue or submit a pull request.
For questions or collaboration opportunities, please reach out via GitHub Issues.
Cesar Augusto
Data Engineer & AI Systems Architect