Bot detection: author-level analysis (H8-H13)#44
Bot detection: author-level analysis (H8-H13)#44jeffreyksmithjr wants to merge 10 commits intomainfrom
Conversation
Adds rules to avoid common LLM writing tics (filler preambles, weasel words, manufactured drama, etc.) to keep generated text natural.
4-stage pipeline evaluating whether cross-repo behavioral signals predict non-merge outcomes. Imports data from neoteny DuckDB cache (32K+ PRs) and PR 27 dataset, extracts burstiness/engagement/cross-repo features with strict anti-lookahead, and evaluates via logistic regression with StratifiedGroupKFold CV. 52 unit tests, lint clean.
Two bugs from the initial full run: - DuckDB returns pd.NaT for NULL timestamps, not None. `pd.NaT is not None` is True, so all PRs were classified as merged. Use pd.notna(). - Neoteny review key is `submitted_at` (underscore), not `submittedAt` (camelCase). Accept both for robustness. - Simplify bot filtering to count correctly via before/after PR count.
- Handle all-NaN columns in _fill_and_scale by falling back to 0.0 (abandoned_pr_rate, ge_score are 100% NaN in current dataset) - Use C=np.inf instead of deprecated penalty=None for LogisticRegression - Skip H5 (GE complement) gracefully when ge_score is entirely NaN - Skip unavailable baselines in Stage 4 when author metadata is missing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Compute GE trust scores using good-egg's v1 and v2 scoring models directly from cached PR data (no API calls), with anti-lookahead: only prior merged PRs on other repos before time T - Fix author metadata import from PR 27 data (profile fields were nested under contribution_data.profile, not top-level) - Add ge_score_v1 and ge_score_v2 to FeatureRow (replaces ge_score) - H5 now evaluates both v1 and v2 GE complement independently - Stage 4 baselines include both GE models as references - GE v1 AUC=0.531, v2 AUC=0.536 as standalone predictors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key finding: burstiness signals predict merge, not non-merge. Bursty authors (24h count >= 3) have 12.7% non-merge rate vs 25.9% baseline. Cross-repo activity is a marker of experience, not spam. GE v2 is the best standalone predictor (AUC=0.536). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…y (H6/H7) Fix ^app/ bot pattern that let 4,034 app/* PRs through. Add H6 interaction features (burst x novelty) to separate spammers from power users, and H7 burst content homogeneity features (size CV, repo entropy). Add account status check script for GitHub API ground truth labels. Key finding: bot filter fix flipped H3 from inverted to correctly oriented (0.479 -> 0.512) and made bot signals genuinely complement GE scores. Interaction features correctly flag known suspects but base rate is too low (78/38K) to move AUC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add import_oss_parquet() to ingest neoteny parquet files as primary data source (238K PRs from 59 repos). Repos with <10% merge rate auto-excluded (apache/spark, pytorch/pytorch, detectron2). DuckDB indexes added for stage 2 performance (~74 min vs hours). Cross-repo coverage improved from 3.9% to 10.3%. All hypothesis AUCs remain 0.479-0.503 (random/inverted). GE v2 baseline strengthened to 0.533. Behavioral PR features conclusively don't predict merge outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pivot from PR-level (AUC ~0.50) to author-level analysis with account_status=suspended as ground truth. 27 suspended accounts found among 1,000 checked suspicious authors. New stages 5-8: - Stage 5: Author aggregates (H8) + bipartite network graph (H10) - Stage 6: Time-series (H9), LLM content via Gemini (H11), semi-supervised k-NN + Isolation Forest (H13) - Stage 7: Author-level evaluation (precision@k, AUC-ROC/PR) - Stage 8: Campaign detection (609 authors in 101 anomalous months) Key results against suspended accounts: - H11 (LLM): P@10=0.50, AUC-ROC=0.976 (unsupervised) - H10 (network): P@10=0.10, AUC-ROC=0.958 - Combined: P@10=0.90, P@25=0.80, AUC-ROC=1.000, AUC-PR=0.880
…61 suspended) Previous run checked 1,000 authors and found 27 suspended. This run checks all 3,216 multi-repo authors, finding 61 suspended (1.9%). Key finding: H10 (network) is robust to expansion (AUC 0.952 vs 0.958), while H11 (LLM) drops from 0.976 to 0.704 -- newly discovered suspensions don't have obviously spammy titles. Combined score P@50 = 0.56, P@100 = 0.43. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🥚 Better Egg: HIGH TrustScore: 79% Score Breakdown
Top Contributions
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request marks a significant advancement in bot detection by shifting the analytical focus from individual pull requests to the authors themselves. The previous PR-level approach proved insufficient, leading to the development and evaluation of new author-centric hypotheses. The new methodology leverages ground truth from suspended GitHub accounts and shows promising results, particularly with network topology and LLM-based content analysis, paving the way for more accurate and actionable bot identification. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces a comprehensive author-level bot detection pipeline, including new data models, experiment stages, statistical analyses, and documentation. The changes are extensive and well-structured. My review focuses on correctness and potential issues in the new experiment code. I've identified a few areas for improvement, including a typo in the experiment design, anomalous timestamps in generated data files, and some inconsistencies and robustness issues in the implementation.
Note: Security Review did not run due to the size of the PR.
| @@ -0,0 +1,21 @@ | |||
| { | |||
| "stage": "stage1", | |||
| "timestamp": "2026-03-06T01:44:57.070310+00:00", | |||
There was a problem hiding this comment.
This checkpoint file, along with others in this PR, contains a future timestamp (2026-03-06...). This is highly unusual and might indicate an issue with the environment where the experiment was run (e.g., incorrect system clock) or a bug in how the data was generated. Please verify that this is intentional and not a sign of a problem.
| **Primary corpus: neoteny DuckDB (998 MB)** | ||
| - 52 repos, 32,374 PRs, 3,308 unique authors | ||
| - Reviews present on ~78% of PRs, commits on 100% | ||
| - Date range: 2014--2026 |
|
|
||
| rows.append({ | ||
| "login": author, | ||
| "hub_score": float(hub_score), |
There was a problem hiding this comment.
The feature is named hub_score, but the implementation uses degree_centrality (as noted in the comment on lines 71-75). This is potentially confusing. To improve clarity, consider renaming this feature to degree_centrality to accurately reflect what is being calculated.
| "hub_score": float(hub_score), | |
| "degree_centrality": float(hub_score), |
| text = response.strip() | ||
| if text.startswith("```"): | ||
| lines = text.split("\n") | ||
| # Remove first and last lines (fences) | ||
| lines = [ln for ln in lines if not ln.strip().startswith("```")] | ||
| text = "\n".join(lines) |
There was a problem hiding this comment.
The current logic for stripping markdown code fences removes any line that starts with ```, which could corrupt valid JSON if a string inside it contains this pattern. A more robust approach would be to specifically remove the opening and closing fence lines.
| text = response.strip() | |
| if text.startswith("```"): | |
| lines = text.split("\n") | |
| # Remove first and last lines (fences) | |
| lines = [ln for ln in lines if not ln.strip().startswith("```")] | |
| text = "\n".join(lines) | |
| text = response.strip() | |
| if text.startswith("```json"): | |
| text = text.removeprefix("```json") | |
| elif text.startswith("```"): | |
| text = text.removeprefix("```") | |
| if text.endswith("```"): | |
| text = text.removesuffix("```") | |
| text = text.strip() |
| "H10_network": { | ||
| "column": "hub_score", | ||
| "transform": None, | ||
| "description": "HITS hub score from bipartite graph", | ||
| }, |
There was a problem hiding this comment.
The description for H10_network states it uses "HITS hub score", but the implementation in stage5_author_features.py actually uses degree_centrality. The description should be updated to match the actual implementation. Ideally, the feature column name would also be changed from hub_score to degree_centrality for consistency.
| "H10_network": { | |
| "column": "hub_score", | |
| "transform": None, | |
| "description": "HITS hub score from bipartite graph", | |
| }, | |
| "H10_network": { | |
| "column": "hub_score", | |
| "transform": None, | |
| "description": "Degree centrality from bipartite author-repo graph", | |
| }, |
Summary
account_status == 'suspended'as ground truthTest plan
uv run pytest experiments/bot_detection/tests/ -x -q(55 tests pass)