Skip to content

Bot detection: author-level analysis (H8-H13)#44

Draft
jeffreyksmithjr wants to merge 10 commits intomainfrom
bot-detection
Draft

Bot detection: author-level analysis (H8-H13)#44
jeffreyksmithjr wants to merge 10 commits intomainfrom
bot-detection

Conversation

@jeffreyksmithjr
Copy link
Contributor

Summary

  • Pivoted from PR-level (AUC ~0.50) to author-level bot detection using account_status == 'suspended' as ground truth
  • Checked all 3,216 multi-repo authors via GitHub API: found 61 suspended (1.9%)
  • H10 (network topology) AUC 0.952, H11 (LLM content) AUC 0.704, Combined P@50 = 0.56
  • 6 new hypotheses (H8-H13), stages 5-8 pipeline, campaign detection (609 Hacktoberfest-era authors)

Test plan

  • uv run pytest experiments/bot_detection/tests/ -x -q (55 tests pass)
  • Full pipeline run with 61 suspended seeds
  • Single-repo author ground truth round (pending)
  • Cross-validation with expanded ground truth

jeffreyksmithjr and others added 10 commits March 5, 2026 12:07
Adds rules to avoid common LLM writing tics (filler preambles, weasel
words, manufactured drama, etc.) to keep generated text natural.
4-stage pipeline evaluating whether cross-repo behavioral signals predict
non-merge outcomes. Imports data from neoteny DuckDB cache (32K+ PRs) and
PR 27 dataset, extracts burstiness/engagement/cross-repo features with
strict anti-lookahead, and evaluates via logistic regression with
StratifiedGroupKFold CV.

52 unit tests, lint clean.
Two bugs from the initial full run:
- DuckDB returns pd.NaT for NULL timestamps, not None. `pd.NaT is not
  None` is True, so all PRs were classified as merged. Use pd.notna().
- Neoteny review key is `submitted_at` (underscore), not `submittedAt`
  (camelCase). Accept both for robustness.
- Simplify bot filtering to count correctly via before/after PR count.
- Handle all-NaN columns in _fill_and_scale by falling back to 0.0
  (abandoned_pr_rate, ge_score are 100% NaN in current dataset)
- Use C=np.inf instead of deprecated penalty=None for LogisticRegression
- Skip H5 (GE complement) gracefully when ge_score is entirely NaN
- Skip unavailable baselines in Stage 4 when author metadata is missing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Compute GE trust scores using good-egg's v1 and v2 scoring models
  directly from cached PR data (no API calls), with anti-lookahead:
  only prior merged PRs on other repos before time T
- Fix author metadata import from PR 27 data (profile fields were
  nested under contribution_data.profile, not top-level)
- Add ge_score_v1 and ge_score_v2 to FeatureRow (replaces ge_score)
- H5 now evaluates both v1 and v2 GE complement independently
- Stage 4 baselines include both GE models as references
- GE v1 AUC=0.531, v2 AUC=0.536 as standalone predictors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key finding: burstiness signals predict merge, not non-merge.
Bursty authors (24h count >= 3) have 12.7% non-merge rate vs 25.9%
baseline. Cross-repo activity is a marker of experience, not spam.
GE v2 is the best standalone predictor (AUC=0.536).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…y (H6/H7)

Fix ^app/ bot pattern that let 4,034 app/* PRs through. Add H6 interaction
features (burst x novelty) to separate spammers from power users, and H7
burst content homogeneity features (size CV, repo entropy). Add account
status check script for GitHub API ground truth labels.

Key finding: bot filter fix flipped H3 from inverted to correctly oriented
(0.479 -> 0.512) and made bot signals genuinely complement GE scores.
Interaction features correctly flag known suspects but base rate is too low
(78/38K) to move AUC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add import_oss_parquet() to ingest neoteny parquet files as primary
data source (238K PRs from 59 repos). Repos with <10% merge rate
auto-excluded (apache/spark, pytorch/pytorch, detectron2). DuckDB
indexes added for stage 2 performance (~74 min vs hours).

Cross-repo coverage improved from 3.9% to 10.3%. All hypothesis AUCs
remain 0.479-0.503 (random/inverted). GE v2 baseline strengthened to
0.533. Behavioral PR features conclusively don't predict merge outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pivot from PR-level (AUC ~0.50) to author-level analysis with
account_status=suspended as ground truth. 27 suspended accounts
found among 1,000 checked suspicious authors.

New stages 5-8:
- Stage 5: Author aggregates (H8) + bipartite network graph (H10)
- Stage 6: Time-series (H9), LLM content via Gemini (H11),
  semi-supervised k-NN + Isolation Forest (H13)
- Stage 7: Author-level evaluation (precision@k, AUC-ROC/PR)
- Stage 8: Campaign detection (609 authors in 101 anomalous months)

Key results against suspended accounts:
- H11 (LLM): P@10=0.50, AUC-ROC=0.976 (unsupervised)
- H10 (network): P@10=0.10, AUC-ROC=0.958
- Combined: P@10=0.90, P@25=0.80, AUC-ROC=1.000, AUC-PR=0.880
…61 suspended)

Previous run checked 1,000 authors and found 27 suspended. This run checks
all 3,216 multi-repo authors, finding 61 suspended (1.9%). Key finding:
H10 (network) is robust to expansion (AUC 0.952 vs 0.958), while H11 (LLM)
drops from 0.976 to 0.704 -- newly discovered suspensions don't have
obviously spammy titles. Combined score P@50 = 0.56, P@100 = 0.43.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🥚 Better Egg: HIGH Trust

Score: 79%

Score Breakdown

Component Value
Graph Score 82%
Merge Rate 86% (44/51 PRs)
Account Age 4,936 days

Top Contributions

Repository PRs Language Stars
2ndSetAI/good-egg 19 Python 20
jeffreyksmithjr/verskyt 9 Python 2
jeffreyksmithjr/galapagos_nao 7 Elixir 21
aws-samples/aws-big-data-blog 3 Java 894
pytorch/pytorch.github.io 2 HTML 278
melissawm/open-source-ai-contribution-policies 1 N/A 114
nerves-project/nerves_examples 1 Elixir 402
kilimchoi/engineering-blogs 1 Ruby 37285
kdeldycke/plumage 1 CSS 55

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request marks a significant advancement in bot detection by shifting the analytical focus from individual pull requests to the authors themselves. The previous PR-level approach proved insufficient, leading to the development and evaluation of new author-centric hypotheses. The new methodology leverages ground truth from suspended GitHub accounts and shows promising results, particularly with network topology and LLM-based content analysis, paving the way for more accurate and actionable bot identification.

Highlights

  • Shift to Author-Level Bot Detection: The bot detection strategy has pivoted from PR-level analysis (H1-H7) to a more effective author-level approach (H8-H13), significantly improving detection accuracy.
  • Ground Truth Established: A robust ground truth dataset of 61 suspended GitHub accounts was identified by checking 3,216 multi-repo authors via the GitHub API, representing 1.9% of checked accounts.
  • High-Performing Hypotheses: New author-level hypotheses, specifically H10 (network topology) with an AUC of 0.952 and H11 (LLM content analysis) with an AUC of 0.704, demonstrated strong predictive power for bot detection.
  • Combined Model Effectiveness: A combined model of author-level features achieved a Precision@50 of 0.56, meaning 28 out of the top 50 ranked authors were confirmed suspended accounts, indicating actionable results for trust scoring.
  • PR-Level Ineffectiveness Confirmed: Previous PR-level features (H1-H7) were found to be largely ineffective, yielding AUCs between 0.479-0.503, reinforcing the need for the author-level pivot.
  • New Pipeline Stages Introduced: The experiment pipeline has been extended with new stages (5-8) to support author-level feature extraction, LLM content analysis, semi-supervised learning, and campaign detection.
Changelog
  • CLAUDE.md
    • Added a 'Writing Style' section with guidelines to avoid common LLM writing tics and promote human-like communication.
  • experiments/bot_detection/DOE.md
    • Added a new Design of Experiments document outlining the bot detection experiment, including overview, data sources, anti-lookahead protocol, hypotheses (H1-H5), statistical methodology, outcome classification, ramp-up procedure, red team checkpoints, and baselines.
  • experiments/bot_detection/RED_TEAM_AUDIT.md
    • Added a new document for logging red team audit findings across various checkpoints (Data Integrity, Anti-Lookahead, Statistical Methods, Scale-Up, Final Audit).
  • experiments/bot_detection/RESULTS.md
    • Added a new document detailing the results of the bot detection experiment, including dataset iteration 4, post-mortem analysis of PR-level features (H1-H7), and iteration 5 author-level bot detection (H8-H13) results, analysis, and caveats.
  • experiments/bot_detection/cache.py
    • Added a new Python module for DuckDB-backed storage, including schema definition for PRs, reviews, commits, and authors.
    • Implemented functions for importing data from Neoteny DuckDB caches and PR27 JSONL files.
    • Provided query helpers for retrieving PRs, reviews, commits, and author-specific data.
    • Added functionality for filtering PRs from known bot accounts and managing author account statuses.
  • experiments/bot_detection/checkpoint.py
    • Added a new Python module for writing and reading stage completion checkpoints, facilitating pipeline state management.
  • experiments/bot_detection/data/results/SUMMARY.md
    • Added a new markdown file summarizing Stage 3 evaluation results for PR-level hypotheses (H1-H7), including AUC-ROC, AUC-PR, Mann-Whitney U, and LRT comparisons.
  • experiments/bot_detection/data/results/baseline_comparison.json
    • Added a new JSON file containing baseline comparison results for PR-level features, including AUC-ROC, AUC-PR, and DeLong tests against a reference baseline.
  • experiments/bot_detection/data/results/statistical_tests.json
    • Added a new JSON file containing detailed statistical test results for PR-level features (H1-H7), including AUC-ROC with confidence intervals, Mann-Whitney U, Holm-Bonferroni sweep results, and nested LRTs.
  • experiments/bot_detection/data/stage1_complete.json
    • Added a new JSON file marking the completion of Stage 1, detailing row counts for PRs, repos, reviews, commits, authors, filtered bots, and outcome classifications.
  • experiments/bot_detection/data/stage2_complete.json
    • Added a new JSON file marking the completion of Stage 2, detailing the number of feature rows and the columns extracted.
  • experiments/bot_detection/data/stage3_complete.json
    • Added a new JSON file marking the completion of Stage 3, detailing the hypotheses evaluated.
  • experiments/bot_detection/data/stage4_complete.json
    • Added a new JSON file marking the completion of Stage 4, detailing the baselines evaluated.
  • experiments/bot_detection/embedding.py
    • Added a new Python module for Gemini text embedding with local file caching to reduce API calls.
    • Implemented functions for loading, saving, and batch embedding texts, with a fallback to zero vectors if the generative AI library is not installed.
    • Included a utility function for calculating cosine similarity between embeddings.
  • experiments/bot_detection/llm_client.py
    • Added a new Python module for LLM classification via LiteLLM, incorporating file-based caching and retry logic for API calls.
  • experiments/bot_detection/models.py
    • Added a new Python module defining Pydantic models for various data structures used in the bot detection experiment, including PR outcomes, PR records, review and commit records, author metadata, and different feature sets (Burstiness, Engagement, Cross-Repo, FeatureRow, AuthorFeatureRow).
    • Introduced models for StageCheckpoint and StudyConfig to standardize data flow and configuration.
  • experiments/bot_detection/pipeline.py
    • Added a new Python module for the bot detection experiment's command-line interface (CLI).
    • Implemented commands to run all pipeline stages, specific stages, or author-level pipeline stages (5-8) sequentially.
  • experiments/bot_detection/scripts/backfill_embeddings.py
    • Added a new Python script for asynchronously backfilling burst_title_embedding_sim into feature parquet files using Gemini embeddings, improving content homogeneity analysis.
  • experiments/bot_detection/scripts/check_account_status.py
    • Added a new Python script for checking GitHub account status of PR authors via the GitHub API.
    • Implemented logic for rate limiting, upserting profile fields, and targeting suspicious authors for status checks.
  • experiments/bot_detection/scripts/smoke_test.py
    • Added a new Python script for an end-to-end smoke test of the bot detection pipeline on a micro scale, ensuring basic functionality across stages.
  • experiments/bot_detection/stages/stage1_build_corpus.py
    • Added a new Python module for Stage 1 of the pipeline, responsible for building the PR corpus from various data sources.
    • Implemented functions for computing per-repo stale thresholds, classifying PR outcomes (merged, rejected, pocket_veto), and filtering known bot accounts.
  • experiments/bot_detection/stages/stage2_extract_signals.py
    • Added a new Python module for Stage 2, focusing on extracting PR-level signal features.
    • Implemented functions to compute burstiness (H1), engagement (H2), cross-repo fingerprinting (H3), GE scores (v1 and v2), interaction features (H6), and burst content homogeneity (H7).
  • experiments/bot_detection/stages/stage3_evaluate.py
    • Added a new Python module for Stage 3, evaluating PR-level hypotheses (H1-H7).
    • Implemented cross-validation, logistic regression, Mann-Whitney U tests, Holm-Bonferroni correction for parameter sweeps, DeLong AUC comparisons, and likelihood ratio tests.
  • experiments/bot_detection/stages/stage4_baselines.py
    • Added a new Python module for Stage 4, evaluating baseline models against PR-level outcomes.
    • Implemented functions for GE score, account age, zero followers, zero repos, and random baselines, with DeLong tests for comparison.
  • experiments/bot_detection/stages/stage5_author_features.py
    • Added a new Python module for Stage 5, computing author-level aggregate (H8) and network (H10) features.
    • Implemented functions to build bipartite graphs, compute hub scores, bipartite clustering, and various other network metrics.
  • experiments/bot_detection/stages/stage6_llm_content.py
    • Added a new Python module for Stage 6 (part 1), performing LLM-based content analysis (H11) on PR titles.
    • Implemented logic to pre-filter authors, generate prompts, call LLMs, and parse scores.
  • experiments/bot_detection/stages/stage6_semi_supervised.py
    • Added a new Python module for Stage 6 (part 2), computing semi-supervised features (H13).
    • Implemented functions for k-Nearest Neighbors (k-NN) distance to seeds and Isolation Forest anomaly scores.
  • experiments/bot_detection/stages/stage6_time_series.py
    • Added a new Python module for Stage 6 (part 3), computing time-series features (H9) for authors.
    • Implemented functions to calculate inter-PR coefficient of variation, median intervals, max dormancy, burst episode count, weekend ratio, and hour entropy.
  • experiments/bot_detection/stages/stage7_author_evaluate.py
    • Added a new Python module for Stage 7, evaluating author-level hypotheses (H8-H13) against ground truth.
    • Implemented metrics like Precision@k, Recall@k, AUC-ROC, AUC-PR, Mann-Whitney U, and cross-validation for individual and combined scores.
  • experiments/bot_detection/stages/stage8_campaigns.py
    • Added a new Python module for Stage 8, detecting coordinated spam campaigns (H12) through temporal patterns.
    • Implemented functions to compute monthly rejection rates, flag anomalous months, identify campaign authors, and cross-reference with suspended accounts.
  • experiments/bot_detection/stats.py
    • Added a new Python module containing statistical utility functions.
    • Included implementations for DeLong AUC test, Holm-Bonferroni correction, likelihood ratio test, Kruskal-Wallis with Dunn's post-hoc, Chi-squared test, Cochran-Armitage trend test, and odds ratio.
  • experiments/bot_detection/study_config.yaml
    • Added a new YAML configuration file for the bot detection experiment, centralizing settings for data sources, classification, burstiness sweep, analysis, scale definitions, bot patterns, embedding, author-level analysis, and file paths.
  • experiments/bot_detection/tests/conftest.py
    • Added a new Python file for pytest configuration.
  • experiments/bot_detection/tests/test_author_evaluate.py
    • Added new unit tests for author evaluation metrics, specifically precision_at_k and lift over random baselines.
  • experiments/bot_detection/tests/test_author_features.py
    • Added new unit tests for author aggregate statistics and the identification of suspicious authors based on merge rates and repository counts.
  • experiments/bot_detection/tests/test_cache.py
    • Added new unit tests for the DuckDB cache, covering schema creation, query helpers, bot filtering, account status management, and parquet data import with deduplication and low merge rate skipping.
  • experiments/bot_detection/tests/test_evaluation.py
    • Added new unit tests for statistical evaluation methods, including Holm-Bonferroni correction, DeLong AUC test, AUC-ROC with confidence intervals, and likelihood ratio tests.
  • experiments/bot_detection/tests/test_llm_content.py
    • Added new unit tests for LLM content parsing, score clamping, and caching mechanisms, along with an integration test for run_stage6_llm_content.
  • experiments/bot_detection/tests/test_models.py
    • Added new unit tests for Pydantic models, ensuring correct instantiation, default values, and serialization for PROutcome, BotDetectionPR, FeatureRow, BurstinessFeatures, EngagementFeatures, CrossRepoFeatures, and StageCheckpoint.
  • experiments/bot_detection/tests/test_network.py
    • Added new unit tests for network feature computation, including get_all_author_repo_pairs, hub scores, and bipartite clustering.
  • experiments/bot_detection/tests/test_semi_supervised.py
    • Added new unit tests for semi-supervised learning components, specifically compute_knn_distances and compute_isolation_forest_scores.
  • experiments/bot_detection/tests/test_signals.py
    • Added new unit tests for PR-level signal feature extraction, covering burstiness, engagement, cross-repo TF-IDF, burst content homogeneity, embedding similarity, and interaction features.
  • pyproject.toml
    • Updated pyproject.toml to include new experiments optional dependencies for duckdb, scikit-learn, pandas, pyarrow, and litellm.
    • Added google-generativeai to development dependencies.
Activity
  • The pull request description indicates that 55 tests passed for experiments/bot_detection/tests/.
  • A full pipeline run was executed using 61 suspended accounts as seeds for ground truth.
  • Further work is pending for a single-repo author ground truth round and cross-validation with expanded ground truth.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a comprehensive author-level bot detection pipeline, including new data models, experiment stages, statistical analyses, and documentation. The changes are extensive and well-structured. My review focuses on correctness and potential issues in the new experiment code. I've identified a few areas for improvement, including a typo in the experiment design, anomalous timestamps in generated data files, and some inconsistencies and robustness issues in the implementation.

Note: Security Review did not run due to the size of the PR.

@@ -0,0 +1,21 @@
{
"stage": "stage1",
"timestamp": "2026-03-06T01:44:57.070310+00:00",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This checkpoint file, along with others in this PR, contains a future timestamp (2026-03-06...). This is highly unusual and might indicate an issue with the environment where the experiment was run (e.g., incorrect system clock) or a bug in how the data was generated. Please verify that this is intentional and not a sign of a problem.

**Primary corpus: neoteny DuckDB (998 MB)**
- 52 repos, 32,374 PRs, 3,308 unique authors
- Reviews present on ~78% of PRs, commits on 100%
- Date range: 2014--2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date range for the primary corpus is specified as 2014--2026. The end year 2026 appears to be a typo and should likely be 2024.


rows.append({
"login": author,
"hub_score": float(hub_score),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The feature is named hub_score, but the implementation uses degree_centrality (as noted in the comment on lines 71-75). This is potentially confusing. To improve clarity, consider renaming this feature to degree_centrality to accurately reflect what is being calculated.

Suggested change
"hub_score": float(hub_score),
"degree_centrality": float(hub_score),

Comment on lines +33 to +38
text = response.strip()
if text.startswith("```"):
lines = text.split("\n")
# Remove first and last lines (fences)
lines = [ln for ln in lines if not ln.strip().startswith("```")]
text = "\n".join(lines)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for stripping markdown code fences removes any line that starts with ```, which could corrupt valid JSON if a string inside it contains this pattern. A more robust approach would be to specifically remove the opening and closing fence lines.

Suggested change
text = response.strip()
if text.startswith("```"):
lines = text.split("\n")
# Remove first and last lines (fences)
lines = [ln for ln in lines if not ln.strip().startswith("```")]
text = "\n".join(lines)
text = response.strip()
if text.startswith("```json"):
text = text.removeprefix("```json")
elif text.startswith("```"):
text = text.removeprefix("```")
if text.endswith("```"):
text = text.removesuffix("```")
text = text.strip()

Comment on lines +35 to +39
"H10_network": {
"column": "hub_score",
"transform": None,
"description": "HITS hub score from bipartite graph",
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for H10_network states it uses "HITS hub score", but the implementation in stage5_author_features.py actually uses degree_centrality. The description should be updated to match the actual implementation. Ideally, the feature column name would also be changed from hub_score to degree_centrality for consistency.

Suggested change
"H10_network": {
"column": "hub_score",
"transform": None,
"description": "HITS hub score from bipartite graph",
},
"H10_network": {
"column": "hub_score",
"transform": None,
"description": "Degree centrality from bipartite author-repo graph",
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant