[DON'T MERGE] Bot detection: author-level analysis (H8-H13) by jeffreyksmithjr · Pull Request #44 · 2ndSetAI/good-egg

jeffreyksmithjr · 2026-03-06T23:50:10Z

Not meant to be merged. Just a historical record.

Summary

Pivoted from PR-level (AUC ~0.50) to author-level bot detection using account_status == 'suspended' as ground truth
Checked 12,898 authors via GitHub API (3,216 multi-repo + 9,682 single-repo): found 323 suspended (61 multi-repo + 262 single-repo)
Two distinct detection profiles: multi-repo uses H11/LLM content (AUC 0.619, P@10=0.70), single-repo uses H8/merge rate (AUC 0.801)
Network and LLM features degenerate for single-repo authors (AUC 0.500)
6 hypotheses (H8-H13), stages 5-8 pipeline, campaign detection (609 Hacktoberfest-era authors)
Population-segmented evaluation via run_stage7_by_population()

Test plan

uv run pytest experiments/bot_detection/tests/ -x -q (55 tests pass)
Full pipeline run with 61 multi-repo suspended seeds
Single-repo ground truth round (9,682 checked, 262 suspended)
Population-segmented evaluation (multi-repo vs single-repo vs all)
RESULTS.md updated with full Iteration 5b analysis

Adds rules to avoid common LLM writing tics (filler preambles, weasel words, manufactured drama, etc.) to keep generated text natural.

4-stage pipeline evaluating whether cross-repo behavioral signals predict non-merge outcomes. Imports data from neoteny DuckDB cache (32K+ PRs) and PR 27 dataset, extracts burstiness/engagement/cross-repo features with strict anti-lookahead, and evaluates via logistic regression with StratifiedGroupKFold CV. 52 unit tests, lint clean.

Two bugs from the initial full run: - DuckDB returns pd.NaT for NULL timestamps, not None. `pd.NaT is not None` is True, so all PRs were classified as merged. Use pd.notna(). - Neoteny review key is `submitted_at` (underscore), not `submittedAt` (camelCase). Accept both for robustness. - Simplify bot filtering to count correctly via before/after PR count.

- Handle all-NaN columns in _fill_and_scale by falling back to 0.0 (abandoned_pr_rate, ge_score are 100% NaN in current dataset) - Use C=np.inf instead of deprecated penalty=None for LogisticRegression - Skip H5 (GE complement) gracefully when ge_score is entirely NaN - Skip unavailable baselines in Stage 4 when author metadata is missing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Compute GE trust scores using good-egg's v1 and v2 scoring models directly from cached PR data (no API calls), with anti-lookahead: only prior merged PRs on other repos before time T - Fix author metadata import from PR 27 data (profile fields were nested under contribution_data.profile, not top-level) - Add ge_score_v1 and ge_score_v2 to FeatureRow (replaces ge_score) - H5 now evaluates both v1 and v2 GE complement independently - Stage 4 baselines include both GE models as references - GE v1 AUC=0.531, v2 AUC=0.536 as standalone predictors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Key finding: burstiness signals predict merge, not non-merge. Bursty authors (24h count >= 3) have 12.7% non-merge rate vs 25.9% baseline. Cross-repo activity is a marker of experience, not spam. GE v2 is the best standalone predictor (AUC=0.536). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…y (H6/H7) Fix ^app/ bot pattern that let 4,034 app/* PRs through. Add H6 interaction features (burst x novelty) to separate spammers from power users, and H7 burst content homogeneity features (size CV, repo entropy). Add account status check script for GitHub API ground truth labels. Key finding: bot filter fix flipped H3 from inverted to correctly oriented (0.479 -> 0.512) and made bot signals genuinely complement GE scores. Interaction features correctly flag known suspects but base rate is too low (78/38K) to move AUC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add import_oss_parquet() to ingest neoteny parquet files as primary data source (238K PRs from 59 repos). Repos with <10% merge rate auto-excluded (apache/spark, pytorch/pytorch, detectron2). DuckDB indexes added for stage 2 performance (~74 min vs hours). Cross-repo coverage improved from 3.9% to 10.3%. All hypothesis AUCs remain 0.479-0.503 (random/inverted). GE v2 baseline strengthened to 0.533. Behavioral PR features conclusively don't predict merge outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pivot from PR-level (AUC ~0.50) to author-level analysis with account_status=suspended as ground truth. 27 suspended accounts found among 1,000 checked suspicious authors. New stages 5-8: - Stage 5: Author aggregates (H8) + bipartite network graph (H10) - Stage 6: Time-series (H9), LLM content via Gemini (H11), semi-supervised k-NN + Isolation Forest (H13) - Stage 7: Author-level evaluation (precision@k, AUC-ROC/PR) - Stage 8: Campaign detection (609 authors in 101 anomalous months) Key results against suspended accounts: - H11 (LLM): P@10=0.50, AUC-ROC=0.976 (unsupervised) - H10 (network): P@10=0.10, AUC-ROC=0.958 - Combined: P@10=0.90, P@25=0.80, AUC-ROC=1.000, AUC-PR=0.880

…61 suspended) Previous run checked 1,000 authors and found 27 suspended. This run checks all 3,216 multi-repo authors, finding 61 suspended (1.9%). Key finding: H10 (network) is robust to expansion (AUC 0.952 vs 0.958), while H11 (LLM) drops from 0.976 to 0.704 -- newly discovered suspensions don't have obviously spammy titles. Combined score P@50 = 0.56, P@100 = 0.43. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-06T23:50:29Z

🥚 Better Egg: HIGH Trust

Score: 79%

Score Breakdown

Component	Value
Graph Score	82%
Merge Rate	86% (44/51 PRs)
Account Age	4,938 days

Top Contributions

Repository	PRs	Language	Stars
2ndSetAI/good-egg	19	Python	21
jeffreyksmithjr/verskyt	9	Python	2
jeffreyksmithjr/galapagos_nao	7	Elixir	21
aws-samples/aws-big-data-blog	3	Java	894
pytorch/pytorch.github.io	2	HTML	278
melissawm/open-source-ai-contribution-policies	1	N/A	116
nerves-project/nerves_examples	1	Elixir	402
kilimchoi/engineering-blogs	1	Ruby	37308
kdeldycke/plumage	1	CSS	55

gemini-code-assist · 2026-03-06T23:51:59Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request marks a significant advancement in bot detection by shifting the analytical focus from individual pull requests to the authors themselves. The previous PR-level approach proved insufficient, leading to the development and evaluation of new author-centric hypotheses. The new methodology leverages ground truth from suspended GitHub accounts and shows promising results, particularly with network topology and LLM-based content analysis, paving the way for more accurate and actionable bot identification.

Highlights

Shift to Author-Level Bot Detection: The bot detection strategy has pivoted from PR-level analysis (H1-H7) to a more effective author-level approach (H8-H13), significantly improving detection accuracy.
Ground Truth Established: A robust ground truth dataset of 61 suspended GitHub accounts was identified by checking 3,216 multi-repo authors via the GitHub API, representing 1.9% of checked accounts.
High-Performing Hypotheses: New author-level hypotheses, specifically H10 (network topology) with an AUC of 0.952 and H11 (LLM content analysis) with an AUC of 0.704, demonstrated strong predictive power for bot detection.
Combined Model Effectiveness: A combined model of author-level features achieved a Precision@50 of 0.56, meaning 28 out of the top 50 ranked authors were confirmed suspended accounts, indicating actionable results for trust scoring.
PR-Level Ineffectiveness Confirmed: Previous PR-level features (H1-H7) were found to be largely ineffective, yielding AUCs between 0.479-0.503, reinforcing the need for the author-level pivot.
New Pipeline Stages Introduced: The experiment pipeline has been extended with new stages (5-8) to support author-level feature extraction, LLM content analysis, semi-supervised learning, and campaign detection.

Changelog

CLAUDE.md
- Added a 'Writing Style' section with guidelines to avoid common LLM writing tics and promote human-like communication.
experiments/bot_detection/DOE.md
- Added a new Design of Experiments document outlining the bot detection experiment, including overview, data sources, anti-lookahead protocol, hypotheses (H1-H5), statistical methodology, outcome classification, ramp-up procedure, red team checkpoints, and baselines.
experiments/bot_detection/RED_TEAM_AUDIT.md
- Added a new document for logging red team audit findings across various checkpoints (Data Integrity, Anti-Lookahead, Statistical Methods, Scale-Up, Final Audit).
experiments/bot_detection/RESULTS.md
- Added a new document detailing the results of the bot detection experiment, including dataset iteration 4, post-mortem analysis of PR-level features (H1-H7), and iteration 5 author-level bot detection (H8-H13) results, analysis, and caveats.
experiments/bot_detection/cache.py
- Added a new Python module for DuckDB-backed storage, including schema definition for PRs, reviews, commits, and authors.
- Implemented functions for importing data from Neoteny DuckDB caches and PR27 JSONL files.
- Provided query helpers for retrieving PRs, reviews, commits, and author-specific data.
- Added functionality for filtering PRs from known bot accounts and managing author account statuses.
experiments/bot_detection/checkpoint.py
- Added a new Python module for writing and reading stage completion checkpoints, facilitating pipeline state management.
experiments/bot_detection/data/results/SUMMARY.md
- Added a new markdown file summarizing Stage 3 evaluation results for PR-level hypotheses (H1-H7), including AUC-ROC, AUC-PR, Mann-Whitney U, and LRT comparisons.
experiments/bot_detection/data/results/baseline_comparison.json
- Added a new JSON file containing baseline comparison results for PR-level features, including AUC-ROC, AUC-PR, and DeLong tests against a reference baseline.
experiments/bot_detection/data/results/statistical_tests.json
- Added a new JSON file containing detailed statistical test results for PR-level features (H1-H7), including AUC-ROC with confidence intervals, Mann-Whitney U, Holm-Bonferroni sweep results, and nested LRTs.
experiments/bot_detection/data/stage1_complete.json
- Added a new JSON file marking the completion of Stage 1, detailing row counts for PRs, repos, reviews, commits, authors, filtered bots, and outcome classifications.
experiments/bot_detection/data/stage2_complete.json
- Added a new JSON file marking the completion of Stage 2, detailing the number of feature rows and the columns extracted.
experiments/bot_detection/data/stage3_complete.json
- Added a new JSON file marking the completion of Stage 3, detailing the hypotheses evaluated.
experiments/bot_detection/data/stage4_complete.json
- Added a new JSON file marking the completion of Stage 4, detailing the baselines evaluated.
experiments/bot_detection/embedding.py
- Added a new Python module for Gemini text embedding with local file caching to reduce API calls.
- Implemented functions for loading, saving, and batch embedding texts, with a fallback to zero vectors if the generative AI library is not installed.
- Included a utility function for calculating cosine similarity between embeddings.
experiments/bot_detection/llm_client.py
- Added a new Python module for LLM classification via LiteLLM, incorporating file-based caching and retry logic for API calls.
experiments/bot_detection/models.py
- Added a new Python module defining Pydantic models for various data structures used in the bot detection experiment, including PR outcomes, PR records, review and commit records, author metadata, and different feature sets (Burstiness, Engagement, Cross-Repo, FeatureRow, AuthorFeatureRow).
- Introduced models for StageCheckpoint and StudyConfig to standardize data flow and configuration.
experiments/bot_detection/pipeline.py
- Added a new Python module for the bot detection experiment's command-line interface (CLI).
- Implemented commands to run all pipeline stages, specific stages, or author-level pipeline stages (5-8) sequentially.
experiments/bot_detection/scripts/backfill_embeddings.py
- Added a new Python script for asynchronously backfilling burst_title_embedding_sim into feature parquet files using Gemini embeddings, improving content homogeneity analysis.
experiments/bot_detection/scripts/check_account_status.py
- Added a new Python script for checking GitHub account status of PR authors via the GitHub API.
- Implemented logic for rate limiting, upserting profile fields, and targeting suspicious authors for status checks.
experiments/bot_detection/scripts/smoke_test.py
- Added a new Python script for an end-to-end smoke test of the bot detection pipeline on a micro scale, ensuring basic functionality across stages.
experiments/bot_detection/stages/stage1_build_corpus.py
- Added a new Python module for Stage 1 of the pipeline, responsible for building the PR corpus from various data sources.
- Implemented functions for computing per-repo stale thresholds, classifying PR outcomes (merged, rejected, pocket_veto), and filtering known bot accounts.
experiments/bot_detection/stages/stage2_extract_signals.py
- Added a new Python module for Stage 2, focusing on extracting PR-level signal features.
- Implemented functions to compute burstiness (H1), engagement (H2), cross-repo fingerprinting (H3), GE scores (v1 and v2), interaction features (H6), and burst content homogeneity (H7).
experiments/bot_detection/stages/stage3_evaluate.py
- Added a new Python module for Stage 3, evaluating PR-level hypotheses (H1-H7).
- Implemented cross-validation, logistic regression, Mann-Whitney U tests, Holm-Bonferroni correction for parameter sweeps, DeLong AUC comparisons, and likelihood ratio tests.
experiments/bot_detection/stages/stage4_baselines.py
- Added a new Python module for Stage 4, evaluating baseline models against PR-level outcomes.
- Implemented functions for GE score, account age, zero followers, zero repos, and random baselines, with DeLong tests for comparison.
experiments/bot_detection/stages/stage5_author_features.py
- Added a new Python module for Stage 5, computing author-level aggregate (H8) and network (H10) features.
- Implemented functions to build bipartite graphs, compute hub scores, bipartite clustering, and various other network metrics.
experiments/bot_detection/stages/stage6_llm_content.py
- Added a new Python module for Stage 6 (part 1), performing LLM-based content analysis (H11) on PR titles.
- Implemented logic to pre-filter authors, generate prompts, call LLMs, and parse scores.
experiments/bot_detection/stages/stage6_semi_supervised.py
- Added a new Python module for Stage 6 (part 2), computing semi-supervised features (H13).
- Implemented functions for k-Nearest Neighbors (k-NN) distance to seeds and Isolation Forest anomaly scores.
experiments/bot_detection/stages/stage6_time_series.py
- Added a new Python module for Stage 6 (part 3), computing time-series features (H9) for authors.
- Implemented functions to calculate inter-PR coefficient of variation, median intervals, max dormancy, burst episode count, weekend ratio, and hour entropy.
experiments/bot_detection/stages/stage7_author_evaluate.py
- Added a new Python module for Stage 7, evaluating author-level hypotheses (H8-H13) against ground truth.
- Implemented metrics like Precision@k, Recall@k, AUC-ROC, AUC-PR, Mann-Whitney U, and cross-validation for individual and combined scores.
experiments/bot_detection/stages/stage8_campaigns.py
- Added a new Python module for Stage 8, detecting coordinated spam campaigns (H12) through temporal patterns.
- Implemented functions to compute monthly rejection rates, flag anomalous months, identify campaign authors, and cross-reference with suspended accounts.
experiments/bot_detection/stats.py
- Added a new Python module containing statistical utility functions.
- Included implementations for DeLong AUC test, Holm-Bonferroni correction, likelihood ratio test, Kruskal-Wallis with Dunn's post-hoc, Chi-squared test, Cochran-Armitage trend test, and odds ratio.
experiments/bot_detection/study_config.yaml
- Added a new YAML configuration file for the bot detection experiment, centralizing settings for data sources, classification, burstiness sweep, analysis, scale definitions, bot patterns, embedding, author-level analysis, and file paths.
experiments/bot_detection/tests/conftest.py
- Added a new Python file for pytest configuration.
experiments/bot_detection/tests/test_author_evaluate.py
- Added new unit tests for author evaluation metrics, specifically precision_at_k and lift over random baselines.
experiments/bot_detection/tests/test_author_features.py
- Added new unit tests for author aggregate statistics and the identification of suspicious authors based on merge rates and repository counts.
experiments/bot_detection/tests/test_cache.py
- Added new unit tests for the DuckDB cache, covering schema creation, query helpers, bot filtering, account status management, and parquet data import with deduplication and low merge rate skipping.
experiments/bot_detection/tests/test_evaluation.py
- Added new unit tests for statistical evaluation methods, including Holm-Bonferroni correction, DeLong AUC test, AUC-ROC with confidence intervals, and likelihood ratio tests.
experiments/bot_detection/tests/test_llm_content.py
- Added new unit tests for LLM content parsing, score clamping, and caching mechanisms, along with an integration test for run_stage6_llm_content.
experiments/bot_detection/tests/test_models.py
- Added new unit tests for Pydantic models, ensuring correct instantiation, default values, and serialization for PROutcome, BotDetectionPR, FeatureRow, BurstinessFeatures, EngagementFeatures, CrossRepoFeatures, and StageCheckpoint.
experiments/bot_detection/tests/test_network.py
- Added new unit tests for network feature computation, including get_all_author_repo_pairs, hub scores, and bipartite clustering.
experiments/bot_detection/tests/test_semi_supervised.py
- Added new unit tests for semi-supervised learning components, specifically compute_knn_distances and compute_isolation_forest_scores.
experiments/bot_detection/tests/test_signals.py
- Added new unit tests for PR-level signal feature extraction, covering burstiness, engagement, cross-repo TF-IDF, burst content homogeneity, embedding similarity, and interaction features.
pyproject.toml
- Updated pyproject.toml to include new experiments optional dependencies for duckdb, scikit-learn, pandas, pyarrow, and litellm.
- Added google-generativeai to development dependencies.

Activity

The pull request description indicates that 55 tests passed for experiments/bot_detection/tests/.
A full pipeline run was executed using 61 suspended accounts as seeds for ground truth.
Further work is pending for a single-repo author ground truth round and cross-validation with expanded ground truth.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces a comprehensive author-level bot detection pipeline, including new data models, experiment stages, statistical analyses, and documentation. The changes are extensive and well-structured. My review focuses on correctness and potential issues in the new experiment code. I've identified a few areas for improvement, including a typo in the experiment design, anomalous timestamps in generated data files, and some inconsistencies and robustness issues in the implementation.

_{Note: Security Review did not run due to the size of the PR.}

…hecked) Expanded ground truth from 61 to 323 suspended accounts by checking 9,682 single-repo authors. Population-segmented evaluation reveals two distinct detection profiles: multi-repo authors detected by LLM content (H11 AUC 0.619), single-repo by merge rate (H8 AUC 0.801). Network and LLM features are degenerate for single-repo authors (AUC 0.500). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ation - Document merge rate lookahead contamination throughout RESULTS.md and RED_TEAM_AUDIT.md: merge_rate computed from all PRs (no temporal windowing) inflates H8, biases ground truth sampling, and propagates through H11 pre-filter, H13 k-NN features, and combined score - Add TF-IDF title analysis (stage 6d) as local, deterministic alternative to Gemini LLM scoring: covers all 31,296 authors including single-repo (AUC 0.595 multi-repo, 0.571 single-repo vs 0.500 degenerate for LLM) - Wire stage 6 sub-stages into pipeline (6a-6d), fix hub_score description from "HITS" to "degree centrality", add .gitignore for data files - Add verify_results.py: 28 automated checks cross-referencing RESULTS.md against JSON output files (population arithmetic, base rates, degenerate AUCs, monotonicity, narrative claims) - Correct campaign/suspended overlap from 0 to 16 (updated with 5b ground truth) - Update all result tables with H11_tfidf row and re-computed Combined scores

…only survivor) Re-run author-level analysis with proper temporal holdouts to eliminate merge_rate lookahead contamination. Six semi-annual cutoffs (2020-2024), all features computed from pre-cutoff PRs only. Results: merge_rate is the only signal that survives decontamination (AUC 0.693±0.110). H9/H10/H11/H13-IF all collapse to chance (~0.50). The contaminated all-time numbers were inflated 5-15% by lookahead. - Add 4 temporal _before() query methods to cache.py - Add cutoff/parquet_path params to stages 5, 6a-6d, 7 - Create stage9_temporal_holdout.py orchestrator - Add run-temporal-holdout CLI command with --cutoffs override - Add temporal_holdout config section to study_config.yaml - Add 14 tests for temporal DB methods (170 total, all pass) - Update RESULTS.md with full Iteration 6 analysis

Stage 10: merge rate non-monotonicity experiment (Iteration 7) - Tested linear/quadratic/two-feature/GBT on hub_score + merge_rate - Relationship is monotonic, not U-shaped. No parameterization wins. Stage 11: two-model pipeline (Iteration 8) - Compared GE v2 proxy, suspension classifiers (LR, GBT), and combined models across 6 temporal cutoffs - Dedicated suspension classifier (LR on 16 features) outperforms GE v2 proxy at stable cutoffs. Combined models don't help. - Practical ceiling: AUC ~0.66-0.70 Stage 12: k-NN holdout experiment (Iteration 9) - Fixed k-NN circularity via suspended-only CV folds - All AUCs now well below 1.0 (was 1.000 due to seeds = labels) - k-NN cosine (0.595) competitive with merge_rate_only (0.617) but weaker than susp_lr_full (0.660) on stable cutoffs - Experiment B: k-NN distance predicts merge quality among active accounts (Spearman rho=-0.272, binary AUC=0.756) - Red-team reviewed: fixed reporting error in per-cutoff table (was using Iteration 6 merge_rate AUCs from different population)

… hub_score experiments Iteration 10 (stages 13-14): - Merge prediction: merge_rate_only (0.576) beats ge_v2_proxy (0.542); hub_score hurts cross-repo prediction. k-NN bot proximity doesn't help. - Advisory score: 10-feature balanced LR achieves AUC 0.65-0.68 for suspension prediction across 6 temporal cutoffs. Iteration 11 (stages 15-17): - Feature ablation: 3-feature model {merge_rate, median_additions, isolation_score} beats full 10-feature model in every cutoff. Recommended Bad Egg feature set. - Hub score repo-specific: repo-specific features dominate; hub_score unreliable (hurts small samples, neutral otherwise). - Hub score unknown contributors: for the population GE actually scores (no prior merged PRs in target repo), merge_rate alone outperforms every model including hub_score across all repo size tiers. Conclusion: drop hub_score from GE v3 scoring formula.

…riments Stage 17 (hub_score unknown contributors): For the GE scoring population (authors with no prior merged PRs in target repo), merge_rate alone outperforms every model including hub_score across all repo size tiers. Definitive: drop hub_score from GE v3. Stage 18 (recency window): No window (3mo, 6mo, 1yr, 2yr) significantly outperforms alltime merge_rate for unknown contributors. Only 20-29% of unknowns have >=2 PRs in short windows. Keep alltime merge_rate in v3.

stage19 tests log_account_age on the GE scoring population (unknown-to-repo authors). On 4 stable cutoffs (n=130-1014, 5-fold CV), mr+age never beats mr_only (DeLong p > 0.07 everywhere). age_only AUC is 0.505-0.522, barely above chance. All three v2 features beyond merge_rate (graph_score, hub_score, account_age) are now tested and eliminated for unknown contributors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rlronan

Not fully reviewed yet, but the evaluation functions need rewrites (probably on all stages(?)) per the inline comments I left here

rlronan · 2026-03-12T19:16:22Z

+        return json.load(f)
+
+
+def approx_eq(a: float, b: float, tol: float = 0.005) -> bool:


Suggest using np.isclose()

rlronan · 2026-03-12T19:19:27Z

+    pop_checks = [
+        ("multi_repo", "n_total", 3208, r"3,?208"),
+        ("multi_repo", "n_positive", 61, r"\b61\b"),
+        ("single_repo", "n_total", 28088, r"28,?088"),
+        ("single_repo", "n_positive", 262, r"\b262\b"),
+        ("all", "n_total", 31296, r"31,?296"),
+        ("all", "n_positive", 323, r"\b323\b"),
+    ]


These numbers should probably be pulled from the json file

rlronan · 2026-03-12T19:26:28Z

+        metrics["auc_roc"] = float(roc_auc_score(y, oof_probs))
+        metrics["auc_pr"] = float(average_precision_score(y, oof_probs))


Pretty sure these should be computed on y[test_idx], and oof_probs[test_idx]

rlronan · 2026-03-12T19:27:09Z

+        # Precision@k
+        for k in [25, 50]:
+            if k <= len(y):
+                top_k_idx = np.argsort(oof_probs)[-k:]
+                prec_at_k = float(y[top_k_idx].sum() / k)
+                metrics[f"precision_at_{k}"] = prec_at_k
+    else:
+        metrics["auc_roc"] = float("nan")
+        metrics["auc_pr"] = float("nan")
+
+    if model_type == "two_feature":
+        metrics["tuned_threshold"] = threshold
+
+    metrics["_oof_probs"] = oof_probs


Pretty sure these should be computed on y[test_idx], and oof_probs[test_idx]

rlronan · 2026-03-12T19:28:27Z

+        valid = np.isfinite(inner_oof)
+        if valid.all() and y_train.sum() > 0:
+            try:
+                auc = roc_auc_score(y_train, inner_oof)


Should be computed on y_train[val_idx] and inner_oof[val_idx]

rlronan · 2026-03-12T19:30:32Z

+        if aucs_roc:
+            arr = np.array(aucs_roc)
+            agg["mean_auc_roc"] = float(np.mean(arr))
+            agg["std_auc_roc"] = float(np.std(arr))
+            agg["min_auc_roc"] = float(np.min(arr))
+            agg["max_auc_roc"] = float(np.max(arr))
+            agg["n_cutoffs_roc"] = len(aucs_roc)
+            agg["per_cutoff_auc_roc"] = aucs_roc
+        if aucs_pr:
+            arr = np.array(aucs_pr)
+            agg["mean_auc_pr"] = float(np.mean(arr))
+            agg["std_auc_pr"] = float(np.std(arr))
+            agg["min_auc_pr"] = float(np.min(arr))
+            agg["max_auc_pr"] = float(np.max(arr))
+            agg["n_cutoffs_pr"] = len(aucs_pr)
+        aggregated[model_type] = agg


I would suggest ideally recomputing the AUC and AUPRC on the aggregate data as a preferred way to to aggregate.

rlronan · 2026-03-12T19:32:15Z

+    if not valid.all():
+        logger.warning("  %s: %d/%d OOF predictions NaN", model_name, (~valid).sum(), n)
+
+    metrics = _compute_metrics(y, oof_probs)


Should be computing on y[test_idx] and oof_probs[test_idx] here. Worth rewriting these eval functions as a single util function I think

rlronan · 2026-03-12T19:45:49Z

+| H10 (network) | 0.10 | 0.08 | 0.04 | 0.03 | 0.523 | 0.023 |
+| **H11 (LLM)** | **0.70** | **0.44** | **0.22** | **0.15** | **0.619** | **0.136** |
+| H11-tfidf | 0.10 | 0.16 | 0.14 | 0.09 | 0.595 | 0.056 |
+| H13 k-NN | 1.00 | 1.00 | 1.00 | 0.61 | 1.000 | 1.000 |


What is going on with H13 Knn here?

Presumably label leakage. Maybe just a circularly defined metric. I'd have to trace through more closely.

jeffreyksmithjr and others added 10 commits March 5, 2026 12:07

Add writing style guidelines to CLAUDE.md

55bc741

Adds rules to avoid common LLM writing tics (filler preambles, weasel words, manufactured drama, etc.) to keep generated text natural.

gemini-code-assist Bot reviewed Mar 6, 2026

View reviewed changes

jeffreyksmithjr and others added 5 commits March 7, 2026 11:29

This was referenced Mar 7, 2026

Bad Egg v1: suspension advisory score #45

Closed

Good Egg v3: simplify scoring to merge_rate only for unknown contributors #46

Closed

jeffreyksmithjr and others added 2 commits March 7, 2026 23:38

jeffreyksmithjr changed the title ~~Bot detection: author-level analysis (H8-H13)~~ [DON'T MERGE] Bot detection: author-level analysis (H8-H13) Mar 9, 2026

rlronan reviewed Mar 12, 2026

View reviewed changes

jeffreyksmithjr closed this Mar 13, 2026

jeffreyksmithjr mentioned this pull request Apr 12, 2026

Show how many PRs contributor has made over recent time period #38

Closed

		return json.load(f)


		def approx_eq(a: float, b: float, tol: float = 0.005) -> bool:

		metrics["auc_roc"] = float(roc_auc_score(y, oof_probs))
		metrics["auc_pr"] = float(average_precision_score(y, oof_probs))

Conversation

jeffreyksmithjr commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥚 Better Egg: HIGH Trust

Score Breakdown

Top Contributions

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlronan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreyksmithjr commented Mar 6, 2026 •

edited

Loading

github-actions Bot commented Mar 6, 2026 •

edited

Loading