Skip to content

Conversation

@BKHMSI
Copy link
Contributor

@BKHMSI BKHMSI commented Feb 10, 2026

No description provided.

@KartikP
Copy link
Contributor

KartikP commented Feb 10, 2026

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 10, 2026

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

Yes, I did intend to change all benchmarks to use ridge regression instead of linear.

@mschrimpf
Copy link
Member

let's keep the original ones for reference? (at least in code, don't have to display on the website.)
I agree we should use ridge going forward

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 10, 2026

Re-added the linear metrics for all benchmarks and fixed ceiling for ridge.

Note that for Pereira2018, we need to cache ceilings for the new metrics.

@mike-ferguson
Copy link
Member

mike-ferguson commented Feb 10, 2026

Hi all,

Please note a few things about the state of the language repo:

  1. Automerging currently disabled for benchmark-only PRs. Merging will require manual approval by Brain-Score admin (Kartik, me, or Martin most likely)
  2. Scoring is disabled as well until language scoring infra comes online (~ end of next week)

KartikP added a commit that referenced this pull request Feb 11, 2026
@KartikP
Copy link
Contributor

KartikP commented Feb 12, 2026

Hi @BKHMSI thanks for the PR. I had a chance to take a look at it and I noticed a few things that require attention:

  1. Please provide a description. A simple explanation of what you've done and why would suffice.

  2. In the benchmark factory, you pass the groupkfold to the linear benchmarks via CV_kwargs which breaks backwards compatibility with the linear variants of the benchmarks. Given that the intention is to hide linear on the leaderboard and use RidgeCV moving foward, could you just not pass any kwargs? Otherwise, tests should also reflect changes.

  3. Missing numpy import in blank2014/benchmark.py, fedorenko2016/benchmark.py, tuckute2024/benchmark.py

  4. Benchmarks return a dict instead of Score object. This breaks the way that the score object is parsed to populate the DB -> leaderboard.
    My recommendation:

        score = Score(np.mean(list(layer_scores.values())))
        score.attrs['layer_scores'] = layer_scores
  1. You've added a substantial amount (RidgeGCV, Ridge benchmark variants, ,etc) yet no tests. To ensure that your additions continue to operate as expected, please consider this addition.

i've attempted at addressing all of these issues in #361. The most significant differences are:

  1. Return a Score object with layer_scores, raw, and ceiling as attributes. This was necessary because the dict was breaking downstream benchmark API.
  2. Default.kfold was set to False instead of "group" to ensure backwards capability for cross-validation. This was the main data integrity risk.
  3. Missing imports (numpy and scipy.linalg)
  4. Missing coords (Blank2014 never added story coord and Fedorenko2016 never added sentence_id coord.
  5. Remove CV_kwargs from linear benchmarks

If #361 looks good to you, please let me know, otherwise, I hope it can be benefit to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants