feat: add lighteval support #23

timurcarstensen · 2025-10-13T20:27:19Z

Adds lighteval support

Stuff addressed in this PR:

remove useless tests
add task groups that can bundle tasks and can be run like `oellm schedule-eval --model "somemodel" --task-groups "oellm-multilingual":

belebele-eu: 
  tasks:
   - lang a
   - lang b
flores200-eu:
  tasks: 
  - lang c
  - langb d

oellm-multilingual: 
  tasks:
  - belebele-eu
  - flores200-eu

simplify results-checking workflow: now we just check what's missing and don't try to parse slurm logs which is error prone and didn't work properly
speed-up container image build times: set compression algorithm to zstd and compression level to 1 (i.e., trade-off storage for faster builds)
remove logic for calculating per-task runtime estimates and just go with num_tasks * 10 minutes for now
add a per-user cache with a 30 day TTL (arbitrarily chosen) that captures all calls to hf_hub_download or load_dataset and stores the call args in a json. When running oellm schedule-eval again we first check the cache and directly use datasets.load_dataset to load/make sure the dataset is on disk without having to endure the overhead that comes with importing lighteval or lm-eval-harness (this implementation lives in oellm/task_cache.py
3rd-party print/log call suppression context manager that only allows 3rd-party libraries to print to stdout/stderr if the user passes --verbose=True from the CLI (lives in oellm/utils.py)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

oellm/main.py

timurcarstensen · 2025-10-13T20:38:11Z

@codex fix the linter issues (run pre commit and then commit the changes)

chatgpt-codex-connector · 2025-10-13T20:41:01Z

Summary

Moved the LightEval benchmark copy/replace step ahead of exporting the jobs CSV so SBATCH arrays reference the copied files instead of unavailable originals. oellm/main.pyL574-L600
Applied the repository’s pre-commit formatting fixes across documentation, workflow, and Apptainer helper files to clear whitespace/end-of-file lint issues. README.mdL1-L6 apptainer/build_sif_local.shL1-L10

Testing

⚠️ pre-commit run --all-files (fails because the mypy hook’s types-all dependency depends on the unavailable types-pkg-resources package)
✅ SKIP=mypy pre-commit run --all-files

View task →

timurcarstensen · 2025-10-13T20:42:42Z

@codex remove the mypy hook from .pre-commit-config.yaml

chatgpt-codex-connector · 2025-10-13T20:44:14Z

Summary

Removed the mypy mirror repository and hook from the pre-commit configuration, leaving only the existing pre-commit-hooks and Ruff checks. .pre-commit-config.yamlL1-L21

Testing

⚠️ pre-commit run --all-files (fails because pre-commit is not installed in the container).

View task →

timurcarstensen · 2025-10-13T20:45:40Z

@codex do these two things:

pull latest changes from main (which adds task groups) and adjust code here as needed
make sure u install pre-commit in the container so you can actually run it, you may also add it as a dev dependency to the pyproject.toml if that is more convenient for the future

chatgpt-codex-connector · 2025-10-13T20:54:29Z

Summary

Updated the scheduler to carry evaluation-suite metadata through task-group expansion, pre-download only the applicable harness datasets, and relocate copied LightEval benchmark lists so array jobs read the mounted paths. oellm/main.pyL552-L761
Extended the interactive CSV builder to track suites for both grouped and manual tasks, avoid duplicate combinations, and include the suite column in previews and saved configs. oellm/interactive_csv_builder.pyL66-L357
Propagated suite information while resolving and flattening task groups and refreshed the bundled group definitions to include both nested groups and LightEval entries. oellm/task_groups.pyL74-L121 oellm/task-groups.yamlL5-L150
Removed the mypy pre-commit hook and added pre-commit to the development dependency set so formatting hooks can run locally. .pre-commit-config.yamlL1-L18 pyproject.tomlL74-L78

Testing

✅ pre-commit run --all-files

View task →

timurcarstensen · 2025-10-20T11:25:11Z

@geoalgo please take a look at the PR description and lmk if you'd like anything else in there :)

Overarching goal of this PR apart from lighteval support is just a general clean-up of the codebase since I added a bunch of AI slop during the initial release and am trying to remove most of that now. Also trying to make things more modular so it's easier to test

geoalgo · 2025-10-21T09:34:47Z

tests/test_interactive_csv_builder.py

We added this test as this logic was broken at some point so the cli deployed could not run.
Can we add tests to cover those cases since we do not have regression tests?

README.md

oellm/task-groups.yaml

geoalgo · 2025-10-21T09:49:30Z

oellm/resources/template.sbatch

+                fi
+            fi
+
+            RESULTS_SUBDIR="{evals_dir}/$(openssl rand -hex 5)"


can we share the subdir logic? there is one above too which calls openssl.

this is to create a random subdir name so that we are very unlikely to get subdir conflicts in the results directory

e.g., in the output we'd have

results resultdir1 resultdir2 ... slurm_logs jobs.csv

since the the naming of the result dirs by the different frameworks can be a little messy (I think that was the reason at the time, this is actually from an earlier PR IIRC) I decided to go with random folders which shouldn't matter too much since we use oellm collect-results to aggregate the outputs into a csv anyway so the user wouldn't have to go hunting for the original json files

geoalgo · 2025-10-21T09:49:44Z

oellm/resources/template.sbatch

+            if [[ -f "$LIGHT_TASK" ]]; then
+                LIGHT_TASK_ARG="$LIGHT_TASK"
+            else
+                last_segment="${{LIGHT_TASK##*|}}"


Can you add a comment to explain which format you expect in english?

Co-authored-by: David Salinas <geoalgo@users.noreply.github.com>

timurcarstensen · 2025-10-22T15:54:53Z

.github/workflows/build-and-push-apptainer.yml

      - name: Build SIF from definition file
        run: |
-          apptainer --verbose build --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def
+          apptainer --verbose build --mksquashfs-args="-comp gzip -Xcompression-level 1" --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def


explanation: trades off space vs aws ec2 time

timurcarstensen · 2025-10-22T15:55:22Z

apptainer/jureca.def


+    # Optional: keep tool envs under /opt to avoid $HOME
+    export UV_TOOL_DIR=/opt/uv-tools
+    uv tool install --python 3.12 "lighteval[multilingual] @ git+https://github.com/huggingface/lighteval.git@63424f4e795ecc577b90646381b374af3a627978"


all this is to make lighteval happy

timurcarstensen · 2025-10-22T15:56:07Z

oellm/resources/task-groups.yaml

@@ -0,0 +1,197 @@
+task_groups:


explanation: split up this file into task_groups and task_super_groups. The latter allows you to aggregate task_groups into something like oellm-multilingual

timurcarstensen · 2025-10-22T16:02:24Z

oellm/utils.py

@@ -0,0 +1,548 @@
+import builtins


TLDR: I moved a bunch of helper funcs into this utils.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-06T12:32:48Z

oellm/main.py

+    df["eval_suite"] = df["eval_suite"].str.lower()
+
    # Ensure that all datasets required by the tasks are cached locally to avoid
    # network access on compute nodes.
    if not skip_checks:
-        _pre_download_task_datasets(
-            df["task_path"].unique(), trust_remote_code=trust_remote_code
-        )
+        lm_eval_tasks = df[df["eval_suite"].isin({"lm-eval-harness"})][
+            "task_path"
+        ].unique()
+        if len(lm_eval_tasks) > 0:
+            _pre_download_task_datasets(


Re-enable dataset prefetch for lm_eval jobs

The new dataset pre-download block now filters rows with eval_suite == "lm-eval-harness", but all jobs constructed from --tasks default to "lm_eval" (see the comprehension that hard-codes eval_suite="lm_eval"). As a result _pre_download_task_datasets is never invoked for the common case of lm-eval-harness tasks provided via the CLI, so datasets are no longer cached before scheduling and offline nodes will fail once they try to download at runtime. Consider matching both "lm_eval" and "lm-eval-harness" (or normalising earlier) so the cache warm-up still runs.

Useful? React with 👍 / 👎.

Use lm-eval harness for INCLUDE and global MMLU

3a3d6b7

timurcarstensen added the codex label Oct 13, 2025 — with ChatGPT Codex Connector

chatgpt-codex-connector bot reviewed Oct 13, 2025

View reviewed changes

oellm/main.py Outdated Show resolved Hide resolved

timurcarstensen and others added 4 commits October 14, 2025 18:33

Remove mypy pre-commit hook

a8104fc

chore: remove tests

3e9a6b6

fix: lighteval integration

1a0b16a

fix: lumi paths

f9c5bce

timurcarstensen changed the title ~~Use lm-eval harness for INCLUDE and global MMLU~~ feat: add lighteval support Oct 20, 2025

timurcarstensen added 2 commits October 20, 2025 13:48

fix: faster compression

64287d4

fix: faster compression

2674439

timurcarstensen requested review from geoalgo and removed request for geoalgo October 20, 2025 11:23

timurcarstensen added 6 commits October 20, 2025 15:44

chore: remove unnecessary files

10d4217

fix: ruff formatting target version

e2c866a

chore: restructure task-groups into groups and super-groups

20f04e9

feat: task-cache prototype

73e2377

feat: task super groups

f831fbc

task cache fix

5fe62ee

geoalgo reviewed Oct 21, 2025

View reviewed changes

timurcarstensen marked this pull request as draft October 21, 2025 09:53

timurcarstensen and others added 3 commits October 21, 2025 22:32

fix: task cache; moving data files to oellm/resources

e816bfd

Update README.md

a97d92d

Co-authored-by: David Salinas <geoalgo@users.noreply.github.com>

misc

c9db766

timurcarstensen added 12 commits October 22, 2025 11:22

fix: updated apptainer definitions to include correct uv install

79ace47

fix: lighteval cli args

13e985c

feat: wrapper to suppress tqdm output

c9160d5

misc

ccf4c5a

fix: lighteval tool python version

97b3d69

nltk setup

541d387

nltk setup

006ab8d

fix: downloading nltk data for lighteval during container setup

15bea15

suppressing all tqdm progress bars

9c97d25

lighteval fixes

f11d4a4

misc

096cbc0

feat: aya-expanse tasks

6e888d7

timurcarstensen commented Oct 22, 2025

View reviewed changes

oellm/utils.py

@@ -0,0 +1,548 @@

import builtins

Copy link

Collaborator Author

timurcarstensen Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: I moved a bunch of helper funcs into this utils.py

timurcarstensen added 7 commits October 22, 2025 20:51

chore: schedule-eval logic cleanup

9d87217

feat: adding spinners

4f9f8a8

chore: making pre-commit happy

fe067fa

misc

f552c96

fix: restrict model parallel

9bbf5c1

fix: result collection

1b81460

fix: leonardo directory

c3e0b41

timurcarstensen marked this pull request as ready for review November 6, 2025 12:29

timurcarstensen requested a review from geoalgo November 6, 2025 12:29

chatgpt-codex-connector bot reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into codex/add-oellm-multilingual-task-group

d510921

geoalgo approved these changes Nov 8, 2025

View reviewed changes

timurcarstensen merged commit 3610171 into main Nov 10, 2025
2 checks passed

timurcarstensen deleted the codex/add-oellm-multilingual-task-group branch November 10, 2025 16:04

feat: add lighteval support #23

feat: add lighteval support #23

Uh oh!

Conversation

timurcarstensen commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

timurcarstensen commented Oct 13, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 13, 2025

Uh oh!

timurcarstensen commented Oct 13, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 13, 2025

Uh oh!

timurcarstensen commented Oct 13, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 13, 2025

Uh oh!

timurcarstensen commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timurcarstensen commented Oct 13, 2025 •

edited

Loading