Skip to content

Conversation

@timurcarstensen
Copy link
Collaborator

@timurcarstensen timurcarstensen commented Oct 13, 2025

Adds lighteval support

Stuff addressed in this PR:

  • remove useless tests
  • add task groups that can bundle tasks and can be run like `oellm schedule-eval --model "somemodel" --task-groups "oellm-multilingual":
belebele-eu: 
  tasks:
   - lang a
   - lang b
flores200-eu:
  tasks: 
  - lang c
  - langb d

oellm-multilingual: 
  tasks:
  - belebele-eu
  - flores200-eu
  • simplify results-checking workflow: now we just check what's missing and don't try to parse slurm logs which is error prone and didn't work properly
  • speed-up container image build times: set compression algorithm to zstd and compression level to 1 (i.e., trade-off storage for faster builds)
  • remove logic for calculating per-task runtime estimates and just go with num_tasks * 10 minutes for now
  • add a per-user cache with a 30 day TTL (arbitrarily chosen) that captures all calls to hf_hub_download or load_dataset and stores the call args in a json. When running oellm schedule-eval again we first check the cache and directly use datasets.load_dataset to load/make sure the dataset is on disk without having to endure the overhead that comes with importing lighteval or lm-eval-harness (this implementation lives in oellm/task_cache.py
  • 3rd-party print/log call suppression context manager that only allows 3rd-party libraries to print to stdout/stderr if the user passes --verbose=True from the CLI (lives in oellm/utils.py)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@timurcarstensen
Copy link
Collaborator Author

@codex fix the linter issues (run pre commit and then commit the changes)

@chatgpt-codex-connector
Copy link

Summary

  • Moved the LightEval benchmark copy/replace step ahead of exporting the jobs CSV so SBATCH arrays reference the copied files instead of unavailable originals. oellm/main.pyL574-L600
  • Applied the repository’s pre-commit formatting fixes across documentation, workflow, and Apptainer helper files to clear whitespace/end-of-file lint issues. README.mdL1-L6 apptainer/build_sif_local.shL1-L10

Testing

  • ⚠️ pre-commit run --all-files (fails because the mypy hook’s types-all dependency depends on the unavailable types-pkg-resources package)
  • SKIP=mypy pre-commit run --all-files

View task →

@timurcarstensen
Copy link
Collaborator Author

@codex remove the mypy hook from .pre-commit-config.yaml

@chatgpt-codex-connector
Copy link

Summary

  • Removed the mypy mirror repository and hook from the pre-commit configuration, leaving only the existing pre-commit-hooks and Ruff checks. .pre-commit-config.yamlL1-L21

Testing

  • ⚠️ pre-commit run --all-files (fails because pre-commit is not installed in the container).

View task →

@timurcarstensen
Copy link
Collaborator Author

@codex do these two things:

  1. pull latest changes from main (which adds task groups) and adjust code here as needed
  2. make sure u install pre-commit in the container so you can actually run it, you may also add it as a dev dependency to the pyproject.toml if that is more convenient for the future

@chatgpt-codex-connector
Copy link

Summary

  • Updated the scheduler to carry evaluation-suite metadata through task-group expansion, pre-download only the applicable harness datasets, and relocate copied LightEval benchmark lists so array jobs read the mounted paths. oellm/main.pyL552-L761
  • Extended the interactive CSV builder to track suites for both grouped and manual tasks, avoid duplicate combinations, and include the suite column in previews and saved configs. oellm/interactive_csv_builder.pyL66-L357
  • Propagated suite information while resolving and flattening task groups and refreshed the bundled group definitions to include both nested groups and LightEval entries. oellm/task_groups.pyL74-L121 oellm/task-groups.yamlL5-L150
  • Removed the mypy pre-commit hook and added pre-commit to the development dependency set so formatting hooks can run locally. .pre-commit-config.yamlL1-L18 pyproject.tomlL74-L78

Testing

  • pre-commit run --all-files

View task →

@timurcarstensen timurcarstensen changed the title Use lm-eval harness for INCLUDE and global MMLU feat: add lighteval support Oct 20, 2025
@timurcarstensen timurcarstensen requested review from geoalgo and removed request for geoalgo October 20, 2025 11:23
@timurcarstensen
Copy link
Collaborator Author

@geoalgo please take a look at the PR description and lmk if you'd like anything else in there :)

Overarching goal of this PR apart from lighteval support is just a general clean-up of the codebase since I added a bunch of AI slop during the initial release and am trying to remove most of that now. Also trying to make things more modular so it's easier to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this test as this logic was broken at some point so the cli deployed could not run.
Can we add tests to cover those cases since we do not have regression tests?

fi
fi

RESULTS_SUBDIR="{evals_dir}/$(openssl rand -hex 5)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share the subdir logic? there is one above too which calls openssl.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to create a random subdir name so that we are very unlikely to get subdir conflicts in the results directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., in the output we'd have

results
    resultdir1
    resultdir2
    ...
slurm_logs
jobs.csv

since the the naming of the result dirs by the different frameworks can be a little messy (I think that was the reason at the time, this is actually from an earlier PR IIRC) I decided to go with random folders which shouldn't matter too much since we use oellm collect-results to aggregate the outputs into a csv anyway so the user wouldn't have to go hunting for the original json files

if [[ -f "$LIGHT_TASK" ]]; then
LIGHT_TASK_ARG="$LIGHT_TASK"
else
last_segment="${{LIGHT_TASK##*|}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain which format you expect in english?

@timurcarstensen timurcarstensen marked this pull request as draft October 21, 2025 09:53
timurcarstensen and others added 3 commits October 21, 2025 22:32
Co-authored-by: David Salinas <geoalgo@users.noreply.github.com>
- name: Build SIF from definition file
run: |
apptainer --verbose build --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def
apptainer --verbose build --mksquashfs-args="-comp gzip -Xcompression-level 1" --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation: trades off space vs aws ec2 time


# Optional: keep tool envs under /opt to avoid $HOME
export UV_TOOL_DIR=/opt/uv-tools
uv tool install --python 3.12 "lighteval[multilingual] @ git+https://github.com/huggingface/lighteval.git@63424f4e795ecc577b90646381b374af3a627978"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this is to make lighteval happy

@@ -0,0 +1,197 @@
task_groups:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation: split up this file into task_groups and task_super_groups. The latter allows you to aggregate task_groups into something like oellm-multilingual

@@ -0,0 +1,548 @@
import builtins
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: I moved a bunch of helper funcs into this utils.py

@timurcarstensen timurcarstensen marked this pull request as ready for review November 6, 2025 12:29
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +197 to +206
df["eval_suite"] = df["eval_suite"].str.lower()

# Ensure that all datasets required by the tasks are cached locally to avoid
# network access on compute nodes.
if not skip_checks:
_pre_download_task_datasets(
df["task_path"].unique(), trust_remote_code=trust_remote_code
)
lm_eval_tasks = df[df["eval_suite"].isin({"lm-eval-harness"})][
"task_path"
].unique()
if len(lm_eval_tasks) > 0:
_pre_download_task_datasets(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-enable dataset prefetch for lm_eval jobs

The new dataset pre-download block now filters rows with eval_suite == "lm-eval-harness", but all jobs constructed from --tasks default to "lm_eval" (see the comprehension that hard-codes eval_suite="lm_eval"). As a result _pre_download_task_datasets is never invoked for the common case of lm-eval-harness tasks provided via the CLI, so datasets are no longer cached before scheduling and offline nodes will fail once they try to download at runtime. Consider matching both "lm_eval" and "lm-eval-harness" (or normalising earlier) so the cache warm-up still runs.

Useful? React with 👍 / 👎.

@timurcarstensen timurcarstensen merged commit 3610171 into main Nov 10, 2025
2 checks passed
@timurcarstensen timurcarstensen deleted the codex/add-oellm-multilingual-task-group branch November 10, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants