-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add lighteval support #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add lighteval support #23
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
|
@codex fix the linter issues (run pre commit and then commit the changes) |
|
Summary
Testing
|
|
@codex remove the mypy hook from .pre-commit-config.yaml |
|
Summary
Testing
|
|
@codex do these two things:
|
|
Summary
Testing
|
|
@geoalgo please take a look at the PR description and lmk if you'd like anything else in there :) Overarching goal of this PR apart from lighteval support is just a general clean-up of the codebase since I added a bunch of AI slop during the initial release and am trying to remove most of that now. Also trying to make things more modular so it's easier to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added this test as this logic was broken at some point so the cli deployed could not run.
Can we add tests to cover those cases since we do not have regression tests?
| fi | ||
| fi | ||
|
|
||
| RESULTS_SUBDIR="{evals_dir}/$(openssl rand -hex 5)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we share the subdir logic? there is one above too which calls openssl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to create a random subdir name so that we are very unlikely to get subdir conflicts in the results directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., in the output we'd have
results
resultdir1
resultdir2
...
slurm_logs
jobs.csv
since the the naming of the result dirs by the different frameworks can be a little messy (I think that was the reason at the time, this is actually from an earlier PR IIRC) I decided to go with random folders which shouldn't matter too much since we use oellm collect-results to aggregate the outputs into a csv anyway so the user wouldn't have to go hunting for the original json files
| if [[ -f "$LIGHT_TASK" ]]; then | ||
| LIGHT_TASK_ARG="$LIGHT_TASK" | ||
| else | ||
| last_segment="${{LIGHT_TASK##*|}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to explain which format you expect in english?
Co-authored-by: David Salinas <geoalgo@users.noreply.github.com>
| - name: Build SIF from definition file | ||
| run: | | ||
| apptainer --verbose build --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def | ||
| apptainer --verbose build --mksquashfs-args="-comp gzip -Xcompression-level 1" --fakeroot eval_env-${{ matrix.image }}.sif apptainer/${{ matrix.image }}.def |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation: trades off space vs aws ec2 time
|
|
||
| # Optional: keep tool envs under /opt to avoid $HOME | ||
| export UV_TOOL_DIR=/opt/uv-tools | ||
| uv tool install --python 3.12 "lighteval[multilingual] @ git+https://github.com/huggingface/lighteval.git@63424f4e795ecc577b90646381b374af3a627978" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this is to make lighteval happy
| @@ -0,0 +1,197 @@ | |||
| task_groups: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation: split up this file into task_groups and task_super_groups. The latter allows you to aggregate task_groups into something like oellm-multilingual
| @@ -0,0 +1,548 @@ | |||
| import builtins | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: I moved a bunch of helper funcs into this utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| df["eval_suite"] = df["eval_suite"].str.lower() | ||
|
|
||
| # Ensure that all datasets required by the tasks are cached locally to avoid | ||
| # network access on compute nodes. | ||
| if not skip_checks: | ||
| _pre_download_task_datasets( | ||
| df["task_path"].unique(), trust_remote_code=trust_remote_code | ||
| ) | ||
| lm_eval_tasks = df[df["eval_suite"].isin({"lm-eval-harness"})][ | ||
| "task_path" | ||
| ].unique() | ||
| if len(lm_eval_tasks) > 0: | ||
| _pre_download_task_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-enable dataset prefetch for lm_eval jobs
The new dataset pre-download block now filters rows with eval_suite == "lm-eval-harness", but all jobs constructed from --tasks default to "lm_eval" (see the comprehension that hard-codes eval_suite="lm_eval"). As a result _pre_download_task_datasets is never invoked for the common case of lm-eval-harness tasks provided via the CLI, so datasets are no longer cached before scheduling and offline nodes will fail once they try to download at runtime. Consider matching both "lm_eval" and "lm-eval-harness" (or normalising earlier) so the cache warm-up still runs.
Useful? React with 👍 / 👎.
Adds lighteval support
Stuff addressed in this PR:
zstdand compression level to 1 (i.e., trade-off storage for faster builds)num_tasks * 10 minutesfor nowhf_hub_downloadorload_datasetand stores the call args in a json. When runningoellm schedule-evalagain we first check the cache and directly usedatasets.load_datasetto load/make sure the dataset is on disk without having to endure the overhead that comes with importing lighteval or lm-eval-harness (this implementation lives inoellm/task_cache.py--verbose=Truefrom the CLI (lives inoellm/utils.py)