-
Notifications
You must be signed in to change notification settings - Fork 1
Add deduplication logic #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks! Mind sharing some more stats on the real dataset?
Code is quite long to review but at least above should give us some more confidence before merge |
| # For leaderboard mode with successful runs, prefer higher scores | ||
| if run_mode == 'leaderboard' and row.get('run_passed') == True: | ||
| if row.get('run_score', 0) > existing_row.get('run_score', 0): | ||
| unique_entries[content_hash] = row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I see. I'll rerun things and reupload
|
@msaroufim for the first two questions we have these ✓ Loaded submissions.parquet: 40,095 entries ============================================================
|
|
@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different. |
This change modifies the extraction processes so that it uses a lot less memory. In particular, the process no longer loads the whole dataset into memory before exporting to parquet files. Instead, it processes the dataset into small, incremental parquet files, and then consolidates these files into a single file as the final step.
This change modifies the extraction processes so that it uses a lot less memory. In particular, the process no longer loads the whole dataset into memory before exporting to parquet files. Instead, it processes the dataset into small, incremental parquet files, and then consolidates these files into a single file as the final step.
# Conflicts: # export.py
|
Updated values Deduplication results Summary:
Original rows: 60357
After hash based dedup dedup: 22718 rows
Final rows: 9281
Removed 51076 duplicates (84.6%)
Saved to data/successful_submissions_deduplicated.parquetSubmissions Flattening and saving...
Deduplication results Summary:
Original rows: 109709
After hash based dedup dedup: 47362 rows
Final rows: 19012
Removed 90697 duplicates (82.7%)
Saved to data/submissions_deduplicated.parquet |
This pull request introduces deduplication functionality to the
export.pyscript and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.The deduplication functionality is specifically does 3 things
['code', 'run_mode', 'run_passed', 'run_meta', 'submission_id']to make processing manageable. Otherwise, I was running into issues with pandas loading the submissions dataframe.There is also a good amount of tests added to make sure this stuff actually works