Easy optimisations#141
Conversation
... to speed up data collection. AI-generated: GPT-5.3-Codex (via GitHub Copilot) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
4c7a7a5 to
28e347f
Compare
samcunliffe
left a comment
There was a problem hiding this comment.
I believe the main saving is concurrently fetching issues data (because many issues for many repos).
| export const mapWithConcurrency = async <T, R>( | ||
| items: T[], | ||
| concurrency: number, | ||
| mapper: (item: T) => Promise<R>, | ||
| ): Promise<R[]> => { | ||
| const results: R[] = new Array(items.length); | ||
| let nextIndex = 0; | ||
|
|
||
| const worker = async () => { | ||
| while (nextIndex < items.length) { | ||
| const currentIndex = nextIndex; | ||
| nextIndex += 1; | ||
| results[currentIndex] = await mapper(items[currentIndex]); | ||
| } | ||
| }; | ||
|
|
||
| await Promise.all( | ||
| Array.from({ length: Math.min(concurrency, items.length) }, () => worker()), | ||
| ); | ||
|
|
||
| return results; | ||
| }; |
There was a problem hiding this comment.
Human Sam wrote this:
This function effectively does map(f, list), but because of JavaScript and nicer lambda functions, the order is first: the items, then: the number of concurrent function mappings to run at once, then the function to map.
Later, this is used in the fetchers so we can fetch ~4 repos and issues concurrently at once.
| const { | ||
| issuesAverageAge: openIssuesAverageAge, | ||
| issuesMedianAge: openIssuesMedianAge, | ||
| } = await calculateIssueMetricsPerRepo(repoName, 'open', octokit, config); |
There was a problem hiding this comment.
Here we're using mapWithConcurrency to map calculateIssueMetricsPerRepo over the list of repository names (for example).
There was a problem hiding this comment.
Pull request overview
Speeds up metrics collection by introducing a small concurrency utility and using it to parallelize backend fetchers; also tweaks the CI workflow and updates the npm lockfile.
Changes:
- Add
mapWithConcurrencyworker-pool helper and use it to parallelize repository contributor stats fetching. - Parallelize per-repository issue metric collection.
- Adjust GitHub Actions workflow to cache npm deps, move token generation later, and run backend via npm workspaces (plus resulting
package-lock.jsonupdates).
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
backend/src/fetchers/fetcher_utils.ts |
Introduces mapWithConcurrency helper for bounded parallel mapping. |
backend/src/fetchers/repository.ts |
Uses bounded concurrency + retry loop to fetch contributor stats more quickly. |
backend/src/fetchers/issues.ts |
Runs per-repo issue metrics collection with bounded concurrency and per-repo error logging. |
.github/workflows/nextjs.yml |
Adds npm caching, moves app-token creation later, and uses workspace-based backend install/run. |
package-lock.json |
Updates lockfile (notably includes large dependency bumps). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - uses: actions/create-github-app-token@29824e69f54612133e76f7eaac726eef6c875baf # v2.2.1 | ||
| id: generate_token | ||
| with: | ||
| app-id: ${{ secrets.APP_ID }} | ||
| private-key: ${{ secrets.APP_PRIVATE_KEY }} |
There was a problem hiding this comment.
Is the move to try and keep the key alive as long as possible (even if it's just a bunch of extra seconds)? (Edit: I just saw the human comment in the fist post) Here is a suggestion for automatically refreshing the key, for longer-running tasks, but it'd require some substantial changes here. Having an overall quicker workflow would be nicer though
There was a problem hiding this comment.
Yeah, exactly.
Though if the dependencies change it can actually take ones of minutes to do the npm install.
This saving is essentially negligible compared to the parallelsation and non blocking concurrency.
Package updates can come in other Dependabot PRs.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| const CONTRIBUTORS_RETRY_DELAY_MS = 15_000; | ||
| // GitHub's stats/contributors endpoint often returns 202 for several minutes | ||
| // while repository statistics are being generated, so allow a longer retry window. | ||
| const CONTRIBUTORS_MAX_RETRIES = 12; |
There was a problem hiding this comment.
| const CONTRIBUTORS_RETRY_DELAY_MS = 15_000; | |
| // GitHub's stats/contributors endpoint often returns 202 for several minutes | |
| // while repository statistics are being generated, so allow a longer retry window. | |
| const CONTRIBUTORS_MAX_RETRIES = 12; | |
| // GitHub's stats/contributors endpoint often returns 202 for several minutes | |
| // while repository statistics are being generated, so allow a longer retry window. | |
| const CONTRIBUTORS_RETRY_DELAY_MS = 30_000; | |
| const CONTRIBUTORS_MAX_RETRIES = 4; |
There was a problem hiding this comment.
@UCL/open-source-impact-seed
So, this business will likely need a bit of tweaking.
It might be nice to run with a lot more retries in the nightly cron but only a few when we're testing dependencies. I.e. have these numbers passed in as configuration parameters to the job.
For now though, this works. ~Some repos still return 202 for the contributor count. But these are not the "stellar" repos that are at the top of the showcase. So my feeling is: leave as is.
Caution
Before you even bother reviewing this: how do we feel about AI-assisted code in this repo?
The core commit here, 28e347f, was heavily AI-assisted. I've reviewed and added some human comments (you have to believe me when I say the human wrote them).
Solves
What
Instead of running the data collection concurrently in series, run 4 threads. This doesn't hit rate limits (or at least not any more than the original) and speeds the data collection up from > 1 hour to ~10 minutes (❗❗❗).
A factor 6 code increase for
600100 new lines of code.I (human Sam) also moved the token generation to just before we call the data collection (so the cache and dependencies etc all happen before). This is a very marginal saving.
Why
Because GitHub app tokens have a hard limit of 1 hour life. There doesn't appear to be a way to extend the validity. Also, the speedup is probably a bit better for developer quality of life when handling Dependabot updates.