-
Notifications
You must be signed in to change notification settings - Fork 683
Add monthly and weekly downloads to recent downloads calculation #12420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Turbo87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refreshing this materialized view is one of the most expensive operations that our database has to deal with. could you check what the before/after performance impact of this change is with a production-size database dump?
|
Refresh using the current query was taking 13.3s SELECT crate_id, SUM(version_downloads.downloads) FROM version_downloads
INNER JOIN versions
ON version_downloads.version_id = versions.id
WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days')
GROUP BY crate_id;The query in this PR is taking 21.5s SELECT
crate_id,
SUM(version_downloads.downloads),
SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '30 days')),
SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '7 days'))
FROM version_downloads
INNER JOIN versions
ON version_downloads.version_id = versions.id
WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days')
GROUP BY crate_id;I improved it to take 13s by removing the timestamp conversions. SELECT
crate_id,
SUM(version_downloads.downloads),
SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > CURRENT_DATE - 30),
SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > CURRENT_DATE - 7)
FROM version_downloads
INNER JOIN versions
ON version_downloads.version_id = versions.id
WHERE version_downloads.date > CURRENT_DATE - 90
GROUP BY crate_id;NOTE1: Removing the timestamp conversions in the older query made it take 11.5s NOTE2: We should probably remove the |
We calculate the recent downloads using a materialized view that
we refresh every day. But it's only for last 90 days.
Most of the other ecosystems like NPM, PyPi, Packagist, Hex
provide daily, weekly and monthly downloads. So, I added
a weekly and monthly column in the recent downloads view
and surfaced the data in the `/api/v1/crates/{name}` endpoint.
1c9b449 to
cd32bcf
Compare
|
interesting. on our production database this operation often take multiple minutes 😅 @eth3lbert any thoughts on this? |
I'm also a bit worried about the time this will take in production, so if we're sure it won't take longer than before, I'm not opposed to it! :D" Referring to the downloads table, I should probably find time to migrate our downloads table to |
That would mean that we get more savings on actual numbers :) Do you need to confirm the benchmark yourself before merging this? Or is there anything else I can do? |
|
I thought about this some more. The 90-day download data for the crates is already available on the API, just not in summed up form, but anyone who wants to calculate these values for a certain crate can already do so. If shields.io wants to display such values, can't they subset the per-day download numbers themselves and sum them up? As I mentioned above, the materialized view is one of the more expensive pieces of our database and I'm slightly hesitant to extend it since we can't go back once we offer these values in our API. I'm adding this issue to the agenda for our crates.io team meeting today. |
How would they get it though? I saw some endpoints but don't think any of them solved this. |
|
|
👍. If deciding to use it, it would need a filter parameter for dates. |
why? you could filter it on the client side, no? |
|
It would be better to pass the filter to the db to reduce the load. In fact, I am wondering how much load constantly requesting this endpoint would add to the db versus the current PR. But, I can't prove anything without having metrics. Since the current PR keeps the time taken for refresh at least the same, without knowing metrics for that endpoint, I would prefer the current PR from a pure performance viewpoint. |
We calculate the recent downloads using a materialized view that we refresh every day. But it's only for last 90 days.
Most of the other ecosystems like NPM, PyPi, Packagist, Hex provide daily, weekly and monthly downloads. So, I added a weekly and monthly column in the recent downloads view and surfaced the data in the
/api/v1/crates/{name}endpoint.I have attached links for shields.io badges that show weekly and monthly downloads for each ecosystem. Once this is merged, I will add the same for Crates.io badges.