Skip to content

Conversation

@pksunkara
Copy link

We calculate the recent downloads using a materialized view that we refresh every day. But it's only for last 90 days.

Most of the other ecosystems like NPM, PyPi, Packagist, Hex provide daily, weekly and monthly downloads. So, I added a weekly and monthly column in the recent downloads view and surfaced the data in the /api/v1/crates/{name} endpoint.

I have attached links for shields.io badges that show weekly and monthly downloads for each ecosystem. Once this is merged, I will add the same for Crates.io badges.

@pksunkara

This comment was marked as outdated.

Copy link
Member

@Turbo87 Turbo87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refreshing this materialized view is one of the most expensive operations that our database has to deal with. could you check what the before/after performance impact of this change is with a production-size database dump?

@Turbo87 Turbo87 added the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Dec 1, 2025
@pksunkara
Copy link
Author

pksunkara commented Dec 1, 2025

Refresh using the current query was taking 13.3s

SELECT crate_id, SUM(version_downloads.downloads) FROM version_downloads
INNER JOIN versions
  ON version_downloads.version_id = versions.id
WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days')
GROUP BY crate_id;

The query in this PR is taking 21.5s

SELECT
  crate_id,
  SUM(version_downloads.downloads),
  SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '30 days')),
  SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '7 days'))
FROM version_downloads
  INNER JOIN versions
    ON version_downloads.version_id = versions.id
  WHERE version_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days')
  GROUP BY crate_id;

I improved it to take 13s by removing the timestamp conversions.

SELECT
  crate_id,
  SUM(version_downloads.downloads),
  SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > CURRENT_DATE - 30),
  SUM(version_downloads.downloads) FILTER (WHERE version_downloads.date > CURRENT_DATE - 7)
FROM version_downloads
  INNER JOIN versions
    ON version_downloads.version_id = versions.id
  WHERE version_downloads.date > CURRENT_DATE - 90
  GROUP BY crate_id;

NOTE1: Removing the timestamp conversions in the older query made it take 11.5s

NOTE2: We should probably remove the (version_id, date) index in version_downloads table because it's not actually being used since older data is archived and almost all the rows are used for calculation, thus postgres uses seq scan. That way, you can save some space. But, I will leave it upto you guys.

We calculate the recent downloads using a materialized view that
we refresh every day. But it's only for last 90 days.

Most of the other ecosystems like NPM, PyPi, Packagist, Hex
provide daily, weekly and monthly downloads. So, I added
a weekly and monthly column in the recent downloads view
and surfaced the data in the `/api/v1/crates/{name}` endpoint.
@Turbo87
Copy link
Member

Turbo87 commented Dec 1, 2025

interesting. on our production database this operation often take multiple minutes 😅

@eth3lbert any thoughts on this?

@eth3lbert
Copy link
Contributor

interesting. on our production database this operation often take multiple minutes 😅

I'm also a bit worried about the time this will take in production, so if we're sure it won't take longer than before, I'm not opposed to it! :D"


Referring to the downloads table, I should probably find time to migrate our downloads table to Declarative Partitioning. However, before actually doing this, it would be better to test whether a new database with the same scale, specifications, and data runs much faster than the current one. This would suggest that our performance is likely being affected by the (BRIN) index.

@pksunkara
Copy link
Author

pksunkara commented Dec 3, 2025

interesting. on our production database this operation often take multiple minutes 😅

That would mean that we get more savings on actual numbers :)

Do you need to confirm the benchmark yourself before merging this? Or is there anything else I can do?

@Turbo87
Copy link
Member

Turbo87 commented Dec 5, 2025

I thought about this some more. The 90-day download data for the crates is already available on the API, just not in summed up form, but anyone who wants to calculate these values for a certain crate can already do so. If shields.io wants to display such values, can't they subset the per-day download numbers themselves and sum them up?

As I mentioned above, the materialized view is one of the more expensive pieces of our database and I'm slightly hesitant to extend it since we can't go back once we offer these values in our API.

I'm adding this issue to the agenda for our crates.io team meeting today.

@Turbo87 Turbo87 moved this to For next meeting in crates.io team meetings Dec 5, 2025
@pksunkara
Copy link
Author

If shields.io wants to display such values, can't they subset the per-day download numbers themselves and sum them up?

How would they get it though? I saw some endpoints but don't think any of them solved this.

@Turbo87
Copy link
Member

Turbo87 commented Dec 5, 2025

I saw some endpoints but don't think any of them solved this.

https://crates.io/api/v1/crates/syn/downloads

@pksunkara
Copy link
Author

👍. If deciding to use it, it would need a filter parameter for dates.

@Turbo87
Copy link
Member

Turbo87 commented Dec 5, 2025

If deciding to use it, it would need a filter parameter for dates.

why? you could filter it on the client side, no?

@pksunkara
Copy link
Author

It would be better to pass the filter to the db to reduce the load. In fact, I am wondering how much load constantly requesting this endpoint would add to the db versus the current PR. But, I can't prove anything without having metrics.

Since the current PR keeps the time taken for refresh at least the same, without knowing metrics for that endpoint, I would prefer the current PR from a pure performance viewpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Projects

Status: For next meeting

Development

Successfully merging this pull request may close these issues.

4 participants