Skip to content

Implement pagination for tsv export#14

Merged
ainefairbrother merged 2 commits intoigsr:mainfrom
likhitha-surapaneni:feature/pagination-export-tsv
Mar 16, 2026
Merged

Implement pagination for tsv export#14
ainefairbrother merged 2 commits intoigsr:mainfrom
likhitha-surapaneni:feature/pagination-export-tsv

Conversation

@likhitha-surapaneni
Copy link
Copy Markdown
Collaborator

IGSR-618
This implements pagination required for TSV download required by igsr/gca_1000genomes_website#72

@ainefairbrother ainefairbrother self-requested a review March 6, 2026 10:32
Copy link
Copy Markdown
Collaborator

@ainefairbrother ainefairbrother left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @likhitha-surapaneni.

Thinking about this a bit more, and testing the feature, it seems that this is perhaps a heavy way to do it.

Previously, this file download endpoint made one regular ES search request. Because ES limits a single request to 10k, download sizes stop at 10k.

In this PR, the backend pages through the full result set in 10k chunks using scroll, which does solve the original 10k limit and allows much larger downloads. The problem is that it still builds the entire TSV in memory, and only then returns it. So if the download is very large, the server has to hold the whole file in memory at once, which can increase memory use and make the download slower - this will be more problematic as our number of files in the db increases.

Also, the ES_EXPORT_SIZE_CAP change from 10k to 1m does not change behaviour. Each ES request is still capped at 10k, but the code grabs 10k pages for as long as it needs to, resulting in an infinite limit. I do think it's correct for us to permit an infinite limit, as we would like users to be able to download a complete list of our files if they need to.

I think streaming the TSV response would be a better approach? This would let us send each batch as it is grabbed instead of holding the entire export in memory. Let me know what you think.

@likhitha-surapaneni
Copy link
Copy Markdown
Collaborator Author

Thanks for this @likhitha-surapaneni.

Thinking about this a bit more, and testing the feature, it seems that this is perhaps a heavy way to do it.

Previously, this file download endpoint made one regular ES search request. Because ES limits a single request to 10k, download sizes stop at 10k.

In this PR, the backend pages through the full result set in 10k chunks using scroll, which does solve the original 10k limit and allows much larger downloads. The problem is that it still builds the entire TSV in memory, and only then returns it. So if the download is very large, the server has to hold the whole file in memory at once, which can increase memory use and make the download slower - this will be more problematic as our number of files in the db increases.

Also, the ES_EXPORT_SIZE_CAP change from 10k to 1m does not change behaviour. Each ES request is still capped at 10k, but the code grabs 10k pages for as long as it needs to, resulting in an infinite limit. I do think it's correct for us to permit an infinite limit, as we would like users to be able to download a complete list of our files if they need to.

I think streaming the TSV response would be a better approach? This would let us send each batch as it is grabbed instead of holding the entire export in memory. Let me know what you think.

This approach makes sense to me Aine, thank you. Implemented streaming the TSV response which will not overload memory, let me know what you think

Copy link
Copy Markdown
Collaborator

@ainefairbrother ainefairbrother left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @likhitha-surapaneni, thank you for these changes, this is now very quick and works well for me. It also means that ES_EXPORT_SIZE_CAP is the total export size (total no. of rows). We should keep an eye on this number as our data grows, but it works well as a guardrail for now.

@ainefairbrother ainefairbrother merged commit e71c741 into igsr:main Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants