Skip to content

Conversation

@natmaka
Copy link

@natmaka natmaka commented Aug 28, 2025

My goal is to speed things up when the database is "huge" (total size way above the host's RAM size).

Approach number 1: offering to the user a way to let pg_sample use the "SYSTEM" option instead of "BERNOULLI". This patch does it. It seems OK but I didn't test thoroughly.

Approach number 2: obtaining the amount of tuples in a table using meta-information collected and stored during an ANALYZE pass, instead of the usual SELECT COUNT() way which often implies reading the whole table. The patch offers provisions to do so, but most of the work has to be done. Let me know if it seems interesting to you.

There are also various modifications made in "janitor" mode.

@mla
Copy link
Owner

mla commented Aug 28, 2025

Awesome, than you @natmaka ! Will review this weekend.

@mla
Copy link
Owner

mla commented Nov 23, 2025

Hey @natmaka, check out the dev branch is you would and see if that's what you're going for.

@natmaka
Copy link
Author

natmaka commented Nov 24, 2025

Hi, @mla, I could not launch it on a huge DB but it worked a on small DB.

I hacked a way to:

  • neglect any table vanishing during a run (on-the-fly)
  • avoid repeating a notice about the same table if all its tuples are already imported
  • complete the code enabling an '--approxcount' option (now probably useless) which lets the code use pg_class instead of 'SELECT COUNT(*)...'

natmaka@df2d882

If it seems useful to you I may submit potentially useful ones as a request against the dev branch.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants