Service: Use iterator to avoid high space complexity by flyrain · Pull Request #3415 · apache/polaris

flyrain · 2026-01-11T19:55:21Z

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

singhpk234 · 2026-01-14T01:17:31Z

+        .filter(mf -> seenPaths.add(mf.path()))
        .filter(mf -> TaskUtils.exists(mf.path(), fileIO))


Set<String> uniquePaths = tableMetadata.snapshots().stream() .flatMap(sn -> sn.allManifests(fileIO).stream()) .map(ManifestFile::path) .collect(Collectors.toSet()); return uniquePaths.parallelStream() // Parallel here! .filter(mf -> TaskUtils.exists(mf.path(), fileIO)) .map(mf -> createManifestTask(...));

Once we call .collect(Collectors.toSet(), the stream is fully materialized, which will lose the benefit of lazy execution. Here we are trying lower the memory footprint based on lazy execution.

singhpk234 · 2026-01-14T02:42:28Z

+      createAndRegisterTasks(batch, metaStoreManager, polarisCallContext, tableEntity);
+      totalCount += batch.size();
+    }
+


can explicitly call batch.clear ?

We could, but we don't have to, as this is the last batch.

singhpk234

LGTM, this seems like a nice improvement thanks @flyrain !

dimas-b · 2026-01-15T01:54:51Z

@pingtimeout : what is your take on this PR?

pingtimeout · 2026-01-15T08:05:17Z

@dimas-b This PR is very confusing to me as after review, I do not think it fixes anything at all...

pingtimeout

Thanks @flyrain for the attempt at fixing the high space complexity issue. This is a good start, but I don't think we are quite there yet.

As far as I can tell, the space complexity of table cleanup was O(UM + PM + S + ST + PST + T). And with this change, it is O(UM + PM + S + ST + PST + batchSize) where:

PM = number of previous metadata files
S = number of snapshots
ST = number of statistics files
PST = number of partition statistics files
UM = number of unique manifest files across all snapshots
T = total number of created TaskEntities

You can see that by running the code with large number of files under constrained memory. You will see that with the current code, there is always a number of files that results in an OOME, proving that the space complexity issue has not been solved by the change. You may want to use realistic (longer) paths to surface the issue faster.

I want to emphasize one critical point that must be addressed before this PR is merged. In #3256, you said the following:

please take a look to see if that solves the problem. It'd be really nice to run this with the same setup we used to validate the current PR which is this PR fixed the issue

Which contradicts the box that you checked in the description of this PR: Added/updated tests with good coverage, or manually tested (and explained how). Were you able to reproduce the issue before attempting to write a fix?

To summarize: based on my review of the code, I am convinced that this does not solve the underlying issue. And based on the lack of testing, I do not think this PR is ready. I appreciate the desire to provide an alternative to #3256. But I think #3256 is the best option we have, all things considered.

pingtimeout · 2026-01-14T09:01:49Z

+  }
+
+  @Test
+  public void testMetadataFileBatchingWithManyFiles() throws IOException {


This test is named testMetadataFileBatchingWithManyFiles but only creates 24 files in total. Unfortunately that does not prove that the code is better at handling large tables.

The intent of this unit test is not to simulate a truly large table, but to validate the batching behavior and correctness when metadata files are processed incrementally. As is common practice, we avoid stress or scale tests in unit tests, since they would significantly slow down CI execution and are better suited for dedicated benchmark.

I understand that unit tests should be quick to avoid slowing down CI. My main concern here is whether this code change has been tested at scale. And if so, how?

Thanks for bringing it up. I think it's a good idea to have a benchmark, more details are here, #3256 (comment).

pingtimeout · 2026-01-14T09:08:12Z

-        .stream()
+        // distinct by manifest path, since multiple snapshots will contain the same manifest
+        // Use stateful filter to dedupe while streaming
+        .filter(mf -> seenPaths.add(mf.path()))


This line adds all unique manifest files across all snapshots to a set that is maintained in memory. Even though the stream is lazy, all unique manifest paths are materialized on the heap. This means that the space complexity does not change.

Thanks for the detailed analysis. I agree that the only remaining unbounded structure here is the in memory set used to dedup manifest paths. I do not think this is a practical concern.

To put concrete numbers on it, with an extreme case that 1 million file paths and an estimated 50 to 100 bytes per path including object and set overhead, the memory footprint would be roughly 40 MB to 95 MB, which is acceptable. That is already a very large table cleanup scenario. At that scale, the question becomes whether we even want the Polaris server itself to handle such a task synchronously in memory. A delegation service would fit better in that case.

pingtimeout · 2026-01-14T09:13:24Z

    int batchSize = callContext.getRealmConfig().getConfig(BATCH_SIZE_CONFIG_KEY, 10);
-    return getMetadataFileBatches(tableMetadata, batchSize).stream()
+
+    // Stream all metadata files without materializing them all at once


The only thing that this change does it to postpone the call to the .map(...) methods, but afaict the memory consumption stays identical.

The main change is that stream().toList() has been removed to avoid fully materializing the results in memory. Instead, an iterator is used together with a configurable batch size (taskPersistenceBatchSize) to read and process items incrementally. This bounds memory usage, as shown in lines 169 to 175.

The parameters of the stream construction are eager, so I am afraid the only thing lazily evaluated here is the call to .flatMap(Function.identity())

The comment is misleading, removed. However, all file paths here is part of metadata.json file, we've loaded the matadata.json file as a table metadata to memory already. Applying lazy evaluation doesn't make sense here.

github-actions · 2026-02-28T02:14:14Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-project-automation Bot added this to Basic Kanban Board Jan 11, 2026

github-project-automation Bot moved this to PRs In Progress in Basic Kanban Board Jan 11, 2026

singhpk234 reviewed Jan 14, 2026

View reviewed changes

singhpk234 previously approved these changes Jan 15, 2026

View reviewed changes

github-project-automation Bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jan 15, 2026

flyrain mentioned this pull request Jan 15, 2026

Object storage operations #3256

Open

dimas-b requested a review from pingtimeout January 15, 2026 01:55

pingtimeout requested changes Jan 15, 2026

View reviewed changes

github-project-automation Bot moved this from Ready to merge to PRs In Progress in Basic Kanban Board Jan 15, 2026

flyrain added 3 commits January 20, 2026 17:57

Push down the batch size

4d81c45

Add tests

0e6ad5f

Resolve comments

789a4b8

flyrain dismissed singhpk234’s stale review via 789a4b8 January 21, 2026 02:15

flyrain force-pushed the batch-size-pushdown branch from 6ac5e4e to 789a4b8 Compare January 21, 2026 02:15

github-actions Bot added the stale label Feb 28, 2026

github-actions Bot closed this Mar 11, 2026

github-project-automation Bot moved this from PRs In Progress to Done in Basic Kanban Board Mar 11, 2026

flyrain reopened this Apr 5, 2026

github-project-automation Bot moved this from Done to PRs In Progress in Basic Kanban Board Apr 5, 2026

github-actions Bot removed the stale label Apr 6, 2026

		.filter(mf -> seenPaths.add(mf.path()))
		.filter(mf -> TaskUtils.exists(mf.path(), fileIO))

Conversation

flyrain commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

dimas-b commented Jan 15, 2026

Uh oh!

pingtimeout commented Jan 15, 2026

Uh oh!

pingtimeout left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flyrain commented Jan 11, 2026 •

edited

Loading