Skip to content

Introduce CachingCollectorManager to parallelize search when using CachingCollector#16247

Open
gaobinlong wants to merge 10 commits into
apache:mainfrom
gaobinlong:cachingCollectorManager
Open

Introduce CachingCollectorManager to parallelize search when using CachingCollector#16247
gaobinlong wants to merge 10 commits into
apache:mainfrom
gaobinlong:cachingCollectorManager

Conversation

@gaobinlong

@gaobinlong gaobinlong commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

This PR introduces CachingCollectorManager, switches GroupingSearch to use search concurrency and move away from the deprecated search(Query, Collector) method.

In addition, the useless constructor GroupingSearch(GroupSelector<?> groupSelector) is removed.

Relates to #12892.

…chingCollector

Signed-off-by: Binlong Gao <gbinlong@amazon.com>
Signed-off-by: Binlong Gao <gbinlong@amazon.com>

@javanna javanna left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see movement in this area, thanks for working on this @gaobinlong ! I left some comments

Comment thread lucene/core/src/java/org/apache/lucene/search/CachingCollectorManager.java Outdated
Comment thread lucene/core/src/java/org/apache/lucene/search/CachingCollectorManager.java Outdated
private final Integer maxDocsToCache;

// One CachingCollector per slice, thread-safe for concurrent newCollector() calls.
private final List<CachingCollector> cachingCollectors = new CopyOnWriteArrayList<>();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this list is needed for the replay functionality. Note that like we discussed in other PRs, newCollector is never called concurrently. It is called sequentially by the coordinating thread which will also call reduce at the end.

What I do worry about when it comes to concurrency though is the isCached mutable flag, that gets modified by the worker threads and accessed by the main thread at the end. There isn't a concurrent access problem with it, but there may be visibility issues. The list does not need to handle concurrency though.

Comment thread lucene/grouping/src/java/org/apache/lucene/search/grouping/GroupingSearch.java Outdated
Comment thread lucene/core/src/java/org/apache/lucene/search/CachingCollectorManager.java Outdated
Signed-off-by: Binlong Gao <gbinlong@amazon.com>
@github-actions github-actions Bot modified the milestones: 10.5.0, 11.0.0 Jun 17, 2026
@gaobinlong

Copy link
Copy Markdown
Contributor Author

@javanna all comments are addressed yet, please help to review again, thanks!

@javanna javanna left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple more comments, this is close! Thanks again!

() ->
new CachingCollectorManager<>(
new TopScoreDocCollectorManager(10, Integer.MAX_VALUE), false, null, null));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add also a test for the happy path ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry ,where? I mean, these tests never verify normal functioning of the collector manager, calling search against it without exceptions. Or am I not looking in the right place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the misunderstanding, added a new test method testBasic() to test the happy path, thanks!

@javanna javanna modified the milestones: 10.5.0, 10.6.0 Jun 25, 2026

@javanna javanna left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work, I left a new batch of comments, I 'd expect these to be the last ones.

() ->
new CachingCollectorManager<>(
new TopScoreDocCollectorManager(10, Integer.MAX_VALUE), false, null, null));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry ,where? I mean, these tests never verify normal functioning of the collector manager, calling search against it without exceptions. Or am I not looking in the right place?

addGroupField(doc, groupField, "author3", canUseIDV);
doc.add(new TextField("content", "random", Field.Store.YES));
doc.add(new Field("id", "6", customType));
doc.add(new Field("id", "5", customType));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for being pedantic, but can we revert these changes and leave things as they are? Or are these changes necessary? Can they otherwise be made as a followup perhaps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

Comment thread lucene/CHANGES.txt Outdated

* GITHUB#15660: Introduce LargeNumHitsTopDocsCollectorManager to parallelize search when using LargeNumHitsTopDocsCollector. (Binlong Gao)

* GITHUB#16247: Introduce CachingCollectorManager to parallelize search when using CachingCollector. (Binlong Gao)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the 10.6 section please, given 10.5 shipped?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

Integer maxDocsToCache) {
if (maxRAMMB == null && maxDocsToCache == null) {
throw new IllegalArgumentException("Either maxRAMMB or maxDocsToCache must be set");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also throw if both are non null given only one will be used and the other silently ignored?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed that, throw exception if both are non-null.

* @param groupSelector a {@link GroupSelector} that defines groups for this GroupingSearch
* @param grouperFactory a factory that creates fresh {@link GroupSelector} instances
*/
public GroupingSearch(GroupSelector<?> groupSelector) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this constructor is a good call, like we previously discussed. Could you perhaps mention this explicitly in the changes entry and the PR description?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Signed-off-by: Binlong Gao <gbinlong@amazon.com>
Signed-off-by: Binlong Gao <gbinlong@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants