Allow merge integrity checks to be aborted sooner#16086
Conversation
This change introduces a merge AbortChecker that is threaded through MergeState into StoredFieldsReader.checkIntegrity and CodecUtil.checksumEntireFile, allowing long-running integrity checks like in apache#13354 to be aborted sooner after a merge is aborted. It is enabled via IndexWriterConfig.setMergeAbortCheckIntervalBytes, which indicate a number of bytes to read during checksumming before checking if the corresponding merge has been aborted. Only stored fields are wired up but other files like postings and doc values could benefit from the same treatment in a follow-up. Having MergeState expose the AbortChecker.checkAborted method can also be useful to other merging logic I think.
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @tlrx, this looks great. I left a couple of comments.
| } | ||
| } | ||
|
|
||
| final MergePolicy.AbortChecker mergeAbortChecker; |
There was a problem hiding this comment.
Maybe add a factory method that takes a OneMerge and a check interval and returns NO_OP if the interval is 0?
There was a problem hiding this comment.
Good suggestion, I added MergePolicy.AbortChecker.create() in 227c021.
|
|
||
| /** Checks if the merge should be aborted, throwing MergeAbortedException if so. */ | ||
| public void checkAborted() throws MergeAbortedException { | ||
| oneMerge.checkAborted(); |
There was a problem hiding this comment.
This will throw an NPE if it's called on NO_OP which I think is trappy.
|
Related: #13354 |
| private final OneMerge oneMerge; | ||
| private final int abortCheckIntervalBytes; | ||
|
|
||
| public AbortChecker(OneMerge oneMerge, int abortCheckIntervalBytes) { |
There was a problem hiding this comment.
Let's make this private and do all construction via the factory method.
mikemccand
left a comment
There was a problem hiding this comment.
I don't like that we need a whole new class/interface, touching so many existing classes, for this. If it's just a "check every N bytes" why not implement that more directly -- store that int/long somewhere and fix checkIntegrity.
I'm also not a fan of only doing stored fields -- other Lucene index files can be ginormous (vectors) -- can we impl for those as well? Or if it really must only be stored fields, let's confess so in the docs for this?
Finally, I'm not convinced we understand the root cause here, so I think we may be fixing a "not actually a problem". At Amazon we've also seen some evidence that merge abort was insanely slow, but somehow never followed up on it / got to root cause. Can we first spend some time back on the issue doing that? For example, is there maybe a bug that's turned off the entire abort checking best effort system? (It is similar to Thread.interrupt best effort checking, but hopefully better!).
It's hard for me to believe checkIntegrity can take minutes on any Lucene index file unless the storage system holding the index is insanely slow. Or, the index is insanely massive. Both are possible :)
| * | ||
| * @param intervalBytes the interval in bytes, must be positive | ||
| */ | ||
| public IndexWriterConfig setMergeAbortCheckIntervalBytes(int intervalBytes) { |
There was a problem hiding this comment.
Let's make this long? In general if something is measuring bytes let's try to use long -- we do math on such things that may lead to overflow?
There was a problem hiding this comment.
does not need a configurable value at all, just a reasonable one (e.g. 1MB). If someone is doing s3, even setting this to 1 could take "forever" due to network latency. We shouldn't overengineer here.
If we can simplify how we do it (see idea on CodecUtil.java), i'd suggest overloading the implementing function with a parameter, to allow unit tests to use a small value, but it doesn't need to be public.
| * Returns the interval in bytes between abort checks during merge integrity verification. See | ||
| * {@link IndexWriterConfig#setMergeAbortCheckIntervalBytes(int)}. | ||
| */ | ||
| public int getMergeAbortCheckIntervalBytes() { |
| * <p>The check is performed in {@link #readBytes} since that is what {@link | ||
| * ChecksumIndexInput#seek} calls in a loop to compute the checksum. | ||
| */ | ||
| private static final class AbortableBufferedChecksumIndexInput |
There was a problem hiding this comment.
Hmm, you're adding another level of buffering / copying, when someone turns on this new setting? I think? Can we subclass FilterIndexInput instead, so it's simply/only the added accounting (tracking total bytes written) and not more buffering / copying?
| protected IndexWriterEventListener eventListener; | ||
|
|
||
| /** Interval in bytes between abort checks during merge integrity verification. */ | ||
| protected volatile int mergeAbortCheckIntervalBytes; |
There was a problem hiding this comment.
Is it only checkIntegrity that we are newly instrumenting here?
| abstract void checkIntegrity() throws IOException; | ||
|
|
||
| /** Check the integrity of the index, with periodic merge abort checking. */ | ||
| public abstract void checkIntegrity(MergePolicy.AbortChecker abortChecker) throws IOException; |
There was a problem hiding this comment.
Since we are only adding "check every N bytes", can we rename AbortChecker to something more specific? Is this a public API, or will Lucene users only interact with the setter/getter on IWC (taking just int or long)?
|
Also, since things can vary drastically depending on the env, let's use time/seconds as the "check every", not bytes? We can hardwire this to something reasonable that's surely in the noise of overall merge cost. For merges, We should be able to make this fix (once we really understand root cause -- do any of our unit tests check for prompt merge aborts?) without having to expose another knob for Lucene users to tune ... |
|
Thanks @romseygeek and @mikemccand for your feedback! Before addressing code review comments, I'll try to reply to the higher-level questions:
Yes, the storage systems were indeed insanely slow when it happened. I've seen this in two situations: disks saturated with IO (caused I think by many concurrent force merges running on non top notch SSDs) and storage backed by object storage like S3 with high read latency hiccups, as David describes in #13354 (comment). In those situations, the Lucene index needed to be closed for system maintenance purpose or to be reopened on a more performant machine.
I don't think there is a bug in the existing abort checking system itself. The existing Note that this change adds the
I found some tests that abort merges but they don't check that the abort happens promptly. Specifically, there is no
Yes I mentioned in the PR description that other files could benefit from this but wanted to keep the radius of changes low at first. For vectors, I wonder if more
I explored the I also explored wrapping the readers (
If we're only accounting for time spent reading bytes for checksumming purposes without tying it to the So I think this leaves us with different options:
I think we could start with option 2 as a simple first improvement in this PR (doesn't interrupt in-progress checksums but avoids starting new ones on an already-aborted merge) and follow up with option 1 or 3 to interrupt long-running checksums. Does that sounds reasonable? If so I can rework the PR. |
| + footerLength(), | ||
| input); | ||
| } | ||
| in.seek(in.length() - footerLength()); |
There was a problem hiding this comment.
This is the call that will cause your i/o. I wonder if, instead of wrapping in a subclass, if we could just change it to a small loop, that only seek()s say, 1MB at once, and calls checkabort() after each iteration.
There was a problem hiding this comment.
Thanks @rmuir, checking every ~1MB seek makes sense and the public configurable value on IndexWriterConfig would become unnecessary.
We'd still need to wire the OneMerge::checkAborted call down to CodecUtil.checksumEntireFile though, so the checkIntegrity() signature on the abstract reader classes would need to change (or be overloaded) to accept the abort callback, something that @mikemccand was concerned about.
There was a problem hiding this comment.
I understand his point too. this was just a brainstorm to try to minimize it practically. I feel like wrapping all the reads might be overkill: if we could contain the solution here, I think it would make the merge abort a lot better.
Otherwise, after we checksum, for the most part merge is no longer just "reading" but also doing writes at the same time, so the existing checks should work there.
Actually does vectors merging today ever check for aborted, e.g. during the concurrent HNSW graph building, except at the very start/end?
Got it, +1 for PnP ("progress not perfection") for sure. I think a lower level approach (wrapping all
Hmm I see, that is a wrinkle. I was thinking we could wrap the output path, but you're right,
That is so tricky -- how did you catch that bulk optos broke? I hope we have tests that get angry?
+1 to at least start with #2 -- that's an easy win! I also ... dare say ... is there any chance |
|
I think during merging we wrap all |
The symmetry with the read path is appealing! But I think it would require wrapping (or reopening) the IndexInputs used by the merge instances through the wrapped merge directory. And those merge instances are obtained via getMergeInstance(), so we'd need a way to pass the merge directory to them... I suspect the required changes would be as invasive as what this PR proposes. Unless you were thinking of a different solution? Overall I think the simplest path is a checkIntegrity(OneMerge) overload that propagates the abort check down to checksumEntireFile, where it checks every ~1MB seek as Robert suggested. |
+1, that's a delightfully simple solution.
Ahh, hrpmph, yes. And these merges will often re-use an already opened pooled reader for each segment, and those were opened well before any merging plans (so no chance to wrap the |
|
Thanks @mikemccand for the feedback. I opened #16264 to add additional merge abort checks before executing file integrity checksums (the point 2 that I suggested in #16086 (comment)). Hopefully it's a less controversial change. It also threads
|
|
Closed in favor of #16281 |
This change introduces a merge
AbortCheckerthat is threaded throughMergeState intoStoredFieldsReader.checkIntegrityandCodecUtil.checksumEntireFile, allowing long-running integrity checks like described in #13354 to be aborted sooner after a merge is aborted.It is enabled via
IndexWriterConfig.setMergeAbortCheckIntervalBytes, which indicates a number of bytes to read during checksumming before checking if the corresponding merge has been aborted.Only stored fields are wired up but other files like postings and doc values could benefit from the same treatment in a follow-up. Having
MergeStateexpose theAbortChecker.checkAbortedmethod could also be useful to other merging logic like HNSW graphs maybe.