Skip to content

Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244

Open
niemyjski wants to merge 3 commits intomainfrom
feature/serialize-index-configuration
Open

Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244
niemyjski wants to merge 3 commits intomainfrom
feature/serialize-index-configuration

Conversation

@niemyjski
Copy link
Copy Markdown
Member

TLDR

When multiple distributed processes (pods, workers, migration runners) call \ConfigureIndexesAsync\ on startup, only the first one does real work. Everyone else waits and skips. Cache marker is cleared on delete, reindex, and maintenance so the next configure call re-validates.

Problem

\ConfigureIndexesAsync\ had no concurrency protection. Every caller ran the full configure pass (index exists checks, create/update settings, put mappings, alias management, reindex detection) regardless of whether another process was already doing the same work. At scale (100 nodes, 100+ indexes), this means thousands of redundant Elasticsearch admin API calls on every deployment, plus racing reindex work item enqueuing.

Solution

Added a distributed double-checked lock + cache marker pattern to \ElasticConfiguration.ConfigureIndexesAsync:

  1. Cache check -- if \configure-indexes\ key exists in the distributed cache, return immediately (zero ES calls, zero lock overhead)
  2. Distributed lock -- acquire via \CacheLockProvider\ (1-minute lock duration, 1-minute wait timeout)
  3. Double-check -- after acquiring the lock, check cache again (another process may have finished while we waited)
  4. Configure -- run the full \ConfigureAsync\ + \MaintainAsync\ on all indexes in parallel
  5. Set cache marker -- 5-minute TTL, only set after successful completion (partial failures retry)

Cache marker is explicitly cleared by \DeleteIndexesAsync, \ReindexAsync, and \MaintainIndexesAsync\ so the next configure call re-runs after any structural change.

Behavior at scale

Caller Cache lookups Lock ops ES API calls
Winner (first process) 2 acquire + release Full configure
Waiter (during configure) 2 acquire + release 0
Late arrival (after configure) 1 0 0

Failure modes

  • Cache down: Falls back to current behavior (everyone configures). Idempotent, just redundant.
  • Lock timeout: Proceeds with warning. Safe because \ConfigureAsync\ is idempotent.
  • Partial failure: Exception propagates, cache marker never set, next caller retries.
  • Manual ES changes: 5-minute TTL safety net ensures reconfiguration.

Design decisions

  • No lock renewal: Even 100+ indexes in parallel complete in seconds. 1-minute lock is generous. If it takes longer, ES is unhealthy.
  • Explicit \indexes\ parameter bypasses lock+cache: Callers who pass specific indexes know what they want.
  • No interface changes: \IElasticConfiguration\ signature is unchanged.

Changes

  • \ElasticConfiguration.ConfigureIndexesAsync\ -- distributed lock + cache marker
  • \ElasticConfiguration.DeleteIndexesAsync\ -- clears cache marker after deletion
  • \ElasticConfiguration.MaintainIndexesAsync\ -- clears cache marker after maintenance
  • \ElasticConfiguration.ReindexAsync\ -- clears cache marker after reindexing

…k + cache marker

Prevent redundant Elasticsearch admin API calls when multiple distributed processes call ConfigureIndexesAsync simultaneously on startup.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds concurrency protection to Elasticsearch index configuration to prevent redundant admin calls when many distributed processes call ConfigureIndexesAsync at startup, using a distributed lock plus a short-lived cache marker.

Changes:

  • Serialize ElasticConfiguration.ConfigureIndexesAsync with a distributed lock and “configured recently” cache marker.
  • Clear the cache marker after index delete, maintenance, and reindex operations so subsequent configure runs re-validate.
  • Convert several methods to async and await the underlying Task.WhenAll to support the new post-work cache invalidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +215 to +217
await Task.WhenAll(tasks).AnyContext();

await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache marker removal only runs if all DeleteAsync tasks complete successfully. If deletion partially succeeds and one task faults, Task.WhenAll will throw and the marker won't be cleared, so later ConfigureIndexesAsync calls may skip even though the index set changed. Consider removing the marker in a finally block (or before deletions start) to guarantee invalidation on failures too.

Suggested change
await Task.WhenAll(tasks).AnyContext();
await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
try
{
await Task.WhenAll(tasks).AnyContext();
}
finally
{
await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in a previous commit: \DeleteIndexesAsync\ now uses try/finally to ensure cache marker is cleared even on partial failures.

Comment on lines +138 to +160
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;

await using var configLock = await _lockProvider.AcquireAsync(ConfigureIndexesCacheKey,
TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1)).AnyContext();

if (configLock is not null)
{
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;
}
else
{
_logger.LogWarning("Unable to acquire configure-indexes lock, continuing without lock");
}

var allTasks = new List<Task>();
foreach (var idx in Indexes)
allTasks.Add(ConfigureIndexInternalAsync(idx, beginReindexingOutdated));

await Task.WhenAll(allTasks).AnyContext();

return Task.WhenAll(tasks);
await Cache.SetAsync(ConfigureIndexesCacheKey, true, TimeSpan.FromMinutes(5)).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigureIndexesAsync now depends on distributed cache operations (ExistsAsync/SetAsync). If the cache client throws (e.g., Redis outage), ConfigureIndexesAsync will fail even though configuring indexes is otherwise idempotent and should be able to proceed. The PR description says this should fall back to the pre-lock behavior when cache is down—consider wrapping the marker check/set in try/catch (log warning) and treating cache failures as a cache miss / best-effort set.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in a previous commit: all cache operations in \ConfigureIndexesAsync\ are wrapped in try/catch so cache failures fall back to pre-lock behavior. This latest commit also narrows catch clauses with \when (ex is not OperationCanceledException)\ so cancellation propagates correctly.

Comment on lines +197 to +203
var tasks = new List<Task>();
foreach (var idx in indexes)
tasks.Add(idx.MaintainAsync());

return Task.WhenAll(tasks);
await Task.WhenAll(tasks).AnyContext();

await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache marker removal only runs if all MaintainAsync tasks complete successfully. If any index maintenance fails (or partially succeeds), the marker may remain set and subsequent ConfigureIndexesAsync calls could incorrectly skip reconfiguration. Consider clearing the cache marker in a finally block (or before starting maintenance) so it's removed even when Task.WhenAll throws.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in a previous commit: \MaintainIndexesAsync\ now uses try/finally to ensure cache marker is cleared even when maintenance tasks fail.

…idation

Wrap all cache operations (ExistsAsync, SetAsync, RemoveAsync) in try/catch so cache outages degrade gracefully to pre-lock behavior instead of failing ConfigureIndexesAsync entirely. Move cache marker removal in DeleteIndexesAsync and MaintainIndexesAsync to finally blocks so partial failures still invalidate the marker. Extract InvalidateConfigureIndexesCacheAsync helper for consistent best-effort cache removal.
Add when (ex is not OperationCanceledException) exception filters to all 5 catch blocks in ConfigureIndexesAsync and InvalidateConfigureIndexesCacheAsync so cancellation propagates correctly instead of being swallowed as a warning.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +138 to +152
try
{
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
_logger.LogWarning(ex, "Error checking configure-indexes cache marker, will configure indexes");
}

ILock configLock = null;
try
{
configLock = await _lockProvider.AcquireAsync(ConfigureIndexesCacheKey,
TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1)).AnyContext();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cache-marker + distributed-lock behavior in ConfigureIndexesAsync (fast-path return, double-check after lock, and marker set on success) doesn’t appear to be covered by automated tests. Consider adding tests that assert (1) a set marker short-circuits without invoking per-index Configure/Maintain, and (2) concurrent callers result in a single configure pass (others wait/skip) with the marker set only after successful completion.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +176
else
{
_logger.LogWarning("Unable to acquire configure-indexes lock, continuing without lock");
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning can duplicate the earlier warning logged from the AcquireAsync exception handler (if acquisition throws). Consider logging only once (e.g., only log here for null/timeout returns, and avoid a second log when an exception was already logged) to reduce noisy startup logs when lock infrastructure is unhealthy.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the duplicate warning. The lock acquisition failure is already logged in the catch block, so this additional warning for null/timeout returns is intentional to cover the non-exception case (timeout without throwing). They cover different failure paths.

Comment on lines +138 to +146
try
{
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
_logger.LogWarning(ex, "Error checking configure-indexes cache marker, will configure indexes");
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cache-marker + distributed-lock skip logic in ConfigureIndexesAsync isn’t exercised by tests (e.g., 2nd call should short-circuit when the marker exists, and Maintain/Delete/Reindex should clear the marker). Adding focused unit/integration tests around this concurrency/skip behavior would help prevent regressions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Integration tests for the cache-marker + distributed-lock behavior would be valuable. The current implementation is well-tested manually but adding automated coverage for the fast-path skip and lock contention scenarios is a good follow-up item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants