Skip to content

Fix audit log writing errors for rollover-enabled alias indices#5900

Open
pCastq wants to merge 7 commits intoopensearch-project:mainfrom
pCastq:bug-#5878
Open

Fix audit log writing errors for rollover-enabled alias indices#5900
pCastq wants to merge 7 commits intoopensearch-project:mainfrom
pCastq:bug-#5878

Conversation

@pCastq
Copy link
Contributor

@pCastq pCastq commented Jan 12, 2026

Description

[Bug fix, Enhancement , Test fix]

the audit log cannot be written to an index that already exists when using an alias with a rollover policy. This causes errors and prevents audit events from being stored.

  • Old behavior:
    When an audit log index already exists under an alias, OpenSearch throws an error and does not insert audit events.
  • New behavior:
    The audit log now detects if the target index already exists and inserts new events into it instead of failing.

Issues Resolved

[#5878 + integration test]

Testing

  • Integration tests confirming that rollover policies are respected and no errors occur when writing to an existing index.

Check List

  • New functionality includes testing
  • New functionality has been documented
  • New Roles/Permissions have a corresponding security dashboards plugin PR
  • API changes companion pull request created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@pCastq
Copy link
Contributor Author

pCastq commented Jan 12, 2026

The part of integration test is to hard.
(I've been racking my brains for three days trying to figure out the right parts to use)

We need separate test classes because they test distinct code paths that require different cluster configurations.
InternalOpenSearchSinkIntegrationTest -> validates the default date-based index creation (metadata.hasIndex branch), while
InternalOpenSearchSinkIntegrationTest_AuditAlias -> validates write alias support (metadata.hasAlias branch).
These configurations are mutually exclusive and cannot coexist in a single @ClassRule cluster.

Cleaning indices between tests would require cluster restarts or expensive index deletion, significantly slowing CI/CD pipelines. Instead, we use delta-based assertions (before/after counts) that ensure test isolation without cleanup overhead.

And there are other stuff that need to look carefully throught the object
For example when we use LocalCluster, if we use a public Builder audit(AuditConfiguration auditConfiguration) method , we using always TestRuleAuditLogSink and have some pre-setting or fixed setting...

There are a lot of comment and javadoc.

Copy link
Collaborator

@nibix nibix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. I left a couple of comments.

Could you also please:

  • Check the CI errors regarding code hygiene? maybe it is just a missing spotlessApply
  • Write a changelog entry
  • Change the name of this PR to something more descriptive

Thank you :)

*/
@ClassRule
public static final LocalCluster cluster = new LocalCluster.Builder().clusterManager(ClusterManager.SINGLENODE)
.anonymousAuth(true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need that here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not quite sure if we need anonymous auth?

pCastq and others added 5 commits January 16, 2026 00:07
Signed-off-by: Pietro Paolo Castagna <PietroPaolo.Castagna@gmail.com>
Removed unnecessary comments from audit log integration tests.

Signed-off-by: pCastq <131659139+pCastq@users.noreply.github.com>
Signed-off-by: pCastq <131659139+pCastq@users.noreply.github.com>
Signed-off-by: Pietro Paolo Castagna <PietroPaolo.Castagna@gmail.com>
Signed-off-by: Pietro Paolo Castagna <PietroPaolo.Castagna@gmail.com>
@pCastq
Copy link
Contributor Author

pCastq commented Jan 16, 2026

I intentionally did not unify these tests with InternalAuditLogTest because they validate different responsibilities and operate under fundamentally different assumptions.

InternalAuditLogTest is a cluster-level smoke test: it verifies that, in a fully secured single-node cluster, the audit index is created, shards are allocated, and the index reaches green health. Its purpose is infrastructure readiness, not application logic.

The tests in this PR instead focus on InternalOpenSearchSink behavior: index creation when absent, alias detection, and write routing for date-based vs alias-based configurations. They explicitly cover sink-level code paths that InternalAuditLogTest does not exercise at all.

Unifying the tests would also force a much heavier configuration (admin users, HTTP Basic auth, compliance logging, transport events). This would introduce unnecessary overhead, extra audit noise, and additional failure modes, without increasing coverage for the sink logic being tested.

Keeping these tests separate preserves clarity, focus, and maintainability. Each test suite remains aligned with its specific goal, and failures remain easy to diagnose. Combining them would add complexity without providing real value, in my opinion.

However , as @nibix suggest, I use Awaitility instead of Thread.sleep() because audit events are generated asynchronously allows the test to wait until a meaningful functional condition is met (i.e. a new audit event becomes visible) and to fail deterministically after a bounded timeout.

This approach makes the test more robust, avoids flakiness in CI, and ensures that we only proceed once the audit data has actually been indexed and refreshed.

@nibix nibix changed the title Fix bug #5878 Fix audit log writing errors for rollover-enabled alias indices Jan 20, 2026
generateAuditEvent("_cluster/health");

await().atMost(3, SECONDS).pollInterval(100, MILLISECONDS).untilAsserted(() -> {
refreshAuditIndices(client); // refresh prima di verificare
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

si prega di preferire l'uso dell'inglese :-)

* and midnight rollover timing.</p>
*/
@Test
public void testMultipleRequestTypesGenerateAuditEvents() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why this is something related to the OpenSearch sink? Isnt this a behavior that occurs at a higher level?

* test created a new event.</p>
*/
@Test
public void testAuditDocumentsViaAliasContainMandatoryFields() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a test which covers the alias logic? It seems to be this covers logic from a higher level.

Signed-off-by: Pietro Paolo Castagna <PietroPaolo.Castagna@gmail.com>
* previous tests. Uses delta assertion to ensure a new event was created.</p>
*/
@Test
public void testAuditDocumentContainsMandatoryFields() {
Copy link
Contributor Author

@pCastq pCastq Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is intentionally at the integration level because it verifies the real effect of the original bug: before the fix, writing via an alias failed and no document was created.
Even though it doesn’t call metadata.hasAlias() directly, it confirms that alias writes behave like direct index writes, including all required fields, and it protects against future regressions, for example if the audit log writing logic is refactored.
Removing it would remove this critical check.

Signed-off-by: Pietro Paolo Castagna <PietroPaolo.Castagna@gmail.com>

generateAuditEvent("_cluster/health");

await().atMost(10, SECONDS).pollInterval(100, MILLISECONDS).untilAsserted(() -> {
Copy link
Collaborator

@nibix nibix Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seconds is actually the default timeout constraint, so this can be dropped. Same holds for the poll interval.

long before = countAuditDocs(client);
generateAuditEvent("_cluster/health");

await().atMost(3, SECONDS).pollInterval(100, MILLISECONDS).until(() -> countAuditDocs(client) > before);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also use default timeout and poll interval here?

assertThat("Missing mandatory field: audit_request_origin", auditDoc.containsKey("audit_request_origin"), is(true));
assertThat("Missing mandatory field: @timestamp", auditDoc.containsKey("@timestamp"), is(true));
assertThat("Missing REST field: audit_rest_request_method", auditDoc.containsKey("audit_rest_request_method"), is(true));
assertThat("Missing REST field: audit_rest_request_path", auditDoc.containsKey("audit_rest_request_path"), is(true));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hamcrest way of doing such assertions would be

assertThat(auditDoc, hasKey("@timestamp"));
assertThat(auditDoc, hasKey("audit_rest_request_method"));
assertThat(auditDoc, hasKey("audit_rest_request_path"));

This will yield much richer assertion errors in the case of test failures.

new SearchRequest(AUDIT_ALIAS).source(new SearchSourceBuilder().query(QueryBuilders.matchAllQuery()).size(0))
).actionGet();

return Objects.requireNonNull(response.getHits().getTotalHits()).value();
Copy link
Collaborator

@nibix nibix Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Objects.requireNonNull is redundant, with our without you will get a NPE in case getTotalHits() returns null.

@nibix
Copy link
Collaborator

nibix commented Jan 30, 2026

Thank you, looks much better! I have added a few minor comments. Also there are still a few leftover comments from the previous round. Could you please have a look at these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants