Skip to content

WIP: OCPBUGS-57474: Authorization Cache V2#530

Closed
jacobsee wants to merge 6 commits intoopenshift:mainfrom
jacobsee:hash-based-authorization-cache
Closed

WIP: OCPBUGS-57474: Authorization Cache V2#530
jacobsee wants to merge 6 commits intoopenshift:mainfrom
jacobsee:hash-based-authorization-cache

Conversation

@jacobsee
Copy link
Copy Markdown
Member

@jacobsee jacobsee commented Jun 24, 2025

  • Move existing (periodic-sync based) AuthorizationCache from cache.go to AuthorizationCacheV1 in cachev1.go (no longer instantiated anywhere currently)
  • Add generic interface for an AuthorizationCache to cache.go
  • Add AuthorizationCacheV2, which uses periodic cluster and namespace-specific hashing to track changes and invalidate the cache/notify watchers
  • Add ReactiveAuthorizationCacheV2, which wraps AuthorizationCacheV2 in event-handling so that heavy sync checks can happen less frequently and changes are still processed in a timely manner
  • Switch the Project API to instantiating a ReactiveAuthorizationCacheV2
  • Significant testing

Motivated by the fact that the current auth cache does not appear to be notifying properly when permissions have been removed during incremental synchronizations (https://issues.redhat.com/browse/OCPBUGS-57474), and it looks to be an old issue in a system that is rather difficult to reason around.

…orizationCacheV2 implementation and switch all internal usage to an interface. Move the original resource-version-based cache to cachev1.go, update wiring and tests, and ensure all consumers use the new interface type.
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 24, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jacobsee
Copy link
Copy Markdown
Member Author

/test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacobsee
Once this PR has been reviewed and has the lgtm label, please assign benluddy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jacobsee
Copy link
Copy Markdown
Member Author

/retest

@jacobsee
Copy link
Copy Markdown
Member Author

/retest

@jacobsee jacobsee changed the title WIP: Project authorization cache V2 WIP: Authorization Cache V2 Jun 25, 2025
@jacobsee
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@jacobsee
Copy link
Copy Markdown
Member Author

/retest

…uthorizationCacheV2 to use it. Switch to Workqueue instead of custom queueing logic. Use LastSyncResourceVersion in global rbac cache to short circuit hashing.
@jacobsee
Copy link
Copy Markdown
Member Author

/retest

@jacobsee
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Contributor

@benluddy benluddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this, it looks really promising!

@@ -0,0 +1,645 @@
package auth
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the cache.go/cache_test.go renames in a clean commit would make it obvious that there are no little changes mixed in.

informers.RoleBindings().Informer(),
}

globalRBACCache := NewGlobalRBACCache(scrLister, scrbLister)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to pass this in rather than constructing it here? You'd be able to write tests that simulate varying the global RBAC without setting up phony informers.

}

// Run begins watching and synchronizing the cache
func (ac *AuthorizationCacheV2) Run(period time.Duration) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the periodic "syncs" be entirely removed in favor of the informer event handlers with periodic resync intervals?

Comment on lines +203 to +219
// Start periodic synchronization
go func() {
// Initial sync
rac.synchronize()

ticker := time.NewTicker(period)
defer ticker.Stop()

for {
select {
case <-ticker.C:
rac.synchronize()
case <-rac.stopCh:
return
}
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this still needed?

Comment on lines +19 to +22
const (
globalSyncKey = "global"
namespaceKeyPrefix = "namespace:"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be at least as clear to use "" as a sentinel key for all namespaces and get rid of the string prefixing.

refs = append(refs, rbacResourceRef{
uid: string(cr.UID),
resourceVersion: cr.ResourceVersion,
kind: "ClusterRole",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kind necessary? We shouldn't have any two objects with the same UID.

Comment on lines +101 to +104
// If we don't have previous versions cached, consider it changed
if g.lastClusterRoleVersion == "" || g.lastClusterRoleBindingVersion == "" {
return true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant. When would we return true from this path but not the path that immediately follows?

Comment on lines +287 to +290
// GetQueueStatus returns information about the queue
func (rac *ReactiveAuthorizationCacheV2) GetQueueStatus() (queued int) {
return rac.queue.Len()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems brittle to have tests that directly depend on this... can't we test the observable results of the workqueue processor without knowing about it?

}

// GetCurrentHash returns the current global RBAC hash
func (g *GlobalRBACCache) GetCurrentHash() string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do callers need to be responsible for detecting that this hash might be stale and for updating it? I think correctness depends on always getting the latest hash. Caching based on the last sync RV makes sense as an optimization, but I think it's an implementation detail and not something callers care about.

@@ -129,7 +54,7 @@ func TestSyncNamespace(t *testing.T) {
nsIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{})
nsLister := corev1listers.NewNamespaceLister(nsIndexer)

authorizationCache := NewAuthorizationCache(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests that can be run against both implementations? Or a shim that runs both implementations for real and detects behavior differences?

@jacobsee jacobsee changed the title WIP: Authorization Cache V2 WIP: OCPBUGS-57474: Authorization Cache V2 Jul 1, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 1, 2025
@openshift-ci-robot
Copy link
Copy Markdown

@jacobsee: This pull request references Jira Issue OCPBUGS-57474, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

  • Move existing (periodic-sync based) AuthorizationCache from cache.go to AuthorizationCacheV1 in cachev1.go (no longer instantiated anywhere currently)
  • Add generic interface for an AuthorizationCache to cache.go
  • Add AuthorizationCacheV2, which uses periodic cluster and namespace-specific hashing to track changes and invalidate the cache/notify watchers
  • Add ReactiveAuthorizationCacheV2, which wraps AuthorizationCacheV2 in event-handling so that heavy sync checks can happen less frequently and changes are still processed in a timely manner
  • Switch the Project API to instantiating a ReactiveAuthorizationCacheV2
  • Significant testing

Motivated by the fact that the current auth cache does not appear to be notifying properly when permissions have been removed during incremental synchronizations (https://issues.redhat.com/browse/OCPBUGS-57474), and it looks to be an old issue in a system that is rather difficult to reason around.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2025
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2025
@openshift-bot
Copy link
Copy Markdown
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2025
@openshift-ci openshift-ci Bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 14, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 14, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 4, 2025

@jacobsee: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-cmd e1ab0ec link false /test e2e-aws-ovn-cmd
ci/prow/e2e-aws-ovn-serial e1ab0ec link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-serial-1of2 e1ab0ec link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-serial-2of2 e1ab0ec link true /test e2e-aws-ovn-serial-2of2
ci/prow/go-verify-deps e1ab0ec link true /test go-verify-deps

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci Bot closed this Jan 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 4, 2026

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants