Skip to content

Conversation

@hedibouattour
Copy link
Collaborator

Previously, entries were cached by entry key only. This caused issues when multiple services referenced the same entry.

This update changes the caching logic to:

  • Cache each entry by service ID while allowing multiple services to target the same entry.

  • When multiple services reference the same entry, override the previous value and keep the latest.

  • On service entry deletion, check whether other services still reference the same entry; if so, recreate the entry.

This ensures consistent behavior when entries are shared across services.

@hedibouattour hedibouattour marked this pull request as draft December 9, 2025 16:58
@hedibouattour hedibouattour force-pushed the refactor-service-caching branch from 33935b7 to f8d2794 Compare December 9, 2025 17:10
@hedibouattour hedibouattour self-assigned this Dec 9, 2025
@hedibouattour hedibouattour force-pushed the refactor-service-caching branch from f8d2794 to ea85208 Compare December 10, 2025 16:54
@hedibouattour hedibouattour marked this pull request as ready for review December 10, 2025 16:55
@hedibouattour hedibouattour requested a review from sknat December 10, 2025 17:22
Previously, entries were cached by entry key only. This caused issues when multiple services referenced the same entry.

This update changes the caching logic to:

* Cache each entry by service ID while allowing multiple services to target the same entry.

* When multiple services reference the same entry, override the previous value and keep the latest.

* On service entry deletion, check whether other services still reference the same entry; if so, recreate the entry.

This ensures consistent behavior when entries are shared across services.
@hedibouattour hedibouattour force-pushed the refactor-service-caching branch from ea85208 to 35634e5 Compare December 15, 2025 15:20
Copy link
Collaborator

@sknat sknat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for taking a stab at this. This looks good, but I think we might still be vulnerable to overrides in VPP that will cause double free. Details inline

// the entry is still referenced by another service, recreate another service entry randomly
s.log.Warnf("svc(del) entry %s was referenced by multiple services", entry.Key())
var randomService string
for randomService = range s.cnatEntriesByService[entry.Key()] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have a way to deterministically pick the service we want to install

entry: entry,
vppID: entryID,
}
if len(s.cnatEntriesByService[entry.Key()]) > 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue I see is that if that happens, we might try to install two services with the same Key in VPP, which will cause issues during cleanup, as VPP's hash table only contains a single slot for (IP;port;proto)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is going to override the existing entry because it's the same key, and keep the latest created entry. It's like an update of the existing entry.
So from vpp's perspective we always keep one entry per (IP;port;proto).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ideally in vpp we should extend the hash to something more than just (IP;port;proto), but for now can you think of a better way to deal with multiple services creating same entry ?


serviceStateMap map[string]ServiceState
// map entry key to a map of service id to entry in vpp
cnatEntriesByService map[string]map[string]*cnatEntry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe do two maps instead ?

// cnatEntriesByServiceName is a map of the k8s service ID to the entry, 
// that is the object representing the translation in VPP
cnatEntriesByServiceName map[string]*cnatEntry
// cnatEntriesByKey is a map of (proto;port;ip) to the active k8s serviceID
serviceIdByKey map[string]string

This would enable us on Add:

  • if serviceIdByKey is empty we just call cnatAdd()
  • if serviceIdByKey has a value, we call cnatDelete() for the previous entry and cnatAdd() for the new one

On Delete:

  • We only delete from VPP if serviceIdByKey contains our service ID

This makes the choice deterministic (last takes precedence), and prevent double adds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code does this I think. We just re-add the entry and it's like an update.

func (v *VppLink) CnatTranslateAdd(tr *types.CnatTranslateEntry) (uint32, error) {
	if len(tr.Backends) == 0 {
		return InvalidID, nil
	}
	client := cnat.NewServiceClient(v.GetConnection())

	paths := make([]cnat.CnatEndpointTuple, 0, len(tr.Backends))
	for _, backend := range tr.Backends {
		paths = append(paths, cnat.CnatEndpointTuple{
			SrcEp: types.ToCnatEndpoint(backend.SrcEndpoint),
			DstEp: types.ToCnatEndpoint(backend.DstEndpoint),
			Flags: backend.Flags,
		})
	}

	response, err := client.CnatTranslationUpdate(v.GetContext(), &cnat.CnatTranslationUpdate{
		Translation: cnat.CnatTranslation{
			Vip:            types.ToCnatEndpoint(tr.Endpoint),
			IPProto:        types.ToVppIPProto(tr.Proto),
			Paths:          paths,
			IsRealIP:       BoolToU8(tr.IsRealIP),
			Flags:          uint8(cnat.CNAT_TRANSLATION_ALLOC_PORT),
			LbType:         cnat.CnatLbType(tr.LbType),
			FlowHashConfig: ip.IPFlowHashConfigV2(tr.HashConfig),
		},
	})
	if err != nil {
		return InvalidID, fmt.Errorf("add/upd CnatTranslate failed: %w", err)
	}
	return response.ID, nil
}

This actually is a CnatTranslationUpdate call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants