-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Description
The ai-proxy-multi plugin's health check mechanism has a structural bug: the construct_upstream function is called from both request context and timer context (healthcheck_manager.timer_create_checker), but only works correctly in request context.
In timer context, construct_upstream always returns nil because the _dns_value runtime field does not exist on instance configs read from etcd, causing health checker creation to permanently fail.
Current Behavior
- When a request hits
pick_target,resolve_endpointis called which setsinstance._dns_value(a runtime-only field on the in-memory config object). fetch_checkeris called, which returnsnil(checker not yet created) and adds the resource towaiting_pool.- The
timer_create_checkertimer fires, reads config from etcd viaresource.fetch_latest_conf, extracts the instance config via jsonpath, and callsplugin.construct_upstream(instance_config). construct_upstreamchecksinstance._dns_value— this field does not exist on configs read from etcd (it's only set in request context byresolve_endpoint).construct_upstreamreturnsnil, socreate_checkeris never called.- The resource is removed from
waiting_pool(waiting_pool[resource_path] = nilat line 211), so it will never be retried. - Subsequent calls to
fetch_checkerfrom request context see the resource is neither inworking_poolnorwaiting_pool, so it gets re-added towaiting_pool— but the same cycle repeats on the next timer tick.
Net effect: Health checkers are never successfully created through the timer path. Unhealthy instances are never filtered out by the load balancer.
Expected Behavior
construct_upstream should be able to compute the upstream node info from the instance's static configuration (endpoint URL or provider defaults) without relying on _dns_value, so that health checkers can be created successfully in timer context.
Code References
construct_upstream requiring _dns_value (ai-proxy-multi.lua#L302-L306):
function _M.construct_upstream(instance)
local upstream = {}
local node = instance._dns_value
if not node then
return nil, "failed to resolve endpoint for instance: " .. instance.name
end_dns_value is only set in request context by resolve_endpoint (ai-proxy-multi.lua#L215):
instance_conf._dns_value = new_nodeTimer calls construct_upstream with etcd config (no _dns_value) (healthcheck_manager.lua#L165-L179):
local res_conf = resource.fetch_latest_conf(resource_path)
-- ...
local upstream_constructor_config = jp.value(res_conf.value, json_path)
upstream = plugin.construct_upstream(upstream_constructor_config) -- _dns_value missingResource permanently removed from waiting_pool after failure (healthcheck_manager.lua#L201-L211):
local checker = create_checker(upstream) -- upstream is nil, so checker is nil
if not checker then
goto continue -- skips add_working_pool
end
-- ...
::continue::
waiting_pool[resource_path] = nil -- permanently removedSuggested Fix Direction
Add a fallback in construct_upstream that computes the node from static config (endpoint URL or provider's default host/port) when _dns_value is not available. The existing resolve_endpoint function already contains this logic — it can be extracted into a pure function like calculate_dns_node(instance_conf) that returns {host, port, scheme} without modifying the input.
Important: Any fix must preserve the ai_driver.get_node() interface used by providers like vertex-ai that compute host dynamically (e.g., based on region).
Environment
- APISIX version: master (current as of 2026-03-18)
- Affects all deployment modes where
ai-proxy-multiis used with health checks enabled
Context
This issue was identified during analysis of PR #12968, which attempts to fix this problem but has additional issues (removes get_node support, couples to resty.healthcheck SHM internals, includes unrelated changes). This issue is filed to track the core bug independently.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status