Skip to content

fix: use shdict instead of events module for nodes data exchange#13066

Open
shreemaan-abhishek wants to merge 6 commits intoapache:masterfrom
shreemaan-abhishek:fix/consul-uses-shdict-for-nodes-data
Open

fix: use shdict instead of events module for nodes data exchange#13066
shreemaan-abhishek wants to merge 6 commits intoapache:masterfrom
shreemaan-abhishek:fix/consul-uses-shdict-for-nodes-data

Conversation

@shreemaan-abhishek
Copy link
Contributor

Description

Fixes #12398

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 4, 2026
f
Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
Baoyuantop
Baoyuantop previously approved these changes Mar 6, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent 503s after APISIX restarts with Consul-based service discovery by switching Consul node data exchange from per-worker in-memory/events-based propagation to a shared dictionary (lua_shared_dict) approach.

Changes:

  • Replace Consul discovery’s intra-worker data propagation (events + local table) with ngx.shared.consul storage.
  • Add lua_shared_dict consul 10m; to the generated Nginx config template and test harness Nginx config.
  • Update Consul dump test expectations to match the new error log message when a service is missing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
apisix/discovery/consul/init.lua Stores/reads discovered Consul nodes via ngx.shared.consul and limits active fetching to the privileged agent.
apisix/cli/ngx_tpl.lua Declares lua_shared_dict consul in the generated Nginx configuration.
t/APISIX.pm Declares lua_shared_dict consul in the test harness Nginx config.
t/discovery/consul_dump.t Updates expected error log output for missing Consul service nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +165 to +173
for k, v in pairs(entity.services) do
local content, json_err = core.json.encode(v)
if content then
consul_dict:set(k, content)
else
log.error("failed to encode dump service: ", k, ", error: ", json_err)
end
end
log.info("load dump file into shared dict success")
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_dump_services() writes services into ngx.shared.consul without clearing existing keys first. Since lua_shared_dict contents survive nginx reloads, this can retain stale keys (or temporarily overwrite newer values) when the dump file is loaded during startup/reload. Consider either clearing the dict (or using a generation/version namespace) before loading the dump so the shared dict reflects exactly the dump contents until Consul updates arrive.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the values are set using service keys, I don't think this should be of concern.

Comment on lines +103 to +104
log.info("process id: ", ngx_worker_id(), ", all_services[", service_name, "] = ",
json_delay_encode(resp_list, true))
json_delay_encode(nodes, true))
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message still references all_services[...], but the data source is now ngx.shared.consul and a decoded nodes table. Updating the message would avoid confusion when debugging (especially since all_services no longer exists).

Copilot uses AI. Check for mistakes.
Comment on lines 90 to +106
function _M.nodes(service_name)
if not all_services then
log.error("all_services is nil, failed to fetch nodes for : ", service_name)
return
local value = consul_dict:get(service_name)
if not value then
log.error("consul service not found: ", service_name, ", return default service")
return default_service and {default_service}
end

local resp_list = all_services[service_name]

if not resp_list then
log.error("fetch nodes failed by ", service_name, ", return default service")
local nodes, err = core.json.decode(value)
if not nodes then
log.error("fetch nodes failed by ", service_name, ", error: ", err)
return default_service and {default_service}
end

log.info("process id: ", ngx_worker_id(), ", all_services[", service_name, "] = ",
json_delay_encode(resp_list, true))
json_delay_encode(nodes, true))

return resp_list
return nodes
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes() now reads JSON from ngx.shared.consul and decodes it on every call, which makes it return a fresh table each time. In apisix/upstream.lua, this defeats the compare_upstream_node fast-path (old_t == new_t) and forces the slow path (sorting/comparing) on every request, which is both a performance regression and can change node ordering (the slow path sorts by host in-place). Consider caching decoded nodes per-worker (e.g., LRU keyed by service name + raw JSON string/version) and returning the same table instance while the shared-dict value is unchanged.

Copilot uses AI. Check for mistakes.
Comment on lines 111 to 116
-- clean old unused data
local old_services = consul_services[consul_server_url] or {}
for k, _ in pairs(old_services) do
all_services[k] = nil
consul_dict:delete(k)
end
core.table.clear(old_services)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_all_services() only deletes keys found in consul_services[consul_server_url]. After an nginx reload (shared dict persists but consul_services is reinitialized), any stale keys already present in ngx.shared.consul will never be deleted, so removed Consul services can keep being returned by nodes(). Consider clearing the shared dict on privileged-agent init, or persisting/deriving the previous key set from the shared dict (e.g., maintain a stored service list/version and delete keys not present in the new update).

Copilot uses AI. Check for mistakes.
all_services[k] = v
local content, err = core.json.encode(v)
if content then
consul_dict:set(k, content)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consul_dict:set(k, content) return values are ignored. If the shared dict runs out of memory, set can fail (or evict other keys via forcible), which would silently drop discovery data and lead to intermittent 503s. Please capture and handle ok, err, forcible from set (and log/metric when it fails or is forcible).

Suggested change
consul_dict:set(k, content)
local ok, set_err, forcible = consul_dict:set(k, content)
if not ok then
log.error("failed to set nodes for service: ", k, " into consul_dict, error: ", set_err)
elseif forcible then
log.warn("consul_dict is out of memory, key(s) evicted when setting service: ", k)
end

Copilot uses AI. Check for mistakes.
…ce discovery nodes and flush consul_dict on init to prevent stale data persistence

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +659 to +662
-- flush stale data that may persist across reloads,
-- since consul_services is re-initialized empty
consul_dict:flush_all()

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consul_dict:flush_all() runs in the privileged agent during init_worker(), but it executes after read_dump_services() may have populated the shared dict (and workers also call read_dump_services() before the privileged-agent guard). This can wipe the dump-loaded nodes and leave the dict empty until Consul fetch completes, defeating the dump-on-reload/startup mitigation and potentially causing 503s. Consider either (a) moving the flush earlier (before any dump load) and performing dump load only in the privileged agent, or (b) removing flush_all() and instead cleaning stale keys during update_all_services() in a way that survives reloads.

Copilot uses AI. Check for mistakes.
Comment on lines 121 to +133
local function update_all_services(consul_server_url, up_services)
-- clean old unused data
local old_services = consul_services[consul_server_url] or {}
for k, _ in pairs(old_services) do
all_services[k] = nil
consul_dict:delete(k)
end
core.table.clear(old_services)

for k, v in pairs(up_services) do
all_services[k] = v
local content, err = core.json.encode(v)
if content then
consul_dict:set(k, content)
else
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_all_services() deletes all previously-known keys before setting the new ones. While this runs, workers can observe a missing service_name in ngx.shared.consul and fall back to the default (or nil), causing transient request failures during an update window. A safer approach is to write/set new values first (optionally under a generation/version namespace) and only then delete obsolete keys, so reads never see an empty/missing service during an update.

Copilot uses AI. Check for mistakes.
Comment on lines 638 to 645
@@ -621,25 +644,24 @@ function _M.init_worker()
end
end
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_dump_services() now writes into ngx.shared.consul, but init_worker() calls it in every process (workers + privileged agent). That means multiple processes can concurrently read/encode/write the same dump data at startup, adding contention and making the final dict contents depend on startup ordering. Consider guarding dump load so only the privileged agent (or a single designated worker) performs the shared-dict writes, and other workers only read from the dict.

Copilot uses AI. Check for mistakes.
  - Reorder update_all_services() to write new values before deleting
    obsolete keys, so workers never observe a missing service mid-update
  - Move dump loading after flush_all() and into the privileged agent
    only, preventing the privileged agent from wiping dump-loaded data
    that workers depend on during startup

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

log.error("all_services is nil, failed to fetch nodes for : ", service_name)
return
local value = consul_dict:get(service_name)
if not value then
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes() uses consul_dict:get(service_name). During reloads or refresh patterns that use flush_all(), get() returns nil even though stale values may still exist, causing avoidable 503s. Consider using get_stale() (and only falling back to default when both fresh+stale are missing). Also consider clearing nodes_cache[service_name] when the dict has no value to avoid retaining entries for removed services indefinitely.

Suggested change
if not value then
if value == nil then
-- try to use stale value during reloads or after flush_all()
value = consul_dict:get_stale(service_name)
end
if value == nil then
-- no fresh or stale value in shared dict; clear per-worker cache entry
nodes_cache[service_name] = nil

Copilot uses AI. Check for mistakes.
Comment on lines 191 to 195
local function write_dump_services()
local entity = {
services = all_services,
services = _M.all_nodes(),
last_update = ngx.time(),
expire = dump_params.expire, -- later need handle it
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_dump_services() now calls _M.all_nodes() which scans all shdict keys and JSON-decodes every service on each dump write. If dump is enabled and service count is high, this can add noticeable CPU overhead at each Consul update. Consider maintaining a privileged-agent-only in-memory snapshot for dump generation, or passing the updated snapshot into write_dump_services to avoid full scans/decodes.

Copilot uses AI. Check for mistakes.
Comment on lines +654 to +657
-- flush stale data that may persist across reloads,
-- since consul_services is re-initialized empty
consul_dict:flush_all()

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes the cross-worker data sharing mechanism for Consul nodes to rely on lua_shared_dict, but there isn’t a regression test validating the restart/reload scenario from #12398 (e.g., repeated HUP reloads / restarts while sending requests should not produce intermittent 503s once Consul has data). Adding a Test::Nginx case similar to existing HUP-based tests would help prevent regressions.

Suggested change
-- flush stale data that may persist across reloads,
-- since consul_services is re-initialized empty
consul_dict:flush_all()

Copilot uses AI. Check for mistakes.
Comment on lines +654 to +656
-- flush stale data that may persist across reloads,
-- since consul_services is re-initialized empty
consul_dict:flush_all()
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consul_dict:flush_all() runs in the privileged agent during init_worker, which will clear the shared dict for all workers during an nginx reload. Since nodes() reads via consul_dict:get(), in-flight requests handled by old workers during a graceful reload can suddenly see missing services and return 503s. Consider avoiding a global flush on reload (e.g., use a generation/version keyspace and swap, or rebuild consul_services from existing dict keys), or at minimum ensure readers can still access stale values during refresh and free memory after flushing (see flush_expired).

Suggested change
-- flush stale data that may persist across reloads,
-- since consul_services is re-initialized empty
consul_dict:flush_all()
-- flush expired stale data that may persist across reloads,
-- since consul_services is re-initialized empty but existing
-- unexpired shared dict entries may still be in use by workers
consul_dict:flush_expired()

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

3 participants