fix: use shdict instead of events module for nodes data exchange by shreemaan-abhishek · Pull Request #13066 · apache/apisix

shreemaan-abhishek · 2026-03-04T10:31:26Z

Description

Fixes #12398

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

Copilot

Pull request overview

This PR addresses intermittent 503s after APISIX restarts with Consul-based service discovery by switching Consul node data exchange from per-worker in-memory/events-based propagation to a shared dictionary (lua_shared_dict) approach.

Changes:

Replace Consul discovery’s intra-worker data propagation (events + local table) with ngx.shared.consul storage.
Add lua_shared_dict consul 10m; to the generated Nginx config template and test harness Nginx config.
Update Consul dump test expectations to match the new error log message when a service is missing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`apisix/discovery/consul/init.lua`	Stores/reads discovered Consul nodes via `ngx.shared.consul` and limits active fetching to the privileged agent.
`apisix/cli/ngx_tpl.lua`	Declares `lua_shared_dict consul` in the generated Nginx configuration.
`t/APISIX.pm`	Declares `lua_shared_dict consul` in the test harness Nginx config.
`t/discovery/consul_dump.t`	Updates expected error log output for missing Consul service nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T00:34:54Z

apisix/discovery/consul/init.lua

+    for k, v in pairs(entity.services) do
+        local content, json_err = core.json.encode(v)
+        if content then
+            consul_dict:set(k, content)
+        else
+            log.error("failed to encode dump service: ", k, ", error: ", json_err)
+        end
+    end
+    log.info("load dump file into shared dict success")


read_dump_services() writes services into ngx.shared.consul without clearing existing keys first. Since lua_shared_dict contents survive nginx reloads, this can retain stale keys (or temporarily overwrite newer values) when the dump file is loaded during startup/reload. Consider either clearing the dict (or using a generation/version namespace) before loading the dump so the shared dict reflects exactly the dump contents until Consul updates arrive.

since the values are set using service keys, I don't think this should be of concern.

Copilot · 2026-03-06T00:34:54Z

apisix/discovery/consul/init.lua

    log.info("process id: ", ngx_worker_id(), ", all_services[", service_name, "] = ",
-        json_delay_encode(resp_list, true))
+        json_delay_encode(nodes, true))


The log message still references all_services[...], but the data source is now ngx.shared.consul and a decoded nodes table. Updating the message would avoid confusion when debugging (especially since all_services no longer exists).

Copilot · 2026-03-06T00:34:54Z

apisix/discovery/consul/init.lua

 function _M.nodes(service_name)
-    if not all_services then
-        log.error("all_services is nil, failed to fetch nodes for : ", service_name)
-        return
+    local value = consul_dict:get(service_name)
+    if not value then
+        log.error("consul service not found: ", service_name, ", return default service")
+        return default_service and {default_service}
    end

-    local resp_list = all_services[service_name]
-
-    if not resp_list then
-        log.error("fetch nodes failed by ", service_name, ", return default service")
+    local nodes, err = core.json.decode(value)
+    if not nodes then
+        log.error("fetch nodes failed by ", service_name, ", error: ", err)
        return default_service and {default_service}
    end

    log.info("process id: ", ngx_worker_id(), ", all_services[", service_name, "] = ",
-        json_delay_encode(resp_list, true))
+        json_delay_encode(nodes, true))

-    return resp_list
+    return nodes


nodes() now reads JSON from ngx.shared.consul and decodes it on every call, which makes it return a fresh table each time. In apisix/upstream.lua, this defeats the compare_upstream_node fast-path (old_t == new_t) and forces the slow path (sorting/comparing) on every request, which is both a performance regression and can change node ordering (the slow path sorts by host in-place). Consider caching decoded nodes per-worker (e.g., LRU keyed by service name + raw JSON string/version) and returning the same table instance while the shared-dict value is unchanged.

Copilot · 2026-03-06T00:34:54Z

apisix/discovery/consul/init.lua

    -- clean old unused data
    local old_services = consul_services[consul_server_url] or {}
    for k, _ in pairs(old_services) do
-        all_services[k] = nil
+        consul_dict:delete(k)
    end
    core.table.clear(old_services)


update_all_services() only deletes keys found in consul_services[consul_server_url]. After an nginx reload (shared dict persists but consul_services is reinitialized), any stale keys already present in ngx.shared.consul will never be deleted, so removed Consul services can keep being returned by nodes(). Consider clearing the shared dict on privileged-agent init, or persisting/deriving the previous key set from the shared dict (e.g., maintain a stored service list/version and delete keys not present in the new update).

Copilot · 2026-03-06T00:34:55Z

apisix/discovery/consul/init.lua

-        all_services[k] = v
+        local content, err = core.json.encode(v)
+        if content then
+            consul_dict:set(k, content)


consul_dict:set(k, content) return values are ignored. If the shared dict runs out of memory, set can fail (or evict other keys via forcible), which would silently drop discovery data and lead to intermittent 503s. Please capture and handle ok, err, forcible from set (and log/metric when it fails or is forcible).

Suggested change

consul_dict:set(k, content)

local ok, set_err, forcible = consul_dict:set(k, content)

if not ok then

log.error("failed to set nodes for service: ", k, " into consul_dict, error: ", set_err)

elseif forcible then

log.warn("consul_dict is out of memory, key(s) evicted when setting service: ", k)

end

…ce discovery nodes and flush consul_dict on init to prevent stale data persistence Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T07:49:16Z

apisix/discovery/consul/init.lua

+    -- flush stale data that may persist across reloads,
+    -- since consul_services is re-initialized empty
+    consul_dict:flush_all()
+


consul_dict:flush_all() runs in the privileged agent during init_worker(), but it executes after read_dump_services() may have populated the shared dict (and workers also call read_dump_services() before the privileged-agent guard). This can wipe the dump-loaded nodes and leave the dict empty until Consul fetch completes, defeating the dump-on-reload/startup mitigation and potentially causing 503s. Consider either (a) moving the flush earlier (before any dump load) and performing dump load only in the privileged agent, or (b) removing flush_all() and instead cleaning stale keys during update_all_services() in a way that survives reloads.

Copilot · 2026-03-06T07:49:17Z

apisix/discovery/consul/init.lua

 local function update_all_services(consul_server_url, up_services)
    -- clean old unused data
    local old_services = consul_services[consul_server_url] or {}
    for k, _ in pairs(old_services) do
-        all_services[k] = nil
+        consul_dict:delete(k)
    end
    core.table.clear(old_services)

    for k, v in pairs(up_services) do
-        all_services[k] = v
+        local content, err = core.json.encode(v)
+        if content then
+            consul_dict:set(k, content)
+        else


update_all_services() deletes all previously-known keys before setting the new ones. While this runs, workers can observe a missing service_name in ngx.shared.consul and fall back to the default (or nil), causing transient request failures during an update window. A safer approach is to write/set new values first (optionally under a generation/version namespace) and only then delete obsolete keys, so reads never see an empty/missing service during an update.

Copilot · 2026-03-06T07:49:17Z

apisix/discovery/consul/init.lua

@@ -621,25 +644,24 @@ function _M.init_worker()
        end
    end


read_dump_services() now writes into ngx.shared.consul, but init_worker() calls it in every process (workers + privileged agent). That means multiple processes can concurrently read/encode/write the same dump data at startup, adding contention and making the final dict contents depend on startup ordering. Consider guarding dump load so only the privileged agent (or a single designated worker) performs the shared-dict writes, and other workers only read from the dict.

- Reorder update_all_services() to write new values before deleting obsolete keys, so workers never observe a missing service mid-update - Move dump loading after flush_all() and into the privileged agent only, preventing the privileged agent from wiping dump-loaded data that workers depend on during startup Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T10:15:12Z

apisix/discovery/consul/init.lua

-        log.error("all_services is nil, failed to fetch nodes for : ", service_name)
-        return
+    local value = consul_dict:get(service_name)
+    if not value then


nodes() uses consul_dict:get(service_name). During reloads or refresh patterns that use flush_all(), get() returns nil even though stale values may still exist, causing avoidable 503s. Consider using get_stale() (and only falling back to default when both fresh+stale are missing). Also consider clearing nodes_cache[service_name] when the dict has no value to avoid retaining entries for removed services indefinitely.

Suggested change

if not value then

if value == nil then

-- try to use stale value during reloads or after flush_all()

value = consul_dict:get_stale(service_name)

end

if value == nil then

-- no fresh or stale value in shared dict; clear per-worker cache entry

nodes_cache[service_name] = nil

Copilot · 2026-03-06T10:15:13Z

apisix/discovery/consul/init.lua

 local function write_dump_services()
    local entity = {
-        services = all_services,
+        services = _M.all_nodes(),
        last_update = ngx.time(),
        expire = dump_params.expire, -- later need handle it


write_dump_services() now calls _M.all_nodes() which scans all shdict keys and JSON-decodes every service on each dump write. If dump is enabled and service count is high, this can add noticeable CPU overhead at each Consul update. Consider maintaining a privileged-agent-only in-memory snapshot for dump generation, or passing the updated snapshot into write_dump_services to avoid full scans/decodes.

Copilot · 2026-03-06T10:15:13Z

apisix/discovery/consul/init.lua

+    -- flush stale data that may persist across reloads,
+    -- since consul_services is re-initialized empty
+    consul_dict:flush_all()
+


This PR changes the cross-worker data sharing mechanism for Consul nodes to rely on lua_shared_dict, but there isn’t a regression test validating the restart/reload scenario from #12398 (e.g., repeated HUP reloads / restarts while sending requests should not produce intermittent 503s once Consul has data). Adding a Test::Nginx case similar to existing HUP-based tests would help prevent regressions.

Suggested change

-- flush stale data that may persist across reloads,

-- since consul_services is re-initialized empty

consul_dict:flush_all()

Copilot · 2026-03-06T10:15:13Z

apisix/discovery/consul/init.lua

+    -- flush stale data that may persist across reloads,
+    -- since consul_services is re-initialized empty
+    consul_dict:flush_all()


consul_dict:flush_all() runs in the privileged agent during init_worker, which will clear the shared dict for all workers during an nginx reload. Since nodes() reads via consul_dict:get(), in-flight requests handled by old workers during a graceful reload can suddenly see missing services and return 503s. Consider avoiding a global flush on reload (e.g., use a generation/version keyspace and swap, or rebuild consul_services from existing dict keys), or at minimum ensure readers can still access stale values during refresh and free memory after flushing (see flush_expired).

Suggested change

-- flush stale data that may persist across reloads,

-- since consul_services is re-initialized empty

consul_dict:flush_all()

-- flush expired stale data that may persist across reloads,

-- since consul_services is re-initialized empty but existing

-- unexpired shared dict entries may still be in use by workers

consul_dict:flush_expired()

apisix/discovery/consul/init.lua

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use shdict instead of events module for nodes data exchange

9ad59f2

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 4, 2026

f

db63da1

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

Baoyuantop previously approved these changes Mar 6, 2026

View reviewed changes

Baoyuantop requested a review from Copilot March 6, 2026 00:29

Copilot started reviewing on behalf of Baoyuantop March 6, 2026 00:29 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

feat: Implement per-worker cache and flush mechanism for Consul servi…

05b9aa5

…ce discovery nodes and flush consul_dict on init to prevent stale data persistence Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

shreemaan-abhishek dismissed Baoyuantop’s stale review via 05b9aa5 March 6, 2026 07:13

shreemaan-abhishek requested a review from Copilot March 6, 2026 07:41

Copilot started reviewing on behalf of shreemaan-abhishek March 6, 2026 07:42 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Baoyuantop linked an issue Mar 6, 2026 that may be closed by this pull request

bug: Discovery synchronization fails for large services due to shared memory payload constraints #12846

Open

shreemaan-abhishek requested a review from Copilot March 6, 2026 09:14

Copilot started reviewing on behalf of shreemaan-abhishek March 6, 2026 09:15 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

ensure all workers have dump_params

541dccf

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

shreemaan-abhishek requested a review from Copilot March 6, 2026 09:57

Copilot started reviewing on behalf of shreemaan-abhishek March 6, 2026 09:58 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Update apisix/discovery/consul/init.lua

e13e543

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

-            consul_dict:set(k, content)
+            local ok, set_err, forcible = consul_dict:set(k, content)
+            if not ok then
+                log.error("failed to set nodes for service: ", k, " into consul_dict, error: ", set_err)
+            elseif forcible then
+                log.warn("consul_dict is out of memory, key(s) evicted when setting service: ", k)
+            end

-    if not value then
+    if value == nil then
+        -- try to use stale value during reloads or after flush_all()
+        value = consul_dict:get_stale(service_name)
+    end
+    if value == nil then
+        -- no fresh or stale value in shared dict; clear per-worker cache entry
+        nodes_cache[service_name] = nil

	-- flush stale data that may persist across reloads,
	-- since consul_services is re-initialized empty
	consul_dict:flush_all()

-    -- flush stale data that may persist across reloads,
-    -- since consul_services is re-initialized empty
-    consul_dict:flush_all()
+    -- flush expired stale data that may persist across reloads,
+    -- since consul_services is re-initialized empty but existing
+    -- unexpired shared dict entries may still be in use by workers
+    consul_dict:flush_expired()

Conversation

shreemaan-abhishek commented Mar 4, 2026

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

shreemaan-abhishek Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants