Skip to content

fix: zombie process on metrics server fail#1926

Open
toddbaert wants to merge 2 commits intomainfrom
fix/zombie-on-metrics-serve-crash
Open

fix: zombie process on metrics server fail#1926
toddbaert wants to merge 2 commits intomainfrom
fix/zombie-on-metrics-serve-crash

Conversation

@toddbaert
Copy link
Copy Markdown
Member

@toddbaert toddbaert commented Apr 2, 2026

Fixes an issue where a failure of the metrics server to start can cause a zombie process.

This is hard and relatively low value to test with unit tests, but I was able to easily reproduce it with this simple script:

#!/bin/bash

cd flagd && go build -o flagd .

# --port and --sync-port both set to 8015; flagd should exit due to the conflict
./flagd start --port 8015 --sync-port 8015 --uri file:../config/samples/example_flags.flagd.json &
PID=$!
sleep 5

if ps -p $PID > /dev/null 2>&1; then
    echo "OH NO, BUG! Process is still alive despite port conflict."
    kill -9 $PID
    exit 1
fi

echo "NO BUG"

After the simple context handling, the script confirms the bug is fixed.

Fixes: #1807

@toddbaert toddbaert requested review from a team as code owners April 2, 2026 20:37
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for polite-licorice-3db33c canceled.

Name Link
🔨 Latest commit a37ab39
🔍 Latest deploy log https://app.netlify.com/projects/polite-licorice-3db33c/deploys/69cedb80313b6f0008340d78

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 2, 2026
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the fix/zombie-on-metrics-serve-crash branch from e419435 to 7d5aff6 Compare April 2, 2026 20:37
@toddbaert toddbaert changed the title Fix/zombie on metrics serve crash fix: zombie process on metrics server fail Apr 2, 2026
gemini-code-assist[bot]

This comment was marked as outdated.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 2, 2026
@toddbaert toddbaert force-pushed the fix/zombie-on-metrics-serve-crash branch 2 times, most recently from fd96e57 to 181d80b Compare April 2, 2026 21:05
Comment on lines -112 to -113
s.serverMtx.RLock()
defer s.serverMtx.RUnlock()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutex+nil-check in the old shutdown goroutines was needed because those goroutines ran concurrently with server setup; the server field might not have been assigned yet when ctx.Done() fired. In the new code, serveWithShutdown is only called after the server is already assigned and receives it as a direct argument, so the race doesn't exist.

}

func (s *ConnectService) startServer(svcConf service.Configuration) error {
func serveWithShutdown(ctx context.Context, server *http.Server, serveFn func() error) error {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted this because we use it in a couple places to handle server shutdowns.

@toddbaert
Copy link
Copy Markdown
Member Author

/gemini review

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the fix/zombie-on-metrics-serve-crash branch from 181d80b to a37ab39 Compare April 2, 2026 21:11
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the server lifecycle management in ConnectService by introducing a serveWithShutdown helper function, which simplifies the Serve method and centralizes graceful shutdown logic. Feedback was provided regarding a potential race condition in serveWithShutdown where a server error could be masked if the context is cancelled at the same time, suggesting a check for errors after the shutdown process completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] flagd process becomes zombie when ConnectService.startServer() fails

1 participant