JsonRpcConnection: Don't drop client from cache prematurely by yhabteab · Pull Request #10210 · Icinga/icinga2

yhabteab · 2024-10-31T09:53:58Z

PR #7445 incorrectly assumed that a peer that had already disconnected and never reconnected was due to the endpoint client being dropped after a successful socket shutdown. However, the issue at that time was that there was not a single timeout guards that could cancel the async_shutdown call, petentially blocking indefinetely. Although removing the client from cache early might have allowed the endpoint to reconnect, it did not resolve the underlying problem. Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache. When our socket termination works reliably, the ApiListener reconnect timer should attempt to reconnect this endpoint after the next tick. Additionally, we now logs both before and after socket termination, which may help identify if it is hanging at any point in between.

16d6c23 Moves the removal of the client from cache to the end of the disconnection coroutine, so that the client gets dropped only after the connection is completely closed.
0479a59 Adds additional logging before and after disconnecting the RPC client connection.
ae11193 Fixes a hidden bug that came to light with the changes of 16d6c23 and is described in JsonRpcConnection: Don't drop client from cache prematurely #10210 (comment). What this commit basically does is that after the disconnect coroutine gets resumed, it first gives the writer coroutine a maximum timeout of 5s to finish on its own. If it exceeds that additional time, it will be terminated forcibly instead of waiting endlessly for it to finish normally.

Tests

[2024-11-28 10:37:03 +0000] information/ApiListener: New client connection for identity 'mbp-yhabteab' from [::ffff:192.168.202.46]:56152
[2024-11-28 10:37:03 +0000] information/ApiListener: Ignoring JSON-RPC connection from [::ffff:192.168.202.46]:56152. We're already connected to Endpoint 'mbp-yhabteab' (last message sent: 2024-11-28 10:36:35, last message received: 2024-11-28 10:36:19).
[2024-11-28 10:37:13 +0000] information/ApiListener: New client connection for identity 'mbp-yhabteab' from [::ffff:192.168.202.46]:56173
[2024-11-28 10:37:13 +0000] information/ApiListener: Ignoring JSON-RPC connection from [::ffff:192.168.202.46]:56173. We're already connected to Endpoint 'mbp-yhabteab' (last message sent: 2024-11-28 10:36:35, last message received: 2024-11-28 10:36:19).
[2024-11-28 10:37:23 +0000] information/JsonRpcConnection: No messages for identity 'mbp-yhabteab' have been received in the last 60 seconds.
[2024-11-28 10:37:23 +0000] information/ApiListener: New client connection for identity 'mbp-yhabteab' from [::ffff:192.168.202.46]:56204
[2024-11-28 10:37:23 +0000] information/ApiListener: Ignoring JSON-RPC connection from [::ffff:192.168.202.46]:56204. We're already connected to Endpoint 'mbp-yhabteab' (last message sent: 2024-11-28 10:36:35, last message received: 2024-11-28 10:36:19).
[2024-11-28 10:37:33 +0000] information/ApiListener: New client connection for identity 'mbp-yhabteab' from [::ffff:192.168.202.46]:56227
[2024-11-28 10:37:33 +0000] information/ApiListener: Ignoring JSON-RPC connection from [::ffff:192.168.202.46]:56227. We're already connected to Endpoint 'mbp-yhabteab' (last message sent: 2024-11-28 10:36:35, last message received: 2024-11-28 10:36:19).
[2024-11-28 10:37:38 +0000] warning/ApiListener: Removing API client for endpoint 'mbp-yhabteab'. 0 API clients left.
[2024-11-28 10:37:38 +0000] warning/JsonRpcConnection: API client disconnected for identity 'mbp-yhabteab'

Al2Klimov

Do we actually need two warnings?

yhabteab · 2024-10-31T11:05:24Z

Actually, no. However, I don't know which log level to use for which of these. So, if you have any better ideas, then please suggest!

yhabteab · 2024-10-31T11:29:04Z

Do we actually need two warnings?

I now have degraded the first log to info.

Al2Klimov

Now as I'm thinking about it, in monitoring in general you use WARNING as a soon indicator (before CRITICAL) that something may be not right. But let's also let Julian come to word...

julianbrost · 2024-10-31T15:55:38Z

I'm not really sure how well that comparison works. But yes, the question for choosing the log severity should also be "does it require attention?". So during a normal reload/config deployment, there ideally shouldn't be any warnings (I do know that we aren't there yet).

For a user, having both messages doesn't sound too helpful: like it says pretty much the same thing twice within at most 10 seconds, so I'd go even further and log the first one at notice.

Additionally, we now logs both before and after socket termination, which may help identify if it is hanging at any point in between.

That makes it sound like something that should only be necessary to debug very specific issues, not something that would be useful logging for every user.

Al2Klimov · 2024-10-31T16:18:24Z

lib/remote/jsonrpcconnection.cpp

-				ApiListener::GetInstance()->RemoveAnonymousClient(this);
-			}
+			Log(LogNotice, "JsonRpcConnection")
+				<< "Disconnecting API client for identity '" << m_Identity << "'";


💡 What about logging this...

lib/remote/jsonrpcconnection.cpp

Al2Klimov

👍

julianbrost · 2024-11-06T13:31:13Z

Neither the PR description nor the commit message really tell the purpose of moving the RemoveClient() call.

it did not resolve the underlying problem.

Here, underlying problem refers to the possibly blocking TLS shutdown without a timeout, i.e. something that's fixed already and not supposed to be fixed by this PR?

Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache.

Why do we want to wait? Like the PR claims that currently it's premature but not why that's the case and what's improved by changing this.

yhabteab · 2024-11-06T14:19:06Z

Here, underlying problem refers to the possibly blocking TLS shutdown without a timeout, i.e. something that's fixed already and not supposed to be fixed by this PR?

Something that should have been fixed already!

Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache.

Why do we want to wait? Like the PR claims that currently it's premature but not why that's the case and what's improved by changing this.

Firstly, you don't want to mark an endpoint as disconnected if it actually isn't. Before #7445, the shutdown flow was as it should be, i.e. first the socket is completely shut down, then the endpoint is marked as disconnected. However, this was changed with #7445 due to an incorrect assumption that disconnected clients being never reconnected again were due to this change, when in fact it might've been stuck somewhere in async_shutdown before reaching the end. So if for some reason the shutdown process gets stuck somewhere in between, we would never know, but assume that the node was completely disconnected, as the API listener would then try to reconnect to the client before fully ensuring that the previous socket was properly closed. This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

julianbrost · 2024-11-06T14:49:10Z

Firstly, you don't want to mark an endpoint as disconnected if it actually isn't.

In contrast, do you want to keep treating an endpoint as connected after the code decided that the connection is dead (for example in the "no messages received" case)? It's more in some in-between state of cleaning up the connection.

So if for some reason the shutdown process gets stuck somewhere in between, we would never know, but assume that the node was completely disconnected, as the API listener would then try to reconnect to the client before fully ensuring that the previous socket was properly closed.

In itself, that doesn't sound bad. Like once the connection is declared dead, it should be fine to establish a new one. It's just problematic if that would turn into a resource leak. Is that the main change here that if the code had a problem in Disconnect(), this PR would turn an invisible resource leak into a more visible "it fails to reconnect because it somehow hangs in Disconnect()?

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

Interestingly, there seems to be quite a connection to #10005, so this might be yet another reason to revive that PR, quoting from that PR's description:

Open questions:

Do we want to try to call ForceDisconnect() directly in case a connection is shut down due to a timeout like "no messages received" on JSON-RPC connections?

Now thinking about this in context, my intuition says yes. If we consider a connection failed, why should we bother attempting a clean TLS shutdown instead of just killing it with fire? That would then also change things here as a (forceful) disconnect should be pretty much instant.

yhabteab · 2024-11-06T15:28:37Z

In contrast, do you want to keep treating an endpoint as connected after the code decided that the connection is dead (for example in the "no messages received" case)?

Yes, generally I would treat an endpoint as connected as long as its socket is not fully shut down, but the no message received case is something different that ideally should not happen that often.

Is that the main change here that if the code had a problem in Disconnect(), this PR would turn an invisible resource leak into a more visible "it fails to reconnect because it somehow hangs in Disconnect()?

Yes. If you just drop the client before anything else and log that it's disconnected when in fact it's not, we won't be able to tell for sure if the shutdown is complete afterwards. Unless there's something that we really messed up, the endpoint should be kept as connected for a maximum of 10s after someone requests a disconnect, so it shouldn't cause any other issues.

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

When looking just at the log entry, then yes, but why should the endpoint be allowed to initiate another connection at all if the current one is not completely closed? I'm just trying to recreate the_{at least for me} logical flow as how it should be, i.e. first either gracefully or forcibly close the current one before marking the endpoint as disconnected.

Now thinking about this in context, my intuition says yes. If we consider a connection failed, why should we bother attempting a clean TLS shutdown instead of just killing it with fire?

As I have already talked to you lately about that PR, I'm perfectly happy with it regardless of this one, and forcibly closing such a dead connection doesn't sound like a bad idea too.

That would then also change things here as a (forceful) disconnect should be pretty much instant.

What exactly would change here otherwise? For me, the referenced PR is just something I would include on top of this, but I don't see how those exclude one another.

julianbrost · 2024-11-07T09:54:37Z

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

When looking just at the log entry, then yes, but why should the endpoint be allowed to initiate another connection at all if the current one is not completely closed? I'm just trying to recreate theat least for me logical flow as how it should be, i.e. first either gracefully or forcibly close the current one before marking the endpoint as disconnected.

That's what I meant with saying it's in some in-between state. The rest of the code also makes a difference between connected and connecting, likewise there's a difference between disconnecting and disconnected (though I don't that distinction is done explicitly in the code).

Now the question is whether a new connection should already be attempted while in that disconnecting state. Currently it is done, you suggest to wait to be fully disconnected with this PR. I think neither is wrong and would probably tend towards the suggested change. You just use a quite high level of "should" here and whether the change is a good idea boils down to how sure we are that this doesn't delay the RemoveClient() call by much more than 10 seconds.

That would then also change things here as a (forceful) disconnect should be pretty much instant.

What exactly would change here otherwise? For me, the referenced PR is just something I would include on top of this, but I don't see how those exclude one another.

Ideally, that forceful disconnect would only be a close syscall¹, so then it wouldn't make a noticeable difference whether you call RemoveClient() before or after that (in contrast to waiting 10 seconds for something to happen).

Though that can't be done manually, Asio needs to be aware of this, so it has to be done using Asio. ↩

yhabteab · 2024-11-28T10:43:28Z

See the updated PR description!

Al2Klimov · 2024-11-28T11:41:56Z

lib/remote/jsonrpcconnection.cpp


+			// In case the writer coroutine is not done yet which might got stuck somewhere in async_write
+			// or async_flush, cancel all operations on the underlying socket to unblock it.
 			m_Stream->lowest_layer().cancel(ec);


Ok, so in addition to improvements to logging 0479a59 and awaiting an actual possibly long disconnect before unregistration 16d6c23, you also cancel I/O 44954ce ...

Al2Klimov · 2024-11-28T11:46:04Z

lib/remote/jsonrpcconnection.cpp

+						// The writer coroutine could not finish soon enough to unblock the waiter down blow,
+						// so we have to do this on our own, and the coroutine will be terminated forcibly when
+						// the ops on the underlying socket are cancelled.
+						m_WriterDone.Set();


... to shorten the disconnection process once it took 5s? But then, shouldn't you m_Stream->lowest_layer().cancel(ec); instead, so that the writer terminates, and let m_WriterDone.Wait(yc); just do its thing?

I can do it, but it should do nothing different!

Al2Klimov

LPTM

Al2Klimov · 2024-11-28T15:14:33Z

lib/remote/jsonrpcconnection.cpp

+			{
+				asio::deadline_timer writerTimeout(m_IoStrand.context(), boost::posix_time::seconds(5));
+				writerTimeout.async_wait(asio::bind_executor(m_IoStrand, [this](boost::system::error_code ec) {
+					if (!ec) {


This is so similar to #10254, in fact you could re-use that class instead of implementing it by yourself. Just as idea.

I'm afraid the self-made timeout here is indeed similar enough to a previous version of #10254 to share its weakness as digged out in #10254 (comment) and fixed in 8dba2bb.

lib/remote/jsonrpcconnection.cpp

Al2Klimov · 2024-12-10T10:39:35Z

lib/remote/jsonrpcconnection.cpp

 			shutdownTimeout->Cancel();

 			m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);
+


... i.e. it could be called while async_shutdown() here yields. This is obviously not what we want, so I'd like the same check here. This would make it a 99% copy of #10254 w/o even unit tests, so... 🤷‍♂️

Alternatively to #10254 you can also use the current Timeout class.

... with Defer!

Last year I just wanted to express that #10254 is not strictly necessary, once also could just Defer-cancel the timeout like Julian did.

yhabteab · 2025-01-08T15:22:07Z

A simple rebase should fix the GHAs once #10254 is merged.

PR #7445 incorrectly assumed that a peer that had already disconnected and never reconnected was due to the endpoint client being dropped after a successful socket shutdown. However, the issue at that time was that there was not a single timeout guards that could cancel the `async_shutdown` call, petentially blocking indefinetely. Although removing the client from cache early might have allowed the endpoint to reconnect, it did not resolve the underlying problem. Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache. When our socket termination works reliably, the `ApiListener` reconnect timer should attempt to reconnect this endpoint after the next tick. Additionally, we now have logs both for before and after socket termination, which may help identify if it is hanging somewhere in between.

Al2Klimov

LG^TM

Al2Klimov · 2025-01-08T15:41:05Z

lib/remote/jsonrpcconnection.cpp

 			shutdownTimeout->Cancel();

 			m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);
+


Last year I just wanted to express that #10254 is not strictly necessary, once also could just Defer-cancel the timeout like Julian did.

yhabteab added enhancement New feature or request area/api REST API labels Oct 31, 2024

yhabteab added this to the 2.15.0 milestone Oct 31, 2024

yhabteab requested review from Al2Klimov and julianbrost October 31, 2024 09:53

cla-bot bot added the cla/signed label Oct 31, 2024

icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/build-fix Follow-up fix, not released yet labels Oct 31, 2024

yhabteab removed bug Something isn't working core/build-fix Follow-up fix, not released yet labels Oct 31, 2024

Al2Klimov reviewed Oct 31, 2024

View reviewed changes

yhabteab force-pushed the endpoint-client-dropped-early branch from 99be923 to d51d6a7 Compare October 31, 2024 11:27

yhabteab requested a review from Al2Klimov October 31, 2024 12:52

yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Oct 31, 2024

Al2Klimov reviewed Oct 31, 2024

View reviewed changes

yhabteab force-pushed the endpoint-client-dropped-early branch from d51d6a7 to b7dfdd3 Compare October 31, 2024 16:08

yhabteab requested a review from Al2Klimov October 31, 2024 16:10

Al2Klimov reviewed Oct 31, 2024

View reviewed changes

yhabteab force-pushed the endpoint-client-dropped-early branch from b7dfdd3 to ecc4b43 Compare November 4, 2024 08:44

yhabteab requested a review from Al2Klimov November 4, 2024 08:44

Al2Klimov previously approved these changes Nov 4, 2024

View reviewed changes

yhabteab added the bug Something isn't working label Nov 28, 2024

yhabteab force-pushed the endpoint-client-dropped-early branch from 64d55df to 44954ce Compare November 28, 2024 10:15

yhabteab requested review from Al2Klimov and julianbrost November 28, 2024 10:38

Al2Klimov reviewed Nov 28, 2024

View reviewed changes

yhabteab force-pushed the endpoint-client-dropped-early branch from 44954ce to ae11193 Compare November 28, 2024 12:10

yhabteab requested a review from Al2Klimov November 28, 2024 12:11

Al2Klimov previously approved these changes Nov 28, 2024

View reviewed changes

Al2Klimov mentioned this pull request Nov 28, 2024

JsonRpcConnection: don't write new messages on shutdown #10221

Merged

Al2Klimov reviewed Nov 28, 2024

View reviewed changes

yhabteab requested a review from Al2Klimov December 10, 2024 09:57

Al2Klimov requested changes Dec 10, 2024

View reviewed changes

Al2Klimov mentioned this pull request Dec 10, 2024

Timeout: use less resources, clean them up better and make cancellation deterministic #10254

Merged

Al2Klimov assigned yhabteab Dec 16, 2024

yhabteab force-pushed the endpoint-client-dropped-early branch from ae11193 to 802ad42 Compare January 8, 2025 15:20

yhabteab requested a review from Al2Klimov January 8, 2025 15:23

yhabteab added 3 commits January 8, 2025 16:30

Log before & after an RPC client is disconnected

41373ad

Don't endlessly wait on writer coroutine on disconnect

1425641

yhabteab force-pushed the endpoint-client-dropped-early branch from 802ad42 to 1425641 Compare January 8, 2025 15:31

Al2Klimov approved these changes Jan 8, 2025

View reviewed changes

julianbrost approved these changes Jan 9, 2025

View reviewed changes

julianbrost merged commit b96dc39 into master Jan 9, 2025

julianbrost deleted the endpoint-client-dropped-early branch January 9, 2025 09:29

yhabteab mentioned this pull request Jan 13, 2025

Add a dedicated method for disconnecting TLS connections #10293

Merged

Al2Klimov added backported Fix was included in a bugfix release and removed consider backporting Should be considered for inclusion in a bugfix release labels Feb 4, 2025

		shutdownTimeout->Cancel();

		m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);

Conversation

yhabteab commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

yhabteab commented Oct 31, 2024

Uh oh!

yhabteab commented Oct 31, 2024

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

julianbrost commented Oct 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

julianbrost commented Nov 6, 2024

Uh oh!

yhabteab commented Nov 6, 2024

Uh oh!

julianbrost commented Nov 6, 2024

Uh oh!

yhabteab commented Nov 6, 2024

Uh oh!

julianbrost commented Nov 7, 2024

Footnotes

Uh oh!

yhabteab commented Nov 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhabteab commented Jan 8, 2025

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhabteab commented Oct 31, 2024 •

edited

Loading