JsonRpcConnection: Don't drop client from cache prematurely#10210
JsonRpcConnection: Don't drop client from cache prematurely#10210julianbrost merged 3 commits intomasterfrom
Conversation
Al2Klimov
left a comment
There was a problem hiding this comment.
Do we actually need two warnings?
|
Actually, no. However, I don't know which log level to use for which of these. So, if you have any better ideas, then please suggest! |
99be923 to
d51d6a7
Compare
I now have degraded the first log to info. |
Al2Klimov
left a comment
There was a problem hiding this comment.
Now as I'm thinking about it, in monitoring in general you use WARNING as a soon indicator (before CRITICAL) that something may be not right. But let's also let Julian come to word...
|
I'm not really sure how well that comparison works. But yes, the question for choosing the log severity should also be "does it require attention?". So during a normal reload/config deployment, there ideally shouldn't be any warnings (I do know that we aren't there yet). For a user, having both messages doesn't sound too helpful: like it says pretty much the same thing twice within at most 10 seconds, so I'd go even further and log the first one at notice.
That makes it sound like something that should only be necessary to debug very specific issues, not something that would be useful logging for every user. |
d51d6a7 to
b7dfdd3
Compare
lib/remote/jsonrpcconnection.cpp
Outdated
| ApiListener::GetInstance()->RemoveAnonymousClient(this); | ||
| } | ||
| Log(LogNotice, "JsonRpcConnection") | ||
| << "Disconnecting API client for identity '" << m_Identity << "'"; |
b7dfdd3 to
ecc4b43
Compare
|
Neither the PR description nor the commit message really tell the purpose of moving the
Here, underlying problem refers to the possibly blocking TLS shutdown without a timeout, i.e. something that's fixed already and not supposed to be fixed by this PR?
Why do we want to wait? Like the PR claims that currently it's premature but not why that's the case and what's improved by changing this. |
Something that should have been fixed already!
Firstly, you don't want to mark an endpoint as disconnected if it actually isn't. Before #7445, the shutdown flow was as it should be, i.e. first the socket is completely shut down, then the endpoint is marked as disconnected. However, this was changed with #7445 due to an incorrect assumption that disconnected clients being never reconnected again were due to this change, when in fact it might've been stuck somewhere in |
In contrast, do you want to keep treating an endpoint as connected after the code decided that the connection is dead (for example in the "no messages received" case)? It's more in some in-between state of cleaning up the connection.
In itself, that doesn't sound bad. Like once the connection is declared dead, it should be fine to establish a new one. It's just problematic if that would turn into a resource leak. Is that the main change here that if the code had a problem in
Though that's not related to moving the Interestingly, there seems to be quite a connection to #10005, so this might be yet another reason to revive that PR, quoting from that PR's description:
Now thinking about this in context, my intuition says yes. If we consider a connection failed, why should we bother attempting a clean TLS shutdown instead of just killing it with fire? That would then also change things here as a (forceful) disconnect should be pretty much instant. |
Yes, generally I would treat an endpoint as connected as long as its socket is not fully shut down, but the
Yes. If you just drop the client before anything else and log that it's disconnected when in fact it's not, we won't be able to tell for sure if the shutdown is complete afterwards. Unless there's something that we really messed up, the endpoint should be kept as connected for a maximum of
When looking just at the log entry, then yes, but why should the endpoint be allowed to initiate another connection at all if the current one is not completely closed? I'm just trying to recreate theat least for me logical flow as how it should be, i.e. first either gracefully or forcibly close the current one before marking the endpoint as disconnected.
As I have already talked to you lately about that PR, I'm perfectly happy with it regardless of this one, and forcibly closing such a dead connection doesn't sound like a bad idea too.
What exactly would change here otherwise? For me, the referenced PR is just something I would include on top of this, but I don't see how those exclude one another. |
That's what I meant with saying it's in some in-between state. The rest of the code also makes a difference between connected and connecting, likewise there's a difference between disconnecting and disconnected (though I don't that distinction is done explicitly in the code). Now the question is whether a new connection should already be attempted while in that disconnecting state. Currently it is done, you suggest to wait to be fully disconnected with this PR. I think neither is wrong and would probably tend towards the suggested change. You just use a quite high level of "should" here and whether the change is a good idea boils down to how sure we are that this doesn't delay the
Ideally, that forceful disconnect would only be a Footnotes
|
64d55df to
44954ce
Compare
|
See the updated PR description! |
lib/remote/jsonrpcconnection.cpp
Outdated
|
|
||
| // In case the writer coroutine is not done yet which might got stuck somewhere in async_write | ||
| // or async_flush, cancel all operations on the underlying socket to unblock it. | ||
| m_Stream->lowest_layer().cancel(ec); |
lib/remote/jsonrpcconnection.cpp
Outdated
| // The writer coroutine could not finish soon enough to unblock the waiter down blow, | ||
| // so we have to do this on our own, and the coroutine will be terminated forcibly when | ||
| // the ops on the underlying socket are cancelled. | ||
| m_WriterDone.Set(); |
There was a problem hiding this comment.
... to shorten the disconnection process once it took 5s? But then, shouldn't you m_Stream->lowest_layer().cancel(ec); instead, so that the writer terminates, and let m_WriterDone.Wait(yc); just do its thing?
There was a problem hiding this comment.
I can do it, but it should do nothing different!
44954ce to
ae11193
Compare
lib/remote/jsonrpcconnection.cpp
Outdated
| { | ||
| asio::deadline_timer writerTimeout(m_IoStrand.context(), boost::posix_time::seconds(5)); | ||
| writerTimeout.async_wait(asio::bind_executor(m_IoStrand, [this](boost::system::error_code ec) { | ||
| if (!ec) { |
There was a problem hiding this comment.
This is so similar to #10254, in fact you could re-use that class instead of implementing it by yourself. Just as idea.
I'm afraid the self-made timeout here is indeed similar enough to a previous version of #10254 to share its weakness as digged out in #10254 (comment) and fixed in 8dba2bb.
| shutdownTimeout->Cancel(); | ||
|
|
||
| m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec); | ||
|
|
There was a problem hiding this comment.
... i.e. it could be called while async_shutdown() here yields. This is obviously not what we want, so I'd like the same check here. This would make it a 99% copy of #10254 w/o even unit tests, so... 🤷♂️
There was a problem hiding this comment.
Alternatively to #10254 you can also use the current Timeout class.
There was a problem hiding this comment.
Last year I just wanted to express that #10254 is not strictly necessary, once also could just Defer-cancel the timeout like Julian did.
ae11193 to
802ad42
Compare
|
A simple rebase should fix the GHAs once #10254 is merged. |
PR #7445 incorrectly assumed that a peer that had already disconnected and never reconnected was due to the endpoint client being dropped after a successful socket shutdown. However, the issue at that time was that there was not a single timeout guards that could cancel the `async_shutdown` call, petentially blocking indefinetely. Although removing the client from cache early might have allowed the endpoint to reconnect, it did not resolve the underlying problem. Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache. When our socket termination works reliably, the `ApiListener` reconnect timer should attempt to reconnect this endpoint after the next tick. Additionally, we now have logs both for before and after socket termination, which may help identify if it is hanging somewhere in between.
802ad42 to
1425641
Compare
| shutdownTimeout->Cancel(); | ||
|
|
||
| m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec); | ||
|
|
There was a problem hiding this comment.
Last year I just wanted to express that #10254 is not strictly necessary, once also could just Defer-cancel the timeout like Julian did.
PR #7445 incorrectly assumed that a peer that had already disconnected and never reconnected was due to the endpoint client being dropped after a successful socket shutdown. However, the issue at that time was that there was not a single timeout guards that could cancel the
async_shutdowncall, petentially blocking indefinetely. Although removing the client from cache early might have allowed the endpoint to reconnect, it did not resolve the underlying problem. Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache. When our socket termination works reliably, theApiListenerreconnect timer should attempt to reconnect this endpoint after the next tick. Additionally, we now logs both before and after socket termination, which may help identify if it is hanging at any point in between.5sto finish on its own. If it exceeds that additional time, it will be terminated forcibly instead of waiting endlessly for it to finish normally.Tests