CpuBoundWork#CpuBoundWork(): don't spin on atomic int to acquire slot by Al2Klimov · Pull Request #9990 · Icinga/icinga2

Al2Klimov · 2024-02-07T10:42:13Z

This is inefficient and implies
possible bad surprises regarding waiting durations on busy nodes. Instead,
use AsioConditionVariable#Wait() if there are no free slots. It's notified
by others' CpuBoundWork#~CpuBoundWork() once finished.

fixes #9988

Also, the current implementation is a spin-lock. 🙈 #10117 (comment)

Al2Klimov · 2024-02-07T11:18:27Z

Low load test

[2024-02-07 12:16:44 +0100] information/ApiListener: New client connection from [::ffff:127.0.0.1]:51732 (no client certificate)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: Request: GET /v1/objects/services/3d722963ae43!4272 (from [::ffff:127.0.0.1]:51732), user: root, agent: curl/8.4.0, status: Not Found).
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: HTTP client disconnected (from [::ffff:127.0.0.1]:51732)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12

Just a few increments/decrements. 👍

Al2Klimov · 2024-02-07T11:24:13Z

High load test

If I literally DoS Icinga with https://github.com/Al2Klimov/i2all.tf/tree/master/i2dos, I get a few of these:

[2024-02-07 12:19:37 +0100] information/CpuBoundWork: Handing over one used slot, free: 0 => 0

After I stop that program and fire one curl as in my low load test above, I get the same picture: still 12 free slots. 👍

Logs

--- lib/base/io-engine.cpp
+++ lib/base/io-engine.cpp
@@ -24,6 +24,7 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)
        std::unique_lock<std::mutex> lock (sem.Mutex);

        if (sem.FreeSlots) {
+               Log(LogInformation, "CpuBoundWork") << "Using one free slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots - 1u;
                --sem.FreeSlots;
                return;
        }
@@ -32,7 +33,9 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)

        sem.Waiting.emplace(&cv);
        lock.unlock();
+       Log(LogInformation, "CpuBoundWork") << "Waiting...";
        cv.Wait(yc);
+       Log(LogInformation, "CpuBoundWork") << "Waited!";
 }

 void CpuBoundWork::Done()
@@ -42,8 +45,10 @@ void CpuBoundWork::Done()
                std::unique_lock<std::mutex> lock (sem.Mutex);

                if (sem.Waiting.empty()) {
+                       Log(LogInformation, "CpuBoundWork") << "Releasing one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots + 1u;
                        ++sem.FreeSlots;
                } else {
+                       Log(LogInformation, "CpuBoundWork") << "Handing over one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots;
                        sem.Waiting.front()->Set();
                        sem.Waiting.pop();
                }

lib/base/io-engine.cpp

julianbrost · 2024-02-19T16:07:37Z

Is the way boost::asio::deadline_timer is used by AsioConditionVariable here actually safe? It's documentation says that shared objects are not thread-safe.

lib/base/io-engine.cpp

Al2Klimov · 2024-11-06T15:00:39Z

In addition, v2.14.2 could theoretically misbehave once the free slot amount falls "temporarily" noticeably below zero. Like, three requestors achieve an ioEngine.m_CpuBoundSemaphore.fetch_sub(1) while it's zero (0 - 3 x 1 = -3). Now, requestor A realizes that it's not allowed to take that slot and adds 1 again (-2). And so does requestor B (-1). But A subtracts again (-2) before C also adds 1 again (-1). And so on.

https://github.com/Icinga/icinga2/blob/v2.14.2/lib/base/io-engine.cpp#L24-L31

So that spinlock blocks not only CPU time, but also slots from legit requestors. The father of all spinlocks, so to say. 🙈 #10117 (comment)

Al2Klimov · 2024-11-28T13:51:34Z

lib/remote/eventshandler.cpp

 	response.set(http::field::content_type, "application/json");

-	IoBoundWorkSlot dontLockTheIoThread (yc);
+	handlingRequest.Done();


I've figured out and pushed 75271fe how to get rid of this and the 38 changed files.

$ git diff --stat 74009f0fc..a663f98b2 lib/base/io-engine.cpp | 77 +++++++++++++++++++++++++++++++++++++++-------------------------------------- lib/base/io-engine.hpp | 64 +++++++++++++++++++++++++++++++++------------------------------- lib/remote/eventshandler.cpp | 2 -- lib/remote/httpserverconnection.cpp | 19 +++++++++++++------ lib/remote/httpserverconnection.hpp | 2 ++ lib/remote/jsonrpcconnection.cpp | 2 +- 6 files changed, 88 insertions(+), 78 deletions(-) $

Please let me know whether you consider this better.

CC @julianbrost @yhabteab @oxzi

I've figured out and pushed 75271fe how to get rid of this and the 38 changed files.

424e1bc (#9990) is an annoying commit, but not something that would have to block this PR.

On the change suggested in 75271fe: that sounds like moving into the direction I suggested in #10142:

Related: when doing bigger changes to the interface there, one other improvement that comes to mind is how HttpServerConnection::StartStreaming() works: currently, to take control over the whole connection, this has to be called, but the underlying ASIO stream is still passed to every handler but it must not be used without calling StartStreaming(), otherwise, there's a good chance the connection ends up in a broken state. This could be improved by only exposing the underlying stream as a return value of the StartStreaming() method, similar to how it works in Go's net/http package.

So if the point of StartStreaming() is to transfer ownership of the connection to the caller, for me it makes sense to release other resources related to the connection like the CpuBoundWork.

Thus, overall I'd say this is a sane change for StartStreaming(), so feel free to keep it in (but the commit history should be cleaned up, that doesn't need that revert commit in there).

Al2Klimov · 2024-11-28T14:22:35Z

lib/base/io-engine.cpp

-}
+	try {
+		cv->Wait(yc);
+	} catch (...) {


Unfortunately, yes – this can throw.

--- test/base-shellescape.cpp +++ test/base-shellescape.cpp @@ -1,6 +1,8 @@ /* Icinga 2 | (c) 2012 Icinga GmbH | GPLv2+ */ #include "base/utility.hpp" +#include <boost/asio.hpp> +#include <boost/asio/spawn.hpp> #include <BoostTestTargetConfig.h> #include <iostream> @@ -16,6 +18,26 @@ BOOST_AUTO_TEST_CASE(escape_basic) BOOST_CHECK(Utility::EscapeShellCmd("$PATH") == "\\$PATH"); BOOST_CHECK(Utility::EscapeShellCmd("\\$PATH") == "\\\\\\$PATH"); #endif /* _WIN32 */ + + auto io = new boost::asio::io_context; + boost::asio::spawn(*io, [io](boost::asio::yield_context yc) { + boost::asio::deadline_timer timer(*io, boost::posix_time::seconds(3)); + boost::system::error_code ec; + + try { + timer.async_wait(yc[ec]); + } catch (const boost::coroutines::detail::forced_unwind&) { + BOOST_CHECK(false); // error: in "base_shellescape/escape_basic": check false has failed + throw; + } + }); + boost::asio::deadline_timer timer(*io, boost::posix_time::seconds(1)); + timer.async_wait([io](boost::system::error_code ec) { + if (!ec) { + delete io; + } + }); + io->run(); } BOOST_AUTO_TEST_CASE(escape_quoted)

Unfortunately, yes – this can throw.

You can't be serious :), you deliberately deleted the I/O object with delete io;, what do you expect to happen in that case then? This is not a realistic test case, I mean where do we perform a questionable operation like this in Icinga 2 code? @julianbrost and I tried to trigger this exception last week but weren't able to, and reading the detailed 🙃 boost docs about it didn't help to understand this either.

If for some reason the I/O context gets deleted in Icinga 2, do you think you can ever recover from it? If something like this happens in Icinga 2, then you have far more severe problems than unreleased CPU semaphores.

If something like this happens in Icinga 2, then you have far more severe problems than unreleased CPU semaphores.

So you'd not catch it at all?

So you'd not catch it at all?

No, I didn't say that! I just want to understand under what normal circumstances such an exception would be triggered and not by deleting the global io object. For instance, how can you explicitly trigger a stack unwinding of a coroutine? If one can force the destruction of a coroutine, then you can also verify that it enters into that new catch-all handler as intended, but none of us is able to do that, and that is the puzzle that needs to be solved.

the puzzle that needs to be solved

Does it need to be resolved? To me, the bare lack of noexcept in async_wait is enough. Yes, I'm serious! If something could be thrown, I catch it and clean up after myself. As I said to Julian, if you want to test it, add a throw 1;. The most often used function that theoretically can throw an exception is operator new, but I don't know yet how to temporarily override malloc(3) locally. But I think I don't even need that, as #9990 (comment) already shows enough. (Also, I didn't test what happens across fork(2), I hope on_before_fork() (or whatever it's called in ASIO) preserves coroutine stacks.)

Though if even deleting the io object triggers this exception (also tested with IoEngine::SpawnCoroutine()), I guess it's an indicator for that if the coroutine gets destroyed for whatever reason, we'll be able to intercept it and handle it accordingly.

Didn't see your intermediate comment above while sending my previous comment!

As I said to Julian, if you want to test it, add a throw 1;.

Where should I put that throw expression? Throwing an exception within the coroutine handler (the provided callback that is called within the coroutine) does not cause the coroutine to be destroyed.

Does it need to be resolved? To me, the bare lack of noexcept in async_wait is enough

If you don't want to make decisions just based on assumptions, then yes you need to understand when this exception could be triggered. I'm not talking about just any exception, but the specific force_unwind exception. As I said before, if the user-supplied callback throws an exception, it will never hit that new catch-all handler here, nor will it destroy the coroutine.

As I said to Julian, if you want to test it, add a throw 1;.

Where should I put that throw expression?

Just inside the try catch instead of cv->Wait(yc);. Seriously. Actually I don't worry especially about a particular exception type. I just easily got forced_unwind in my comment above. Who knows, maybe a malloc(3) fails? It's just: theoretically, the method could throw – I handle it. Especially in this case the code above in CpuBoundWork#CpuBoundWork() has already deployed some pointers to stack variables to IoEngine#m_CpuBoundSemaphore.Waiting. If our super unlikely exception hits someone w/o my try/catch once in 10y, happy debugging!

Now "better", colleagues?

--- a/lib/base/io-engine.cpp +++ b/lib/base/io-engine.cpp @@ -30,2 +30,5 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc, boost::asio::io_contex if (ie.m_CpuBoundSemaphore.compare_exchange_weak(freeSlots, freeSlots - 1)) { + if (cv) { + Log(LogCritical, "LOLCAT", "Got slot via ASIO!"); + } return; @@ -36,3 +39,3 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc, boost::asio::io_contex cv = Shared<AsioConditionVariable>::Make(ie.GetIoContext()); - continue; + //continue; }

[2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO! [2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO! [2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO!

I mean, it works well, a DoS even triggers the above logs, but the fair scheduling is unsurprisingly gone. :(

Also, FWIW, I gave up on df63a78 (boost::asio::async_result).

jschmidt-icinga

Since I last looked at this I got an idea that can make this slightly simpler and easier to use.

Otherwise I still think this works for our immediate needs, but @julianbrost will still need to take a look at it.

lib/base/io-engine.cpp

jschmidt-icinga · 2026-03-16T12:54:21Z

lib/base/io-engine.hpp


 private:
-	boost::asio::yield_context yc;
+	boost::asio::deadline_timer m_Timer;


Can you change this to use an boost::asio::steady_timer instead? I know we don't actually use the timer functionality here, but deadline_timer has been deprecated for a while, maybe we shouldn't use it in new code.

In case you're wondering why the build fails: The IDE will tell you steady_timer is available without including the file explicitly, likely because starting from some boost version one of our often used headers already includes it transitively. But for older distros you'll have to include the header manually where you're using it.

lib/base/io-engine.cpp

lib/remote/httpmessage.cpp

lib/remote/httpmessage.hpp

lib/remote/httpmessage.cpp

yhabteab

This works perfectly fine for me! So, it's up to others to decide if we want to merge this or not.

yhabteab · 2026-03-19T09:28:50Z

Wait, I've actually missed something. Right now, the CPU-slot is only freed when the handlers want to send chunk-encoded responses. But what if they just send a normal response without ever calling StartStreaming()? Currently, most of the HTTP errors are sent like that, and the CPU-slot will still be occupied until the response is sent.

icinga2/lib/remote/httputility.cpp

Line 102 in 50b4bc5

HttpUtility::SendJsonBody(response, params, result);

icinga2/lib/remote/httputility.cpp

Lines 82 to 83 in 50b4bc5

    
           response.set(http::field::content_type, "application/json"); 
        
           response.GetJsonEncoder(params && GetLastParameter(params, "pretty")).Encode(val);

icinga2/lib/remote/httpmessage.cpp

Lines 59 to 64 in 50b4bc5

    
           void MayFlush(boost::asio::yield_context& yield) override 
        
           { 
        
           	if (m_Message.body().Size() >= l_FlushThreshold) { 
        
           		m_Message.Flush(yield); 
        
           	} 
        
           }

So, I think you should perform the same Done call in Flush as well.

Al2Klimov · 2026-03-19T10:25:18Z

send a normal response without ever calling StartStreaming

I fixed it.

yhabteab · 2026-03-19T11:00:16Z

Please open a separate PR for 9ffbba6 and drop it from here! With my comment, I meant to say that you should call Done in Flush instead of/or StartStreaming method, as that's the common path for all handlers.

So, I think you should perform the same Done call in Flush as well.

so that `/v1/events` doesn't have to use `IoBoundWorkSlot`. `IoBoundWorkSlot#~IoBoundWorkSlot()` will wait for a free semaphore slot which will be almost immediately released by `CpuBoundWork#~CpuBoundWork()`. Just releasing the already aquired slot manually is more efficient.

This is inefficient and involves possible bad surprises regarding waiting durations on busy nodes. Instead, use `AsioConditionVariable#Wait()` if there are no free slots. It's notified by others' `CpuBoundWork#~CpuBoundWork()` once finished.

yhabteab

See #9990 (review)

Al2Klimov requested a review from yhabteab February 7, 2024 10:42

Al2Klimov self-assigned this Feb 7, 2024

cla-bot bot added the cla/signed label Feb 7, 2024

icinga-probot bot added area/api REST API area/distributed Distributed monitoring (master, satellites, clients) core/quality Improve code, libraries, algorithms, inline docs ref/IP labels Feb 7, 2024

Al2Klimov removed their assignment Feb 7, 2024

Al2Klimov marked this pull request as ready for review February 7, 2024 11:24

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from c11989f to 8d24525 Compare February 9, 2024 13:03

julianbrost reviewed Feb 9, 2024

View reviewed changes

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 8d24525 to bf74280 Compare February 20, 2024 12:21

julianbrost requested changes Feb 20, 2024

View reviewed changes

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from bf74280 to 9062934 Compare February 21, 2024 11:13

yhabteab reviewed Feb 22, 2024

View reviewed changes

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 9062934 to a00262f Compare February 22, 2024 09:58

Al2Klimov requested a review from julianbrost February 22, 2024 09:58

Al2Klimov added this to the 2.15.0 milestone Apr 5, 2024

Al2Klimov mentioned this pull request Jun 20, 2024

Two separate CpuBoundWork semaphores – HTTP | JSONRPC #10046

Open

yhabteab mentioned this pull request Sep 27, 2024

Log HTTP/RPC message processing stats #10141

Merged

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from a00262f to 26ef66e Compare September 27, 2024 09:51

Al2Klimov mentioned this pull request Nov 27, 2024

Add a dedicated method for disconnecting TLS connections #10005

Merged

Al2Klimov commented Nov 28, 2024

View reviewed changes

Al2Klimov requested review from oxzi and yhabteab November 29, 2024 09:19

Al2Klimov assigned Al2Klimov and unassigned Al2Klimov Jan 23, 2026

Al2Klimov requested a review from jschmidt-icinga January 23, 2026 15:19

Al2Klimov marked this pull request as ready for review January 23, 2026 15:19

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 49b95bc to 81a22ae Compare January 23, 2026 15:19

julianbrost mentioned this pull request Feb 17, 2026

API denial of service in master branch due to CpuBoundWork held while writing responses to clients #10719

Open

jschmidt-icinga reviewed Mar 16, 2026

View reviewed changes

lib/base/io-engine.cpp Show resolved Hide resolved

Al2Klimov requested a review from jschmidt-icinga March 16, 2026 12:16

jschmidt-icinga reviewed Mar 16, 2026

View reviewed changes

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 81a22ae to 6877c90 Compare March 17, 2026 14:50

Al2Klimov requested a review from jschmidt-icinga March 17, 2026 14:50

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 6877c90 to d35b9a2 Compare March 18, 2026 09:08

yhabteab reviewed Mar 18, 2026

View reviewed changes

lib/base/io-engine.cpp Outdated Show resolved Hide resolved

lib/remote/httpmessage.cpp Outdated Show resolved Hide resolved

lib/remote/httpmessage.hpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from d35b9a2 to 1150280 Compare March 18, 2026 14:26

yhabteab reviewed Mar 18, 2026

View reviewed changes

lib/remote/httpmessage.cpp Outdated Show resolved Hide resolved

yhabteab previously approved these changes Mar 19, 2026

View reviewed changes

Al2Klimov added 9 commits March 19, 2026 14:51

Remove unused IoBoundWorkSlot

844b341

CpuBoundWork#~CpuBoundWork(): just call #Done() to deduplicate code

149aecf

[Refactor] CpuBoundWork#CpuBoundWork(): require an io_context::strand

e03db5f

Introduce AsioConditionVariable

c40c15e

Make IoEngine#m_CpuBoundSemaphore unsigned

cefa942

Remove unused IoEngine::YieldCurrentCoroutine()

7b45d8d

Remove unused IoEngine#m_AlreadyExpiredTimer

85b16d2

yhabteab approved these changes Mar 19, 2026

View reviewed changes

Al2Klimov mentioned this pull request Mar 19, 2026

Always call HttpUtility::SendJsonBody() with a boost::asio::yield_context #10764

Open

Conversation

Al2Klimov commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Al2Klimov commented Feb 7, 2024

Low load test

Uh oh!

Al2Klimov commented Feb 7, 2024

High load test

Logs

Uh oh!

Uh oh!

julianbrost commented Feb 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Al2Klimov commented Nov 6, 2024

Uh oh!

Al2Klimov Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhabteab Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

yhabteab commented Mar 19, 2026

Uh oh!

Al2Klimov commented Mar 19, 2026

Uh oh!

yhabteab commented Mar 19, 2026

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Al2Klimov commented Feb 7, 2024 •

edited

Loading

Al2Klimov Nov 28, 2024 •

edited

Loading

yhabteab Nov 29, 2024 •

edited

Loading

jschmidt-icinga Mar 18, 2026 •

edited

Loading