Replies: 1 comment
-
|
I think this seems reasonable to me and we have deployed similar approaches in the past - including the S3 output plus a.n.other for the exact needs you talk about. There's also options to replay from S3 then as well if the less available stack falls over. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I am currently designing a logging pipeline for our organization. We need to ingest a variety of log inputs - including syslog, and we need to consider how to make the logging pipeline as robust as possible against loss of logs due to outages within the logging pipeline.
The general architecture I'm considering looks something like this:
graph TD subgraph server [Application server] server_logfiles[(Log files)] -->|tail| server_fb eventlog[(Event log)] -->|winlog| server_fb server_fb[fluentbit] end subgraph syslog_cluster["Syslog Receiver Cluster"] syslog-collector-vip((Syslog VIP)) syslog_active[Active Syslog Receiver] syslog_standby[Standby Syslog Receiver] end syslog-collector-vip -->|syslog| syslog_active -->|forward| logrouter syslog-collector-vip .->|syslog| syslog_standby -->|forward| logrouter syslog-sender[Network device or appliance] server_fb -->|forward| logrouter syslog-sender -->|syslog| syslog-collector-vip logrouter[Log router / concentrator] logrouter -->|S3| s3[(S3 Bucket with Object Lock)] logrouter -->|OpenSearch| opensearch[(OpenSearch cluster)](Incidentally, the reason for the dual write to S3 and OpenSearch is to be able to fulfill two requirements that are not easilly combined into one system, having the logs stored immutably for forensic reasons, and also being able to query them efficiently for day-to-day operations. But that's not what's important right now.)
The idea is that, for systems where this is possible, we should run fluentbit directly on those systems in order to collect log data in a distributed fashion. But this won't be possible for every system out there - some systems will only be able to send syslog.
There are many advantages to running Fluentbit directly on each managed host, the main one being that it allows for distributed buffering and backpressure in case the central log router / concentrator experiences a spike or is down for a few minutes for maintenance. It's not a big deal, the log files will still be there when the log router is back up and running again.
But that's simply not the case with syslog. Syslog is akin to your network devices screaming into the void, throwing a UDP packet in the general direction of a syslog server and hoping a process is there to receive it. No matter what you do, some event loss will happen. I understand that, I just want to minimize how much loss there is. That's why the architecture diagram above calls for syslog receivers specifically to be run on two dedicated servers in an active/passive HA cluster, sharing a virtual IP using VRRP (probably using something like keepalived).
The path most travelled seems to be to set up something like rsyslog or syslog-ng, and then have fluentbit ingest its logs, but that seems to be an extra step and another thing that adds complication and potential lossage, and I'd rather have a simpler approach that is workable. Fluentbit can receive syslog directly, so why not just use that directly?
And that's the question: Has anyone actually deployed fluentbit this way, and have a sustainable approach to monitor the health of fluentbit and its capacity to receive and process syslog messages, in a way that can feed into keepalived so it can poll its health?
And is this even a reasonable approach? Or would I be better off trying to poll logs from my devices with a perl script and a cron job?
Beta Was this translation helpful? Give feedback.
All reactions