Skip to content

fix: race condition#3

Merged
demurky merged 9 commits intomainfrom
fix-race
Dec 16, 2025
Merged

fix: race condition#3
demurky merged 9 commits intomainfrom
fix-race

Conversation

@ParsaJR
Copy link
Contributor

@ParsaJR ParsaJR commented Dec 12, 2025

Since the metrics is likely to be read more frequently than written, using RWMutex seems to fit better for this situation.

closes #2

@ParsaJR ParsaJR requested a review from demurky December 12, 2025 11:46
Copy link
Contributor

@demurky demurky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my comments on the commit, ideally we decide to update the server's metrics every n seconds (the typical update rate of Prometheus is 60s). At 0s, the producer thread begins writing data to a buffer. At 60s, it begins writing to a new buffer, and so on. Incoming consumers begin reading from the latest buffer, incrementing that buffer's "reference counter". We periodically "remove" old buffers whose reference counters are 0. We put a timeout on consumers to limit the maximum number of buffers.

Synchronization is required to ensure:

  • No consumer begins reading from an old buffer.
  • No buffer is removed while being consumed.

main.go Outdated
Comment on lines 53 to 55
mu.RLock()
defer mu.RUnlock()
w.Write(metrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens up the possibility for DoS. w.Write() could potentially take a long time to finish, during which time the gather goroutine would be locked. At best, the now incorrect timings of the p.Step() calls would result in incorrect metrics. At worst, a rogue, slow connection could take down the entire server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the problem is with the HTTP I/O operation duration, can't we just make a copy of the latest matrics variable? This way, the mu.Unlock() doesn't at the HTTP I/O Duration mecry.
Here:

	http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
		mu.RLock()
		metricsCopy := make([]byte, len(metrics))
		_ = copy(metricsCopy, metrics)
		mu.RUnlock()

		w.Write(metricsCopy)
	})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would create a copy of metrics for every HTTP connection, using up memory, which again would allow a potential DoS with lots of connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the http server in general, the program can benefit from having a deadline for both Read and Write Operations.

@ParsaJR
Copy link
Contributor Author

ParsaJR commented Dec 13, 2025

I can't really wrap my mind around a solution that doesn't involve making a copy of the latest value.

Incoming consumers begin reading from the latest buffer, incrementing that buffer's "reference counter"

So if i understand your suggestion correctly , the way you are suggesting is that, the program always serves a buffer that is from a minute ago, this way, that buffer is concurrent safe because nothing gets written to it.

And by using this kind of synchronization, the program doesn't have anything to provide for the first 60 seconds when it starts working. (Which is not a concern)

@demurky
Copy link
Contributor

demurky commented Dec 13, 2025

I wrote Incoming consumers begin reading from the latest buffer. I really meant the last buffer not being written to; the second to last buffer.

this way, that buffer is concurrent safe because nothing gets written to it.

Yes.

And by using this kind of synchronization, the program doesn't have anything to provide for the first 60 seconds when it starts working

Yes.

This is similar to the way file systems handle references to inodes. I think it's called the read-copy-update mechanism? I don't know.

@ParsaJR
Copy link
Contributor Author

ParsaJR commented Dec 15, 2025

I've done some work regarding to your suggestion. I've introduced a struct 'RCU` that implements something similar to this.

Do you have any insights relating to this implementation? I'm not quite sure about its performance outcomes, mainly because of the heavy usage of Mutex nearly everywhere.

We periodically "remove" old buffers whose reference counters are 0.

I'm not quite sure when this critical moment should happen?

And also, who should responsible for that? the gather() function, or something integrated in the rcu structure?

Currently it it happens in the gather(). The timer.Tick ticks every one minute and it checks in the non blocking select statement:

func gather() {
	p := pipeline.New([]int{1, 5, 10, 15, 30, 60})

	timer := time.NewTicker(60 * time.Second)

	for {
		data, err := netdev.ReadNetDev()
		if err != nil {
			panic(fmt.Errorf("could not read netdev: %w", err))
		}

		recv, trns, err := netdev.GetTraffic(data)
		if err != nil {
			panic(fmt.Errorf("could not get traffic: %w", err))
		}

		m := p.Step(recv, trns)

		// Non blocking. It should be done fast.
		rcuSlice.Assign(m)

		select {
		case <-timer.C:
			rcuSlice.Rotate()
		default:
		}

		time.Sleep(time.Second)
	}
}

@ParsaJR ParsaJR requested a review from demurky December 15, 2025 21:02
@demurky
Copy link
Contributor

demurky commented Dec 15, 2025

We seem to have forgotten that Go has a garbage collector.

@demurky demurky merged commit 61d54eb into main Dec 16, 2025
1 check passed
Copy link
Contributor

@demurky demurky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ParsaJR
Copy link
Contributor Author

ParsaJR commented Dec 16, 2025

Thanks. Consider making this repository public, whenever you want.

It looks production-grade to me.

@ParsaJR ParsaJR deleted the fix-race branch December 16, 2025 20:19
demurky added a commit that referenced this pull request Dec 17, 2025
demurky added a commit that referenced this pull request Dec 17, 2025
demurky added a commit that referenced this pull request Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition related to the global variable "metrics" in "main.go"

2 participants