subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition

The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.

As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.

The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's `*os.File`) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.

_One_ broker, for unknown reasons, was unable to reclaim dangling `*os.File`'s, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which `*os.File` references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling `*os.File` references. Other than that, I'm currently scratching my head.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions