Skip to content

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

@jgraettinger

Description

@jgraettinger

The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.

As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.

The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's *os.File) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.

One broker, for unknown reasons, was unable to reclaim dangling *os.File's, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which *os.File references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling *os.File references. Other than that, I'm currently scratching my head.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions