Occasional disk repair errors on GrowRootAPFSVolume

Very tangentially-related to https://github.com/aws/ec2-macos-init/issues/41, we've seen a different type of error also occasionally, across a couple versions of macOS 13 (most recently, 13.5 which is what we're currently running):

```
2023/12/19 10:16:07.759431 Error while running module [GrowRootAPFSVolume] (type: command, group: 3) with message:  and err: ec2macosinit: error executing command [[/bin/zsh -c ec2-macos-utils grow --id root]] with stdout [] and stderr [time="19 Dec 23 10:15 GMT" level=info msg="Configuring diskutil for product" product="macOS Ventura 13.5.0"
time="19 Dec 23 10:15 GMT" level=info msg="Attempting to grow container..." device_id=disk5s2s1
time="19 Dec 23 10:15 GMT" level=info msg="Checking if device can be APFS resized..." device_id=disk5s2s1
time="19 Dec 23 10:15 GMT" level=info msg="Device can be resized"
time="19 Dec 23 10:15 GMT" level=info msg="Repairing the parent disk..."
time="19 Dec 23 10:15 GMT" level=info msg="Repairing parent disk..." parent_id=disk4
time="19 Dec 23 10:15 GMT" level=info msg="Successfully repaired the parent disk"
time="19 Dec 23 10:15 GMT" level=info msg="Fetching amount of free space on device..." device_id=disk5
time="19 Dec 23 10:15 GMT" level=info msg="Resizing container to maximum size..." device_id=disk5 free_space="268 GB"
Error: diskutil: failed to run diskutil command to resize the container, stderr [Error: -69716: Storage system verify or repair failed
]: error waiting for specified command to exit: exit status 1]: exit status 1
```

In this case, it was a 250GB AMI launched with a 500GB volume that it wanted to grow into.

The impact seemed to be that it's allowed to continue running through the init steps (as it's configured as best effort), but that we end up with a host in production with a smaller volume than was expected. While we have some automated cleanup in place, the size difference was enough that we didn't really have enough free space to reclaim from.

If I run `sudo ec2-macos-utils grow --id root` later I was able to grow it properly, so I'm guessing that perhaps it is just unlucky timing. We don't see this widely but it's come up in our logs 3-4 times over the past few months.

Just thinking whether there might be some more graceful way to handle this type of error or at least a retry. The timestamps of 10:15:51 -> 10:16:07 is only 16 seconds. I'd be ok to put up with it retrying this one a couple more times if it there's a good chance of it succeeding with a retry. Is there any mechanism supported in ec2-macos-init where we could add retries as a generic thing? (I see [this](https://github.com/aws/ec2-macos-init/blob/e45a68ff238519b57978ede669d4d53b26946df7/lib/ec2macosinit/systemconfig.go#L198) but it looks like it's specific to that one type of task).

Also if https://github.com/aws/ec2-macos-init/issues/41 were addressed such that it doesn't even try to grow the volume if it won't be able to grow it any further, you could probably reduce the total runtime somewhat for the cases where the volume size matches anyway?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional disk repair errors on GrowRootAPFSVolume #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Occasional disk repair errors on GrowRootAPFSVolume #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions