Skip to content

Sync on production can crash ungracefully #1

@amoeba

Description

@amoeba

We noticed an out-of-sync state between the production CN and urn:node:ARCTIC the other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates from urn:node:ARCTIC we were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:

  • Sync shouldn't go OOM and crash
  • If sync does crash, it shouldn't update the last sync (last harvest?) timestamp because this causes and out of sync state that's very hard to detect

We talked about possible next steps on our dev call this week and came up with:

  1. Bump max heap (Xmx) on the process. This might not be possible due to limited resources on cn-ucsb-1.
  2. Move sync (and processing?) over to another host with more resources
  3. We might consider making MN's responsible for auditing (Note: Bryce thinks this is not quite the route to go but it's an idea that came up nonetheless)
  4. In the mean time before a fix, we could consider auditing sync on some of our more active member nodes (ARCTIC, ESS-DIVE, RW)
  5. Set up monitoring on our logs to detect crashes like this
  6. Work on figuring out the bugs at the top of this post

For now, @taojing2002 is going to look into this and coordinate with @datadavev and we can go from there.

[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions