Sync on production can crash ungracefully

We noticed an out-of-sync state between the production CN and `urn:node:ARCTIC` the other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates from `urn:node:ARCTIC` we were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:

- Sync shouldn't go OOM and crash
- If sync does crash, it shouldn't update the last sync (last harvest?) timestamp because this causes and out of sync state that's very hard to detect

We talked about possible next steps on our dev call this week and came up with:

1. Bump max heap (Xmx) on the process. This might not be possible due to limited resources on `cn-ucsb-1`.
2. Move sync (and processing?) over to another host with more resources
3. We might consider making MN's responsible for auditing (Note: Bryce thinks this is not quite the route to go but it's an idea that came up nonetheless)
4. In the mean time before a fix, we could consider auditing sync on some of our more active member nodes (`ARCTIC`, `ESS-DIVE`, `RW`)
5. Set up monitoring on our logs to detect crashes like this
6. Work on figuring out the bugs at the top of this post

For now, @taojing2002  is going to look into this and coordinate with @datadavev and we can go from there. 

[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync on production can crash ungracefully #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sync on production can crash ungracefully #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions