We noticed an out-of-sync state between the production CN and urn:node:ARCTIC the other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates from urn:node:ARCTIC we were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:
- Sync shouldn't go OOM and crash
- If sync does crash, it shouldn't update the last sync (last harvest?) timestamp because this causes and out of sync state that's very hard to detect
We talked about possible next steps on our dev call this week and came up with:
- Bump max heap (Xmx) on the process. This might not be possible due to limited resources on
cn-ucsb-1.
- Move sync (and processing?) over to another host with more resources
- We might consider making MN's responsible for auditing (Note: Bryce thinks this is not quite the route to go but it's an idea that came up nonetheless)
- In the mean time before a fix, we could consider auditing sync on some of our more active member nodes (
ARCTIC, ESS-DIVE, RW)
- Set up monitoring on our logs to detect crashes like this
- Work on figuring out the bugs at the top of this post
For now, @taojing2002 is going to look into this and coordinate with @datadavev and we can go from there.
[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]
We noticed an out-of-sync state between the production CN and
urn:node:ARCTICthe other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates fromurn:node:ARCTICwe were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:We talked about possible next steps on our dev call this week and came up with:
cn-ucsb-1.ARCTIC,ESS-DIVE,RW)For now, @taojing2002 is going to look into this and coordinate with @datadavev and we can go from there.
[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]