Descriptive summary
Originally began in ticket #3389
We are (were?) seeing more instances of Blazegraph entering a crash state where it was still running and could respond to a few URI requests in it's LRU cache but most requests would fail. This also wouldn't be enough for the cluster to restart the pod. This was happening regularly, up to multiple times a week, but not at a regular frequency that we could find.
In this crash state, requests to Blazegraph would mostly all fail and OD then goes to fetching data from external sources. These are ultimately unnecessary, not rate-limited and usually get us at least a temporary block from some data sources.
Example Blazegraph error:
ERROR: SPARQL-QUERY: queryStr=SELECT * WHERE { ?p ?o }
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.lang.RuntimeException: addr=-6465244 : cause=java.lang.RuntimeException: java.io.IOException: I/O error
»·at java.util.concurrent.FutureTask.report(FutureTask.java:122)
»·at java.util.concurrent.FutureTask.get(FutureTask.java:192)
We decided the best approach would be a Rake task that could be run before major work such as importers, or from the console, or as a regular status check. This is also closer to the actual OD code doing querying than a shell-based query, though we may switch to that at a later point.
Expected behavior
Query Blazegraph for real label results from a list of known but lightly used URIs.
Log any failures and post alert messages to Slack.
Related work
#3383
#3389
Accessibility Concerns
Add any information here to indicate any known or suspected accessibility issues for this ticket
Descriptive summary
Originally began in ticket #3389
We are (were?) seeing more instances of Blazegraph entering a crash state where it was still running and could respond to a few URI requests in it's LRU cache but most requests would fail. This also wouldn't be enough for the cluster to restart the pod. This was happening regularly, up to multiple times a week, but not at a regular frequency that we could find.
In this crash state, requests to Blazegraph would mostly all fail and OD then goes to fetching data from external sources. These are ultimately unnecessary, not rate-limited and usually get us at least a temporary block from some data sources.
Example Blazegraph error:
We decided the best approach would be a Rake task that could be run before major work such as importers, or from the console, or as a regular status check. This is also closer to the actual OD code doing querying than a shell-based query, though we may switch to that at a later point.
Expected behavior
Query Blazegraph for real label results from a list of known but lightly used URIs.
Log any failures and post alert messages to Slack.
Related work
#3383
#3389
Accessibility Concerns
Add any information here to indicate any known or suspected accessibility issues for this ticket