What problem does your feature solve?
At the end of a run, blaster emits a JSON summary of how the endpoint performed. Included in this summary is a dict of errors -> error_stats (string -> struct{} -- the struct contains the error code, a count of how many times it was seen as a response, and the first/last time it was seen). With this, as an operator, it's difficult to disambiguate what this implies for the RPC's buckling point for that endpoint.
What would you like to see?
I think the most ergonomic means of resolving this is to record the RPS at which the errors occured. There are several options for this, but I propose keeping a list of ranges over which the error was encountered (e.g. 'rps_seen_at' : [8], [10, 20], [22, 30]). This gives the operator more information regarding where in the RPS range the endpoint faltered, buckled, and eventually failed at without having to sift through the logs manually.
This is a low-overhead addition because we already have a merge range helper in this repo (it's used for the ledger window functionality).
What alternatives are there?
For less granularity, we could just keep the first/last RPS we saw the error at. For more, we could create a histogram of { RPS : [% requests at the RPS that errored] }. The latter is the most useful in providing a summary of what happened at each RPS, but it also increases the verbosity of the results section; this may be a tradeoff point where the solution is just to go read the logs if one needs that much detail.
What problem does your feature solve?
At the end of a run,
blasteremits a JSON summary of how the endpoint performed. Included in this summary is a dict of errors ->error_stats(string -> struct{}-- the struct contains the error code, a count of how many times it was seen as a response, and the first/last time it was seen). With this, as an operator, it's difficult to disambiguate what this implies for the RPC's buckling point for that endpoint.What would you like to see?
I think the most ergonomic means of resolving this is to record the RPS at which the errors occured. There are several options for this, but I propose keeping a list of ranges over which the error was encountered (e.g.
'rps_seen_at' : [8], [10, 20], [22, 30]). This gives the operator more information regarding where in the RPS range the endpoint faltered, buckled, and eventually failed at without having to sift through the logs manually.This is a low-overhead addition because we already have a merge range helper in this repo (it's used for the ledger window functionality).
What alternatives are there?
For less granularity, we could just keep the first/last RPS we saw the error at. For more, we could create a histogram of
{ RPS : [% requests at the RPS that errored] }. The latter is the most useful in providing a summary of what happened at each RPS, but it also increases the verbosity of the results section; this may be a tradeoff point where the solution is just to go read the logs if one needs that much detail.