sql: deflake TestStatusAPIContentionEvents#162596
sql: deflake TestStatusAPIContentionEvents#162596dhartunian wants to merge 1 commit intocockroachdb:masterfrom
Conversation
Add detection for duplicate span errors to the test and ignore them since this test only cares about the contention information that's captured while tracing, and not the trace itself. Also added notes to the two callsites I found that were causing this error to trigger. I confirmed that the test would stop flaking if I commented out the `ImportRemoteRecording` line in distsql_running.go. I've chosen for now not to disrupt the application logic because even if I disabled duplicate span detection, I'd still want it on during tests, and I haven't observed this error in production so it doesn't seem to be urgent. Resolves: cockroachdb#160544 Resolves: cockroachdb#160876 Epic: none Release note: None
|
Merging to
|
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/8c18043/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/8c18043009b0206bcf9df9e9ef7a7279d4950c81/bin/pkg_sql_tests benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/f275471/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/f2754717ca1ec136f609fdfa5ca779e0fbf120e1/bin/pkg_sql_tests benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=f275471 --new=8c18043 ./pkg/sql/tests🔴 Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/8c18043/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/8c18043009b0206bcf9df9e9ef7a7279d4950c81/bin/pkg_sql_tests benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/f275471/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/f2754717ca1ec136f609fdfa5ca779e0fbf120e1/bin/pkg_sql_tests benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=f275471 --new=8c18043 ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/8c18043/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/8c18043009b0206bcf9df9e9ef7a7279d4950c81/bin/pkg_sql_tests benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/8c18043/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/f275471/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/f2754717ca1ec136f609fdfa5ca779e0fbf120e1/bin/pkg_sql_tests benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/f275471/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=f275471 --new=8c18043 ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/8c18043009b0206bcf9df9e9ef7a7279d4950c81/21756806326-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/f2754717ca1ec136f609fdfa5ca779e0fbf120e1/21756806326-1/\* old/built with commit: 8c18043009b0206bcf9df9e9ef7a7279d4950c81 |
yuzefovich
left a comment
There was a problem hiding this comment.
@yuzefovich reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl and @dhartunian).
pkg/sql/distsql_running.go line 1685 at r1 (raw file):
if len(meta.TraceData) > 0 { if span := tracing.SpanFromContext(r.ctx); span != nil { // Note(davidh): I believe this import can be triggered multiple times
Yes, if the query has a distributed plan, then each remote flow might send meta.TraceData multiple times (depending on what operators we have in that flow), and then we can also have multiple remote flows. This has been the case since forever. (The gateway flows are handled in a special way as of #98831.)
dhartunian
left a comment
There was a problem hiding this comment.
@dhartunian made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl and @yuzefovich).
pkg/sql/distsql_running.go line 1685 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Yes, if the query has a distributed plan, then each remote flow might send
meta.TraceDatamultiple times (depending on what operators we have in that flow), and then we can also have multiple remote flows. This has been the case since forever. (The gateway flows are handled in a special way as of #98831.)
Do you think I should remove the dupe trace error and log it + continue when writing the trace? Seems like it could get triggered easily even if I haven't seen any evidence.
There was a problem hiding this comment.
@yuzefovich made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl and @dhartunian).
pkg/sql/distsql_running.go line 1685 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Do you think I should remove the dupe trace error and log it + continue when writing the trace? Seems like it could get triggered easily even if I haven't seen any evidence.
Hm, I dunno. Is the problem that we're trying to import the same span multiple times?
I could definitely see that happening, but via children spans - e.g. if we have colrpc.Outbox -> colfetcher.ColBatchScan on the remote node, then the tracing span for ColBatchScan is a child of that for Outbox, and both operators will emit TraceData (the ColBatchScan only for itself while the Outbox for its own span with the ColBatchScan's span as a child). Is this setup problematic?
We could definitely refactor the code on the execution side to guarantee that only remote "root" execution spans are propagated as TraceData if you think it's worth cleaning this up.
Add detection for duplicate span errors to the test and ignore them since this test only cares about the contention information that's captured while tracing, and not the trace itself.
Also added notes to the two callsites I found that were causing this error to trigger. I confirmed that the test would stop flaking if I commented out the
ImportRemoteRecordingline in distsql_running.go.I've chosen for now not to disrupt the application logic because even if I disabled duplicate span
detection, I'd still want it on during tests, and I haven't observed this error in production so it
doesn't seem to be urgent.
Resolves: #160544
Resolves: #160876
Epic: none
Release note: None