Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .github/workflows/ci-dgraph-integration2-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,15 @@ jobs:
with:
fetch-depth: 0

- name: Restore benchmark dataset cache
uses: actions/cache/restore@v4
with:
path: dgraphtest/datafiles
key: dataset-dgraphtest-v1
Copy link

@mwelles-istari mwelles-istari Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiva-istari one thing I'm concerned about here is that what version of the dataset fixtures is used is non-deterministic, and there's no way to know for sure which version of which fixture was used in a particular build if the cache was used. Which build seeded that cache? Which version of which fixture was cached when it did?

There's a hardcoded "-v1" suffix on the key, which AFAIK doesn't refer to any version of the test data being used — just the version of the key name itself.

I think instead either:

  • The release tag of the test fixtures to use should be configured in an env var or project file, such that:
    • Only the versions of the fixtures at that tag can be downloaded.
    • The cache key suffix used must match the release tag (e.g. dataset-dgraphtest-<release-tag>).
    • If possible, make it controllable via a test execution runtime flag as well.

This will ensure builds get the same version of the test data they would on a fresh checkout even when a cache copy is used, that builds are idempotent and test results are reproducible. And it ensures that any version of a given test data asset is downloaded only once, then cached and reused by every subsequent build that uses it.

Additional concern: The file-exists check that guards re-downloading (fi.Size() > 0) won't catch corrupted or truncated files from a previous failed run. Combined with replacing MakeDirEmpty with os.MkdirAll (which preserves existing content), there's no mechanism to detect or recover from a bad cached file. A checksum validation or known-size check would improve reliability here.


- name: Ensure datafiles directory
run: mkdir -p dgraphtest/datafiles

- name: Set up Go
uses: actions/setup-go@v6
with:
Expand All @@ -55,3 +64,10 @@ jobs:
go clean -testcache
# sleep
sleep 5

- name: Save benchmark dataset cache
if: success()
uses: actions/cache/save@v4
with:
path: dgraphtest/datafiles
key: dataset-dgraphtest-v1
15 changes: 14 additions & 1 deletion .github/workflows/ci-dgraph-ldbc-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ jobs:
- name: Checkout Dgraph
uses: actions/checkout@v5

- name: Restore LDBC dataset cache
uses: actions/cache/restore@v4
with:
path: ${{ github.workspace }}/test-data
key: dataset-ldbc-v1

- name: Set up Go
uses: actions/setup-go@v6
with:
Expand Down Expand Up @@ -61,6 +67,13 @@ jobs:
# move the binary
cp dgraph/dgraph ~/go/bin/dgraph
# run the ldbc tests
cd t; ./t --suite=ldbc
cd t; ./t --suite=ldbc --tmp=${{ github.workspace }}/test-data
# clean up docker containers after test execution
./t -r

- name: Save LDBC dataset cache
if: success()
uses: actions/cache/save@v4
with:
path: ${{ github.workspace }}/test-data
key: dataset-ldbc-v1
15 changes: 14 additions & 1 deletion .github/workflows/ci-dgraph-load-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ jobs:
steps:
- uses: actions/checkout@v5

- name: Restore load test dataset cache
uses: actions/cache/restore@v4
with:
path: ${{ github.workspace }}/test-data
key: dataset-load-v1

- name: Set up Go
uses: actions/setup-go@v6
with:
Expand Down Expand Up @@ -60,8 +66,15 @@ jobs:
# move the binary
cp dgraph/dgraph ~/go/bin/dgraph
# run the load tests
cd t; ./t --suite=load
cd t; ./t --suite=load --tmp=${{ github.workspace }}/test-data
# clean up docker containers after test execution
./t -r
# sleep
sleep 5

- name: Save load test dataset cache
if: success()
uses: actions/cache/save@v4
with:
path: ${{ github.workspace }}/test-data
key: dataset-load-v1
30 changes: 21 additions & 9 deletions dgraphtest/load.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"runtime"
"strconv"
"strings"
"time"

"github.com/pkg/errors"

Expand All @@ -41,10 +42,10 @@ func (c *LocalCluster) HostDgraphBinaryPath() string {
}

var datafiles = map[string]string{
"1million.schema": "https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.schema?raw=true",
"1million.rdf.gz": "https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.rdf.gz?raw=true",
"21million.schema": "https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/21million.schema?raw=true",
"21million.rdf.gz": "https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/21million.rdf.gz?raw=true",
"1million.schema": "https://raw.githubusercontent.com/dgraph-io/dgraph-benchmarks/refs/heads/main/data/1million.schema",
"1million.rdf.gz": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/data/1million.rdf.gz",
"21million.schema": "https://raw.githubusercontent.com/dgraph-io/dgraph-benchmarks/refs/heads/main/data/21million.schema",
"21million.rdf.gz": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/data/21million.rdf.gz",
}

type DatasetType int
Expand Down Expand Up @@ -604,11 +605,22 @@ func (d *Dataset) ensureFile(filename string) string {
}

func downloadFile(fname, url string) error {
cmd := exec.Command("wget", "-O", fname, url)
cmd.Dir = datasetFilesPath

if _, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("error downloading file %s: %w", fname, err)
const maxRetries = 3
fpath := filepath.Join(datasetFilesPath, fname)
for attempt := 1; attempt <= maxRetries; attempt++ {
cmd := exec.Command("wget", "--tries=3", "--waitretry=5", "--retry-connrefused", "-O", fname, url)
cmd.Dir = datasetFilesPath

if out, err := cmd.CombinedOutput(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go-level retry loop (3 attempts) wraps a wget invocation that itself retries 3 times (--tries=3 --waitretry=5). This means up to 9 wget requests per file, with cumulative backoff delays that could reach several minutes before a genuine failure is reported.

Consider either:

  • Removing the wget-level retries (--tries=1) and letting the Go loop handle all retry logic, or
  • Removing the Go-level loop and relying solely on wget's built-in retry mechanism.

Having both layers is redundant and makes the failure timeline hard to reason about.

log.Printf("attempt %d/%d failed to download %s: %v\n%s", attempt, maxRetries, fname, err, string(out))
if attempt < maxRetries {
time.Sleep(time.Duration(attempt*5) * time.Second)
continue
}
_ = os.Remove(fpath)
return fmt.Errorf("error downloading file %s after %d attempts: %w", fname, maxRetries, err)
}
return nil
}
return nil
}
64 changes: 46 additions & 18 deletions t/t.go
Original file line number Diff line number Diff line change
Expand Up @@ -1159,7 +1159,27 @@ var rdfFileNames = [...]string{
"workAt_0.rdf"}

var ldbcDataFiles = map[string]string{
"ldbcTypes.schema": "https://github.com/dgraph-io/dgraph-benchmarks/blob/main/ldbc/sf0.3/ldbcTypes.schema?raw=true",
"ldbcTypes.schema": "https://media.githubusercontent.com/media/dgraph-io/dgraph-benchmarks/refs/heads/main/ldbc/sf0.3/ldbcTypes.schema",
}

func wgetWithRetry(fname, url, dir string) error {
Copy link

@mwelles-istari mwelles-istari Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a near-exact duplicate of downloadFile in dgraphtest/load.go. Is there a reason it needs to be redefined here instead of moving it to a shared location and reusing it?

This is especially concerning here because the function contents have clearly had to evolve over time as different failure causes were found and more guards were added. Every place it's duplicated is one more place to find and fix (or forget to fix) when the next issue arises.

The duplication is compounded by the double-retry approach — both copies pass --tries=3 --waitretry=5 --retry-connrefused to wget AND wrap that in a Go-level 3-attempt loop with exponential backoff, meaning up to 9 wget invocations per file with cumulative delays that could reach several minutes before a genuine failure is reported. If this retry logic needs tuning in the future (and it likely will), having it in exactly one place is essential.

const maxRetries = 3
fpath := filepath.Join(dir, fname)
for attempt := 1; attempt <= maxRetries; attempt++ {
cmd := exec.Command("wget", "--tries=3", "--waitretry=5", "--retry-connrefused", "-O", fname, url)
cmd.Dir = dir
if out, err := cmd.CombinedOutput(); err != nil {
fmt.Printf("attempt %d/%d failed to download %s: %v\n%s\n", attempt, maxRetries, fname, err, string(out))
if attempt < maxRetries {
time.Sleep(time.Duration(attempt*5) * time.Second)
continue
}
_ = os.Remove(fpath)
return fmt.Errorf("failed to download %s after %d attempts: %w", fname, maxRetries, err)
}
return nil
}
return nil
}

func downloadDataFiles() {
Expand All @@ -1168,12 +1188,13 @@ func downloadDataFiles() {
return
}
for fname, link := range datafiles {
cmd := exec.Command("wget", "-O", fname, link)
cmd.Dir = *tmp

if out, err := cmd.CombinedOutput(); err != nil {
fmt.Printf("Error %v\n", err)
panic(fmt.Sprintf("error downloading a file: %s", string(out)))
fpath := filepath.Join(*tmp, fname)
if fi, err := os.Stat(fpath); err == nil && fi.Size() > 0 {
fmt.Printf("Skipping %s (already exists)\n", fname)
continue
}
if err := wgetWithRetry(fname, link, *tmp); err != nil {
panic(fmt.Sprintf("error downloading %s: %v", fname, err))
}
}
}
Expand All @@ -1189,20 +1210,26 @@ func downloadLDBCFiles(dir string) {
}

start := time.Now()
sem := make(chan struct{}, 5)
var wg sync.WaitGroup
for fname, link := range ldbcDataFiles {
fpath := filepath.Join(dir, fname)
if fi, err := os.Stat(fpath); err == nil && fi.Size() > 0 {
fmt.Printf("Skipping %s (already exists)\n", fname)
continue
}
wg.Add(1)
go func(fname, link string, wg *sync.WaitGroup) {
go func(fname, link string) {
defer wg.Done()
start := time.Now()
cmd := exec.Command("wget", "-O", fname, link)
cmd.Dir = dir
if out, err := cmd.CombinedOutput(); err != nil {
fmt.Printf("Error %v\n", err)
panic(fmt.Sprintf("error downloading a file: %s", string(out)))
sem <- struct{}{}
defer func() { <-sem }()

dlStart := time.Now()
if err := wgetWithRetry(fname, link, dir); err != nil {
panic(fmt.Sprintf("error downloading %s: %v", fname, err))
}
fmt.Printf("Downloaded %s to %s in %s \n", fname, dir, time.Since(start))
}(fname, link, &wg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling panic() inside a goroutine will crash the entire program without unwinding other goroutines or running deferred cleanup. The original code had this same issue, but since this PR is improving error handling, it would be worth fixing.

Consider using errgroup.Group instead of sync.WaitGroup — it propagates errors cleanly from goroutines and would allow downloadLDBCFiles to return an error rather than panicking.

fmt.Printf("Downloaded %s to %s in %s \n", fname, dir, time.Since(dlStart))
}(fname, link)
}
wg.Wait()
fmt.Printf("Downloaded %d files in %s \n", len(ldbcDataFiles), time.Since(start))
Expand Down Expand Up @@ -1387,7 +1414,9 @@ func run() error {
needsData := testSuiteContainsAny("load", "ldbc", "all")
if needsData && *tmp == "" {
*tmp = filepath.Join(os.TempDir(), "dgraph-test-data")
x.Check(testutil.MakeDirEmpty([]string{*tmp}))
}
if needsData {
x.Check(os.MkdirAll(*tmp, 0755))
}
if testSuiteContainsAny("load", "all") {
downloadDataFiles()
Expand Down Expand Up @@ -1449,7 +1478,6 @@ func main() {
procId = rand.Intn(1000)

err := run()
_ = os.RemoveAll(*tmp)
if err != nil {
os.Exit(1)
}
Copy link
Contributor

@mlwelles mlwelles Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing os.RemoveAll(*tmp) is necessary for the caching strategy, but it silently changes behavior for local development — test data will now accumulate in the temp directory across runs and never be cleaned up.

Rather than removing this unconditionally, add a new flag to control it:

keepData = pflag.Bool("keep-data", false,
    "Preserve downloaded test data after run (for CI caching). Default cleans up.")

Then in main(), restore the cleanup but make it conditional:

err := run()
if !*keepData {
    _ = os.RemoveAll(*tmp)
}

The CI workflows would then pass --keep-data explicitly:

cd t; ./t --suite=load --tmp=${{ github.workspace }}/test-data --keep-data

This fits the existing flag pattern (similar to --keep for clusters, --download for resources), is explicit and discoverable via --help, and avoids coupling behavior to the environment.

Expand Down
Loading