Skip to content

ci: optimize workflows for Blacksmith + swap to gemma4 models#184

Open
Gajesh2007 wants to merge 1 commit into
masterfrom
ci-optimizations
Open

ci: optimize workflows for Blacksmith + swap to gemma4 models#184
Gajesh2007 wants to merge 1 commit into
masterfrom
ci-optimizations

Conversation

@Gajesh2007
Copy link
Copy Markdown
Member

@Gajesh2007 Gajesh2007 commented May 18, 2026

Summary

Optimize all CI workflows to leverage Blacksmith's infrastructure and migrate test models from Qwen to Gemma 4.

Workflow optimizations

Workflow Optimization Expected impact
ci.yml Cargo artifact cache for provider tests Save 3-8 min/run
ci.yml golangci-lint-action v9.2.0 (pre-built binary, analysis cache) Save ~30s + better annotations
integration.yml Cache Homebrew, Swift .build, HuggingFace models Save 5-10 min/run on warm cache
release-swift.yml Cache HuggingFace test model Save ~2 min/release
threat-model-review.yml pip cache Save ~15s

All caches use standard actions/cache — Blacksmith's transparent proxy intercepts them automatically at ~400 MB/s (vs GitHub's ~90 MB/s) with 25 GB free storage per repo.

Sticky disks (useblacksmith/stickydisk) were evaluated but are Linux-only (ext4 + block devices). All heavy CI jobs run on macOS Apple Silicon runners, so actions/cache with the transparent proxy is the best available option.

Model migration

  • Primary test model: mlx-community/Qwen3.5-0.8B-MLX-4bitmlx-community/gemma-4-e4b-4bit (5.2 GB)
  • Removed: mlx-community/gemma-3-270m-4bit secondary model
  • Release tests: mlx-community/Qwen3-0.6B-8bitmlx-community/gemma-4-e4b-4bit
  • Updated testbed defaults, benchmark configs, fullstack integration test constants, and KnownModelSizes
  • Updated RUNNER_DESC to reflect actual Blacksmith runner (macos-26 (M4 Blacksmith))

Files changed (9)

  • .github/workflows/ci.yml — Cargo cache + golangci-lint action
  • .github/workflows/integration.yml — Homebrew/Swift/HF caches + model swap
  • .github/workflows/release-swift.yml — HF cache + model swap
  • .github/workflows/threat-model-review.yml — pip cache
  • e2e/testbed/config.go — model defaults + known sizes
  • e2e/testbed/suite.go — fallback model ID
  • e2e/testbed/testbed_test.go — assertion updates
  • e2e/benchmark_test.go — benchmark model configs
  • coordinator/api/fullstack_integration_test.go — testModel constant

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Workflow optimizations:
- ci.yml: Add Cargo artifact cache for provider tests (saves 3-8 min/run)
- ci.yml: Replace manual go install with golangci-lint-action v9.2.0
  (pre-built binary download, built-in analysis cache, GitHub annotations)
- integration.yml: Cache Homebrew packages, Swift build artifacts, and
  HuggingFace models via actions/cache (Blacksmith transparent proxy at
  400 MB/s, 25 GB free per repo)
- release-swift.yml: Cache HuggingFace test model
- threat-model-review.yml: Cache pip packages

All caches leverage Blacksmith's transparent cache proxy — zero
Blacksmith-specific actions needed, standard actions/cache is
automatically intercepted and served from co-located storage.

Model migration (Qwen -> Gemma 4):
- Replace mlx-community/Qwen3.5-0.8B-MLX-4bit with gemma-4-e4b-4bit
  as the primary test model across CI, testbed, and benchmarks
- Remove gemma-3-270m-4bit secondary model
- Update KnownModelSizes, testbed defaults, benchmark configs, and
  fullstack integration test constants
- Update RUNNER_DESC to reflect actual Blacksmith runner (M4 macos-26)

Sticky disks (useblacksmith/stickydisk) were evaluated but are
Linux-only (ext4 + block devices); all heavy CI jobs run on macOS
Apple Silicon runners.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview May 18, 2026 10:55pm
d-inference-console-ui-dev Ready Ready Preview May 18, 2026 10:55pm
d-inference-landing Ready Ready Preview May 18, 2026 10:55pm

Request Review

@github-actions
Copy link
Copy Markdown

No security-relevant production code changed in this PR; all modifications are CI workflow definitions, test infrastructure, and the threat-model-review automation itself.


Trust boundaries touched

None of the standard TB-xxx boundaries are touched. The changed files are GitHub Actions workflows and test/benchmark helpers that never run in the production coordinator, provider, or console-ui processes.

Threat coverage

No T-xxx threats are affected. The diff contains no changes to authentication, encryption, attestation, billing, or any other security-sensitive path.


New attack surface not covered by an existing threat

Two areas are worth a brief note for the threat model backlog — neither is an immediate finding, but both represent implicit trust decisions:

1. CI workflow permissions and secret exposure (.github/workflows/)
The threat model explicitly marks the release supply chain out of scope, so this is noted purely for completeness. release-swift.yml presumably has access to signing credentials and the coordinator/api/fullstack_integration_test.go and e2e/ tests run against real coordinator code. If any workflow GITHUB_TOKEN permission or externally supplied secret (e.g. the release API key, A-005) is broader than needed, it could be abused in a compromised-runner scenario. Worth auditing permissions: blocks against least-privilege, but not in scope for this threat model.

2. e2e/testbed/ — testbed config reachability
e2e/testbed/config.go and suite.go configure coordinator endpoints and likely inject test API keys or mock attestation material. If any testbed fixture (fake attestation blob, mock MDM callback, permissive CORS origin) were accidentally reachable in a staging environment that shares production secrets, it could touch TB-005 (unauthenticated MDM webhook) or TB-009 (attestation chain). This is low-risk as written — the files live under e2e/ and are not compiled into production binaries — but the threat model's existing note that api.dev.darkbloom.xyz is out of scope should be confirmed to cover the testbed target URL if one is hardcoded in config.go.

SEC-* findings resolved

None.


🔐 Threat model: docs/threat-model.yaml · Updates on each push to this PR

Comment thread e2e/testbed/config.go
var KnownModelSizes = map[string]string{
"mlx-community/Qwen3.5-0.8B-MLX-4bit": "0.5 GB",
"mlx-community/gemma-3-270m-4bit": "0.2 GB",
"mlx-community/gemma-4-e4b-4bit": "5.2 GB",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we extend the list to also support https://huggingface.co/openai/gpt-oss-20b?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heavy to start -- might need 20-30 minutes to download

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a18374fac8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread e2e/testbed/config.go
func DefaultModelConfig() ModelConfig {
return ModelConfig{
ModelID: "mlx-community/gemma-3-270m",
ModelID: "mlx-community/gemma-4-e4b-4bit",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep CI defaults on a Swift-loadable model

When TESTBED_MODEL_ID is not set, this default is what the e2e/profile jobs pass to provider-swift. The Swift provider loads the cache entry with LLMModelFactory.shared.loadContainer, but the pinned libs/mlx-swift-lm submodule at 8d1cbcd does not register the Gemma 4 gemma4 model type, so a fresh CI run that downloads mlx-community/gemma-4-e4b-4bit will fail while loading the model rather than exercising inference. Keep the default on a supported text model, or update the Swift MLX submodule/model loader in the same change.

Useful? React with 👍 / 👎.

set -euo pipefail
/tmp/mlxvenv/bin/pip install --quiet --no-cache-dir 'huggingface_hub[cli]'
/tmp/mlxvenv/bin/hf download mlx-community/Qwen3-0.6B-8bit \
/tmp/mlxvenv/bin/hf download mlx-community/gemma-4-e4b-4bit \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Download the model the release tests actually load

The release job still runs provider-swift live tests with DARKBLOOM_LIVE_MLX_TESTS=1, and those fixtures load LiveInferenceFixtures.tinyModelID (mlx-community/Qwen3-0.6B-8bit) rather than this Gemma model. On a fresh runner/cache this step no longer seeds the Qwen snapshot, so the live tests record the model as missing (or only pass because of stale cache state) instead of validating the release binary. Keep downloading the tiny Qwen model, or change the test fixture and supported loader together.

Useful? React with 👍 / 👎.

@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 18, 2026

Found 21 test failures on Blacksmith runners:

Failures

Test View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_HeavyLoad_100Concurrent_10KB View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_HighConcurrency View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_ManyUsers View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_MultiModelMultiProvider View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_QueueSaturation View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_SingleModelScaling View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_SingleModelScaling/5-providers View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_SingleProviderNonStreaming View Logs
github.com/eigeninference/d-inference/e2e/TestBenchmark_SingleProviderStreaming View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_AttestationHeaders View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_BillingBalanceDeduction View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_ConcurrentRequests View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_E2EEncryptionCorrectness View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_MultipleRequestsAccounting View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_NonStreamingInference View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_ProviderPayoutSplit View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_ReferralRewardDistribution View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_StreamingContentValidation View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_StreamingInference View Logs
github.com/eigeninference/d-inference/e2e/TestIntegration_SwiftProviderRealRoutingGates View Logs
github.com/eigeninference/d-inference/e2e/TestProfile_SingleProviderNonStreaming View Logs

Fix with Codesmith
Need help on this PR? Tag @codesmith with what you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants