Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,24 @@ All notable changes to this project will be documented here.

## [Unreleased]

### Changed

- **Triage persistOutput is atomic**: Cluster and classification inserts for triage results are now wrapped in a single database transaction. Previously, inserts were issued one at a time — a partial failure left the triage record in an inconsistent state. The new `PersistOutput` method on `TriageStore` commits atomically or rolls back entirely.

- **SMTP, Telegram, and GitHub sends retry on transient failures**: SMTP email sends, Telegram messages, and GitHub commit status posts now retry up to 3 times with exponential backoff on transient errors (5xx, 429 rate limits, connection timeouts). Telegram and GitHub retries respect the `Retry-After` header. Client errors (auth failures, invalid recipients, 4xx) are not retried. The shared `internal/smtptransient` package provides consistent transient-error classification for SMTP across `mail` and `mailer`.

- **Webhook dispatch uses bounded worker pool**: Outbound webhook dispatches are now bounded by a semaphore (default concurrency: 10). Under burst traffic, excess dispatches are dropped with a warning instead of spawning unbounded goroutines.

- **Triage jobs respect server shutdown**: Triage jobs now accept a parent context from the server. When the server shuts down gracefully, in-flight triage jobs are cancelled instead of running for up to 5 minutes after shutdown.

- **LLM retry only on transient errors**: The LLM CLI provider now retries only on transient errors (context deadlines, process signals). Non-transient errors (syntax errors, invalid model names, exit code 1) fail immediately instead of being retried.

- **Invitation emails have HTML alternative part**: Invitation emails are now sent as `multipart/alternative` with both plaintext and HTML parts. The HTML part includes a styled invitation card with an "Accept Invitation" button. Plaintext-only emails frequently landed in spam; the HTML alternative part improves deliverability. The MIME boundary uses `crypto/rand` for uniqueness.

- **Telegram RunURL HTML-escaped**: The `CI_RUN_URL` field in Telegram notifications is now HTML-escaped before insertion into `<a href>` attributes. Previously, a malicious RunURL containing `"` or `>` could break out of the attribute and inject HTML/JavaScript in the Telegram message (XSS). Other fields (repo, branch, commit) were already escaped.

- **invited_by FK allows SET NULL on user deletion**: The `invited_by` foreign key on the `invitations` table now uses `ON DELETE SET NULL` instead of the default restrict. Previously, deleting a user who had invited others would fail with a foreign key violation. Migration 000024 drops the unnamed FK, makes `invited_by` nullable, and adds a named constraint with `ON DELETE SET NULL`. Note: the down migration for 000024 is intentionally irreversible (see the migration SQL for details).

### Added

- **Frontend error boundaries**: A root-level `ErrorBoundary` wraps the entire app in `main.tsx`, a TanStack Router `errorComponent` on the root route catches routing errors, and `ErrorBoundary` wrappers around all Recharts chart sections in `dashboard.tsx` and `analytics.tsx` prevent chart crashes from unmounting the app. The error UI includes both "Try Again" and "Reload" buttons.
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ export ST_RECONCILE_INTERVAL=60s # How often to check for orphans
export ST_RECONCILE_ORPHAN_TIMEOUT=5m # Grace period before declaring an execution orphaned
```

When `ST_SMTP_HOST` is not set the mailer runs in no-op mode — all outbound email is silently discarded. Set it to enable email notifications.
When `ST_SMTP_HOST` is not set the mailer runs in no-op mode — all outbound email is silently discarded. Set it to enable email notifications. SMTP sends retry up to 3 times with exponential backoff on transient errors (5xx, connection timeout).

When `ST_GITHUB_TOKEN` is not set, GitHub commit status posting is disabled. When set, passing `github_owner`, `github_repo`, and `github_sha` query parameters to `POST /api/v1/reports` will post a `scaledtest/e2e` commit status back to GitHub after the report is ingested.
When `ST_GITHUB_TOKEN` is not set, GitHub commit status posting is disabled. When set, passing `github_owner`, `github_repo`, and `github_sha` query parameters to `POST /api/v1/reports` will post a `scaledtest/e2e` commit status back to GitHub after the report is ingested. GitHub status posts retry up to 3 times with exponential backoff on 429 and 5xx responses, respecting the `Retry-After` header.

When `ST_DISABLE_RATE_LIMIT=true` is set, all rate-limit middleware is bypassed and a warning is logged at startup. Use this only in controlled test environments (e.g. CI running E2E suites with many per-test user registrations). **Never set this in production** — it removes brute-force protection on auth endpoints.

Expand Down Expand Up @@ -515,6 +515,8 @@ internal/
github/ # GitHub commit status client
llm/ # LLM provider abstraction (Anthropic, OpenAI, mock)
mail/ # Email sender interface and SMTP implementation
mailer/ # Invitation email composer (SMTP, multipart HTML)
smtptransient/ # Shared SMTP transient-error classification (4xx/5xx/timeout)
webhook/ # Outbound webhook dispatch
ws/ # WebSocket hub for real-time updates
k8s/ # Kubernetes job management
Expand Down
6 changes: 5 additions & 1 deletion docs/ci-integration/telegram-notifications.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,11 @@ Branch: main Commit: def5678
View run
```

All external fields (`CI_REPO`, `CI_BRANCH`, `CI_COMMIT_MSG`) are HTML-escaped before interpolation to prevent invalid HTML from breaking the Telegram API call.
All external fields (`CI_REPO`, `CI_BRANCH`, `CI_COMMIT_MSG`) and `CI_RUN_URL` are HTML-escaped before interpolation to prevent invalid or malicious content from breaking the Telegram API call or injecting HTML/JavaScript.

## Retry behavior

The Telegram client retries on transient failures (429 rate-limited and 5xx server errors) with exponential backoff (1s, 2s, 4s, up to 3 retries). When the Telegram API returns a 429 response with a `retry_after` parameter, the client respects that value as the backoff duration instead of the default exponential backoff. Client errors (4xx except 429) are not retried.

## Graceful degradation

Expand Down
4 changes: 2 additions & 2 deletions internal/db/db_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ func TestMigrationsEmbedded(t *testing.T) {
t.Errorf("migration up/down mismatch: %d up, %d down", ups, downs)
}

if ups != 23 {
t.Errorf("expected 23 migration pairs, got %d", ups)
if ups != 24 {
t.Errorf("expected 24 migration pairs, got %d", ups)
}

// Verify each up has a matching down
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
-- Revert: make invited_by NOT NULL again and remove the FK with SET NULL.
-- Any rows where invited_by was set to NULL by ON DELETE SET NULL must be
-- updated to a sentinel UUID before we can re-add NOT NULL, otherwise the
-- constraint would fail. Since there is no meaningful sentinel user, this
-- migration is intentionally NOT reversible — once deployed, invitations
-- may have invited_by = NULL, and rolling back would silently corrupt data.
-- Do not use this down migration; deploy a new migration instead if reversibility
-- is required.
SELECT 1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-- Make invited_by nullable and set ON DELETE SET NULL so that deleting
-- a user who invited others does not violate the foreign key constraint.
-- Must drop the existing unnamed FK created by migration 000017 before
-- adding the new one with ON DELETE SET NULL; PostgreSQL does not allow
-- two FK constraints on the same column.
ALTER TABLE invitations
DROP CONSTRAINT invitations_invited_by_fkey,
ALTER COLUMN invited_by DROP NOT NULL,
ADD CONSTRAINT fk_invitations_invited_by
FOREIGN KEY (invited_by) REFERENCES users(id) ON DELETE SET NULL;
111 changes: 107 additions & 4 deletions internal/github/github.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,19 @@ import (
"bytes"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"regexp"
"strconv"
"time"

"github.com/rs/zerolog/log"
)

const defaultMaxRetries = 3

// StatusPoster posts a GitHub commit status.
type StatusPoster interface {
PostStatus(ctx context.Context, owner, repo, sha, state, description, statusContext, targetURL string) error
Expand All @@ -21,6 +27,7 @@ type Client struct {
token string
HTTPClient *http.Client
APIURL string
maxRetries int
}

// New creates a GitHub Client with the given token.
Expand All @@ -33,9 +40,16 @@ func New(token string) *Client {
token: token,
HTTPClient: &http.Client{Timeout: 10 * time.Second},
APIURL: "https://api.github.com",
maxRetries: defaultMaxRetries,
}
}

// WithMaxRetries sets the number of retry attempts for transient errors.
func (c *Client) WithMaxRetries(n int) *Client {
c.maxRetries = n
return c
}

var (
validOwnerRepo = regexp.MustCompile(`^[a-zA-Z0-9._-]+$`)
validSHA = regexp.MustCompile(`^[0-9a-fA-F]{7,40}$`)
Expand All @@ -49,7 +63,9 @@ type statusPayload struct {
}

// PostStatus posts a commit status to the GitHub Statuses API.
// state must be one of "success", "failure", "pending", "error".
// It retries on 429 (rate limited) and 5xx (server error) responses with
// exponential backoff, respecting the Retry-After header on 429 responses.
// Client errors (4xx except 429) are not retried.
func (c *Client) PostStatus(ctx context.Context, owner, repo, sha, state, description, statusContext, targetURL string) error {
if !validOwnerRepo.MatchString(owner) {
return fmt.Errorf("invalid github owner: %q", owner)
Expand All @@ -72,6 +88,45 @@ func (c *Client) PostStatus(ctx context.Context, owner, repo, sha, state, descri
}

url := fmt.Sprintf("%s/repos/%s/%s/statuses/%s", c.APIURL, owner, repo, sha)

var lastErr error
for attempt := 0; attempt <= c.maxRetries; attempt++ {
if ctx.Err() != nil {
return ctx.Err()
}

lastErr = c.doPost(ctx, url, body)
if lastErr == nil {
return nil
}

if !isRetriableError(lastErr) {
return lastErr
}

if attempt < c.maxRetries {
backoff := retryAfterDuration(lastErr)
if backoff == 0 {
backoff = time.Duration(1<<uint(attempt)) * time.Second
}
log.Warn().Err(lastErr).
Int("attempt", attempt+1).
Str("sha", sha).
Dur("backoff", backoff).
Msg("github: retrying PostStatus")

select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(backoff):
}
}
}
return lastErr
}

// doPost executes a single HTTP POST to the GitHub statuses API.
func (c *Client) doPost(ctx context.Context, url string, body []byte) error {
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
if err != nil {
return fmt.Errorf("create request: %w", err)
Expand All @@ -89,8 +144,56 @@ func (c *Client) PostStatus(ctx context.Context, owner, repo, sha, state, descri
defer resp.Body.Close()
io.Copy(io.Discard, resp.Body) //nolint:errcheck

if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return fmt.Errorf("github status API returned %d", resp.StatusCode)
if resp.StatusCode >= 200 && resp.StatusCode < 300 {
return nil
}

if resp.StatusCode == http.StatusTooManyRequests {
retryAfter := resp.Header.Get("Retry-After")
return &retriableError{
statusCode: resp.StatusCode,
retryAfter: retryAfter,
}
}

if resp.StatusCode >= 500 {
return &retriableError{statusCode: resp.StatusCode}
}

return fmt.Errorf("github status API returned %d", resp.StatusCode)
}

// retriableError represents a transient HTTP error that should be retried.
type retriableError struct {
statusCode int
retryAfter string
}

func (e *retriableError) Error() string {
if e.retryAfter != "" {
return fmt.Sprintf("github status API returned %d (Retry-After: %s)", e.statusCode, e.retryAfter)
}
return fmt.Sprintf("github status API returned %d", e.statusCode)
}

// isRetriableError returns true for 429 and 5xx errors.
func isRetriableError(err error) bool {
var re *retriableError
return errors.As(err, &re)
}

// retryAfterDuration extracts the Retry-After duration from a retriableError.
// Returns 0 if not available or parseable.
func retryAfterDuration(err error) time.Duration {
var re *retriableError
if !errors.As(err, &re) {
return 0
}
if re.retryAfter == "" {
return 0
}
if secs, e := strconv.Atoi(re.retryAfter); e == nil {
return time.Duration(secs) * time.Second
}
return nil
return 0
}
Loading
Loading