nixopus · raghavyuva · Apr 8, 2026 · Apr 8, 2026
diff --git a/.gitignore b/.gitignore
@@ -23,3 +23,4 @@ Thumbs.db
 *.swp
 *.swo
 *~
+docs/
diff --git a/package.json b/package.json
@@ -35,7 +35,7 @@
     "@mastra/observability": "^1.0.0",
     "@mastra/pg": "^1.0.0",
     "@mastra/s3": "^0.2.1",
-    "@nixopus/api-client": "^0.0.15",
+    "@nixopus/api-client": "0.0.16",
     "@openrouter/ai-sdk-provider": "^2.3.0",
     "@toon-format/toon": "^2.1.0",
     "ai": "^6.0.97",

diff --git a/skills/deploy-delegation/SKILL.md b/skills/deploy-delegation/SKILL.md
@@ -0,0 +1,32 @@
+---
+name: deploy-delegation
+description: Sub-agent routing table — which agent handles diagnostics, machine health, infrastructure, GitHub, billing, and notifications. Load when the current task is not a direct deployment.
+metadata:
+  version: "1.1"
+---
+
+# Delegation
+
+Use the `delegate` tool to route non-deploy tasks to specialized agents. Pass the agent name and a task description with all relevant context.
+
+## Routing Table
+
+| Agent | Handles | Example tasks |
+|-------|---------|---------------|
+| `diagnostics` | Build errors, crashes, runtime issues | "Investigate why deployment X failed" |
+| `machine` | Server health, CPU/RAM, Docker daemon, DNS, backups | "Check server memory usage" |
+| `infrastructure` | Domain listing/creation/deletion, containers, healthchecks, server management | "List all domains and their status" |
+| `github` | Branches, PRs, file operations | "Create a fix branch and PR for the Dockerfile" |
+| `preDeploy` | First-time validation, monorepo assessment | "Run pre-deploy checks on this repository" |
+| `notification` | Deploy alerts, channel config | "Send a deploy success notification to Slack" |
+| `billing` | Credits, plans, invoices | "Check credit balance" |
+
+## Usage
+
+```
+delegate({ agent: "diagnostics", task: "Investigate deployment failure for applicationId=abc-123. Check logs and container state." })
+```
+
+Always include relevant identifiers in the task: applicationId, owner, repo, branch. The delegate tool automatically injects context formatting for agents that need it.
+
+Delegation is synchronous — process the result in the same response. If delegation returns an error, try using direct tools instead.
diff --git a/skills/deploy-flow/SKILL.md b/skills/deploy-flow/SKILL.md
@@ -0,0 +1,41 @@
+---
+name: deploy-flow
+description: Full deploy pipeline — source detection, codebase analysis, project creation, deployment monitoring, and live URL delivery. Load when the user wants to deploy an application.
+metadata:
+  version: "1.0"
+---
+
+# Deploy Flow
+
+## Source Detection — FIRST STEP
+Check the user message for a [context] block before calling any tool.
+
+source=s3: The context block contains syncTarget (the S3 storage ID for files). It may also contain applicationId if the app already exists.
+- Call load_local_workspace with the syncTarget value to load the codebase.
+- If applicationId is present: the app exists. Use it for deploy_project.
+- If applicationId is absent: this is a first deploy. Analyze the codebase, then createProject (source and repository are set automatically for S3). Deploy the newly created app.
+- GitHub connector tools are blocked for S3 sources — do not attempt them.
+
+No context block: Standard GitHub flow — get_github_connectors → get_github_repositories → analyze_repository → explore → createProject → deploy_project.
+If get_github_connectors returns empty or has no valid connectors, the user has not connected GitHub yet. Do NOT continue the deploy flow. Instead: read_skill("github-onboarding") and follow the guide.
+
+## Deploy Steps
+1. Load codebase.
+2. Analyze: read_file, list_directory, grep to find ecosystem, port, Dockerfile, compose, env vars. Base conclusions on actual file contents.
+3. If monorepo detected (workspaces, turbo.json, nx.json, apps/ with multiple services): read_skill("monorepo-strategy") for service discovery, dependency ordering, and build context strategy.
+4. If no Dockerfile: load read_skill("dockerfile-generation") and the matching ecosystem skill (e.g. read_skill("node-deploy")). Also read_skill("dockerignore-generation") if no .dockerignore exists. For static sites needing Caddy config, read_skill("caddyfile-generation"). Generate and save with write_workspace_files. For s3 sources, write_workspace_files is enough — files sync automatically. For GitHub sources, push via branch + PR when required by policy.
+5. If the app has database migrations (Prisma, TypeORM, Django, Alembic, etc.): read_skill("database-migration") to determine how to run migrations during deployment.
+6. Generate domain BEFORE createProject. Call generate_random_subdomain to get a subdomain. Pass it in the `domains` array when calling createProject in the next step — this attaches the domain at creation time and avoids extra tool calls later. For custom domains, read_skill("domain-attachment").
+7. createProject (if app doesn't exist) — pass `domains: ["<generated-subdomain>"]` to attach the domain at creation. Then call deploy_project. For compose: pass compose_services and compose_domains.
+8. Monitor deployment — mandatory but lean. Call getApplicationDeployments(limit=1) once to get the deployment ID. Then poll getDeploymentById only — do NOT call getDeploymentLogs unless the status is failed/error or the user asks. One poll is enough if the build is fast; for slow builds, poll at most 2-3 times. Do NOT call getApplication after deploy — you already have the app details from createProject.
+9. Verify: read_skill("post-deploy-verification") and run the verification checklist.
+10. Share the live URL clearly and explicitly once the app is verified reachable.
+
+## Rules
+- Use createProject to create apps. create_application does not exist.
+- Check getApplications before creating to avoid duplicates.
+- Never hardcode secrets. Use updateApplication for env vars.
+- If the user asks to deploy, do not stop at planning, explanation, or diagnosis. Execute the deployment flow unless blocked by missing credentials, missing access, or required secrets.
+- "Would you like me to fix this?" is a failure mode when the fix is obvious. Fix it.
+- If an operation is async, keep polling and keep the user updated until there is a terminal outcome. Do not abandon the flow with promises of future follow-up.
+- If you create a PR, include the URL in your reply. If it failed, say what failed.
diff --git a/skills/diagnostic-workflow/SKILL.md b/skills/diagnostic-workflow/SKILL.md
@@ -0,0 +1,21 @@
+---
+name: diagnostic-workflow
+description: Layer-by-layer diagnostic workflow for application and container issues — deployment logs, container state, HTTP probes. Load when investigating a deployment failure or runtime issue.
+metadata:
+  version: "1.0"
+---
+
+# Diagnostic Workflow
+
+## Diagnostic Layers (IN ORDER, stop on root cause)
+1. get_application_deployments — check deployment history and status
+2. get_deployment_logs — read build and deploy logs for errors
+3. list_containers → search_tools("container logs") → load needed tools
+4. get_container_logs — check container runtime output
+5. search_tools("http probe") → http_probe public URL
+
+If the issue appears application-level, check logs layer by layer. For container-level resource issues, defer to the Machine Agent which has host_exec.
+
+If the issue appears to be server-level (CPU, RAM, disk, Docker daemon, DNS, proxy, or domain/TLS), defer to the Machine Agent.
+
+Match log output against the pattern tables in the failure-diagnosis skill before hypothesizing. Tool 404 → skip layer. Root cause: bold summary, evidence in code block, fix in 1-2 sentences.
diff --git a/skills/domain-attachment/SKILL.md b/skills/domain-attachment/SKILL.md
@@ -1,32 +1,29 @@
 ---
 name: domain-attachment
-description: Attach a domain to an application after deployment. Covers auto-generated subdomains, existing domain selection, and custom domain setup with DNS configuration and verification.
+description: Domain setup for applications. Preferred path is passing domains at creation time via createProject. Falls back to add_application_domain for post-creation attachment or custom domains.
 metadata:
-  version: "1.0"
+  version: "1.1"
 ---
 
 # Domain Attachment
 
-After a successful deployment, the app needs a domain to be reachable. Follow this flow.
+## Preferred: Pass Domain at Creation Time
 
-## Step 1: Gather Options
+The fastest path — zero extra tool calls after project creation:
 
-- Call `get_domains` to list available domains.
-- Call `generate_random_subdomain` to get a subdomain suggestion.
+1. Call `generate_random_subdomain` to get a subdomain.
+2. Pass `domains: ["<subdomain>"]` in the `createProject` call.
+3. Done — the domain is attached at creation, wildcard DNS and TLS are automatic.
 
-## Step 2: Present Options
+Use this path for all standard deploys. Only fall back to the post-creation flow below when adding domains to an existing app or when the user wants a custom domain.
 
-- If only one domain exists, attach it automatically with `add_application_domain`.
-- If multiple domains exist, present the list and ask the user which to use.
-- Always offer the generated subdomain as a quick option (works immediately, no DNS setup needed).
-- Ask if the user wants to use a custom domain instead.
+## Post-Creation: Auto-Generated Subdomain
 
-## Step 3: Auto-Generated Subdomain (fast path)
+If the app already exists and has no domain:
 
-If the user picks the generated subdomain:
-
-1. Call `add_application_domain` with the subdomain.
-2. Done — wildcard DNS and TLS are handled automatically.
+1. Call `generate_random_subdomain`.
+2. Call `add_application_domain` with id (app UUID) and the subdomain.
+3. Done — wildcard DNS and TLS are handled automatically.
 
 ## Step 4: Custom Domain Setup
 

diff --git a/skills/github-workflow/SKILL.md b/skills/github-workflow/SKILL.md
@@ -0,0 +1,30 @@
+---
+name: github-workflow
+description: Fix-via-PR workflow, file operations, connector resolution, and GitHub safety rules. Load when performing GitHub operations like creating branches, PRs, or file changes.
+metadata:
+  version: "1.0"
+---
+
+# GitHub Workflow
+
+## Connector Resolution
+When a connectorId is provided in the delegation message, use that connector_id when calling get_github_repositories to list repos from the correct GitHub account. If no connectorId is provided and there are multiple connectors, use get_github_connectors to list them and pick the first one with valid credentials, then use its ID for get_github_repositories.
+
+## File Write Capabilities
+- github_create_or_update_file: Create or update a single file. To update, first read the file with github_get_repo_file to get its current sha, then pass that sha. To create a new file, omit sha.
+- github_create_branch: Create a new branch from a commit SHA. Use get_github_repository_branches to find the source branch HEAD SHA.
+- github_create_pull_request: Open a PR from a head branch into a base branch.
+
+## Fix-via-PR Flow
+When asked to fix a file in a repo:
+1. Call github_get_branch with the default branch name (e.g. "main") to get its HEAD commit SHA.
+2. Create a fix branch with github_create_branch using that SHA (e.g. branch name "nixopus/fix-dockerfile").
+3. Read the file to fix with github_get_repo_file on the default branch to get its content and blob sha.
+4. Write the fixed file with github_create_or_update_file targeting the fix branch, passing the blob sha from step 3.
+5. Create a PR with github_create_pull_request from the fix branch into the default branch.
+Return the PR URL, PR number, and fix branch name to the parent agent in your final message. Never say work is "underway" or that you will send the link later.
+
+## GitHub Safety
+- Never commit/push to main. Always branch → PR.
+- Never merge PRs unless user explicitly requests. Return PR URL.
+- No destructive ops (force push, branch delete, PR close) without user approval.
diff --git a/skills/incident-response/SKILL.md b/skills/incident-response/SKILL.md
@@ -131,3 +131,17 @@ If the fix PR is merged and a new deployment triggers:
 - **`failure-diagnosis`** — Pattern tables for identifying root causes
 - **`rollback-strategy`** — When to rollback vs fix forward
 - **`post-deploy-verification`** — Verify fix worked after merge
+
+## Event Context
+
+Your prompt contains the full incident context formatted by the event pipeline. This includes the event type, source details, error information, and any relevant identifiers (application, deployment, repository, etc.). Use all provided context to drive your investigation.
+
+## Safety Rules
+
+- Never merge PRs. Always return the PR URL for user approval.
+- Never push to main/master. Always create a fix branch.
+- If you cannot determine the root cause, notify the user with what you found and stop.
+- Do not retry the same fix more than once. Maximum 3 auto-fix attempts per incident before escalating.
+- Include all relevant context identifiers when delegating to diagnostics or github.
+- After delegation returns, immediately process the result. Never say work is "underway".
+- Every response must end with concrete information or a completed action.
diff --git a/skills/machine-ops/SKILL.md b/skills/machine-ops/SKILL.md
@@ -0,0 +1,43 @@
+---
+name: machine-ops
+description: Machine-level diagnostic layers, lifecycle management (restart/pause/resume), metrics analysis, and backup operations. Load when investigating server health or managing machine state.
+metadata:
+  version: "1.0"
+---
+
+# Machine Operations
+
+## Lifecycle Management
+You can check and control the machine instance state:
+- get_machine_lifecycle_status → current state (Running, Paused, Stopped), PID, uptime
+- restart_machine → restart the instance (requires user approval)
+- pause_machine → pause the instance (requires user approval)
+- resume_machine → resume a paused instance (requires user approval)
+
+Always check get_machine_lifecycle_status before performing restart/pause/resume.
+
+## Metrics & Events
+- get_machine_metrics → historical time-series metrics (CPU, memory, disk, network)
+- get_machine_metrics_summary → summarized averages, peaks, and trends
+- get_machine_events → lifecycle events (restarts, failures, state changes)
+
+Use metrics for trend analysis and incident correlation. Use get_machine_stats for a point-in-time snapshot.
+
+## Backups
+- get_backup_schedule → current backup schedule configuration
+- update_backup_schedule → modify backup frequency, retention, timing
+- list_machine_backups → list available backups with timestamps and status
+- trigger_machine_backup → create an immediate backup (requires approval)
+
+## Diagnostic Layers (IN ORDER, stop on root cause)
+1. get_servers_ssh_status → reachable?
+2. get_machine_stats → CPU, RAM, disk, load, uptime
+3. Anomalies: mem>90% → host_exec "ps aux --sort=-%mem | head -20". disk>85% → "du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10". cpu>80% → "ps aux --sort=-%cpu | head -20". load>2x cores → overloaded.
+4. Docker → host_exec "systemctl status docker --no-pager", "docker info 2>&1 | head -30"
+5. System logs → host_exec "dmesg | tail -30", "journalctl -u docker --since '30 min ago' --no-pager | tail -50"
+6. Proxy/domain: follow domain-tls-routing skill. Caddy status/logs/validate via host_exec. For domain CRUD or reachability checks, defer to Infrastructure Agent.
+7. Network → host_exec "ss -tlnp"
+8. Cleanup → host_exec "docker system df"
+
+Root cause: bold summary, evidence in code block, fix in 1-2 sentences.
+No anomalies: report healthy with key metrics.
diff --git a/skills/mcp-integrations/SKILL.md b/skills/mcp-integrations/SKILL.md
@@ -0,0 +1,17 @@
+---
+name: mcp-integrations
+description: MCP server discovery, tool invocation, and provider catalog integration. Load when a task involves external services, third-party tools, or when the user asks about MCP servers.
+metadata:
+  version: "1.0"
+---
+
+# MCP Integrations
+
+When a task involves external services, third-party tools, or capabilities beyond core Nixopus (e.g. databases, monitoring, CI/CD, analytics, logging, storage, auth providers), proactively check whether an MCP integration can help:
+
+1. Use search_tools with "mcp" to load MCP tools.
+2. Call discover_mcp_tools to list tools from all enabled MCP servers. Each tool entry includes server_id, tool name, description, and inputSchema.
+3. Call call_mcp_tool to invoke a specific tool: pass server_id (UUID from discover_mcp_tools), tool_name (exact name string), and arguments (a JSON object matching the tool's inputSchema — use proper types: strings, numbers, booleans, not everything as strings).
+4. If no relevant integration exists, call list_mcp_provider_catalog to show what integrations the user can enable.
+
+Also use these tools when the user explicitly asks about MCP servers — list, add, update, delete, or test connections.
diff --git a/skills/pre-deploy-checklist/SKILL.md b/skills/pre-deploy-checklist/SKILL.md
@@ -79,3 +79,10 @@ Report as a table:
 | Migrations | PASS/WARN/N/A | Migration tool and command |
 
 Only block deployment (report FAIL) for checks 1-4. Checks 5-8 are warnings that should be reported but don't block.
+
+## Summary Format
+Report the checklist table, then:
+**Ready**: what looks good
+**Warnings**: non-critical issues
+**Blockers**: must fix before deploy
+**Recommendations**: specific fixes with code blocks
diff --git a/skills/self-heal/SKILL.md b/skills/self-heal/SKILL.md
@@ -0,0 +1,16 @@
+---
+name: self-heal
+description: Self-healing loop for failed deployments — diagnose, fix, redeploy up to 3 attempts, then escalate or rollback. Load when a deployment fails or build errors occur.
+metadata:
+  version: "1.0"
+---
+
+# Self-Heal
+
+## Flow (max 3 attempts)
+On build_failed: get_deployment_logs → diagnose → write fix → push via branch+PR if needed → redeploy → resume monitoring.
+- Do not stop to ask the user unless the fix is ambiguous or requires credentials you do not have.
+- After each failed attempt, tell the user what broke and what you are trying next.
+- Maximum 3 self-heal attempts.
+- After 3 failures, read_skill("rollback-strategy") to decide whether to rollback or escalate.
+- If escalation is required, tell the user plainly that you could not complete the deployment automatically and that a Nixopus team member will reach out shortly to help finish it.
diff --git a/src/config.ts b/src/config.ts
@@ -11,7 +11,7 @@ export const config = {
 
   redisUrl: process.env.REDIS_URL || '',
 
-  logLevel: (process.env.LOG_LEVEL || 'info') as 'debug' | 'info' | 'warn' | 'error',
+  logLevel: (process.env.LOG_LEVEL || (process.env.NODE_ENV === 'production' ? 'info' : 'debug')) as 'debug' | 'info' | 'warn' | 'error',
   logName: process.env.LOG_NAME || 'Agent',
 
   observabilityEnabled: process.env.OBSERVABILITY_ENABLED !== 'false',
-Original file line number
+Diff line change
@@ Expand Up / @@ -23,3 +23,4 @@ Thumbs.db @@
     *.swp
     *.swo
     *~
+    docs/