Skip to content

Conversation

@wiseflat
Copy link
Owner

The main improvements added to this PR are:

  • Add playbook to handle Nvidia GPU
  • Test VLLM and deploy an llm endpoint
  • Add new exporters (vllm, Nvidia)
  • and more

The Promtail role now always executes its setup tasks and the config template
conditionally includes a Loki `remote_write` client when `loki_remote_write`
is defined, enabling log pushing to a remote Loki instance. Additionally,
the `force` parameter in the download task was changed from the string
"no" to the boolean `false` for correct usage.
Switch the coredns role to use official CoreDNS releases instead of building
from source, create a dedicated system user and group, and update the
Corefile template to obtain the Nomad management token from the primary
Nomad master node. Documentation is extended with a Nomad cluster mode
section and the golang role is removed from the playbook.
Add a dedicated entrypoint for Traefik's metrics at port 8081 and update
the Prometheus scrape configuration to rewrite the target address.
This separates metric traffic from the main HTTPS entrypoint.
Replace the static Nomad address with a Jinja2 expression that pulls the address from Ansible hostvars, defaulting to 127.0.0.1. This makes the Traefik configuration dynamic and adaptable to different environments.
Introduce a new Ansible role to install and manage the NVIDIA GPU Exporter. The role includes defaults (disabled by default), handlers to restart the service, build tasks that download and install the binary, systemd service template, and upstream variable handling for version detection. This enables optional deployment of the exporter on hosts with NVIDIA GPUs.
- Convert quoted string booleans to native boolean values across defaults and templates.
- Add dynamic TLS SAN IP range generation and expose it via `nomad_tls_ip_range`.
- Enable Docker private registry support and simplify Docker TLS handling.
- Restructure certificate copy tasks to use loops for server and client nodes.
- Comment out S3 storage plugin job templates and its handler flush.
- Disable CNI installation task and update related conditionals.
- Update various template files to use lower‑cased boolean rendering.

BREAKING CHANGE: S3 storage plugin is disabled by default and boolean handling has changed; existing playbooks or roles that relied on the previous string representations or S3 plugin may need adjustment.
@wiseflat wiseflat merged commit 32aa4af into main Oct 21, 2025
3 checks passed
@wiseflat wiseflat deleted the dev/mgarcia/nomad-autoscaler branch October 29, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants