🔍 PIQC — GPU Revenue Leak & Inference Efficiency

Kubernetes AI/ML Introspector for vLLM Deployments
One command. See which GPUs are leaking money and why.

🔒 Read-only — no agents, no sidecars, nothing installed permanently. Runs as a Job, prints results, exits.

Features • Quick Start • Commands • Output Formats • Installation

🎯 Overview

PIQC (Production Inference Quality Control) is a read-only Kubernetes-native tool that discovers AI/ML inference deployments, measures their efficiency, and surfaces the dollar cost of idle and unallocated GPUs — in a single command.

Nothing is installed permanently. PIQC runs as a Kubernetes Job using a scoped read-only service account. It collects data, prints the report, and exits. No agents, no sidecars, no cluster modifications.

piqc scan

No flags needed. PIQC connects to your current kubectl context, scans all namespaces, and immediately prints a cost report showing which models are running, at what efficiency (MFU), what they cost per 1K tokens, and how much GPU spend is being wasted today.

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   🔍 PIQC Scan Flow                                                          │
│                                                                              │
│   ┌─────────┐     ┌──────────────┐     ┌─────────────┐     ┌─────────────┐   │
│   │ K8s     │────▶│ Discovery &  │────▶│ Collect     │────▶│ Generate    │   │
│   │ Cluster │     │ Detection    │     │ Metrics     │     │ ModelSpec   │   │
│   └─────────┘     └──────────────┘     └─────────────┘     └─────────────┘   │
│                                                                              │
│   • Scans all namespaces          • GPU metrics via nvidia-smi              │
│   • Detects vLLM workloads        • Runtime metrics via vLLM API            │
│   • Weighted confidence scoring   • KV cache, latency, throughput           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

✨ Features

🔍 Intelligent Discovery

Auto-Detection: Automatically discovers vLLM inference deployments across all namespaces
Weighted Confidence Scoring: Uses multiple signals (images, env vars, CLI args, labels) with weighted scoring
Framework Detection: Identifies vLLM with high accuracy using pattern matching and heuristics

📊 Comprehensive Metrics Collection

GPU Metrics: Real-time GPU utilization, memory, temperature, and power via nvidia-smi
Runtime Metrics: Collects vLLM API metrics including:
- Request latency (P50, P95, P99)
- Token throughput (prefill & decode)
- KV cache utilization
- Queue depth and active requests
- Health status

💰 Revenue Leak Detection

GPU underutilization — Deployments with utilization below 60% threshold, with dollar waste per day and annualized
Dark capacity — GPU nodes with no pods scheduled (paying for nodes sitting empty)
Tier misplacement — Models running on a GPU tier beyond what their parameter count requires (e.g. a 7B model on an H100), with estimated cost delta per day
Fragmentation — Nodes with free GPU slots too small to fit any running model — slots are permanently stranded until the cluster is rebalanced
Pending GPU pods — Active workloads blocked from scheduling due to insufficient contiguous GPU slots, shown with how long they have been waiting
Cost Summary panel — Total spend rate, all three leak categories, and total estimated leak per day / per year
MFU (Model FLOPS Utilization) — Observed compute vs. theoretical GPU peak per deployment
Cost per 1K tokens — Translate GPU spend into a business metric comparable to API pricing

📄 Multiple Output Formats

Format	Description
Table	Cost report with MFU, $/1K tokens, idle waste (default)
YAML	Kubernetes-style ModelSpec files
JSON	Machine-readable JSON output
PIQC Facts	Standardized facts bundle for quality assessment

🚀 Production-Ready

Parallel Processing: Multi-threaded scanning with configurable workers
RBAC Support: Pre-configured ClusterRole and ServiceAccount manifests
Flexible Modes: Auto-detect, remote (kubeconfig), or in-cluster execution
Timeout Controls: Configurable operation timeouts
Docker Image: Pre-built multi-platform image (linux/amd64 + linux/arm64) available on GitHub Container Registry — no install required

🔮 Coming Soon

🔴 AMD GPU Support

Support for AMD Instinct and Radeon GPUs via rocm-smi:

AMD Instinct MI250X/MI300X detection
GPU utilization, memory & temperature metrics
ROCm ecosystem integration
Seamless multi-vendor GPU environments

🌐 LLM-D (LLM-Distributed)

Discovery and documentation for distributed LLM inference:

Distributed inference topology mapping
Multi-node GPU coordination metrics
Cross-node performance aggregation
Distributed KV cache analysis

🚀 Quick Start

Option 1: Run as a Kubernetes Job (recommended)

The simplest way — runs inside your cluster with no Docker auth or kubeconfig wrangling:

Step 1 — Apply RBAC permissions (one-time setup):

kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yaml

Step 2 — Run the scan:

kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/scan-job.yaml

Step 3 — View the output:

kubectl logs -f job/piqc-scan -n kube-system

Clean up when done:

kubectl delete job piqc-scan -n kube-system

The job auto-deletes itself after 10 minutes (ttlSecondsAfterFinished: 600).

Option 2: Run with Docker from your laptop

For laptops and CI pipelines. Requires exporting a static kubeconfig first (avoids cloud auth plugin issues):

# Export a static kubeconfig with embedded credentials
kubectl config view --raw --flatten > /tmp/piqc-kubeconfig.yaml

# Run the scan
docker run --rm \
  -v /tmp/piqc-kubeconfig.yaml:/root/.kube/config \
  ghcr.io/paralleliq/piqc:latest \
  scan --format table

The image supports both linux/amd64 and linux/arm64.

Option 3: Install from source

git clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install
poetry run piqc scan --format table

Test Your Connection

# Verify cluster connectivity and permissions
piqc test-connection

Run Your First Scan

# Scan entire cluster with console table output
piqc scan --format table

# Scan and generate YAML ModelSpec files
piqc scan --format yaml -o ./output

# Scan with runtime metrics from vLLM API
piqc scan --collect-runtime --format json

Expected Output

ModelSpec Introspector v1.0.0
========================================

[INFO] Connecting to cluster...
       Context: my-k8s-context
       Cluster: my-cluster

[INFO] Scanning namespaces...
       Discovered: 12 namespace(s)

[INFO] Detecting inference workloads...
       Pods analyzed: 47
       Inference deployments found: 3

Framework Distribution:
┃ Framework ┃ Count ┃
├───────────┼───────┤
│ vllm      │     3 │

[INFO] Scan completed in 8.2s

📋 Commands

`piqc scan`

Scan Kubernetes cluster for vLLM model deployments and generate ModelSpec documentation.

piqc scan [OPTIONS]

Scan Options

Option	Default	Description
`--kubeconfig PATH`	`~/.kube/config`	Path to kubeconfig file
`--context TEXT`	current	Kubernetes context to use
`-n, --namespace TEXT`	all	Specific namespace to scan
`--format [yaml\|json\|table]`	`yaml`	Output format
`-o, --output PATH`	`./output`	Output directory for generated files

Collection Options

Option	Default	Description
`--collect-runtime`	`false`	Collect runtime metrics via vLLM API
`--no-exec`	`false`	Disable pod exec (skip GPU metrics)
`--no-logs`	`false`	Disable log reading
`--aggregate/--no-aggregate`	`aggregate`	Aggregate metrics across pod replicas
`--contribute-benchmarks`	`false`	Contribute anonymized GPU/model performance data to the ParallelIQ benchmark dataset

Output Options

Option	Default	Description
`--combined`	`false`	Generate single combined output file
`--output-piqc`	`false`	Generate `piqc-facts.json` (PIQC v0.1 schema)

Execution Options

Option	Default	Description
`--timeout INT`	`30`	Operation timeout in seconds
`--workers INT`	`10`	Number of parallel workers
`--mode [auto\|remote\|incluster\|dry-run]`	`auto`	Execution mode
`-v, --verbose`	`false`	Enable verbose output
`--debug`	`false`	Enable debug mode with detailed trace

Examples

# Basic scan - discover all vLLM deployments
piqc scan

# Scan specific namespace with JSON output
piqc scan -n production --format json

# Quick scan without GPU metrics (faster)
piqc scan --no-exec

# Collect runtime metrics from vLLM API
piqc scan --collect-runtime

# Generate PIQC facts bundle for quality assessment
piqc scan --output-piqc -o ./facts

# Combined output file instead of per-deployment files
piqc scan --combined -o ./output

# Table output to console (human-readable)
piqc scan --format table

# Custom kubeconfig and context
piqc scan --kubeconfig /path/to/config --context my-cluster

# Disable metric aggregation across replicas
piqc scan --no-aggregate

# Full verbose debug mode
piqc scan -v --debug

# Contribute anonymized GPU/model benchmarks to ParallelIQ dataset
piqc scan --contribute-benchmarks

# Preview what benchmark data would be sent (no identifying info)
piqc scan --contribute-benchmarks --verbose

`piqc test-connection`

Test connection to Kubernetes cluster and verify required permissions.

piqc test-connection [OPTIONS]

Option	Default	Description
`--kubeconfig PATH`	`~/.kube/config`	Path to kubeconfig file
`--context TEXT`	current	Kubernetes context to use

Example Output

ModelSpec Introspector v1.0.0
========================================

[INFO] Testing cluster connection...

Connection successful

Context: my-context
Cluster: my-cluster
[INFO] Testing namespace access...
       Accessible namespaces: 15

All checks passed

`piqc version`

Display version information.

piqc version
# Output: ModelSpec Introspector v1.0.0

📁 Output Formats

YAML Format (Default)

Generates individual Kubernetes-style YAML files for each deployment:

apiVersion: modelspec/v1
kind: ModelSpec
metadata:
  name: vllm-llama-7b
  namespace: inference
  collectionTimestamp: "2024-01-07T12:00:00Z"
  collectorVersion: "1.0.0"
model:
  name: meta-llama/Llama-2-7b-hf
  architecture: llama
  parameters: "7B"
  identificationConfidence: 0.95
engine:
  name: vllm
  version: "0.4.0"
  detectionConfidence: 0.95
inference:
  precision: float16
  tensorParallelSize: 4
  maxModelLen: 4096
  gpuMemoryUtilization: 0.90
resources:
  replicas: 2
  gpuCount: 4
  gpus:
    - type: A100-SXM4-80GB
      memoryTotal: "80GB"
      utilization: 87
      memoryUsed: 72000
runtimeState:
  vllm:
    healthStatus: healthy
    kvCacheUsagePercent: 45.2
    avgPromptThroughput: 1250.5
    avgGenerationThroughput: 85.3
dataCompleteness:
  staticConfig: true
  gpuMetrics: true
  runtimeMetrics: true

JSON Format

Same structure as YAML but in JSON format, ideal for programmatic processing.

Table Format

Default output of piqc scan — no flags required:

                                                    Discovered Inference Deployments
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Deployment                  ┃ Engine  ┃ GPU              ┃ Replicas ┃ GPU Util ┃  MFU ┃ $/1K tokens ┃   $/hr ┃   Idle $/day ┃   Tier Fit   ┃ Namespace       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ meta-llama/Llama-3-70B-Inst │ vllm    │ 8xH100-SXM4-80GB │        2 │       4% │ 3.1% │     $0.0842 │ $68.00 │    $1,566.72 │ ⚠ >A100-80GB │ production      │
│ mistral-7b-instruct         │ vllm    │ 1xA100-SXM4-40GB │        1 │      11% │ 8.4% │     $0.0073 │  $2.50 │       $53.40 │    ⚠ >T4     │ production      │
│ codellama-34b-staging       │ vllm    │ 4xH100-SXM4-80GB │        1 │       0% │  N/A │         N/A │ $17.00 │      $408.00 │ ⚠ >A100-40GB │ staging         │
│ embedding-bge-large         │ vllm    │ 1xT4             │        3 │      82% │  N/A │     $0.0002 │  $1.35 │        $5.83 │      ✓       │ shared-services │
│ unknown-runtime-7f3a2       │ unknown │ 2xA100-SXM4-80GB │        1 │      N/A │  N/A │         N/A │  $7.00 │ util unknown │      ?       │ ml-platform     │
└─────────────────────────────┴─────────┴──────────────────┴──────────┴──────────┴──────┴─────────────┴────────┴──────────────┴──────────────┴─────────────────┘

╭──────────────────────────────────── Cost Summary ──────────────────────────────────────╮
│   Total GPU spend rate     : $95.85/hr                                                 │
│                                                                                        │
│   Leased & idle (util <60%) : $2,033.95/day  (pods running, GPUs underused)            │
│   Unallocated nodes        : $1,152.00/day  (12 GPU(s) with no pods scheduled)         │
│   Tier misplacement        :   $721.20/day  (3 model(s) on oversized GPU tier)         │
│                                                                                        │
│   Total estimated leak     : $3,907.15/day  ($1,426,110/yr)                            │
│                                                                                        │
│   Avg MFU (active deployments) : 15.7%  (healthy range: 30–60%)                        │
╰────────────────────────────────────────────────────────────────────────────────────────╯

Tier Fit column:

Symbol	Meaning
`✓`	Model is on an appropriate GPU tier for its size
`⚠ >T4`	Model is over-provisioned — minimum sufficient tier shown
`?`	Parameter count not parseable from model name

PIQC Facts Bundle

With --output-piqc, generates a standardized facts bundle for quality assessment systems:

{
  "schemaVersion": "piqc-scan.v0.1",
  "generatedAt": "2024-01-07T12:00:00Z",
  "tool": {
    "name": "piqc",
    "version": "1.0.0"
  },
  "cluster": {
    "context": "my-context",
    "name": "my-cluster"
  },
  "objects": [
    {
      "workloadId": "ns/inference/deployment/vllm-llama-7b",
      "facts": {
        "runtime.engineType": {"value": "vllm", "dataConfidence": "high"},
        "runtime.engineVersion": {"value": "0.4.0", "dataConfidence": "medium"},
        "hardware.gpuType": {"value": "A100-SXM4-80GB", "dataConfidence": "high"},
        "hardware.gpuCount": {"value": 4, "dataConfidence": "high"},
        "hardware.gpuMemoryTotal": {"value": 80, "unit": "GB", "dataConfidence": "high"},
        "observed.gpuUtilization": {"value": 87, "unit": "%", "dataConfidence": "high"},
        "vllm.tensorParallelSize": {"value": 4, "dataConfidence": "high"},
        "vllm.maxModelLen": {"value": 4096, "dataConfidence": "high"},
        "observed.kvCacheUsage": {"value": 45.2, "unit": "%", "dataConfidence": "high"}
      }
    }
  ]
}

📥 Installation

Prerequisites

Python: 3.11 or higher
Kubernetes Access: Valid kubeconfig with cluster access
Poetry: For development installation

Install from Source

# Clone the repository
git clone https://github.com/paralleliq/piqc.git
cd piqc

# Install with Poetry
poetry install

# Verify installation
poetry run piqc --version

Install for Development

# Clone and install with dev dependencies
git clone https://github.com/paralleliq/piqc.git
cd piqc
poetry install --with dev

# Run tests
poetry run pytest tests/unit -v

# Run with coverage
poetry run pytest tests/unit --cov=src/piqc

🔐 Kubernetes RBAC Requirements

PIQC is read-only. It never creates, modifies, or deletes any resource in your cluster. The only write permission is pods/exec (to run nvidia-smi inside pods for GPU metrics) and pods/exec can be disabled with --no-exec.

Apply the provided RBAC manifests:

kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yaml

Required Permissions

Resource	Verbs	Purpose
`pods`	get, list	Discover inference workloads
`pods/exec`	create	Run nvidia-smi for GPU metrics
`pods/log`	get	Enhanced framework detection
`namespaces`	get, list	Scan multiple namespaces
`deployments`	get, list	Identify deployment metadata
`statefulsets`	get, list	Identify StatefulSet workloads
`services`	get, list	Endpoint detection

RBAC Files

rbac/
├── serviceaccount.yaml    # ServiceAccount for PIQC
├── clusterrole.yaml       # ClusterRole with required permissions
└── clusterrolebinding.yaml # Binds role to service account

🔧 Execution Modes

Mode	Description
`auto`	Automatically detect if running in-cluster or remotely
`remote`	Force remote mode (uses kubeconfig)
`incluster`	Force in-cluster mode (uses ServiceAccount)
`dry-run`	Simulate scan without cluster access

🐛 Troubleshooting

Docker Auth Plugin Errors (GKE / EKS / AKS)

If you see gke-gcloud-auth-plugin not found or similar errors when using Docker, use the in-cluster Job approach (Option 1 above) — it runs inside the cluster and needs no auth plugins.

Alternatively, export a static kubeconfig:

kubectl config view --raw --flatten > /tmp/piqc-kubeconfig.yaml
docker run --rm -v /tmp/piqc-kubeconfig.yaml:/root/.kube/config ghcr.io/paralleliq/piqc:latest scan

Connection Issues

# Verify kubeconfig is valid
kubectl cluster-info

# Test with specific context
piqc test-connection --context my-context

# Enable debug mode for detailed errors
piqc scan --debug

RBAC Permission Errors

# Check current permissions
kubectl auth can-i list pods --all-namespaces
kubectl auth can-i create pods/exec -n <namespace>

# Apply RBAC manifests
kubectl apply -f https://raw.githubusercontent.com/paralleliq/piqc/main/deploy/rbac.yaml

GPU Metrics Unavailable

If nvidia-smi is not available in containers, use --no-exec:

piqc scan --no-exec

Runtime Metrics Not Collected

Ensure the vLLM service is accessible. Use --collect-runtime and check:

# Verify vLLM health endpoint
kubectl port-forward svc/<vllm-service> 8000:8000
curl http://localhost:8000/health

🧪 Development

Running Tests

# Run all unit tests
poetry run pytest tests/unit -v

# Run with coverage
poetry run pytest tests/unit --cov=src/piqc

# Run integration tests (requires cluster)
poetry run pytest tests/integration -v

Code Quality

# Format code with Black
poetry run black src/ tests/

# Lint code with Ruff
poetry run ruff check src/ tests/

# Type checking with MyPy
poetry run mypy src/

📚 Project Structure

piqc/
├── src/piqc/
│   ├── cli/                  # CLI commands (scan, test-connection, version)
│   ├── collectors/           # Data collectors (vLLM config, GPU metrics)
│   ├── core/                 # Core logic (orchestrator, discovery, k8s client)
│   ├── generators/           # Output generators (YAML, JSON, Table, PIQC)
│   ├── models/               # Pydantic data models (ModelSpec, PIQC schema)
│   ├── parsers/              # Configuration parsers (vLLM)
│   └── utils/                # Utilities (logging, exceptions)
├── tests/
│   ├── unit/                 # Unit tests
│   └── integration/          # Integration tests (with mock containers)
├── rbac/                     # Kubernetes RBAC manifests
├── docs/                     # Documentation (LaTeX guides)
└── examples/                 # Example ModelSpec files

📄 License

Apache License 2.0 - see LICENSE for details.

Built with ❤️ by ParallelIQ
_{🚀 Model-aware GPU Control Plane}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
deploy		deploy
docs		docs
examples		examples
k8s		k8s
piqc-test-outputs		piqc-test-outputs
rbac		rbac
src/piqc		src/piqc
tests		tests
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
ModelSpec_Final_Documentation.pdf		ModelSpec_Final_Documentation.pdf
README.md		README.md
REPO_STRUCTURE.md		REPO_STRUCTURE.md
SECURITY.md		SECURITY.md
TEST_GIT.md		TEST_GIT.md
gcp_testing_guide.md.resolved		gcp_testing_guide.md.resolved
piqc-test-outputs.zip		piqc-test-outputs.zip
piqc_Guide.pdf		piqc_Guide.pdf
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🔍 PIQC — GPU Revenue Leak & Inference Efficiency

🎯 Overview

✨ Features

🔍 Intelligent Discovery

📊 Comprehensive Metrics Collection

💰 Revenue Leak Detection

📄 Multiple Output Formats

🚀 Production-Ready

🔮 Coming Soon

🚀 Quick Start

Option 1: Run as a Kubernetes Job (recommended)

Option 2: Run with Docker from your laptop

Option 3: Install from source

Test Your Connection

Run Your First Scan

Expected Output

📋 Commands

piqc scan

Scan Options

Collection Options

Output Options

Execution Options

Examples

piqc test-connection

Example Output

piqc version

📁 Output Formats

YAML Format (Default)

JSON Format

Table Format

PIQC Facts Bundle

📥 Installation

Prerequisites

Install from Source

Install for Development

🔐 Kubernetes RBAC Requirements

Required Permissions

RBAC Files

🔧 Execution Modes

🐛 Troubleshooting

Docker Auth Plugin Errors (GKE / EKS / AKS)

Connection Issues

RBAC Permission Errors

GPU Metrics Unavailable

Runtime Metrics Not Collected

🧪 Development

Running Tests

Code Quality

📚 Project Structure

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`piqc scan`

`piqc test-connection`

`piqc version`

Packages