CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Priority #1: CORRECTNESS

Correctness is our number one priority. No hacks, no workarounds, no shortcuts.

Code must be maximally correct - if something feels wrong, it probably is
Match established semantics (Kubernetes, SQL/PostgreSQL, Rust idioms)
Special cases are acceptable ONLY when they reflect actual underlying semantics (e.g., ConfigMaps have data not spec because that's how K8s works)
NEVER assume or decide to take shortcuts without consulting me first
Any deviations from standards, best practices, or established patterns must be discussed before implementation
Anything that would trigger the "spidey senses" of a very senior engineer/architect should make you pause and ask
When in doubt, ask - do not proceed with assumptions

Priority #2: Performance

Performance is our second priority - still very important.

Minimize API calls to Kubernetes (push filters to API when possible)
Parallel queries across clusters
Efficient data structures and algorithms
But never sacrifice correctness for performance

Git Commits

Do NOT add Claude attributions to commit messages (no "Generated with Claude Code", no "Co-Authored-By: Claude")
Write clear, conventional commit messages focusing on what changed and why
Always run cargo fmt before committing
Always verify with cargo clippy for lint issues before committing

Code Navigation

Use the Serena MCP with rust-analyzer for code navigation and symbol lookup
Prefer symbolic tools (find_symbol, find_referencing_symbols, get_symbols_overview) over grep for Rust code

Project Overview

k8sql exposes the Kubernetes API as a SQL-compatible database using Apache DataFusion as the query engine. kubectl contexts are treated as databases (switchable with USE cluster1;) and Kubernetes resources are exposed as tables. The _cluster column is part of every table's primary key, enabling cross-cluster queries.

k8sql supports both SQL and PRQL (Pipelined Relational Query Language) with automatic detection - just write either language and it works.

Build Commands

cargo build              # Debug build
cargo build --release    # Release build
cargo test               # Run all tests

Usage Examples

# Interactive REPL (default)
k8sql

# Batch mode - SQL queries
k8sql -q "SELECT * FROM pods"
k8sql -q "SELECT name, status->>'phase' as phase FROM pods WHERE namespace = 'kube-system'"

# Batch mode - PRQL queries (auto-detected)
k8sql -q "from pods | take 5"
k8sql -q "from pods | filter namespace == 'kube-system' | select {name, namespace} | take 10"

# Multi-cluster queries
k8sql -q "SELECT * FROM pods WHERE _cluster = 'prod'"
k8sql -q "SELECT * FROM pods WHERE _cluster = '*'"  # All clusters

# Batch mode with context and output format
k8sql -c prod-cluster -q "SELECT * FROM deployments" -o json

# Daemon mode (PostgreSQL wire protocol)
k8sql daemon --port 15432

Architecture

src/
├── main.rs                    # Entry point, CLI argument parsing, mode dispatch
├── cli/
│   ├── mod.rs
│   ├── args.rs                # Clap argument definitions
│   └── repl.rs                # Rustyline REPL with completion/highlighting
├── datafusion_integration/
│   ├── mod.rs
│   ├── context.rs             # K8sSessionContext - DataFusion session setup
│   ├── provider.rs            # K8sTableProvider - TableProvider implementation
│   ├── convert.rs             # JSON to Arrow RecordBatch conversion
│   ├── preprocess.rs          # Query preprocessing (PRQL compilation, JSON paths, ->> precedence)
│   ├── prql.rs                # PRQL detection and compilation to SQL
│   ├── json_path.rs           # JSON path syntax preprocessor (.field, [n], [])
│   └── hooks.rs               # Query hooks for filter extraction
├── kubernetes/
│   ├── mod.rs                 # ApiFilters type for filter pushdown
│   ├── cache.rs               # Resource discovery cache persistence
│   ├── client.rs              # K8sClientPool - multi-cluster connection caching
│   ├── discovery.rs           # Dynamic resource discovery and schema generation
│   └── field_selectors.rs     # Field selector registry and types for K8s API pushdown
├── output/
│   ├── mod.rs                 # QueryResult type and format dispatch
│   ├── table.rs               # Pretty table output
│   ├── json.rs                # JSON output
│   ├── csv.rs                 # CSV output
│   └── yaml.rs                # YAML output
└── daemon/
    ├── mod.rs
    └── pgwire_server.rs       # PostgreSQL wire protocol server

Key Components

DataFusion Integration (`src/datafusion_integration/`)

k8sql uses Apache DataFusion as its SQL query engine:

K8sSessionContext (context.rs): Wraps DataFusion's SessionContext, registers all K8s resources as tables, provides JSON functions for querying nested fields.
K8sTableProvider (provider.rs): Implements DataFusion's TableProvider trait. Fetches data from Kubernetes API on scan, extracts and pushes filters (namespace, cluster, label selectors) to the API.
JSON to Arrow conversion (convert.rs): Converts K8s JSON resources to Arrow RecordBatches. Metadata fields use native Arrow types (Timestamp for dates, Int64 for integers); labels/annotations/spec/status are stored as JSON strings.
Query Preprocessing (preprocess.rs): Three-stage pipeline: (1) PRQL-to-SQL compilation if needed, (2) JSON path syntax conversion, (3) arrow operator precedence fix. Ensures both SQL and PRQL users get intuitive JSON access.
JSON Path Preprocessor (json_path.rs): Converts intuitive dot notation (status.phase, spec.containers[0].image) to PostgreSQL arrow operators. Uses DataFusion's SQL tokenizer to avoid matching inside strings. Supports array indexing [n] and array expansion [] via UNNEST.
PRQL Support (prql.rs): Detects PRQL queries (starting with from, let, or prql) and compiles them to SQL using the prqlc compiler. Auto-detection means users can write either language without mode switching.

Resource Discovery (`src/kubernetes/discovery.rs`)

Dynamic resource discovery at startup:

Discovers all available resources including CRDs
Generates consistent schema: _cluster, api_version, kind, metadata fields, spec/status as JSON
Special schemas for ConfigMaps, Secrets, Events, and metrics resources
Core resources (v1, apps, batch, etc.) take priority over conflicting names from other API groups

ColumnDataType Enum

Schema definitions use a type-safe enum for Arrow data types:

pub enum ColumnDataType {
    Text,       // Arrow Utf8 - strings, JSON blobs
    Timestamp,  // Arrow Timestamp(ms) - creation/deletion timestamps
    Integer,    // Arrow Int64 - generation, count fields
}

Helper constructors simplify schema definition:

ColumnDef::text("name", "metadata.name")
ColumnDef::timestamp("created", "metadata.creationTimestamp")
ColumnDef::integer("generation", "metadata.generation")
ColumnDef::text_raw("_cluster")  // No JSON path (virtual column)

K8sClientPool (`src/kubernetes/client.rs`)

Manages connections to multiple Kubernetes clusters:

Caches clients by context name
Caches resource registries with TTL
Supports parallel queries across clusters

Discovery Cache (30-minute TTL):

Resource discovery (CRDs, schemas, available tables) is cached for 30 minutes
USE commands use cached discovery for instant context switching
Cache reduces API load and improves performance with large kubeconfigs
Only discovery metadata is cached - actual resource queries always hit the API
Force fresh discovery: restart k8sql or use --refresh-crds flag

Retry Logic:

Connection and discovery failures retry up to 5 times with exponential backoff (100ms, 200ms, 400ms, 800ms, 1600ms base delays)
±25% jitter added to prevent thundering herd when multiple clients retry simultaneously
Handles intermittent network issues and proxy problems
All requested contexts must succeed - no partial failures (ensures predictability for scripting)

Key Dependencies

datafusion: SQL query engine (Apache Arrow-based)
datafusion-functions-json: JSON querying functions (json_get_str, json_get_int, etc.)
datafusion-postgres: PostgreSQL wire protocol for daemon mode
prqlc: PRQL compiler (compiles PRQL to SQL)
kube: Kubernetes client (kube-rs)
k8s-openapi: Kubernetes API type definitions (v1.32)
rustyline: REPL with readline support
clap: CLI argument parsing
fastrand: Fast random number generation (for retry jitter)

Table Schema

All resources share a consistent schema:

Column	Arrow Type	Description
`_cluster`	Utf8	Kubernetes context name
`api_version`	Utf8	API version (v1, apps/v1, etc.)
`kind`	Utf8	Resource kind (Pod, Deployment, etc.)
`name`	Utf8	Resource name
`namespace`	Utf8	Namespace (null for cluster-scoped)
`uid`	Utf8	Unique identifier
`created`	Timestamp(ms)	Creation timestamp (native, supports date comparisons)
`generation`	Int64	Spec change counter (native, supports numeric comparisons)
`labels`	Utf8 (JSON)	JSON object, access with `labels->>'key'`
`annotations`	Utf8 (JSON)	JSON object, access with `annotations->>'key'`
`spec`	Utf8 (JSON)	Resource specification
`status`	Utf8 (JSON)	Resource status

Special cases:

ConfigMaps/Secrets: data column instead of spec/status
Events: type, reason, message, count, etc.
PodMetrics/NodeMetrics: timestamp, window, containers/usage
CustomResourceDefinitions: group, scope, resource_kind, plural, singular, short_names, categories (queryable as crds or customresourcedefinitions)

Query Optimization

Server-side (pushed to K8s API):

SQL Condition	K8s API Optimization
`namespace = 'x'`	Uses namespaced API endpoint
`labels->>'app' = 'nginx'`	K8s label selector
`labels->>'app' = 'x' AND labels->>'env' = 'y'`	Combined label selector: `app=x,env=y`
`status->>'phase' = 'Running'`	K8s field selector (pods only)
`spec->>'nodeName' = 'node-1'`	K8s field selector (pods only)
`type = 'Opaque'`	K8s field selector (secrets only)
`name = 'pod-123'`	K8s field selector (`metadata.name`)
`_cluster = 'prod'`	Only queries that cluster
`_cluster = '*'`	Queries all clusters in parallel

Field Selectors (src/kubernetes/field_selectors.rs):

Limited to = and != operators (K8s restriction)
Resource-specific support (e.g., status.phase for pods, type for secrets)
Hardcoded registry based on K8s API documentation
Namespace field selector intentionally not used (namespaced API endpoints are more efficient)

Supported resources and fields:

All resources: metadata.name
Pods: spec.nodeName, spec.restartPolicy, spec.schedulerName, spec.serviceAccountName, spec.hostNetwork, status.phase, status.podIP, status.nominatedNodeName
Events: involvedObject.kind, involvedObject.namespace, involvedObject.name, involvedObject.uid, involvedObject.apiVersion, involvedObject.resourceVersion, involvedObject.fieldPath, reason, reportingComponent, source, type
Secrets: type
Namespaces: status.phase
ReplicaSets: status.replicas
Jobs: status.successful
Nodes: spec.unschedulable
CertificateSigningRequests: spec.signerName

Client-side (DataFusion):

Unsupported field selectors (fields not in registry)
LIKE patterns
Complex expressions
ORDER BY, LIMIT, GROUP BY
Operators not supported by K8s field selectors (IN, NOT IN, etc.)

JSON Array Access

Use json_get_array() with UNNEST for array expansion:

SELECT json_get_str(UNNEST(json_get_array(spec, 'containers')), 'image') FROM pods

Performance Tips

Projection Pushdown (automatic optimization):

k8sql automatically skips converting unrequested JSON columns
SELECT name FROM pods only processes ~50 bytes per pod (vs ~2KB for full resource)
SELECT name, namespace, labels FROM pods processes ~200 bytes per pod
Impact: 10-40x reduction in CPU/memory for selective queries
No user action needed - optimization happens automatically

Query Performance Best Practices:

Select only needed columns: SELECT name, namespace FROM pods is much faster than SELECT * FROM pods
Use namespace filters: WHERE namespace = 'kube-system' uses efficient namespaced API endpoint
Use label selectors: WHERE labels->>'app' = 'nginx' pushes to K8s API (reduces network transfer)
Use field selectors: WHERE status->>'phase' = 'Running' for pods pushes to K8s API
Cluster filters: WHERE _cluster = 'prod' avoids querying unnecessary clusters
Combine filters: Multiple AND conditions on labels/fields push to K8s as combined selectors

What to Avoid:

SELECT * FROM pods when you only need a few columns (wastes CPU converting large JSON fields)
Queries without namespace/cluster filters on large clusters (fetches all resources)
Client-side filtering (LIKE, complex expressions) when API-pushable filters exist

REPL Controls

Tab: Auto-complete keywords/tables/columns
↑/↓: Navigate history
Ctrl+D: Quit
\dt: SHOW TABLES (displays: table_name, aliases, group, version, kind, scope, resource_type)
\l: SHOW DATABASES
\d table: DESCRIBE table
\x: Toggle expanded display
\q: Quit

PRQL Support

PRQL (Pipelined Relational Query Language) is supported as an alternative to SQL. Queries are auto-detected based on syntax - no mode switching required.

Detection Logic

PRQL: Queries starting with from, let, or prql
SQL: Queries starting with SELECT, WITH, SHOW, DESCRIBE, EXPLAIN, USE, etc.

PRQL Examples

# Basic query
from pods | take 5

# Filter and select
from pods
filter namespace == "kube-system"
select {name, namespace, created}
take 10

# Sort and limit
from deployments
filter namespace == "default"
sort name
take 20

# Aggregation
from pods
group namespace (aggregate {count = count this})
sort {-count}

PRQL vs SQL Comparison

PRQL	SQL
`from pods \| take 5`	`SELECT * FROM pods LIMIT 5`
`from pods \| filter namespace == "x"`	`SELECT * FROM pods WHERE namespace = 'x'`
`from pods \| select {name, namespace}`	`SELECT name, namespace FROM pods`
`from pods \| sort created`	`SELECT * FROM pods ORDER BY created`
`from pods \| sort {-created}`	`SELECT * FROM pods ORDER BY created DESC`

JSON Path Syntax

k8sql provides intuitive dot notation for accessing nested JSON fields. This works in both SQL and PRQL:

Syntax	Meaning	Example
`.field`	Object field access	`status.phase`
`[n]`	Array index (0-based)	`spec.containers[0]`
`[]`	Array expansion (UNNEST)	`spec.containers[]`

Conversion Examples:

Input	Converted To
`status.phase`	`status->>'phase'`
`spec.selector.app`	`spec->'selector'->>'app'`
`p.status.phase`	`p.status->>'phase'`
`spec.containers[0].image`	`spec->'containers'->0->>'image'`
`spec.containers[].image`	`(UNNEST(json_get_array(spec->'containers', NULL)))->>'image'`

SQL Examples:

-- Simple field access
SELECT name, status.phase FROM pods WHERE status.phase = 'Running'

-- Array indexing
SELECT name, spec.containers[0].image FROM pods

-- Array expansion (one row per container)
SELECT name, spec.containers[].image AS container_image FROM pods

-- With table alias
SELECT p.name, p.spec.containers[0].image FROM pods p

PRQL Examples:

# Simple field access (no s-strings needed!)
from pods
filter status.phase == "Running"
select {name, phase = status.phase}

# Array indexing
from pods
select {name, image = spec.containers[0].image}

# Array expansion
from pods
select {name, container_image = spec.containers[].image}

Known JSON Columns (auto-detected):

labels, annotations, spec, status, data, owner_references
CRD fields detected from OpenAPI schema (object/array types)

Non-JSON columns (preserved as-is):

pods.name → stays pods.name (table.column reference)
public.pods → stays public.pods (schema.table reference)

Advanced: Raw Arrow Operators

For complex cases, use PostgreSQL arrow operators directly:

-- Already-converted syntax is not double-processed
SELECT spec->>'replicas' FROM deployments

-- Complex JSON functions
SELECT json_get_str(UNNEST(json_get_array(spec, 'containers')), 'image') FROM pods

In PRQL, use s-strings for raw SQL when needed:

from pods
select {name, containers_json = s"spec->'containers'"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Priority #1: CORRECTNESS

Priority #2: Performance

Git Commits

Code Navigation

Project Overview

Build Commands

Usage Examples

Architecture

Key Components

DataFusion Integration (`src/datafusion_integration/`)

Resource Discovery (`src/kubernetes/discovery.rs`)

ColumnDataType Enum

K8sClientPool (`src/kubernetes/client.rs`)

Key Dependencies

Table Schema

Query Optimization

Server-side (pushed to K8s API):

Client-side (DataFusion):

JSON Array Access

Performance Tips

REPL Controls

PRQL Support

Detection Logic

PRQL Examples

PRQL vs SQL Comparison

JSON Path Syntax

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Priority #1: CORRECTNESS

Priority #2: Performance

Git Commits

Code Navigation

Project Overview

Build Commands

Usage Examples

Architecture

Key Components

DataFusion Integration (src/datafusion_integration/)

Resource Discovery (src/kubernetes/discovery.rs)

ColumnDataType Enum

K8sClientPool (src/kubernetes/client.rs)

Key Dependencies

Table Schema

Query Optimization

Server-side (pushed to K8s API):

Client-side (DataFusion):

JSON Array Access

Performance Tips

REPL Controls

PRQL Support

Detection Logic

PRQL Examples

PRQL vs SQL Comparison

JSON Path Syntax

DataFusion Integration (`src/datafusion_integration/`)

Resource Discovery (`src/kubernetes/discovery.rs`)

K8sClientPool (`src/kubernetes/client.rs`)