Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 53 additions & 83 deletions .claude/skills/architecture/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,105 +7,75 @@ user-invocable: true

## gProfiler Architecture Overview

### High-Level Architecture
gProfiler now has two important paths:

```
┌─────────────────────────────────────────────────────────────┐
│ gprofiler/main.py │
│ (Orchestration Layer) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│ │ perf │ │ Java │ │ Python │ │ Ruby │ │ .NET ││
│ │profiler │ │profiler │ │profiler │ │profiler │ │profiler││
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
│ └──────────┴──────────┴──────────┴───────────┘ │
│ ▼ │
│ gprofiler/merge.py │
│ (Profile Data Aggregation) │
├─────────────────────────────────────────────────────────────┤
│ Output Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Flamegraph │ │ Upload │ │ Local Output │ │
│ │ (HTML) │ │ (Studio) │ │ (collapsed) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
1. **Classic profiling path**: discover processes -> select profilers -> collect samples -> merge -> output/upload
2. **Dynamic profiling control path**: heartbeat with backend -> receive commands -> queue/prioritize -> run continuous or ad-hoc profiling -> report completion

### Key Components
When answering architecture questions, identify which path the request touches before reading or editing code.

#### 1. Profiler Registry (`gprofiler/profilers/registry.py`)
- Decorator-based profiler registration
- Runtime discovery of available profilers
- Configuration-based profiler selection
### Main architecture areas

#### 2. Profiler Base (`gprofiler/profilers/profiler_base.py`)
- Abstract base class for all profilers
- Lifecycle: `start()` → `snapshot()` → `stop()`
- Common utilities for process discovery
| Area | Primary files | What belongs here |
|------|---------------|-------------------|
| CLI + orchestration | `gprofiler/main.py` (~1546) | Argument parsing, runtime wiring, top-level orchestration. Treat as a hotspot; avoid broad edits if a narrower seam exists. |
| Profiler registration + lifecycle | `gprofiler/profilers/registry.py`, `gprofiler/profilers/profiler_base.py` | Registration, mode selection, `start()` / `snapshot()` / `stop()` lifecycle. |
| Runtime profilers | `gprofiler/profilers/*.py` | Runtime/tool-specific logic: process discovery, sampling, stack parsing, cleanup. |
| Merge/output | `gprofiler/merge.py` (~330) | Stack aggregation, symbol handling, output shaping. |
| Metadata enrichment | `gprofiler/metadata/` | Application identifiers and host/system metadata. |
| Dynamic profiling command control | `gprofiler/dynamic_profiling_management/heartbeat.py` (~354), `command_control.py` (~233), `continuous.py` (~68), `ad_hoc.py` (~76) | Heartbeat polling, command parsing, queue priority, pause/resume, execution routing. |
| Shared test infrastructure | `tests/conftest.py` (~708) | Docker fixtures, runtime builders, cleanup. Shared and regression-sensitive. |

#### 3. Individual Profilers (`gprofiler/profilers/*.py`)
### Runtime profilers

| Profiler | Backend Tool | Key Features |
|----------|--------------|--------------|
| `perf.py` | Linux perf | System-wide, kernel stacks |
| `java.py` | async-profiler | JVM attach, allocation profiling |
| `python.py` | py-spy | No instrumentation needed |
| `python_ebpf.py` | PyPerf | eBPF-based, lower overhead |
| Profiler | Backend tool | Notes |
|----------|--------------|-------|
| `perf.py` | Linux perf | System-wide profiling, kernel/user stacks |
| `java.py` | async-profiler | JVM attach, allocation profiling, large hotspot (~1555 lines) |
| `python.py` | py-spy | Python sampling without instrumentation |
| `python_ebpf.py` | PyPerf | eBPF-based Python profiling |
| `ruby.py` | rbspy | Ruby VM sampling |
| `php.py` | phpspy | PHP process profiling |
| `dotnet.py` | dotnet-trace | .NET Core/5+ support |
| `node.py` | perf | V8 JavaScript profiling |

#### 4. Merge Layer (`gprofiler/merge.py`)
- Combines samples from multiple profilers
- Handles symbol resolution
- Produces unified stack traces

#### 5. Metadata Collection (`gprofiler/metadata/`)
- `application_identifiers.py` - Extracts app names from processes
- `system_metadata.py` - Collects host information
- Enriches profiles with context
| `dotnet.py` | dotnet-trace | .NET support |
| `node.py` | perf | Node/V8 profiling |

### Data Flow
### Classic profiling data flow

```text
1. Discover target processes
2. Select profilers / modes
3. Each profiler samples independently
4. merge.py aggregates results
5. Output collapsed stacks, flamegraph data, or upload results
```
1. Process Discovery
└── Scan /proc for target processes

2. Profiler Selection
└── Match processes to appropriate profilers
### Dynamic profiling control flow

3. Sampling
└── Each profiler collects stacks independently

4. Aggregation
└── merge.py combines all samples

5. Output
└── Generate flamegraph or upload to Studio
```text
1. Backend receives profiling request
2. Agent heartbeat polls for work
3. heartbeat.py parses command payload
4. command_control.py enqueues by priority
5. continuous.py or ad_hoc.py executes the command
6. Agent reports command completion
```

### Key Files to Understand

| File | Lines | Purpose |
|------|-------|---------|
| `main.py` | ~1500 | Entry point, CLI, orchestration |
| `profilers/perf.py` | ~500 | Core perf integration |
| `profilers/java.py` | ~1800 | Complex JVM profiling |
| `merge.py` | ~400 | Profile aggregation |
| `utils/perf_process.py` | ~200 | perf subprocess management |
Priority is `stop > adhoc > continuous`. Do not describe or implement a parallel control path unless the existing heartbeat/queue system is truly insufficient.

### Extension Points
### Where to make changes

1. **Add new profiler**: Implement `ProfilerBase`, use `@register_profiler`
2. **Add metadata**: Extend `application_identifiers.py`
3. **New output format**: Modify `main.py` output handling
4. **New deployment**: Add to `deploy/` directory
- **Add a new profiler**: new file under `gprofiler/profilers/`, register via `@register_profiler`, add targeted tests.
- **Change profiler behavior**: edit the specific profiler first; avoid `main.py` unless CLI/wiring must change.
- **Change merge or output**: start in `gprofiler/merge.py`.
- **Change heartbeat / dynamic profiling**: start in `gprofiler/dynamic_profiling_management/`.
- **Change shared test behavior**: touch `tests/conftest.py` only when multiple tests need new shared infra.

### Instructions
### Guidance when answering users

When user asks about architecture:
1. Start with high-level overview above
2. Dive into specific component if asked
3. Reference actual code files with line numbers
4. Explain data flow through the system
1. Start with the smallest relevant architecture slice, not the whole repo.
2. If the request is about command-driven profiling, include the heartbeat + queue modules, not just `main.py`.
3. Call out hotspot files when relevant:
- `gprofiler/main.py` (~1546)
- `gprofiler/profilers/java.py` (~1555)
- `tests/conftest.py` (~708)
4. Prefer concrete file references over generic descriptions.
66 changes: 51 additions & 15 deletions .claude/skills/heartbeat/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,34 @@ export GPROFILER_SERVICE="your-service-name"
export GPROFILER_SERVER="http://localhost:8080"

/opt/gprofiler/gprofiler \
--enable-heartbeat-server \
-u \
--token=$GPROFILER_TOKEN \
--service-name=$GPROFILER_SERVICE \
--server-host $GPROFILER_SERVER \
--api-server $GPROFILER_SERVER \
--dont-send-logs \
--server-upload-timeout 10 \
-c \
--disable-metrics-collection \
--java-safemode= \
--heartbeat-interval 30 \
-d 60 \
--java-no-version-check
```

`--server-host` still exists as a deprecated alias, but prefer `--api-server`.

### Required flags

Current `main.py` validation requires heartbeat mode to include:

- `--enable-heartbeat-server`
- `--upload-results`
- `--token`
- `--service-name`

Use the skill to explain or debug this mode only in terms of the current flags above.

### Command Flow

```
Expand Down Expand Up @@ -81,6 +96,8 @@ export GPROFILER_SERVER="http://localhost:8080"

Priority: `stop > adhoc > continuous`

The current implementation lives under `gprofiler/dynamic_profiling_management/`. Do not refer users to `gprofiler/command_control.py`; that path is stale.

### API Endpoints

**Submit Profiling Request:**
Expand Down Expand Up @@ -133,11 +150,35 @@ Requirements:
### Key Files

```
gprofiler/main.py # Heartbeat integration
gprofiler/command_control.py # CommandManager class
docs/HEARTBEAT_SYSTEM_README.md # Full documentation
gprofiler/main.py # CLI + heartbeat flag validation
gprofiler/dynamic_profiling_management/heartbeat.py # Polling and command handling
gprofiler/dynamic_profiling_management/command_control.py # Queue logic and priority
gprofiler/dynamic_profiling_management/continuous.py # Continuous slot
gprofiler/dynamic_profiling_management/ad_hoc.py # Ad-hoc slot
tests/test_heartbeat_system.py # Heartbeat flow validation
docs/HEARTBEAT_SYSTEM_README.md # Full documentation
```

### Testing heartbeat changes

Use the smallest useful validation first:

```bash
# Focused heartbeat test
sudo python3 -m pytest -v tests/test_heartbeat_system.py

# Lightweight broader regression
sudo ./tests/test.sh --executable
```

For local end-to-end testing against a backend, the repo docs describe this sequence:

1. Start the Performance Studio backend.
2. Run `python tests/run_heartbeat_agent.py`
3. Submit commands with `python tests/test_heartbeat_system.py --live`

Prefer the existing docs/test scripts over inventing custom heartbeat harnesses.

### Troubleshooting

**Agent not receiving commands:**
Expand All @@ -161,22 +202,17 @@ docs/HEARTBEAT_SYSTEM_README.md # Full documentation
--enable-heartbeat-server # Enable heartbeat mode
--heartbeat-interval 30 # Heartbeat frequency (seconds)
--api-server URL # Backend server URL
--server-host URL # Deprecated alias for --api-server
--upload-results # Required for heartbeat mode
--token TOKEN # Authentication token
--service-name NAME # Service identifier
--enable-hw-metrics-collection # Enable PerfSpect
--perfspect-path PATH # PerfSpect binary path
```

---

## TODO: Skill Content to Add
### Review points for heartbeat work

- [ ] **Add complete API reference** - All heartbeat API endpoints with examples
- [ ] **Add command_control.py documentation** - CommandManager class details
- [ ] **Add authentication flow** - Token validation and refresh
- [ ] **Add error response codes** - All possible error responses
- [ ] **Add deployment examples** - K8s, Docker Compose, systemd configs
- [ ] **Add PerfSpect output examples** - Sample hardware metrics output
- [ ] **Add monitoring integration** - How to monitor heartbeat health
- [ ] **Add scaling guidance** - Multi-agent deployment patterns
- Preserve queue semantics: `stop > adhoc > continuous`
- Preserve idempotency; do not allow the same command to execute twice
- Avoid moving heartbeat logic into `main.py` if `dynamic_profiling_management/` is sufficient
- Add targeted heartbeat tests before broader regression runs
Loading
Loading