scxprof#3282
Draft
kkdwvd wants to merge 21 commits intosched-ext:mainfrom
Draft
Conversation
Add basic CLI structure for the SCX workload profiler tool. This provides the foundation for adding profiling subcommands. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add the record subcommand that shells out to perf record mem with --all-cgroups and --data-page-size flags. Output is written to the specified output directory. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add BPF-based tracing for task hint map updates. The HintsRecorder uses fentry on bpf_map_update_elem to capture changes to task storage maps, emitting events through a ring buffer. Events are written to hints.jsonl. Implement poll-based event loop multiplexing shutdown signal, perf process exit, and ring buffer readiness. HintsRecorder properly cleans up in Drop by detaching the BPF program first, then draining remaining events. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
After recording completes, create a tar.gz archive of the output directory. This can be disabled with the --disable-archive flag. Check if the output directory exists before recording and bail with an error. Clean up the output directory if recording fails. Archiving is done after successful recording, so archive failures don't trigger cleanup. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add --hints-map-ring-sz option to configure the hints ring buffer size in MB (default: 1). Track dropped events when the ring buffer is full and warn the user at the end of recording with the count and suggestion to increase the buffer size. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add Drop implementation for SpawnedProcess that sends SIGINT and waits for the child to exit. This ensures spawned processes are properly cleaned up when the program exits, including on Ctrl-C. Handle EINTR from poll() by treating it as a shutdown signal. Accept SIGINT termination of perf as success since we send it intentionally. Clean up the output directory when recording is interrupted via Ctrl-C instead of leaving partial data behind. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add -t/--timeout option to specify how long the profiling session should run, defaulting to 30 seconds. When the timeout elapses, the perf process is signaled to stop and recording completes normally. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add -l/--ldlat option to control the load latency threshold for perf mem record. Default is 10 cycles (perf's default is 30). Lower values capture more memory access events. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add the process subcommand that takes a profile file or directory via -f/--file option. If given a tar.gz archive, it extracts it to a directory with the same name. It then checks what files are present (perf.data and/or hints.jsonl) and prints a summary. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add perf.script generation and parsing capabilities. We use --perf-script flag to record command for generating perf.script during recording (disabled by default, better symbolization when enabled) Also, implement process command to create <dir>.out output directory with perf.jsonl generated from perf.script, then arse perf script output with fields: comm, tid, pid, ip, addr, phys_addr, data_page_size, dso, sym. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Save the output of 'perf version' to perf.version file in the output directory at the start of recording. This helps identify which perf version was used to create the perf.data file. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Allow overriding the perf binary path by setting the SCXPROF_PERF environment variable. This is useful when testing with different perf versions or when perf is not in the default PATH. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
When verbose is enabled, print unparseable lines to stderr as they are encountered during perf.script parsing. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Samples without a valid physical address are not useful for memory profiling analysis. Skip these entries and report the count in the summary output. In verbose mode, print each skipped line. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Skip samples from "perf" and "swapper" processes during processing as they are not relevant for workload memory profiling. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
perf script generation can consume significant time and resources. Since it's only useful when also analyzing collected symbol info, flip the default to disabled and require explicit opt-in. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add a new extract command that reads perf.jsonl and outputs comm frequencies in decreasing order. This serves as scaffolding for future extraction and analysis features. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Separate samples into groups based on cgroup: 1. "other": samples not in workload.slice 2. "workload.slice": samples in workload.slice but not in an allotment 3. workload-tw-<UUID>.<id>.allotment.slice: individual workload allotments Each group shows its sample count, percentage of total, and comm frequency breakdown. The default workload and allotment regex can be overriden by the user using the available flags. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add --verbose flag to show detailed stats. By default, output a JSON config with an allotment cell with template regex, including CommPrefix matches for comms with >5% of samples, workload.slice cell for remaining workload samples, and rest cell for everything else. More sophisiticated clustering logic will follow in subsequent commits. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Make a few changes in preparation of sophisticated clustering changes in subsequent commits. First, change GroupStats to GroupData storing actual samples in time order, sharded by the group. Pass sample slices to compute_clusters for future memory clustering. Generalize the generate_config loop to handle all group types uniformly, since we may need to cluster tasks in those groups as well and create subcells. Finally, add verbosity levels: -v for >1% summary, -vv for detailed output that includes everything. While at it, we should also skip subcells when no significant comms (leaf cells) to reduce unnecessary overhead in the scheduler consuming the cell config. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Contributor
|
I wonder |
Contributor
|
could the outputs of this be structured such that writing files out is an option, not a req? i.e. could file writing be part of main.rs such that someone could import the lib.rs that used and skip writing/reading json to disk? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce a workload profiler tool that is used to extract clusters of threads that should be located together in a soft partition to maximize the chance that memory access locality is preserved and subsequently reduce destructive interference due to CPU scheduling decisions.
Examples