Skip to content

scxprof#3282

Draft
kkdwvd wants to merge 21 commits intosched-ext:mainfrom
kkdwvd:scxprof
Draft

scxprof#3282
kkdwvd wants to merge 21 commits intosched-ext:mainfrom
kkdwvd:scxprof

Conversation

@kkdwvd
Copy link
Copy Markdown
Contributor

@kkdwvd kkdwvd commented Feb 2, 2026

Introduce a workload profiler tool that is used to extract clusters of threads that should be located together in a soft partition to maximize the chance that memory access locality is preserved and subsequently reduce destructive interference due to CPU scheduling decisions.

Examples

# 1. Record memory access profile (30 seconds by default)                            
scxprof record -o myprofile          

# 2. Process the recorded profile        
scxprof process -f myprofile.tar.gz  

# 3. Extract cell configuration          
scxprof extract -f myprofile.out/perf.jsonl                                      

Commands                                 

# Record                                   

scxprof record [OPTIONS]                 

Options:                                 
  -o, --output <DIR>          Output directory [default: scxprof.out]                
  -t, --timeout <SECONDS>     Recording duration [default: 30]                       
  -l, --ldlat <CYCLES>        Load latency threshold [default: 10]                   
      --enable-perf-script    Generate perf.script during recording                  
      --disable-archive       Keep directory instead of creating tar.gz              

# Process                                  

scxprof process [OPTIONS]      

Options:                                 
  -f, --file <PATH>    Path to profile (tar.gz or directory)                         
  -v, --verbose        Print parsing details                                         

# Extract                                  

scxprof extract [OPTIONS]                

Options:                                 
  -f, --file <PATH>                        Path to perf.jsonl file                   
      --workload-cgroup-regex <REGEX>      Workload cgroup pattern [default: workload.slice]                                                                                 
      --workload-allotment-cgroup-regex    Allotment cgroup pattern                  
  -v, --verbose                            Verbosity level (-v summary, -vv detailed)

Example Workflow                         

# Record on production host              
$ scxprof record -t 60 -o prod-profile   

# Transfer and process                   
$ scxprof process -f prod-profile.tar.gz 
Output directory: prod-profile.out       
Copying perf.script...                   
Parsing perf.script to generate perf.jsonl...                                        
Parsed 48231 of 52104 records (3412 skipped, 461 unparseable)                        

# Extract with verbose stats             
$ scxprof extract -f prod-profile.out/perf.jsonl -v                                                                                                        
Total samples: 48231                     

workload.allotment.slice: 28419 samples (58.92%)                              
  web: 18234 (64.16%)                   
  memcached: 5821 (20.48%)                
  mcrouter: 2104 (7.40%)                 
  ... 12 more below 1%                   

workload.slice: 15102 samples (31.31%)   
  systemd: 4521 (29.94%)                 
  ... 8 more below 1%                    

rest: 4710 samples (9.77%)               
  kworker: 1203 (25.54%)                 
  ... 15 more below 1%                   
  
[                                                                                                                                                                            
  {                                                                                                                                                                          
    "name": "allotment",                                                                                                                                                     
    "match": {                                                                                                                                                               
      "CgroupRegex": "workload\\.allotment\\.slice"                                                                                                                          
    },                                                                                                                                                                       
    "subcells": [                                                                                                                                                            
      {                                                                                                                                                                      
        "name": "web",                                                                                                                                                       
        "matches": [[{"CommPrefix": "web"}]]                                                                                                                                 
      },                                                                                                                                                                     
      {                                                                                                                                                                      
        "name": "memcached",                                                                                                                                                 
        "matches": [[{"CommPrefix": "memcached"}]]                                   
      },                                                                             
      {                                  
        "name": "mcrouter",              
        "matches": [[{"CommPrefix": "mcrouter"}]]                                    
      },                                 
      {                                  
        "name": "rest",                  
        "matches": [[]]                  
      }                                  
    ]                                    
  },                                     
  {                                                                                  
    "name": "workload.slice",                                                        
    "matches": [[{"CgroupContains": "workload.slice"}]],                             
    "subcells": [                        
      {                                                                              
        "name": "rest",                  
        "matches": [[]]                  
      }                                                                              
    ]                                    
  },                                     
  {                                      
    "name": "rest",                      
    "matches": [[]],                     
    "subcells": [                        
      {                                  
        "name": "rest",                  
        "matches": [[]]                  
      }                                  
    ]                                    
  }                                      
]

kkdwvd added 20 commits February 2, 2026 05:01
Add basic CLI structure for the SCX workload profiler tool. This
provides the foundation for adding profiling subcommands.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add the record subcommand that shells out to perf record mem with
--all-cgroups and --data-page-size flags. Output is written to the
specified output directory.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add BPF-based tracing for task hint map updates. The HintsRecorder uses
fentry on bpf_map_update_elem to capture changes to task storage maps,
emitting events through a ring buffer. Events are written to hints.jsonl.

Implement poll-based event loop multiplexing shutdown signal, perf process
exit, and ring buffer readiness. HintsRecorder properly cleans up in Drop
by detaching the BPF program first, then draining remaining events.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
After recording completes, create a tar.gz archive of the output
directory. This can be disabled with the --disable-archive flag.

Check if the output directory exists before recording and bail with an
error. Clean up the output directory if recording fails. Archiving is
done after successful recording, so archive failures don't trigger
cleanup.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add --hints-map-ring-sz option to configure the hints ring buffer size
in MB (default: 1). Track dropped events when the ring buffer is full
and warn the user at the end of recording with the count and suggestion
to increase the buffer size.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add Drop implementation for SpawnedProcess that sends SIGINT and waits
for the child to exit. This ensures spawned processes are properly
cleaned up when the program exits, including on Ctrl-C.

Handle EINTR from poll() by treating it as a shutdown signal. Accept
SIGINT termination of perf as success since we send it intentionally.

Clean up the output directory when recording is interrupted via Ctrl-C
instead of leaving partial data behind.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add -t/--timeout option to specify how long the profiling session should
run, defaulting to 30 seconds. When the timeout elapses, the perf process
is signaled to stop and recording completes normally.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add -l/--ldlat option to control the load latency threshold for perf mem
record. Default is 10 cycles (perf's default is 30). Lower values capture
more memory access events.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add the process subcommand that takes a profile file or directory via
-f/--file option. If given a tar.gz archive, it extracts it to a
directory with the same name. It then checks what files are present
(perf.data and/or hints.jsonl) and prints a summary.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add perf.script generation and parsing capabilities.  We use
--perf-script flag to record command for generating perf.script during
recording (disabled by default, better symbolization when enabled)
Also, implement process command to create <dir>.out output directory
with perf.jsonl generated from perf.script, then arse perf script output
with fields: comm, tid, pid, ip, addr, phys_addr, data_page_size, dso, sym.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Save the output of 'perf version' to perf.version file in the output
directory at the start of recording. This helps identify which perf
version was used to create the perf.data file.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Allow overriding the perf binary path by setting the SCXPROF_PERF
environment variable. This is useful when testing with different
perf versions or when perf is not in the default PATH.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
When verbose is enabled, print unparseable lines to stderr as they
are encountered during perf.script parsing.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Samples without a valid physical address are not useful for memory
profiling analysis. Skip these entries and report the count in the
summary output. In verbose mode, print each skipped line.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Skip samples from "perf" and "swapper" processes during processing
as they are not relevant for workload memory profiling.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
perf script generation can consume significant time and resources.
Since it's only useful when also analyzing collected symbol info,
flip the default to disabled and require explicit opt-in.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add a new extract command that reads perf.jsonl and outputs comm
frequencies in decreasing order. This serves as scaffolding for
future extraction and analysis features.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Separate samples into groups based on cgroup:
1. "other": samples not in workload.slice
2. "workload.slice": samples in workload.slice but not in an allotment
3. workload-tw-<UUID>.<id>.allotment.slice: individual workload allotments

Each group shows its sample count, percentage of total, and comm
frequency breakdown. The default workload and allotment regex can be
overriden by the user using the available flags.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Add --verbose flag to show detailed stats. By default, output a JSON
config with an allotment cell with template regex, including CommPrefix
matches for comms with >5% of samples, workload.slice cell for remaining
workload samples, and rest cell for everything else. More sophisiticated
clustering logic will follow in subsequent commits.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Make a few changes in preparation of sophisticated clustering changes in
subsequent commits. First, change GroupStats to GroupData storing actual
samples in time order, sharded by the group. Pass sample slices to
compute_clusters for future memory clustering. Generalize the
generate_config loop to handle all group types uniformly, since we may
need to cluster tasks in those groups as well and create subcells.
Finally, add verbosity levels: -v for >1% summary, -vv for detailed
output that includes everything. While at it, we should also skip
subcells when no significant comms (leaf cells) to reduce unnecessary
overhead in the scheduler consuming the cell config.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
@htejun
Copy link
Copy Markdown
Contributor

htejun commented Feb 2, 2026

I wonder scxprof may be too generic a name.

@likewhatevs
Copy link
Copy Markdown
Contributor

could the outputs of this be structured such that writing files out is an option, not a req?

i.e. could file writing be part of main.rs such that someone could import the lib.rs that used and skip writing/reading json to disk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants