Skip to content

radiant-systems-lab/ProvScope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProvScope

ProvScope finds points of divergence and convergence between two executions of the same program under different inputs. Given two function-call traces captured by Intel Pin, it aligns them hierarchically against the program's CFG and reports exactly where control flow first diverges and where it reconverges.

Published at HiPC 2022.


How It Works

Binary + Intel Pin                   Binary + Intel Pin
(input A)                            (input B)
     │                                    │
     ▼                                    ▼
Raw function trace                  Raw function trace
(->funcName / <-funcName lines)     (->funcName / <-funcName lines)
     │                                    │
     └──────────────┬─────────────────────┘
                    ▼
        tools/extractFunccalls.py
        (filter to syscall-relevant calls,
         detect non-returning functions)
                    │
                    ▼
          .ftr  (filtered function trace)
          .nr   (non-returning function list)
                    │
                    ▼
              src/provScope -c
        (align traces against CFG,
         compute divergence/convergence)
                    │
                    ▼
          Divergence/convergence report

Dependencies

Dependency Purpose
Intel Pin Dynamic binary instrumentation to capture function traces
g++ with C++14 Building provScope
OpenSSL (libssl-dev, libcrypto) Hash computation in the alignment engine
jsoncpp (libjsoncpp-dev) JSON output
Python 3 Preprocessing scripts in tools/
CFG files from LLVManalyses Per-function control flow graphs (the coreutil_parsed/ data in this repo was generated there)

Install on Ubuntu:

sudo apt-get install g++ libssl-dev libjsoncpp-dev
# Intel Pin: download from https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-binary-instrumentation-tool-downloads.html

Repository Layout

ProvScope/
├── src/                        # Main C++ analysis tool (provScope)
│   ├── provScope.cpp           # Entry point and mode dispatch
│   ├── funcTrace.cpp/h         # Hierarchical function trace construction
│   ├── cfg.cpp/h               # CFG loading and traversal
│   ├── Comparison.cpp/h        # Divergence/convergence alignment algorithm
│   ├── Matrix.cpp/h            # Edit-distance matrix
│   ├── subgraph.cpp/h          # Subgraph matching
│   ├── readFiles.cpp/h         # File I/O for traces and CFGs
│   ├── Args.cpp/h              # CLI argument parsing
│   ├── ec.cpp/h                # Error codes
│   ├── tools.cpp/h             # Utilities
│   ├── automate.py             # Batch runner for experiments
│   ├── inputs/                 # Prepared input files for experiments
│   └── Makefile
├── tools/                      # Preprocessing and visualization scripts
│   ├── extractFunccalls.py     # Parse raw Pin trace → .ftr + .nr files
│   ├── FuncAnalysis/           # Intel PinTool that captures function traces
│   ├── funcList.py             # glibc function list helper
│   ├── convertDot.py           # Convert CFG .txt → Graphviz .dot
│   ├── convertDotDirectory.py  # Batch convertDot.py over a directory
│   ├── countDiff.py            # Count differences in `diff` output
│   ├── cntNumLine.py           # Count trace lines from `main` entry
│   ├── reduceLines.py          # Reduce trace size
│   ├── CONSTANTS.py            # Shared constants and arg parsing
│   └── Makefile
├── coreFuncTraceInput/         # Pre-processed .ftr trace files (coreutils benchmarks)
├── coreutil_parsed/            # Pre-generated CFG files (from LLVManalyses)
├── noRetFuncs/                 # Pre-generated .nr files (non-returning functions)
├── funcList.txt                # glibc function list (used to filter syscall-relevant calls)
├── glibc.txt                   # glibc symbol reference
└── noRetFuncList.txt           # Aggregate non-returning function list

Building

cd src
make
# Output: src/provScope

Usage

Step 0 — Capture Function Traces with Intel Pin

First, build the PinTool:

# Copy the FuncAnalysis tool into your Pin installation:
cp -r tools/FuncAnalysis $PIN_ROOT/source/tools/
cd $PIN_ROOT/source/tools/FuncAnalysis
make

Run your binary under Pin to capture the raw trace:

$PIN_ROOT/pin -t $PIN_ROOT/source/tools/FuncAnalysis/obj-intel64/FuncAnalysis.so \
    -o trace_inputA.txt -- ./your_binary [inputA]

$PIN_ROOT/pin -t $PIN_ROOT/source/tools/FuncAnalysis/obj-intel64/FuncAnalysis.so \
    -o trace_inputB.txt -- ./your_binary [inputB]

The raw trace uses ->funcName for function entry and <-funcName for return.

Step 1 — Preprocess Traces

Filter each trace down to syscall-relevant calls and detect non-returning functions:

cd tools
python3 extractFunccalls.py trace_inputA.txt funcList.txt
# Outputs: trace_inputA.ftr  trace_inputA.nr

python3 extractFunccalls.py trace_inputB.txt funcList.txt
# Outputs: trace_inputB.ftr  trace_inputB.nr

funcList.txt (at the repo root) lists glibc functions — calls that only reach libc without touching a syscall are pruned.

Step 2 — Generate CFGs (via LLVManalyses)

CFG files are produced by the companion LLVManalyses repo. The output is a directory of per-function .txt files. Pre-generated examples for coreutils are in coreutil_parsed/.

Step 3 — Find Divergence/Convergence

./src/provScope -c funcList.txt trace_inputA.nr coreutil_parsed/your_parsed \
    trace_inputA.ftr trace_inputB.ftr

Arguments:

Position Argument Description
1 -c Compare mode
2 funcList.txt glibc function list (filters non-syscall calls)
3 *.nr Non-returning functions file (one file covers both traces since they share the same binary)
4 coreutil_parsed/<prog>_parsed Directory of per-function CFG .txt files
5 trace1.ftr First preprocessed function trace
6 trace2.ftr Second preprocessed function trace

Concrete example (using the repo's pre-generated data):

./src/provScope -c funcList.txt noRetFuncs/uniq_all.nr coreutil_parsed/uniq_parsed \
    coreFuncTraceInput/uniq/uniqc.ftr coreFuncTraceInput/uniq/uniqd.ftr

Batch Mode

Prepare a text file where each line holds the arguments for one comparison:

funcList.txt noRetFuncs/uniq_all.nr coreutil_parsed/uniq_parsed coreFuncTraceInput/uniq/uniqc.ftr coreFuncTraceInput/uniq/uniqd.ftr

Then run:

./src/provScope -f input.txt

All Modes

Flag Argc Arguments Description
-c 7 funcList noRetFile parsedCFGDir ftr1 ftr2 Compare two traces (main use case)
-p 6 funcList noRetFile parsedCFGDir ftr1 Find all paths in a single trace
-t 6 funcList noRetFile parsedCFGDir ftr1 Print trace in hierarchical format
-s 4 parsedCFGDir outfile Compute program specification from CFGs
-f 3 inputFile Batch mode: read arguments from file
-h 2 Print help

File Formats

Raw Pin trace (input to extractFunccalls.py):

->main
->set_program_name
<-set_program_name
->getopt_long
<-getopt_long

.ftr — filtered function trace (input to provScope):

main
set_program_name
/set_program_name
getopt_long
/getopt_long

Uses /funcName for returns. Only calls along syscall-reaching paths are kept.

.nr — non-returning functions (one name per line):

strrchr
__ofl_unlock
__stdio_close

Functions that exit via jump rather than ret — needed to reconstruct the call hierarchy correctly.

Parsed CFG .txt (one file per function, in coreutil_parsed/):

0x24198c0,epoint,0,0,0,na,na,na

Comma-delimited node records. Generated by LLVManalyses.


Experiments (HiPC 2022)

All benchmark data is pre-included in the repo:

Experiment Data
Differential locations coreFuncTraceInput/ + noRetFuncs/ + coreutil_parsed/ — run with -c mode
Tracing overhead tools/FuncAnalysis/ PinTool
Reduction in PIN traces tools/extractFunccalls.py (lines before/after filtering)
CFG specification size LLVManalyses repo

Benchmarks: cat, chown, date, sort, uniq, b2sum, bzip2, bwa, mcf, minimap2.


Known Limitations

  • Requires Intel Pin for trace capture (proprietary, must be downloaded separately)
  • CFG files must be pre-generated by the LLVManalyses repo
  • Non-returning function detection is a best-effort stack scan; edge cases may require manual .nr correction
  • Collective calls / inlined functions may require special handling in extractFunccalls.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors