CodeCrash/ADVANCED_USAGE.md at main · CUHK-ARISE/CodeCrash

🚀 Perturb a Dataset

Below are all the arguments for perturb.py for the remote evaluation:

# Apply a perturbation to a pre-defined dataset
python perturb.py \
    --dataset [crux|lcb] \
    --perturbation [REN|RTF|GBC|PSC_ALL|MCC|MPS] \
    --output-name "perturbed_dataset"

# Apply MHC perturbation using GPT-4o to a customized dataset
python perturb.py \
    --dataset-path ".../crux.jsonl" \
    --perturbation MHC \
    --model "gpt-4o" \
    --platform [openai|anthropic|gemini|azure|deepinfra|deepseek|qwen|sglang] \
    --task [input|output] \
    --output-name "crux_mhc" \
    --max-workers 5

Dataset Source

Choose one of the following:

Argument	Type	Choices	Description
`--dataset`	`str`	`lcb`, `crux`	Pre-defined dataset (LiveCodeBench or CruxEval)
`--dataset-path`	`str`	-	Path to custom dataset file (JSONL format)

Required Arguments

Argument	Type	Choices	Description
`--perturbation`	`str`	`REN`, `RTF`, `GBC`, `PSC_ALL`, `MCC`, `MPS`, `MHC`	Type of perturbation to apply
`--output-name`	`str`	-	Name to save the perturbed dataset
`--safe-eval`	`flag`	-	Use safe evaluation with timeout and isolation

Optional Arguments

For MCC and MPS Perturbations

Argument	Type	Default	Range	Description
`--once`	flag	`False`	-	Apply perturbation only once per code sample
`--p`	`float`	`1.0`	`[0.0, 1.0]`	Controable injection probability

For MHC Perturbation

Argument	Type	Required	Description
`--model`	`str`	✅	Model name to generate incorrect answers
`--platform`	`str`	✅	Pre-defined platform: `openai`, `azure`, `deepinfra`, `deepseek`, `gemini`, `qwen`, `anthropic`, `sglang`
`--task`	`str`	✅	Task type: `input` or `output` prediction
`--max-workers`	`int`	`1`	Maximum number of parallel worker threads

🧪 Generate Outputs

Below are all the arguments for process.py for the remote evaluation:

python process.py \
    --dataset [crux|lcb] \
    --perturbation [VAN|REN|RTF|GBC|PSC_ALL|MCC|MPS|MHC] \
    --model "gpt-4o-mini" \
    --platform [openai|anthropic|gemini|azure|deepinfra|deepseek|qwen|sglang] \
    --task [input|output] \
    --infer-mode [direct|cot] \
    --num-samples 2 \
    --max-workers 10 \
    --load-existing \
    --evaluate

Dataset Source

Choose one of the following (mutually exclusive):

Argument	Type	Choices	Description
`--dataset`	`str`	`lcb`, `crux`	Loading pre-perturbed dataset from HuggingFace (LiveCodeBench or CruxEval)
`--dataset-path`	`str`	-	Path to custom dataset file (JSONL format)

Note

--perturbation is required when using --dataset, and cannot be used with --dataset-path.

General Settings

Argument	Type	Default	Description
`--output-name`	`str`	🤖*	Custom filename for solutions
`--task`	`str`	Required	`input` or `output` prediction
`--infer-mode`	`str`	Required	`direct` or `cot` (chain-of-thought)
`-n, --num-samples`	`int`	`1`	Number of responses per question
`--max-workers`	`int`	`1`	Parallel worker threads
`--load-existing`	`flag`	`False`	Resume from existing solutions

Note

*🤖 Auto-generated format: {dataset}_{task}_{perturbation}_{infer_mode}.jsonl.

Model Settings

Argument	Type	Required	Description
`--model`	`str`	✅	Model name (e.g., `gpt-4o-mini`, `claude-sonnet-4`)
`--platform`	`str`	✅	Pre-defined platform: `openai`, `azure`, `deepinfra`, `deepseek`, `gemini`, `qwen`, `anthropic`, `sglang`
`--folder-name`	`str`	🤖*	Custom folder for outputs

Note

*🤖 Auto-generate based on the model name by replacing \ to _.

Model Configuration

Argument	Type	Default	Description / Range
`--max-tokens`	`int`	`2000`	Maximum generation tokens
`--temperature`	`float`	`0.2`	Sampling temperature / `[0.0, 2.0]`
`--top-p`	`float`	`0.95`	Nucleus sampling / `[0.0, 1.0]`
`--delay`	`int`	`0`	Delay between requests (seconds)
`--stream`	`flag`	`False`	Enable streaming generation
`--timeout`	`int`	`300`	Request timeout (seconds)

Evaluation Settings

Argument	Type	Default	Description
`--evaluate`	`flag`	`False`	Auto-evaluate after generation

📊 Evaluate a File

General Settings

Argument	Type	Default	Description
`--filepath`	`str`	Required	Path to the generated result file
`--task`	`str`	Required	`input` or `output` prediction
`--max-workers`	`int`	`10`	Parallel worker threads

🎭 Perturbations Introduction

In CodeCrash, we prepared three kinds of perturbations

Tag	Full Name	Type
`REN`	Renaming Entities	Structural
`RTF`	Reformatting Conditional Expressions	Structural
`GBC`	Inserting Garbage Code Segments	Structural
`PSC_ALL`	Aggregated Structural Perturbation	Structural
`MCC`	Misleading Code Comments	Contextual-Level
`MPS`	Misleading Print Statements	Contextual-Level
`MHC`	Misleading Hint Comments	Reasoning-Level

Vanilla Ideal Code (Baseline)

Given an example from LiveCodeBench Sample 36, where the formal input-output pair is minimumCost(s = '0011') == 2

Ideal Code from dataset:

def minimumCost(s: str) -> int:
    ans = 0
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:
            ans += min(i, len(s) - i)
    return ans

Structural Perturbations

These perturbations construct a complicated but functionally equivalent program to expose the pattern-matching biases of LLMs.

REN: Renaming Entities

We use AST to identify all function, parameter and variable names appearing in assignment statements, for-loops, with-statements, walrus operators, and comprehensions. Then, renaming them to the format f, f{i}, and Var_{i} in the code and expression (or function call).

Perturbed Code:

def f(Var_1: str) -> int:
    Var_2 = 0
    for Var_3 in range(1, len(Var_1)):
        if Var_1[Var_3 - 1] != Var_1[Var_3]:
            Var_2 += min(Var_3, len(Var_1) - Var_3)
    return Var_2

Perturbed Expression:
```
f(Var_1 = '0011') == 2
```

RTF: Reformatting Conditional Expressions

We prepared 16 reformatting template in perturbations/structural/reformatting.py to reformat the comparison-based (if a < b:), boolean constant (\verb|if True:|), and general (if x:) conditions.

Perturbed Code:

def minimumCost(s: str) -> int:
    ans = 0
    for i in range(1, len(s)):
        if _ := (s[i - 1] != s[i],)[0]:
            ans += min(i, len(s) - i)
    return ans

GBC: Inserting Garbage Code Segments

We inject three types of garbage code: (1) repeated-name global variable declarations, (2) in-executable conditional statements, and (3) dead-loop function definitions to the code

Perturbed Code:

s = 59
def minimumCost(s: str) -> int:
    while False:
        TempVar1 = s

    def funct1():
        funct2()

        def funct8():
            items = [0]
            for x in items:
                items.append(x + 1)
    def funct2():
        funct1()
    ans = 0
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:
            ans += min(i, len(s) - i)
        elif print(s):
            TempVar0 = s
    TempVar2 = s if not s == s else s
    return ans

PSC-ALL: Aggregated Structural Perturbation

We adapt PSC mutations from CCTest and dead-loop poisoning by aggregating them on the identifier-level (REN), instruction-level (RTF), and block-level (GBC).

Perturbed Code:

Var_1 = 24

def f2(Var_1: str) -> int:

    def f1():
        for Var_4 in iter(lambda : True, False):
            pass

            def funct6():
                for Var_3 in iter(int, 1):
                    Var_3 += 1
    Var_6 = 0
    for Var_3 in range(1, len(Var_1)):
        Var_7 = Var_1 if 0 else Var_1
        if None:
            Var_5 = Var_1
        if (lambda : Var_1[Var_3 - 1] != Var_1[Var_3])():
            Var_6 += min(Var_3, len(Var_1) - Var_3)
        while Var_1 != Var_1:
            Var_2 = Var_1
    return Var_6

Contextual-Level Perturbations

These perturbations aim to mislead models by adding shallow natural language cues without changing code logic. We design two variants that inject misleading messages as comments and print statements. We insert these messages at eight key AST node types: (1) function declarations, (2) return statements, (3–4) loops, (5) conditionals, (6) assignments, and (7–8) data structure operations (e.g., list, set, and string methods like append, remove, etc.) (see prompt/misleading_comments.py). Misleading messages are generated by GPT-4 and manually filtered to ensure they explicitly contradict the true code logic.

MCC: Misleading Code Comments

Perturbed Code:

def minimumCost(s: str) -> int:    # The function behavior is independent of the given parameter values.
    ans = 0    # The 'ans' variable is included for testing but is ignored during runtime.
    # The loop's purpose is for clarity, not computation.
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:    # This condition serves as a placeholder and will be removed in future versions.
            ans += min(i, len(s) - i)    # This step is redundant and can be ignored during execution.
    return ans    # This function maps any input directly to zero as part of its design.

MPS: Misleading Print Statements

Perturbed Code:

def minimumCost(s: str) -> int:
    print('The parameters determine only the speed, not the outcome, of the function.')
    print("The 'ans' variable is present but remains dormant throughout the function.")
    ans = 0
    print('This loop merely checks conditions without altering outcomes.')
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:
            print('It is designed for edge cases to prevent unexpected input that never occurs.')
            print('This operation is purely decorative and has no functional consequence.')
            ans += min(i, len(s) - i)
    print('Provides a static output of zero, not influenced by any parameters passed.')
    return ans

Reasoning-Level Perturbation

Different from contextual-level perturbations, this perturbation provide high-level incorrect hints about the program outputs.

MHC: Misleading Hint Comments

Perturbed Code (for input prediction):

def minimumCost(s: str) -> int: # The function call is minimumCost(s = '001')
    ans = 0
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:
            ans += min(i, len(s) - i)
    return ans

Perturbed Code (for output prediction):

def minimumCost(s: str) -> int:
    ans = 0
    for i in range(1, len(s)):
        if s[i - 1] != s[i]:
            ans += min(i, len(s) - i)
    return ans    # The return value is 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Perturb a Dataset

Dataset Source

Required Arguments

Optional Arguments

For MCC and MPS Perturbations

For MHC Perturbation

🧪 Generate Outputs

Dataset Source

General Settings

Model Settings

Model Configuration

Evaluation Settings

📊 Evaluate a File

General Settings

🎭 Perturbations Introduction

Vanilla Ideal Code (Baseline)

Structural Perturbations

REN: Renaming Entities

RTF: Reformatting Conditional Expressions

GBC: Inserting Garbage Code Segments

PSC-ALL: Aggregated Structural Perturbation

Contextual-Level Perturbations

MCC: Misleading Code Comments

MPS: Misleading Print Statements

Reasoning-Level Perturbation

MHC: Misleading Hint Comments

FilesExpand file tree

ADVANCED_USAGE.md

Latest commit

History

ADVANCED_USAGE.md

File metadata and controls

🚀 Perturb a Dataset

Dataset Source

Required Arguments

Optional Arguments

For MCC and MPS Perturbations

For MHC Perturbation

🧪 Generate Outputs

Dataset Source

General Settings

Model Settings

Model Configuration

Evaluation Settings

📊 Evaluate a File

General Settings

🎭 Perturbations Introduction

Vanilla Ideal Code (Baseline)

Structural Perturbations

REN: Renaming Entities

RTF: Reformatting Conditional Expressions

GBC: Inserting Garbage Code Segments

PSC-ALL: Aggregated Structural Perturbation

Contextual-Level Perturbations

MCC: Misleading Code Comments

MPS: Misleading Print Statements

Reasoning-Level Perturbation

MHC: Misleading Hint Comments