Run the following command to setup the environment and install all the dependencies listed in /code/requirements.txt.
./code/setup_env.shSet the environment variable with the OpenAI API key.
export OPENAI_API_KEY="<ENTER OPENAI API KEY>"All the scripts to extract relevant context can be found under /code/extract_context/ directory.
Extract the list of instructions from the specifications document and generate a mapping between instructions and human written test files from the wasm spec repo.
python3 main_extract_instruction_test_map.pyExtract all the relevant constraints for each instruction from the WASM specifications document.
python3 main_extract_constraints.pyFor a given implementation, extract the relevant code snippet from the source code files for each instruction. Then generate a list of differences (described in natural language) between the extracted code snippets.
python3 main_extract_code_diffs.pyThe human written test files have several test cases in each file for a given instruction. We preprocess these files to only keep the test cases relevant to validation and separate them by creating a new file for each test case.
python3 main_separate_human_tests.pyThe test generation framework includes two main steps. The scripts can be found in /code/test_generation/
First, we use all the extracted information as context to prompt the LLM to generate descriptions of what each test should do.
python3 main_test_decription_generation.pyWe then use the test descriptions along with few randomly sampled human written tests, as few shot examples, and a list of human written guidelines to generate the tests themselves.
python3 main_test_generation.pyThe data loaders and other data specific utilities are in /code/utils/data_utils.py
The model specific utilities are in /code/utils/model_utils.py
All the prompts are in /code/utils/prompt_utils.py
All the evaluation scripts are under /code/evaluation/
We setup the four different WASM implementations by following the guidelines in the respective repositories.
After setup, the following commands should run successfully:
# Should run the Wasm Spec implementation (may need to add an alias)
wasm-spec -v
# Should run the Wizard VM CLI
wizeng --version
# Should run the wasmtime CLI
wasmtime --version
# Should run the V8 CLI
d8 -versionWe remove the assertion strings generated by the LLM as the expected output since the error messages are not consistent between implementations. Removing them simplifies differential tests to Pass/Fail instead of error message differences.
# Should place new `.wast` files, stripped of assertion strings, in `data/all_generated_tests/no_assert_str_wast_tests`
pushd code/evaluation/ && python3 remove_assert_str.py && popdWe convert the generated .wast test files into the format suitable for execution for each of the implementations.
The wizard-engine, wasmtime and WASM spec use the .bin.wast format.
To translate the .wast files to .bin.wast, run the following script:
# Should place generated `.bin.wast` files in `data/all_generated_tests/BIN_no_assert_str_wast_tests`
pushd ./code/evaluation/format_translation && ./to_binary.sh && popdThe v8 engine uses .js format.
To translate the .wast files to .js, run the following script:
# Should place generated `.js` files in `data/all_generated_tests/JS_no_assert_str_wast_tests`
pushd ./code/evaluation/format_translation && ./to_js.sh && popdNOTE: THIS WILL OVERWRITE LOGS FROM THE MOST-RECENT RUN, TO PROTECT THEM, MOVE code/evaluation/output/* CONTENTS TO ANOTHER LOCATION.
We then run the tests on all the different implementations under test and compare them against the WASM spec implementation.
For the wizard-engine:
pushd ./code/evaluation/test_runners && ./spec_vs_wizard.sh 2>&1 | tee ../output/logs_wizard_eng__wasm_spec.out && popdFor wasmtime:
pushd ./code/evaluation/test_runners && ./spec_vs_wasmtime.sh 2>&1 | tee ../output/logs_wasmtime__wasm_spec.out && popdFor v8:
pushd ./code/evaluation/test_runners && ./spec_vs_v8.sh 2>&1 | tee ../output/logs_v8__wasm_spec.out && popdWe take the logs from running the tests for each implementation and compare the execution output against the WASM spec implementation, and we extract all the tests where the outcomes of the two implementations are different. This produces .json file comprising all the information relevant to the differentiating tests, including the test file metadata, test content and the execution outcome of the WASM spec and the implementation under test, aiding in analysis of tests which highlight differentiating behaviour.
To get the differential tests for some implementation, make sure the right one is targeted at the top of the script through the test_implementation variable as shown below:
test_implementation = 'wizard_eng'
#test_implementation = 'v8'
#test_implementation = 'wasmtime'Once the right one is targeted (wizard_eng, v8, or wasmtime), run the following to extract the differential tests:
pushd ./code/evaluation/ && python3 get_diff_tests.py && popdThis will output a file summarizing the found differential tests in the same directory:
ls -al ./code/evaluation/diff_tests__wasm_spec__*We have some scripts that have proven useful when manually looking through the differentiating tests for Wasm.
First, wasmtime has different behavior when imported items are not available for linking at verification time. The wasmtime engine categorizes this as a failure to verify, whereas other implementations ignore the missing import and verify the code that is available. We have a script to generate some Wasm modules with stubbed imports to help work around this issue and see if the tests are differentiating with the workaround in place.
To generate the stubs:
pushd ./code/evaluation/evaluation_helpers && ./create_import_mod.sh && popdNow you have tests with stubbed imports in ../../data/all_generated_tests/WITH_IMPORTS_no_assert_str_wast_tests, notice these are .wast.
You'll need to pick up from section 4.3 and 4.4 to run the tests. Remember to change the input dir name as appropriate in the script!
-
/code/context/has all the relevant context including the source code files, extracted constraints, bug classes and their descriptions, extracted code context (includes relevant code snippets for instruction, and list of differences between the implementations), and the generated test descriptions. -
/code/human_written_tests/WebAssembly/processed_control_flow_validation/has both the relevant human written test files as well as the processed tests files (See Section 1.4).