This script was initially developed to assist in the argumentation of an article. Feel free to use the script and read the article for more context.
To run this project I recommend that you create a Python virtual environment that you can easily run the following command in the terminal.
python -m venv /dir_name
You can change dir_name for any other dir name.
Then activate the venv com by running the following command in the terminal.
dir_name\Scripts\activate
Install the required libraries.
pip install -r requirements.txt
For this project you will need spark and hadoop set in the environment variables, I recommend following the PySpark installation manual.
Set your configs on benchmark.conf.
Example:
query = select * from file where account2 = '80010EA80'
iterations = 15
operations = query, size, all
formats = csv, json
Run benchmark.
python main.py
The benchmark.conf file asks for some settings for the project to run.
-
operations: There are 5 operations that take place and are accounted for, 1 of them being mandatory and the others are optional. The operations are, query which measures the time it takes to make the query in question, size which measures the size that the file occupies, read which is measures the time it takes to open the file, write which happens mandatorily and measures the time it takes to write the file and save it to disk and all which counts the time to perform the read, Query and write together. Default value is query, write, size, read, all
-
query: It is the parameter that accepts a query that generates a result and is compatible with Spark SQL, if it is not passed, the operation query will not happen. Default value is None.
-
iterations: It is the parameter that defines the number of times the benchmark will be run. It is normal for there to be peaks in resource utilization on the machine or cluster that you are using to run the benchmark and this changes the results of the benchmark, a greater number of executions and logging ensures greater reliability of the results. Default value is 1.
-
formats: It is the configuration that receives the types of files that should be made the benchmarks, the options are csv, json, parquet, orc and avro. You can choose any of these formats to do, selecting a smaller number of formats ensures that the benchmark will be done faster. Default values is json, csv, avro, parquet, orc
The results.csv is a file automatically generated by the script and its function is to store the benchmark data, it has the following schema:
| column | type |
|---|---|
| operation | text |
| format | text |
| measure | text |
| value | float |
| file | text |
The operation column is the name of the operation that was performed, the format column is the file format, the measure column is the measure (megabytes or seconds), value is the value of the measure, and file is the name of the file to allow you to make multiple benchmarks with different types of files