File formats benchmark

This script was initially developed to assist in the argumentation of an article. Feel free to use the script and read the article for more context.

Running project

To run this project I recommend that you create a Python virtual environment that you can easily run the following command in the terminal.

python -m venv /dir_name

You can change dir_name for any other dir name.

Then activate the venv com by running the following command in the terminal.

dir_name\Scripts\activate

Install the required libraries.

pip install -r requirements.txt

For this project you will need spark and hadoop set in the environment variables, I recommend following the PySpark installation manual.

PySpark installation manual

!Place a .csv file of any size and format in the datasets dir

Set your configs on benchmark.conf.

Example:

query = select * from file where account2 = '80010EA80'

iterations = 15

operations =  query, size, all

formats =  csv, json

Run benchmark.

python main.py

benchmark.conf

The benchmark.conf file asks for some settings for the project to run.

operations: There are 5 operations that take place and are accounted for, 1 of them being mandatory and the others are optional. The operations are, query which measures the time it takes to make the query in question, size which measures the size that the file occupies, read which is measures the time it takes to open the file, write which happens mandatorily and measures the time it takes to write the file and save it to disk and all which counts the time to perform the read, Query and write together. Default value is query, write, size, read, all
query: It is the parameter that accepts a query that generates a result and is compatible with Spark SQL, if it is not passed, the operation query will not happen. Default value is None.
iterations: It is the parameter that defines the number of times the benchmark will be run. It is normal for there to be peaks in resource utilization on the machine or cluster that you are using to run the benchmark and this changes the results of the benchmark, a greater number of executions and logging ensures greater reliability of the results. Default value is 1.
formats: It is the configuration that receives the types of files that should be made the benchmarks, the options are csv, json, parquet, orc and avro. You can choose any of these formats to do, selecting a smaller number of formats ensures that the benchmark will be done faster. Default values is json, csv, avro, parquet, orc

results.csv

The results.csv is a file automatically generated by the script and its function is to store the benchmark data, it has the following schema:

column	type
operation	text
format	text
measure	text
value	float
file	text

The operation column is the name of the operation that was performed, the format column is the file format, the measure column is the measure (megabytes or seconds), value is the value of the measure, and file is the name of the file to allow you to make multiple benchmarks with different types of files

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
operations		operations
utils		utils
.gitignore		.gitignore
benchmark.conf		benchmark.conf
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
results.csv		results.csv
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File formats benchmark

Running project

PySpark installation manual

!Place a .csv file of any size and format in the datasets dir

benchmark.conf

results.csv

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

File formats benchmark

Running project

PySpark installation manual

!Place a .csv file of any size and format in the datasets dir

benchmark.conf

results.csv

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages