From 4168a5690718e06597759aa17eed0d6f3662749e Mon Sep 17 00:00:00 2001 From: tobe Date: Thu, 13 May 2021 15:02:11 +0800 Subject: [PATCH 1/4] Add the rfc about design of fetool --- sparkfe/design_of_fetool.md | 82 +++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 sparkfe/design_of_fetool.md diff --git a/sparkfe/design_of_fetool.md b/sparkfe/design_of_fetool.md new file mode 100644 index 0000000..fe1e691 --- /dev/null +++ b/sparkfe/design_of_fetool.md @@ -0,0 +1,82 @@ +- Start Date: 2021-05-13 +- Target Major Version: 1.0 +- Reference Issues: https://github.com/4paradigm/SparkFE/issues/69 +- Implementation PR: + +# Summary + +We should provide some tools with easy-to-use API like Python for users to meet the featue extraction requirements like creating FEDB DDL. + + +# Basic example + +Users can use fetool command to create FEDB DDL of creating tables. + +``` +fetool gen_ddl sql.yaml +``` + +Users can use fetool command to finish the following tasks without developing. + +``` +fetool csv_to_parquet /csv_files /parquet_files +fetool sample_parquet /parquet_files +fetool inspect_parquet /parquet_files +fetool check_skew /parquet_files +fetool benchmark 'spark-submit --master local /pyspark_app.py' +``` + +# Motivation + +Now users can use SparkFE for feature extraction with SQL API. Development is required since it's the library of distributed computing and do not solve problems without specific SQLs. + +However, there are some common tools which may be used for general feature extraction scenarios. For examples, we may sample the input dataset and check if its distribution is balanced. These tools are general and useful for feature extraction which may reduce the cost of development if we want to use SparkFE for AI. If we inspect the distribution of dataset in advance, the window skew optimization may use the distribution to achieve better performance. + +Therefore, providing the common tools for feature extraction is useful for developers. The easiest way to use is commad-line and we should provide Python and Java API for different developers. + +# Detailed design + +We want to use Python to wrap the feature extraction tools since Python is easy to use and has integrated with other programming languages. + +Users can install this tool with `pip` and all the functions can be called with command-line tool and Python functions. The fetool should be the standard Python package and command-line tool. There are some kinds of secnarios which fetool should support. + +* For data processing jobs like converting dataset and sampling dataset, we can use PySpark API which can submit the jobs in Python programming language. +* For benchmark tool which requires running multiple jobs with different Spark distributions, we may use `subprocess` api to invoke the shell commands. +* For some utilities written in Java/Scala like generating FEDB DDL, we may use `py4j` to invoke Java functions for Python API and command-line. + +Python package is able to meet the above requirements and easy to maintain. The codebase would be like these. + +``` +- python + setup.py + requirements.txt + - fetool + __init__.py + fetool.py + csv_to_parquet.py + sample_parquet.py + inspect_parquet.py + check_skew.py + gen_fdb_ddl.py + ...... +``` + +# Drawbacks + +Since it is the extension of SparkFE, there is no drawback for the existing core project. + +Implementation and maintenance cost is small if the they are used by most developers because they don't need to maintain by themselves. + +# Alternatives + +What other designs have been considered? What is the impact of not doing this? + +The command-line tool could be implemented in Java, C++ or other programming languages. But they may require to compile before using which is not easy as Python. + +# Adoption strategy + +Users may use the source Python scripts or install the tool with `pip install fetool`. + +# Unresolved questions + +None. From 4d22563c1e2c05c7402f8521ec3b0794a860c0e3 Mon Sep 17 00:00:00 2001 From: tobe Date: Thu, 13 May 2021 15:27:37 +0800 Subject: [PATCH 2/4] Attach one more issue for fetool rfc --- sparkfe/design_of_fetool.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sparkfe/design_of_fetool.md b/sparkfe/design_of_fetool.md index fe1e691..b5522d3 100644 --- a/sparkfe/design_of_fetool.md +++ b/sparkfe/design_of_fetool.md @@ -1,6 +1,6 @@ - Start Date: 2021-05-13 - Target Major Version: 1.0 -- Reference Issues: https://github.com/4paradigm/SparkFE/issues/69 +- Reference Issues: https://github.com/4paradigm/SparkFE/issues/58 and https://github.com/4paradigm/SparkFE/issues/69 - Implementation PR: # Summary @@ -16,7 +16,7 @@ Users can use fetool command to create FEDB DDL of creating tables. fetool gen_ddl sql.yaml ``` -Users can use fetool command to finish the following tasks without developing. +Users can use fetool command to finish the following tasks without development. ``` fetool csv_to_parquet /csv_files /parquet_files From 689cbc652ed7cf39427274d2331615feb85a5725 Mon Sep 17 00:00:00 2001 From: tobe Date: Thu, 13 May 2021 15:33:37 +0800 Subject: [PATCH 3/4] Add the install and usage of fetool command for fetool rfc --- sparkfe/design_of_fetool.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/sparkfe/design_of_fetool.md b/sparkfe/design_of_fetool.md index b5522d3..f6e3a07 100644 --- a/sparkfe/design_of_fetool.md +++ b/sparkfe/design_of_fetool.md @@ -61,6 +61,31 @@ Python package is able to meet the above requirements and easy to maintain. The ...... ``` +The command to install fetool should be `pip install fetool` or `pip3 install fetool`. + +The command-line should look like this. + +``` +$ fetool -h +usage: fetool [-h] + {version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark} + ... + +positional arguments: + {version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark} + version Print the version of fetool + csv_to_parquet csv_to_parquet $input_csv_path $output_parquet_path + sample_parquet sample_parquet $parquet_path + inspect_parquet inspect_parquet $parquet_path + check_skew check_skew $parquet_path + benchmark benchmark $command + +optional arguments: + -h, --help show this help message and exit +``` + +Developers can extend the functionality by adding new Python script to add sub-command for fetool. + # Drawbacks Since it is the extension of SparkFE, there is no drawback for the existing core project. From 5d46252b7fe955036f8d7cfbaceacacddfeaa8e4 Mon Sep 17 00:00:00 2001 From: tobe Date: Thu, 13 May 2021 15:35:03 +0800 Subject: [PATCH 4/4] Fix typo in fetool rfc --- sparkfe/design_of_fetool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sparkfe/design_of_fetool.md b/sparkfe/design_of_fetool.md index f6e3a07..eb094d1 100644 --- a/sparkfe/design_of_fetool.md +++ b/sparkfe/design_of_fetool.md @@ -84,7 +84,7 @@ optional arguments: -h, --help show this help message and exit ``` -Developers can extend the functionality by adding new Python script to add sub-command for fetool. +Developers can extend the functionality by adding new Python scripts and sub-command for fetool. # Drawbacks