-
Notifications
You must be signed in to change notification settings - Fork 4
RFC: Design of fetool #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tobegit3hub
wants to merge
4
commits into
4paradigm:main
Choose a base branch
from
tobegit3hub:add_rfc_design_of_fetool
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| - Start Date: 2021-05-13 | ||
| - Target Major Version: 1.0 | ||
| - Reference Issues: https://github.com/4paradigm/SparkFE/issues/58 and https://github.com/4paradigm/SparkFE/issues/69 | ||
| - Implementation PR: | ||
|
|
||
| # Summary | ||
|
|
||
| We should provide some tools with easy-to-use API like Python for users to meet the featue extraction requirements like creating FEDB DDL. | ||
|
|
||
|
|
||
| # Basic example | ||
|
|
||
| Users can use fetool command to create FEDB DDL of creating tables. | ||
|
|
||
| ``` | ||
| fetool gen_ddl sql.yaml | ||
| ``` | ||
|
|
||
| Users can use fetool command to finish the following tasks without development. | ||
|
|
||
| ``` | ||
| fetool csv_to_parquet /csv_files /parquet_files | ||
| fetool sample_parquet /parquet_files | ||
| fetool inspect_parquet /parquet_files | ||
| fetool check_skew /parquet_files | ||
| fetool benchmark 'spark-submit --master local /pyspark_app.py' | ||
| ``` | ||
|
|
||
| # Motivation | ||
|
|
||
| Now users can use SparkFE for feature extraction with SQL API. Development is required since it's the library of distributed computing and do not solve problems without specific SQLs. | ||
|
|
||
| However, there are some common tools which may be used for general feature extraction scenarios. For examples, we may sample the input dataset and check if its distribution is balanced. These tools are general and useful for feature extraction which may reduce the cost of development if we want to use SparkFE for AI. If we inspect the distribution of dataset in advance, the window skew optimization may use the distribution to achieve better performance. | ||
|
|
||
| Therefore, providing the common tools for feature extraction is useful for developers. The easiest way to use is commad-line and we should provide Python and Java API for different developers. | ||
|
|
||
| # Detailed design | ||
|
|
||
| We want to use Python to wrap the feature extraction tools since Python is easy to use and has integrated with other programming languages. | ||
|
|
||
| Users can install this tool with `pip` and all the functions can be called with command-line tool and Python functions. The fetool should be the standard Python package and command-line tool. There are some kinds of secnarios which fetool should support. | ||
|
|
||
| * For data processing jobs like converting dataset and sampling dataset, we can use PySpark API which can submit the jobs in Python programming language. | ||
| * For benchmark tool which requires running multiple jobs with different Spark distributions, we may use `subprocess` api to invoke the shell commands. | ||
tobegit3hub marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * For some utilities written in Java/Scala like generating FEDB DDL, we may use `py4j` to invoke Java functions for Python API and command-line. | ||
|
|
||
| Python package is able to meet the above requirements and easy to maintain. The codebase would be like these. | ||
|
|
||
| ``` | ||
| - python | ||
| setup.py | ||
| requirements.txt | ||
| - fetool | ||
| __init__.py | ||
| fetool.py | ||
| csv_to_parquet.py | ||
| sample_parquet.py | ||
| inspect_parquet.py | ||
| check_skew.py | ||
| gen_fdb_ddl.py | ||
| ...... | ||
| ``` | ||
|
|
||
| The command to install fetool should be `pip install fetool` or `pip3 install fetool`. | ||
|
|
||
| The command-line should look like this. | ||
|
|
||
| ``` | ||
| $ fetool -h | ||
| usage: fetool [-h] | ||
| {version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark} | ||
| ... | ||
|
|
||
| positional arguments: | ||
| {version,csv_to_parquet,sample_parquet,inspect_parquet,check_skew,benchmark} | ||
| version Print the version of fetool | ||
| csv_to_parquet csv_to_parquet $input_csv_path $output_parquet_path | ||
| sample_parquet sample_parquet $parquet_path | ||
| inspect_parquet inspect_parquet $parquet_path | ||
| check_skew check_skew $parquet_path | ||
| benchmark benchmark $command | ||
|
|
||
| optional arguments: | ||
| -h, --help show this help message and exit | ||
| ``` | ||
|
|
||
| Developers can extend the functionality by adding new Python scripts and sub-command for fetool. | ||
|
|
||
| # Drawbacks | ||
|
|
||
| Since it is the extension of SparkFE, there is no drawback for the existing core project. | ||
|
|
||
| Implementation and maintenance cost is small if the they are used by most developers because they don't need to maintain by themselves. | ||
|
|
||
| # Alternatives | ||
|
|
||
| What other designs have been considered? What is the impact of not doing this? | ||
|
|
||
| The command-line tool could be implemented in Java, C++ or other programming languages. But they may require to compile before using which is not easy as Python. | ||
|
|
||
| # Adoption strategy | ||
|
|
||
| Users may use the source Python scripts or install the tool with `pip install fetool`. | ||
|
|
||
| # Unresolved questions | ||
|
|
||
| None. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.