-
Notifications
You must be signed in to change notification settings - Fork 0
Develop SharefileToSnowflakeDag class (GF) #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gfitzgerald-ea
wants to merge
48
commits into
main
Choose a base branch
from
feature/sharefile_to_snowflake_dag_builder_gf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
6fe9c36
test first class method
mberrien-fitzsimons 8ce7eb9
updated variable inputs to init and individual methods
mberrien-fitzsimons 40cac6f
added dock strings to all methods
mberrien-fitzsimons d550d0c
updated local path organization
mberrien-fitzsimons 36d9651
updated way that dag is initialized
mberrien-fitzsimons 7c4734f
added global to dag
mberrien-fitzsimons 5d590b9
updated dag call structure
mberrien-fitzsimons efb510d
updated way params is called within Dag instantion
mberrien-fitzsimons db969c8
small update
mberrien-fitzsimons b335d66
updated file_sources to include .keys
mberrien-fitzsimons 2fa3399
updated dag to using eacustomdag
mberrien-fitzsimons 58a0332
updated param type to array
mberrien-fitzsimons 5ec70e4
updated sharefile to snowflake dag builder script to include global e…
mberrien-fitzsimons 929d9ad
removed global function because it did not work for explosing dag id
mberrien-fitzsimons f7178ac
updated s3.py in attempt to fix metadata_column bug
mberrien-fitzsimons 9ef191f
updated class so that it would construct required folder structure
mberrien-fitzsimons 111dc3b
updated input variable name
mberrien-fitzsimons 248f4d6
updated way that local path is structured
mberrien-fitzsimons 89a340b
updated where date and timestamp are instantiated
mberrien-fitzsimons 85f91eb
updated timestamp code to use airflow runtime instead of datetime now…
mberrien-fitzsimons 9882e64
updated timestamp to go back to original form
mberrien-fitzsimons 9644caf
updated way filepath is created
mberrien-fitzsimons 74e8615
save final changes to sharefile to snowflake dag before complete refa…
mberrien-fitzsimons 01e8438
updated sharefile to snowflake dag builder class to remove unecessary…
mberrien-fitzsimons 9a90112
updated sharefile sources
mberrien-fitzsimons bd8408e
re-ordered init arguments
mberrien-fitzsimons bd50225
update to readme documentation
mberrien-fitzsimons 6eff93a
updated readme with yaml file showing file_sources
mberrien-fitzsimons df7cc14
added additional information to yaml section of readme
mberrien-fitzsimons 32b37ff
updated s3 to snowflake task
mberrien-fitzsimons e791ca2
updated sharefile to snowflake dag builder
mberrien-fitzsimons 0f4ef7c
added print statement for folder ID to see what is happening
mberrien-fitzsimons 6f57bf6
updated transfer_to_s3
mberrien-fitzsimons 344ada0
removed partial method
mberrien-fitzsimons 950ca35
added controle flow for choosing tansfer from s3 to snowflake step
mberrien-fitzsimons 3dab30f
updated sharefile to working state
mberrien-fitzsimons f76a18a
updated file to handle both single file and folder. testing in TX dev…
mberrien-fitzsimons 47b0166
This was doe in order to access changes to the sharefiletodiskoperato…
mberrien-fitzsimons 5baa75b
fixed mistake in imports
mberrien-fitzsimons 2e53992
Merge branch 'main' into feature/sharefile_to_snowflake_dag_builder
7ed9fa0
Clean up dag and add to init.
1c588db
Update SharefileToSnowflakeDag.
5d3eea4
Make local file path more configurable.
4de45e3
Refactor sharefile_to_snowflake_dag and create ea_csv helper functions.
623b7d7
Make move_to_processed optional and allow numbered txt columns.
ee79a5f
Update conditional move_to_provessed task handling.
908307c
Update SharefileToSnowflakeDag and ea_csv docs.
f3fe13e
Add csv_encoding argument to translate_csv_to_jsonl.
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| import os | ||
| import pandas as pd | ||
|
|
||
|
|
||
| def txt_to_csv( | ||
| file_in, | ||
| file_out=None, | ||
| delimiter=',', | ||
| has_header=True, | ||
| column_names=None, | ||
| delete_txt=False | ||
| ): | ||
| """Convert a txt file to a csv. | ||
|
|
||
| Args: | ||
| - file_in (str): A path to a txt file. | ||
| - file_out (str): A path to a csv file. If 'None', then the input file path | ||
| is used. | ||
| - delimiter (str): A txt file delimiter. | ||
| - has_header (bool): If True, use the first row of the txt file as a column | ||
| header. If False, insert a column header using the column_names arg. | ||
| Default is True. | ||
| - column_names (list[str]): An ordered list of column names to use in the | ||
| output csv. If 'None' and has_header is False, insert an ordered, | ||
| integer column header (e.g. 1, 2, ..., n where n is the number of | ||
| columns). | ||
| - delete_txt (bool): If True, delete the input txt file. | ||
|
|
||
| Returns: | ||
| - file_out (str): A csv file path. | ||
| """ | ||
|
|
||
| if has_header == True: | ||
| # Force str dtype, otherwise pandas will do things like cast int's to | ||
| # floats | ||
| df = pd.read_csv(file_in, delimiter=delimiter, dtype=str) | ||
|
|
||
| elif has_header == False and column_names != None: | ||
| df = pd.read_csv(file_in, delimiter=delimiter, dtype=str, header=None) | ||
| df.columns = column_names | ||
|
|
||
| elif has_header == False and column_names == None: | ||
| df = pd.read_csv(file_in, delimiter=delimiter, dtype=str, header=None) | ||
| # 1-indexed column labels can simplify downstream processing | ||
| df.columns = df.columns + 1 | ||
|
|
||
| if file_out == None: | ||
| file_out = file_in[:-4] + '.csv' | ||
|
|
||
| df.to_csv(file_out, index=False) | ||
|
|
||
| if delete_txt: | ||
| os.remove(file_in) | ||
|
|
||
| return file_out | ||
|
|
||
|
|
||
| def txt_files_to_csv( | ||
| path_in, | ||
| path_out=None, | ||
| delimiter=',', | ||
| has_header=False, | ||
| column_names=None, | ||
| delete_txt=False, | ||
| include_subdirs=False | ||
| ): | ||
| """Convert all txt files in a directory to csv files. Also works with a | ||
| single txt file path and can be optionally configured to process files in | ||
| all subdirectories. | ||
|
|
||
| Args: | ||
| - path_in (str): A file or directory path containing zero or more txt files. | ||
| - path_out (str): A file or directory path to write csv file(s) to. If | ||
| 'None', then the input path is used. Note that a file will retain its | ||
| original names except with a .csv extension. | ||
| - delimiter (str): A txt file delimiter. Note that this function assumes | ||
| that all txt files in an input directory use the same delimiter. | ||
| - has_header (bool): If True, use the first row of the txt file(s) as a | ||
| column header. If False, insert a column header using the column_names | ||
| arg. Default is True. | ||
| - column_names (list[str]): An ordered list of column names to use in the | ||
| output csv(s). If 'None' and has_header is False, insert an ordered, | ||
| integer column header (e.g. 1, 2, ..., n where n is the number of | ||
| columns). | ||
| - delete_txt (bool): If True, delete all of the input txt files. | ||
| - include_subdirs (bool): If True, process all files in all subdirectories. | ||
| If False, only process files in the top level of the specified | ||
| directory. Default is False. | ||
|
|
||
| Returns: | ||
| - path_out (str): A file or directory path containing the output csv | ||
| file(s). | ||
| """ | ||
|
|
||
| for root, _, files in os.walk(path_in): | ||
|
|
||
| for file in files: | ||
|
|
||
| # Only process txt files | ||
| if file[-4:] != '.txt': | ||
| continue | ||
|
|
||
| filepath_in = os.path.join(root, file) | ||
|
|
||
| if path_out == None: | ||
| dir_out = root | ||
| else: | ||
| dir_out = path_out | ||
|
|
||
| filename_out = file[:-4] + '.csv' | ||
| filepath_out = os.path.join(dir_out, filename_out) | ||
|
|
||
| txt_to_csv( | ||
| file_in=filepath_in, | ||
| file_out=filepath_out, | ||
| delimiter=delimiter, | ||
| has_header=has_header, | ||
| column_names=column_names, | ||
| delete_txt=delete_txt | ||
| ) | ||
|
|
||
| if include_subdirs == False: | ||
| break | ||
|
|
||
| if path_out == None: | ||
| path_out = path_in | ||
|
|
||
| return path_out | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.