Skip to content

Subrahmanyajoshi/Preprocessing-with-Dataflow-Pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google-Cloud-Dataflow-Pipelines

A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python

  • In this repo I have used 2 different procedures which use apache beams pipelines and run on Dataflow.
  • Jobs can be submitted to Dataflow either through notebook directly or using a python script.
  • The same apache beam pipelines can also be run locally by toggling a single parameter of pipeline.

Important:

  • Notebooks I have mentioned above are Google Cloud Vertex AI workbench notebooks run in Google Cloud with apache beam environment.
  • All the python scripts are also run in Vertex AI notebooks environment itself.
  • While mentioning paths of files/directories in google cloud buckets, make sure to mention full path starting from gs:// .

Install Dependencies

pip install requirements.txt

Csv Reader

  • Reads csv data stored in bigquery, groups it on the basis of a field and writes details of each dataframe created into google cloud storage output directory path specified.
  • Notebook has the capability to read data from csv files and push it to a bigquery table.
  • Executing following commands after navigating to csv_reader/src/.

Running locally

python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --direct-runner

Running on Dataflow

python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --dataflow-runner

Text parser

  • Reads a huge input text file, applies multiple preprocessing techniques on it.
  • Output of this pipeline is a csv file containing cleaned text data and labels.
  • Executing following commands after navigating to text_parsing/src/.

Running locally

python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input>  --output-dir=<output-dir> --direct-runner

Running on Dataflow

python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input>  --output-dir=<output-dir> --dataflow-runner

About

A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors