Skip to content

Latest commit

 

History

History
112 lines (102 loc) · 4.44 KB

File metadata and controls

112 lines (102 loc) · 4.44 KB

Data Engineering with Google Cloud Professional Certificate

  • Course / Module 1: Google Cloud Platform Big Data and Machine Learning Fundamentals
  • Course / Module 2: Modernizing Data Lakes and Data Warehouses with GCP
  • Course / Module 3: Building Batch Data Pipelines on GCP
  • Course / Module 4: Building Resilient Streaming Analytics Systems on GCP
  • Course / Module 5: Smart Analytics, Machine Learning, and AI on GCP
  • Course / Module 6: Preparing for the Google Cloud Professional Data Engineer Exam

Google Cloud Platform Big Data and Machine Learning Fundamentals

Intro

McKinsey research, by 2020, we'll have 50 billion devices connected in the Internet of Things. ... only about one percent of the data generated today is actually analyzed ...

  • Big Data challenges:
    • migrate existing data workloads (ex. hadoop, spark jobs)
    • analyzing large dataset at scale
    • building streaming data pipelines
    • applying machine learning to your data

Intro to GCP

  • GCP was initially build to power google own apps
  • GCP infrastructure (building blocks):
    • compute
    • storage
    • network
    • security
  • Big Data and ML Products build upon the GCP infrastructure in order to abstract the bare metal way.

cloud computing differs from desktop computing: ex. compute and storage are independent.

Compute power for ML workloads

  • ML Model is used for image, video stabilization in google image, youtube, ...
  • the pre-trained models / AI building blocks are offered:
    • sight: cloud vision, cloud video intelligence, AutoML vision
    • language: cloud translation, AutoML translation, ...
    • conversion: cloud text-to-speech, ...
  • google designed hardware for ML:
    • TPU: tensor processing unit
    • ASIC: AI accelerator application-specific integrated circuit
    • TPU is an ASIC

Example: create VM and Storage bucket

  • create a VM instance
  • create a global unique bucket
  • open shell in VM
  • access list files in bucket: gutil ls gs://...
  • copy file to bucket: gutil cp file gs://...
  • create public link to bucket files

Data pipelines

  • build data pipelines before building ML models from that data
  • data pipeline: bring the data to your system

GCP hierarchy

  • resources:
    • BigQuery dataset
    • Cloud storage bucket
    • Compute engine instance
  • projects:
    • dev
    • test project
    • production
  • folders - collections of projects:
    • team a
      • product 1, product 2
    • team b
  • organization - root node of the entire GCP hierarchy
    • optional
    • apply policies (IAM, user access)

zones and regions physically orginazes resources. projects organizes logically resources.

Networking

  • private network
  • google layed fiber optic cable that crosses oceans
  • the data centers are interconnected. cable diameter ~10cm
  • 1 Petabit/sec bandwith

Google's Jupiter Network can deliver enough bandwidth to allow 100,000 machines to communicate amongst each other. Google's Network, interconnects with the public Internet at more than 90 internet exchanges and more than 100 points of presence worldwide. Google responds to the user's request from an Edge network location that will provide the lowest delay or latency.

Security

  • Responsibility management
    • On-premise: you manage the the responsibilities
      • hardware, network, OS, identity, web app security, development, usage, access policy, ...
    • IaaS: identity, web app security, development, usage, access policy, ...
    • PaaS: web app security, development, usage, access policy, ...
    • Managed services: usage, access policy
  • stored data is encrypted. ex. in BigQuery

Evolution of data processing frameworks

  • know how the frameworks have evolved
  • as data growth the needs for handling data at google also growth. Innovation by google:
    • 2002: GFS invented foundation for storage, bigquery
    • 2004: MapReduce paper introduced large scale of data processing
    • hadoop was created
    • 2006: Bigtable, inspiration for Apache HBase, MongoDB
    • 2008: Dremel, new approach for data processing
    • 2010 - 2018: Colossus, Flume, Megastore, Spanner, Pub/Sub (for messasing), Tensorflow (for ML), TPU
  • these innovations are now provided as services in GCP:
    • 2002: Cloud storage
    • 2004: Dataproc
    • 2006: Bigtable
    • 2008: BigQuery
    • 2010: Dataflow
    • 2011: Datastore
    • 2014: Pub/Sub
    • 2015: ML Engine
    • 2016: Cloud Spanner
    • 2018: AutoML