Skip to content

Latest commit

 

History

History
10 lines (8 loc) · 593 Bytes

File metadata and controls

10 lines (8 loc) · 593 Bytes

Project: Cleaning and Exploring Big Data using PySpark

Alt Image text

  • Task 1 - Install Spark on Google Colab and load datasets in PySpark
  • Task 2 - Change column datatype, remove whitespaces and drop duplicates
  • Task 3 - Remove columns with Null values higher than a threshold
  • Task 4 - Group, aggregate and create pivot tables
  • Task 5 - Rename categories and impute missing numeric values
  • Task 6 - Create visualizations to gather insights