Extreme Weather Analysis with PySpark and EMR Serverless

This capsule runs a PySpark job on EMR Serverless that analyzes data from the NOAA Global Surface Summary of Day dataset from the Registry of Open Data on AWS.

extreme_weather.py analyzes data from the current year and finds the weather location with the most extreme rain, wind, snow, and temperature.

By running the Spark job from a Code Ocean Capsule, we maintain provenance of the analysis that would otherwise be lost. The following screenshot shows the lineage of a result generated by this capsule which captures all the components involved in producing the result and gaurantees its reproducibility.

Secrets

This capsule requires two secrets attached in the environment editor.

AWS Cloud Credentials or an Assumable Role: AWS credentials which grant read and write access to the S3 bucket specified in the App Panel.
Custom Key: Job role ARN for EMR access (at least EMR Serverless)

This capsule was built off the example provided here: https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark

Input Data

Weather data from the NOAA Global Surface Summary of Day open data bucket (s3://noaa-gsod-pds/)

Output Data

output_weather_report.txt: text file containing extreme weather stats from the current year such as the highest temperature, the most precepitation, and the corresponding locations.

output: Console output from the run.

App Panel Parameters

S3 Bucket

This will be the entry point for the Spark job, storage for intermediate files, and storage for Spark logs.

Name of Spark Job

The name of the EMR Serverless Spark Job

Code Ocean is a cloud-based computational platform that aims to make it easy for researchers to share, discover, and run code.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
environment		environment
images		images
metadata		metadata
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extreme Weather Analysis with PySpark and EMR Serverless

Secrets

Input Data

Output Data

App Panel Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extreme Weather Analysis with PySpark and EMR Serverless

Secrets

Input Data

Output Data

App Panel Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages