Practices - Data engineering

TP - Data processing with Apache Spark

To process a large amount of data partitioned on a data lake, you can use data processing frameworks such as Apache Spark :

Read : https://spark.apache.org/docs/latest/sql-programming-guide.html

Some questions :

What is Spark RDD API ?
What is Spark Dataset API ?
With which languages can you use Spark ?
Which data sources or data sinks can Spark work with ?

Analyse data with Apache Spark and Scala

One engineering team of your company created for you a TV News data stored as JSON inside the folder data-news-json/.

Your goal is to analyze it with your savoir-faire, enrich it with metadata, and store it as a column-oriented format.

Look at src/main/scala/com/github/polomarcus/main/Main.scala and update the code

Important note: As you work for a top-notch software company following world-class practices, and you care about your project quality, you'll write a test for every function you write.

You can see tests inside src/test/scala/ and run them with sbt test

How can you deploy your app to a cluster of machines ?

https://spark.apache.org/docs/latest/cluster-overview.html

Business Intelligence (BI)

How could use we Spark to display data on a BI tool such as Metabase ?

Tips: https://github.com/polomarcus/television-news-analyser#spin-up-1-postgres-metabase-nginxand-load-data-to-pg

Continuous build and test

Pro Tips : https://www.scala-sbt.org/1.x/docs/Running.html#Continuous+build+and+test

Make a command run when one or more source files change by prefixing the command with ~. For example, in sbt shell try:

sbt
> ~ testQuick

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
data-news-json		data-news-json
project		project
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Practices - Data engineering

TP - Data processing with Apache Spark

Analyse data with Apache Spark and Scala

How can you deploy your app to a cluster of machines ?

Business Intelligence (BI)

Continuous build and test

About

Uh oh!

Releases

Packages

Uh oh!

Languages

thibaut-dst/tp-data-processing-framework

Folders and files

Latest commit

History

Repository files navigation

Practices - Data engineering

TP - Data processing with Apache Spark

Analyse data with Apache Spark and Scala

How can you deploy your app to a cluster of machines ?

Business Intelligence (BI)

Continuous build and test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages