To process a large amount of data partitioned on a data lake, you can use data processing frameworks such as Apache Spark :
Some questions :
- What is Spark RDD API ?
- What is Spark Dataset API ?
- With which languages can you use Spark ?
- Which data sources or data sinks can Spark work with ?
One engineering team of your company created for you a TV News data stored as JSON inside the folder data-news-json/.
Your goal is to analyze it with your savoir-faire, enrich it with metadata, and store it as a column-oriented format.
- Look at
src/main/scala/com/github/polomarcus/main/Main.scalaand update the code
Important note: As you work for a top-notch software company following world-class practices, and you care about your project quality, you'll write a test for every function you write.
You can see tests inside src/test/scala/ and run them with sbt test
How could use we Spark to display data on a BI tool such as Metabase ?
Pro Tips : https://www.scala-sbt.org/1.x/docs/Running.html#Continuous+build+and+test
Make a command run when one or more source files change by prefixing the command with ~. For example, in sbt shell try:
sbt
> ~ testQuick