|
| 1 | +# Kotlin Spark API |
| 2 | + |
| 3 | +--- |
| 4 | + |
| 5 | +Your next API to work with [Spark](https://spark.apache.org/) |
| 6 | + |
| 7 | +One day this should become part of https://github.com/apache/spark of repository, consider this as beta-quality software. |
| 8 | + |
| 9 | +## Goal |
| 10 | + |
| 11 | +This project adds missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/). |
| 12 | + |
| 13 | +Despite Kotlin having first-class compatibility API Kotlin developers might want to use familiar features like data |
| 14 | +classes and lambda expressions as simple expressions in curly braces or method references. |
| 15 | + |
| 16 | +## Non-goals |
| 17 | + |
| 18 | +There is no goal to replace any currently supported language or provide them with some functionality to support Kotlin |
| 19 | +language. |
| 20 | + |
| 21 | +## Installation |
| 22 | + |
| 23 | +Currently, there are no kotlin-spark-api artifacts in maven central, but you can obain copy using JitPack here: |
| 24 | +[](https://jitpack.io/#JetBrains/kotlin-spark-api) |
| 25 | + |
| 26 | +There is support for `Maven`, `Gradle`, `SBT` and `leinengen` on JitPack. |
| 27 | + |
| 28 | +This project does not force you to use any concrete version of spark, but we've only tested it with spark `3.0.0-preview2`. |
| 29 | +We believe it should also work fine with version `2.4.5` |
| 30 | + |
| 31 | +So if you're using Maven you'll hve to add following into your `pom.xml`: |
| 32 | + |
| 33 | +```xml |
| 34 | +<repositories> |
| 35 | + <repository> |
| 36 | + <id>jitpack.io</id> |
| 37 | + <url>https://jitpack.io</url> |
| 38 | + </repository> |
| 39 | +</repositories> |
| 40 | +<dependency> |
| 41 | + <groupId>com.github.JetBrains.kotlin-spark-api</groupId> |
| 42 | + <artifactId>core</artifactId> |
| 43 | + <version>${kotlin-spark-api.version}</version> |
| 44 | +</dependency> |
| 45 | +<dependency> |
| 46 | + <groupId>org.apache.spark</groupId> |
| 47 | + <artifactId>spark-sql_2.12</artifactId> |
| 48 | + <version>2.4.5</version> |
| 49 | +</dependency> |
| 50 | +``` |
| 51 | + |
| 52 | +`core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to |
| 53 | +try out this project. |
| 54 | + |
| 55 | +## Usage |
| 56 | + |
| 57 | +First (and hopefully last) thing you need to do is to add following import to your Kotlin file: |
| 58 | + |
| 59 | +```kotlin |
| 60 | +import org.jetbrains.spark.api.* |
| 61 | +``` |
| 62 | + |
| 63 | +Then you can create SparkSession we all remember and love |
| 64 | + |
| 65 | +```kotlin |
| 66 | +val spark = SparkSession |
| 67 | + .builder() |
| 68 | + .master("local[2]") |
| 69 | + .appName("Simple Application").orCreate |
| 70 | + |
| 71 | +``` |
| 72 | + |
| 73 | +To create Dataset you may call `toDS` method like this |
| 74 | + |
| 75 | +```kotlin |
| 76 | +spark.toDS("a" to 1, "b" to 2) |
| 77 | +``` |
| 78 | + |
| 79 | +Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments. |
| 80 | + |
| 81 | +Also, there are several interesting aliases in API, like `leftJoin`, `rightJoin` etc. |
| 82 | +Interesting fact about them that they're null-safe by design. For example, `leftJoin` is aware of nullability and returns |
| 83 | +`Dataset<Pair<LEFT, RIGHT?>>`. |
| 84 | +Note that were forcing `RIGHT` to be nullable for you as developer to be able to handle this situation. |
| 85 | + |
| 86 | +We know that `NullPointerException`s are hard to debug in Spark And trying hard to make them happen as rare as possible. |
| 87 | + |
| 88 | +## Useful helper methods |
| 89 | + |
| 90 | +### `withSpark` |
| 91 | + |
| 92 | +We provide you with useful function `withSpark`, which accepts everything that may be needed to run spark — properties, |
| 93 | +name, master location and so on. Also it accepts block, which should be launched in spark context. |
| 94 | + |
| 95 | +After work block ends `spark.stop()` is called automatically. |
| 96 | + |
| 97 | +```kotlin |
| 98 | +withSpark { |
| 99 | + dsOf(1, 2) |
| 100 | + .map { it to it } |
| 101 | + .show() |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs. |
| 106 | + |
| 107 | +### `withCached` |
| 108 | + |
| 109 | +It may easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache` |
| 110 | +method. But there it is hard to control when we're using cached `Dataset` and when not. |
| 111 | +Also it's easy to forget to unpersist cached data, which may make things to break unexpectadle or just take more memory |
| 112 | +then intended. |
| 113 | + |
| 114 | +To solve these problems we introduce `withCached` function |
| 115 | + |
| 116 | +```kotlin |
| 117 | +withSpark { |
| 118 | + dsOf(1, 2, 3, 4, 5) |
| 119 | + .map { it to (it + 2) } |
| 120 | + .withCached { |
| 121 | + showDS() |
| 122 | + |
| 123 | + filter { it.first % 2 == 0 }.showDS() |
| 124 | + } |
| 125 | + .map { c(it.first, it.second, (it.first + it.second) * 2) } |
| 126 | + .show() |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +Here we're showing cached `Dataset` for debugging purposes then filtering it. `filter` method returns filtered `Dataset` |
| 131 | +and then cached `Dataset` is being unpersisted so we have more memory to call `map` method and collect resulting `Dataset`. |
| 132 | + |
| 133 | +## Examples |
| 134 | + |
| 135 | +You cn find more examples in [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module. |
| 136 | + |
| 137 | +## Issues and feedback |
| 138 | + |
| 139 | +Issues and any feedback are very welcome in `Issues` here. |
| 140 | + |
| 141 | +If you find that we missed some important feature — please report and we'll consider adding it. |
0 commit comments