Skip to content

Commit 9635aa0

Browse files
committed
Adds readme
1 parent 351e9f9 commit 9635aa0

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed

README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Kotlin Spark API
2+
3+
---
4+
5+
Your next API to work with [Spark](https://spark.apache.org/)
6+
7+
One day this should become part of https://github.com/apache/spark of repository, consider this as beta-quality software.
8+
9+
## Goal
10+
11+
This project adds missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/).
12+
13+
Despite Kotlin having first-class compatibility API Kotlin developers might want to use familiar features like data
14+
classes and lambda expressions as simple expressions in curly braces or method references.
15+
16+
## Non-goals
17+
18+
There is no goal to replace any currently supported language or provide them with some functionality to support Kotlin
19+
language.
20+
21+
## Installation
22+
23+
Currently, there are no kotlin-spark-api artifacts in maven central, but you can obain copy using JitPack here:
24+
[![](https://jitpack.io/v/JetBrains/kotlin-spark-api.svg)](https://jitpack.io/#JetBrains/kotlin-spark-api)
25+
26+
There is support for `Maven`, `Gradle`, `SBT` and `leinengen` on JitPack.
27+
28+
This project does not force you to use any concrete version of spark, but we've only tested it with spark `3.0.0-preview2`.
29+
We believe it should also work fine with version `2.4.5`
30+
31+
So if you're using Maven you'll hve to add following into your `pom.xml`:
32+
33+
```xml
34+
<repositories>
35+
<repository>
36+
<id>jitpack.io</id>
37+
<url>https://jitpack.io</url>
38+
</repository>
39+
</repositories>
40+
<dependency>
41+
<groupId>com.github.JetBrains.kotlin-spark-api</groupId>
42+
<artifactId>core</artifactId>
43+
<version>${kotlin-spark-api.version}</version>
44+
</dependency>
45+
<dependency>
46+
<groupId>org.apache.spark</groupId>
47+
<artifactId>spark-sql_2.12</artifactId>
48+
<version>2.4.5</version>
49+
</dependency>
50+
```
51+
52+
`core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to
53+
try out this project.
54+
55+
## Usage
56+
57+
First (and hopefully last) thing you need to do is to add following import to your Kotlin file:
58+
59+
```kotlin
60+
import org.jetbrains.spark.api.*
61+
```
62+
63+
Then you can create SparkSession we all remember and love
64+
65+
```kotlin
66+
val spark = SparkSession
67+
.builder()
68+
.master("local[2]")
69+
.appName("Simple Application").orCreate
70+
71+
```
72+
73+
To create Dataset you may call `toDS` method like this
74+
75+
```kotlin
76+
spark.toDS("a" to 1, "b" to 2)
77+
```
78+
79+
Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments.
80+
81+
Also, there are several interesting aliases in API, like `leftJoin`, `rightJoin` etc.
82+
Interesting fact about them that they're null-safe by design. For example, `leftJoin` is aware of nullability and returns
83+
`Dataset<Pair<LEFT, RIGHT?>>`.
84+
Note that were forcing `RIGHT` to be nullable for you as developer to be able to handle this situation.
85+
86+
We know that `NullPointerException`s are hard to debug in Spark And trying hard to make them happen as rare as possible.
87+
88+
## Useful helper methods
89+
90+
### `withSpark`
91+
92+
We provide you with useful function `withSpark`, which accepts everything that may be needed to run spark — properties,
93+
name, master location and so on. Also it accepts block, which should be launched in spark context.
94+
95+
After work block ends `spark.stop()` is called automatically.
96+
97+
```kotlin
98+
withSpark {
99+
dsOf(1, 2)
100+
.map { it to it }
101+
.show()
102+
}
103+
```
104+
105+
`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.
106+
107+
### `withCached`
108+
109+
It may easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
110+
method. But there it is hard to control when we're using cached `Dataset` and when not.
111+
Also it's easy to forget to unpersist cached data, which may make things to break unexpectadle or just take more memory
112+
then intended.
113+
114+
To solve these problems we introduce `withCached` function
115+
116+
```kotlin
117+
withSpark {
118+
dsOf(1, 2, 3, 4, 5)
119+
.map { it to (it + 2) }
120+
.withCached {
121+
showDS()
122+
123+
filter { it.first % 2 == 0 }.showDS()
124+
}
125+
.map { c(it.first, it.second, (it.first + it.second) * 2) }
126+
.show()
127+
}
128+
```
129+
130+
Here we're showing cached `Dataset` for debugging purposes then filtering it. `filter` method returns filtered `Dataset`
131+
and then cached `Dataset` is being unpersisted so we have more memory to call `map` method and collect resulting `Dataset`.
132+
133+
## Examples
134+
135+
You cn find more examples in [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module.
136+
137+
## Issues and feedback
138+
139+
Issues and any feedback are very welcome in `Issues` here.
140+
141+
If you find that we missed some important feature — please report and we'll consider adding it.

0 commit comments

Comments
 (0)