11[//]: # (title: Read)
22<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Read-->
33
4- The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, Apache Arrow input formats.
4+ The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats.
55
6- ` read` method automatically detects input format based on file extension and content
6+ The `. read()` function automatically detects the input format based on file extension and content:
77
88```kotlin
99DataFrame.read("input.csv")
1010```
1111
12- Input string can be a file path or URL.
12+ The input string can be a file path or URL.
1313
14- ## Reading CSV
14+ ## Read from CSV
1515
16- All these calls are valid:
16+ To read a CSV file, use the `.readCSV()` function.
17+
18+ To read a CSV file from a file:
1719
1820```kotlin
1921import java.io.File
20- import java.net.URL
2122
2223DataFrame.readCSV("input.csv")
24+ // Alternatively
2325DataFrame.readCSV(File("input.csv"))
26+ ```
27+
28+ To read a CSV file from a URL:
29+
30+ ```kotlin
31+ import java.net.URL
32+
2433DataFrame.readCSV(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"))
2534```
2635
27- All `readCSV` overloads support different options.
28- For example, you can specify custom delimiter if it differs from `,`, charset
29- and column names if your CSV is missing them
36+ ### Specify delimiter
37+
38+ By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument:
3039
3140<!---FUN readCsvCustom-->
3241
@@ -41,7 +50,9 @@ val df = DataFrame.readCSV(
4150
4251<!---END-->
4352
44- Column types will be inferred from the actual CSV data. Suppose that CSV from the previous
53+ ### Column type inference from CSV
54+
55+ Column types are inferred from the CSV data. Suppose that the CSV from the previous
4556example had the following content:
4657
4758<table>
@@ -51,7 +62,7 @@ example had the following content:
5162<tr><td>89</td><td>abc</td><td>7.1</td><td>false</td></tr>
5263</table>
5364
54- [`DataFrame`](DataFrame.md) schema we get is:
65+ Then the [`DataFrame`](DataFrame.md) schema we get is:
5566
5667```text
5768A: Int
@@ -60,7 +71,7 @@ C: Double
6071D: Boolean?
6172```
6273
63- [`DataFrame`](DataFrame.md) will try to parse columns as JSON, so when reading following table with JSON object in column D:
74+ [`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
6475
6576<table>
6677<tr><th>A</th><th>D</th></tr>
7788 C: Int
7889```
7990
80- For column where values are lists of JSON values:
91+ For a column where values are lists of JSON values:
8192<table>
8293<tr><th>A</th><th>G</th></tr>
8394<tr><td>12</td><td>[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]</td></tr>
92103 D: Int
93104```
94105
95- ### Dealing with locale specific numbers
106+ ### Work with locale- specific numbers
96107
97108Sometimes columns in your CSV can be interpreted differently depending on your system locale.
98109
@@ -102,8 +113,8 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
102113<tr><td>41,111</td></tr>
103114</table>
104115
105- Here comma can be decimal or thousands separator, thus different values.
106- You can deal with it in two ways
116+ Here a comma can be decimal or thousands separator, thus different values.
117+ You can deal with it in two ways:
107118
1081191) Provide locale as a parser option
109120
@@ -132,20 +143,34 @@ val df = DataFrame.readCSV(
132143<!---END-->
133144
134145
135- ## Reading JSON
146+ ## Read from JSON
147+
148+ To read a JSON file, use the `.readJSON()` function. JSON files can be read from a file or a URL.
149+
150+ Note that after reading a JSON with a complex structure, you can get hierarchical
151+ [`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.
152+
153+ To read a JSON file from a file:
154+
155+ <!---FUN readJson-->
156+
157+ ```kotlin
158+ val df = DataFrame.readJson(file)
159+ ```
160+
161+ <!---END-->
136162
137- Basics for reading JSONs are the same: you can read from file or from remote URL.
163+ To read a JSON file from a URL:
138164
139165```kotlin
140166DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json")
141167```
142168
143- Note that after reading a JSON with a complex structure, you can get hierarchical
144- [`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.
169+ ### Column type inference from JSON
145170
146- Also note that type inferring process for JSON is much simpler than for CSV.
147- JSON string literals are always supposed to have String type, number literals
148- take different `Number` kinds, boolean literals are converted to `Boolean`.
171+ Type inference for JSON is much simpler than for CSV.
172+ JSON string literals are always supposed to have String type. Number literals
173+ take different `Number` kinds. Boolean literals are converted to `Boolean`.
149174
150175Let's take a look at the following JSON:
151176
@@ -178,17 +203,13 @@ Let's take a look at the following JSON:
178203]
179204```
180205
181- We can read it from file
182-
183- <!---FUN readJson-->
206+ We can read it from file:
184207
185208```kotlin
186209val df = DataFrame.readJson(file)
187210```
188211
189- <!---END-->
190-
191- Corresponding [`DataFrame`](DataFrame.md) schema will be
212+ The corresponding [`DataFrame`](DataFrame.md) schema is:
192213
193214```text
194215A: String
@@ -200,7 +221,9 @@ D: Boolean?
200221Column A has `String` type because all values are string literals, no implicit conversion is performed. Column C
201222has `Number` type because it's the least common type for `Int` and `Double`.
202223
203- ### JSON Reading Options: Type Clash Tactic
224+ ### JSON parsing options
225+
226+ #### Manage type clashes
204227
205228By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and
206229any number of object properties:
@@ -251,9 +274,9 @@ For this case, you can set `typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS`
251274
252275This option is also possible to set in the Gradle- and KSP plugin by providing `jsonOptions`.
253276
254- ### JSON Reading Options: Key/Value Paths
277+ #### Specify Key/Value Paths
255278
256- If you have some JSON looking like
279+ If you have a JSON like:
257280
258281```json
259282{
@@ -280,10 +303,10 @@ If you have some JSON looking like
280303}
281304```
282305
283- you will get a column for each dog, which becomes an issue when you have a lot of dogs.
284- This issue is especially noticeable when generating data schemas from the JSON, as you might even run out of memory
285- when doing that due to the sheer number of generated interfaces.\
286- Instead, you can use `keyValuePaths` to specify paths to the objects that should be read as key value frame columns.
306+ You will get a column for each dog, which becomes an issue when you have a lot of dogs.
307+ This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory
308+ when doing that due to the sheer number of generated interfaces. Instead, you can use `keyValuePaths` to specify paths
309+ to the objects that should be read as key value frame columns.
287310
288311This can be the difference between:
289312
@@ -342,22 +365,35 @@ Only the bracket notation of json path is supported, as well as just double quot
342365
343366For more examples, see the "examples/json" module.
344367
345- ## Reading Excel
368+ ## Read from Excel
346369
347- Add dependency:
370+ Before you can read data from Excel, add the following dependency:
348371
349372```kotlin
350373implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
351374```
352375
353- Right now [`DataFrame`](DataFrame.md) supports reading Excel spreadsheet formats: xls, xlsx.
376+ To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported
377+ Excel spreadsheet formats are: xls, xlsx.
378+
379+ To read an Excel spreadsheet from a file:
380+
381+ ```kotlin
382+ val df = DataFrame.readExcel(file)
383+ ```
354384
355- You can read from file or URL.
385+ To read an Excel spreadsheet from a URL:
386+
387+ ```kotlin
388+ DataFrame.readExcel("https://example.com/data.xlsx")
389+ ```
390+
391+ ### Cell type inference from Excel
356392
357393Cells representing dates will be read as `kotlinx.datetime.LocalDateTime`.
358- Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`
394+ Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`.
359395
360- Sometimes cells can have wrong format in Excel file, for example you expect to read column of String:
396+ Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of ` String` :
361397
362398```text
363399IDS
367403C100
368404```
369405
370- You will get column of Serializable instead (common parent for Double & String)
406+ You will get column of ` Serializable` instead (common parent for ` Double` and ` String`).
371407
372- You can fix it using convert:
408+ You can fix it using the `. convert()` function :
373409
374410<!---FUN fixMixedColumn-->
375411
@@ -387,25 +423,28 @@ df1["IDS"].type() shouldBe typeOf<String>()
387423
388424<!---END-->
389425
390- ## Reading Apache Arrow formats
426+ ## Read Apache Arrow formats
391427
392- Add dependency:
428+ Before you can read data from Apache Arrow format, add the following dependency:
393429
394430```kotlin
395431implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
396432```
397433
398- <warning>
399- Make sure to follow [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide when using Java 9+
400- </warning>
434+ To read Apache Arrow formats, use the `.readArrowFeather()` function:
401435
402- [`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
403- and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
404- from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
405436<!---FUN readArrowFeather-->
406437
407438```kotlin
408439val df = DataFrame.readArrowFeather(file)
409440```
410441
411442<!---END-->
443+
444+ [`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
445+ and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
446+ from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
447+
448+ > If you use Java 9+, follow the [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide.
449+ >
450+ {style="note"}
0 commit comments