Skip to content

Commit 732880c

Browse files
committed
update docs and add missing.md
1 parent 2108ae0 commit 732880c

File tree

6 files changed

+145
-6
lines changed

6 files changed

+145
-6
lines changed

docs/make.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,16 +11,17 @@ makedocs(
1111
# modules = [InMemoryDatasets],
1212
doctest = false,
1313
clean = false,
14-
sitename = "In Memory Datasets",
14+
sitename = "InMemoryDatasets",
1515
# format = Documenter.HTML(
1616
# canonical = "https://sl-solution.github.io/InMemoryDataset.jl/stable/",
1717
# edit_link = "main"
1818
# ),
1919
pages = Any[
2020
"Introduction" => "index.md",
2121
"First Steps" => "man/basics.md",
22+
"Tutorial" => "man/tutorial.md",
2223
"User Guide" => Any[
23-
"Tutorial" => "man/tutorial.md",
24+
"Missing Values" => "man/missing.md",
2425
"Formats" => "man/formats.md",
2526
"Call functions on each observation" => "man/map.md",
2627
"Row-wise operations" => "man/byrow.md",

docs/src/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ If you are new to InMemoryDatasets.jl, probably **First steps with Datasets** or
1414
```@contents
1515
Pages = ["man/basics.md",
1616
"man/tutorial.md",
17+
"man/missing.md",
1718
"man/formats.md",
1819
"man/map.md",
1920
"man/byrow.md",

docs/src/man/grouping.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
InMemoryDatasets uses two approaches to group observations: sorting, and hashing. In sorting approach, it sorts the data set based on given columns and finds the starts and ends of each group based on the sorted values. In hashing approach, it uses a customised algorithm to group observations. Each of these approaches has some advantages over the other one and for any particular problem one of them might be more suitable than the other one.
5+
InMemoryDatasets uses two approaches to group observations: sorting, and hashing. In sorting approach, it sorts the data set based on given columns and finds the starts and ends of each group based on the sorted values. In hashing approach, it uses a hybrid algorithm to group observations. Each of these approaches has some advantages over the other one and for any particular problem one of them might be more suitable than the other one.
66

77
## `groupby!` and `groupby`
88

@@ -258,7 +258,7 @@ The `ungroup!` function can be used in scenarios that one needs to modify a data
258258

259259
## `gatherby`
260260

261-
The `gatherby` function uses the hashing approach to group observations based on a set of columns. InMemoryDatasets uses a customised algorithm to gather observations which sometimes does this without using the `hash` function. The `gatherby` function doesn't sort the data set, instead, it uses the in-house developed algorithm to group observations. `gatherby` can be particularly useful when sorting is computationally expensive. Another benefit of `gatherby` is that, by default, it keeps the order of observations in each group the same as their appearance in the original data set.
261+
The `gatherby` function uses the hashing approach to group observations based on a set of columns. InMemoryDatasets uses a hybrid algorithm to gather observations which sometimes does this without using the `hash` function. The `gatherby` function doesn't sort the data set, instead, it uses the hybrid algorithm to group observations. `gatherby` can be particularly useful when sorting is computationally expensive. Another benefit of `gatherby` is that, by default, it keeps the order of observations in each group the same as their appearance in the original data set.
262262

263263
The `gatherby` function uses the formatted values for gathering the observations into groups, however, using `mapformats = false` changes this behaviour.
264264

docs/src/man/joins.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ In general (for some special cases InMemoryDatasets may use "hash-join" techniqu
2222

2323
For `leftjoin` and `innerjoin` the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the `stable` keyword argument to `true` to guarantee a stable sort. For `outerjoin` the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table, there is no specific order.
2424

25-
By default, the join functions use a modified `Heap Sort` algorithm to sort the observations in the right data set, however, setting `alg = QuickSort` change the default algorithm to the Quick Sort one.
25+
By default, the join functions use a hybrid `Heap Sort` algorithm to sort the observations in the right data set, however, setting `alg = QuickSort` change the default algorithm to a hybrid Quick Sort one.
2626

2727
For very large data sets, if the sorting of the first key is expensive, setting the `accelerate` keyword argument to `true` may improve the overall performance. By setting `accelerate = true`, InMemoryDatasets first divides all observations in the right data set into multiple parts (up to 1024 parts) based on the first passed key, and then for each observations in the left data set finds the corresponding part in the right data set and searches for the matching observations only within that part.
2828

docs/src/man/missing.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# How InMemoryDatasets treats missing values?
2+
3+
## Every column supports `missing`
4+
5+
The `Dataset()` constructor automatically converts each column of a data set to allow ‍‍‍‍‍`missing` when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting `missing` type.
6+
7+
## Functions which skip missing values
8+
9+
When InMemoryDatasets loaded into a Julia session, the behaviour of the following functions will be changed in such a way that they will remove missing values if an `AbstractVector{Union{T, Missing}}` is passed as their argument. And it is the user responsibility to handle the situations where this is not desired.
10+
11+
The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section `INTEGERS` refers to `{U/Int8, U/Int16, U/Int32, U/Int64}` and `FLOATS` refers to `{Float16, Float32, Float64}`):
12+
13+
* `argmax` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
14+
* `argmin` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
15+
* `cummax` : For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
16+
* `cummax!`: For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
17+
* `cummin` : For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
18+
* `cummin!`: For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
19+
* `cumprod` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
20+
* `cumprod!`: For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
21+
* `cumsum` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
22+
* `cumsum!` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
23+
* `extrema` : For `INTEGERS`, `FLOATS`, and `TimeType` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
24+
* `findmax` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
25+
* `findmin` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
26+
* `maximum` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
27+
* `mean` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
28+
* `median` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
29+
* `median!` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
30+
* `minimum` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
31+
* `std` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
32+
* `sum` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
33+
* `var` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
34+
35+
```jldoctest
36+
julia> x = [1,1,missing]
37+
3-element Vector{Union{Missing, Int64}}:
38+
1
39+
1
40+
missing
41+
42+
julia> sum(x)
43+
2
44+
45+
julia> mean(x)
46+
1.0
47+
48+
julia> maximum(x)
49+
1
50+
51+
julia> minimum(x)
52+
1
53+
54+
julia> findmax(x)
55+
(1, 1)
56+
57+
julia> findmin(x)
58+
(1, 1)
59+
60+
julia> cumsum(x)
61+
3-element Vector{Union{Missing, Int64}}:
62+
1
63+
2
64+
2
65+
66+
julia> cumsum(x, missings = :skip)
67+
3-element Vector{Union{Missing, Int64}}:
68+
1
69+
2
70+
missing
71+
72+
julia> cumprod(x, missings = :skip)
73+
3-element Vector{Union{Missing, Int64}}:
74+
1
75+
1
76+
missing
77+
78+
julia> median(x)
79+
1.0
80+
```
81+
82+
### Some remarks
83+
84+
`var` and `std` will return `missing` when `dof = true` and an `AbstractVector{Union{T, Missing}}` of length one is passed as their argument. This is different from the behaviour of these functions defined in the `Statistics` package.
85+
86+
```jldoctest
87+
julia> var(Union{Missing, Int}[1])
88+
missing
89+
90+
julia> std(Union{Missing, Int}[1])
91+
missing
92+
93+
julia> var([1]) # fallback to Statistics.var
94+
NaN
95+
96+
julia> std([1]) # fallback to Statistics.std
97+
NaN
98+
```
99+
100+
## Multithreaded functions
101+
102+
The `sum`, `minimum`, and `maximum` functions also support the `threads` keyword argument. When it is set to `true`, they exploit all cores for calculation.
103+
104+
## `topk`, `IMD.n`, and `IMD.nmissing`
105+
106+
The following function is also exported by InMemoryDatasets:
107+
108+
* `topk` : Return top(bottom) k values of a vector. It ignores `missing` values, unless all values are `missing` which it returns `[missing]`.
109+
110+
and the following functions are not exported but are available via `dot` notation:
111+
112+
* `InMemoryDatasets.n` or `IMD.n` : Return number of non-missing elements
113+
* `InMemoryDatasets.nmissing` or `IMD.nmissing` : Return number of `missing` elements
114+
115+
```jldoctest
116+
julia> x = [13, 1, missing, 10]
117+
4-element Vector{Union{Missing, Int64}}:
118+
13
119+
1
120+
missing
121+
10
122+
123+
julia> topk(x, 2)
124+
2-element Vector{Int64}:
125+
13
126+
10
127+
128+
julia> topk(x, 2, rev = true)
129+
2-element Vector{Int64}:
130+
1
131+
10
132+
julia> IMD.n(x)
133+
3
134+
135+
julia> IMD.nmissing(x)
136+
1
137+
```

docs/src/man/sorting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Sorting is one of the key tasks for Datasets. Actually, when we group a data set
88
99
## `sort!/sort`
1010

11-
The `sort!` function accepts a Dataset and a set of columns and sorts the given Dataset based on provided columns. By default the `sort!` function does the sorting based on the formatted values, however, using `mapformats = false` forces the sorting be done based on the actual values. `sort!` doesn't create a new dataset, it only replaces the original one with the sorted one. If the original data set needed to be untouched the `sort` function must be used. By default, both `sort!` and `sort` functions do a stable sort using a `Heap` sort algorithm. If the stability of the sort is not needed, using the keyword option `stable = false` can improve the performance. User can also change the default sorting algorithm to `QuickSort` by using the `alg = QuickSort` option. By default the ascending sorting is used for the sorting task, and using `rev = true` changes it to descending ordering, and for multiple columns a vector of `true`, `false` can be supplied for this option, i.e. each column can be sorted in ascending or descending order independently. Note that:
11+
The `sort!` function accepts a Dataset and a set of columns and sorts the given Dataset based on provided columns. By default the `sort!` function does the sorting based on the formatted values, however, using `mapformats = false` forces the sorting be done based on the actual values. `sort!` doesn't create a new dataset, it only replaces the original one with the sorted one. If the original data set needed to be untouched the `sort` function must be used. By default, both `sort!` and `sort` functions do a stable sort using a hybrid `Heap` sort algorithm. If the stability of the sort is not needed, using the keyword option `stable = false` can improve the performance. User can also change the default sorting algorithm to hybrid `QuickSort` by using the `alg = QuickSort` option. By default the ascending sorting is used for the sorting task, and using `rev = true` changes it to descending ordering, and for multiple columns a vector of `true`, `false` can be supplied for this option, i.e. each column can be sorted in ascending or descending order independently. Note that:
1212

1313
* Datasets uses `isless` for checking the order of values.
1414

0 commit comments

Comments
 (0)