update docs and add missing.md

sl-solution · sl-solution · commit 732880ca425d · 2021-12-13T22:22:06.000+13:00
diff --git a/docs/make.jl b/docs/make.jl
@@ -11,16 +11,17 @@ makedocs(
     # modules = [InMemoryDatasets],
     doctest = false,
     clean = false,
-    sitename = "In Memory Datasets",
+    sitename = "InMemoryDatasets",
     # format = Documenter.HTML(
     #     canonical = "https://sl-solution.github.io/InMemoryDataset.jl/stable/",
     #     edit_link = "main"
     # ),
     pages = Any[
         "Introduction" => "index.md",
         "First Steps" => "man/basics.md",
+        "Tutorial" => "man/tutorial.md",
         "User Guide" => Any[
-            "Tutorial" => "man/tutorial.md",
+            "Missing Values" => "man/missing.md",
             "Formats" => "man/formats.md",
             "Call functions on each observation" => "man/map.md",
             "Row-wise operations" => "man/byrow.md",
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -14,6 +14,7 @@ If you are new to InMemoryDatasets.jl, probably **First steps with Datasets** or
 ```@contents
 Pages = ["man/basics.md",
          "man/tutorial.md",
+         "man/missing.md",
          "man/formats.md",
          "man/map.md",
          "man/byrow.md",
diff --git a/docs/src/man/grouping.md b/docs/src/man/grouping.md
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-InMemoryDatasets uses two approaches to group observations: sorting, and hashing. In sorting approach, it sorts the data set based on given columns and finds the starts and ends of each group based on the sorted values. In hashing approach, it uses a customised algorithm to group observations. Each of these approaches has some advantages over the other one and for any particular problem one of them might be more suitable than the other one.
+InMemoryDatasets uses two approaches to group observations: sorting, and hashing. In sorting approach, it sorts the data set based on given columns and finds the starts and ends of each group based on the sorted values. In hashing approach, it uses a hybrid algorithm to group observations. Each of these approaches has some advantages over the other one and for any particular problem one of them might be more suitable than the other one.
 
 ## `groupby!` and `groupby`
 
@@ -258,7 +258,7 @@ The `ungroup!` function can be used in scenarios that one needs to modify a data
 
 ## `gatherby`
 
-The `gatherby` function uses the hashing approach to group observations based on a set of columns. InMemoryDatasets uses a customised algorithm to gather observations which sometimes does this without using the `hash` function. The `gatherby` function doesn't sort the data set, instead, it uses the in-house developed algorithm to group observations. `gatherby` can be particularly useful when sorting is computationally expensive. Another benefit of `gatherby` is that, by default, it keeps the order of observations in each group the same as their appearance in the original data set.
+The `gatherby` function uses the hashing approach to group observations based on a set of columns. InMemoryDatasets uses a hybrid algorithm to gather observations which sometimes does this without using the `hash` function. The `gatherby` function doesn't sort the data set, instead, it uses the hybrid algorithm to group observations. `gatherby` can be particularly useful when sorting is computationally expensive. Another benefit of `gatherby` is that, by default, it keeps the order of observations in each group the same as their appearance in the original data set.
 
 The `gatherby` function uses the formatted values for gathering the observations into groups, however, using `mapformats = false` changes this behaviour.
 
diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md
@@ -22,7 +22,7 @@ In general (for some special cases InMemoryDatasets may use "hash-join" techniqu
 
 For `leftjoin` and `innerjoin` the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the `stable` keyword argument to `true` to guarantee a stable sort. For `outerjoin` the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table, there is no specific order.
 
-By default, the join functions use a modified `Heap Sort` algorithm to sort the observations in the right data set, however, setting `alg = QuickSort` change the default algorithm to the Quick Sort one.
+By default, the join functions use a hybrid `Heap Sort` algorithm to sort the observations in the right data set, however, setting `alg = QuickSort` change the default algorithm to a hybrid Quick Sort one.
 
 For very large data sets, if the sorting of the first key is expensive, setting the `accelerate` keyword argument to `true` may improve the overall performance. By setting `accelerate = true`, InMemoryDatasets first divides all observations in the right data set into multiple parts (up to 1024 parts) based on the first passed key, and then for each observations in the left data set finds the corresponding part in the right data set and searches for the matching observations only within that part.
 
diff --git a/docs/src/man/missing.md b/docs/src/man/missing.md
@@ -0,0 +1,137 @@
+# How InMemoryDatasets treats missing values?
+
+## Every column supports `missing`
+
+The `Dataset()` constructor automatically converts each column of a data set to allow ‍‍‍‍‍`missing` when constructs a data set. All algorithms in InMemoryDatasets are optimised to minimised the overhead of supporting `missing` type.
+
+## Functions which skip missing values
+
+When InMemoryDatasets loaded into a Julia session, the behaviour of the following functions will be changed in such a way that they will remove missing values if an `AbstractVector{Union{T, Missing}}` is passed as their argument. And it is the user responsibility to handle the situations where this is not desired.
+
+The following list summarises the details of how InMemoryDatasets removes/skips/ignores missing values (for the rest of this section `INTEGERS` refers to `{U/Int8, U/Int16, U/Int32, U/Int64}` and `FLOATS` refers to `{Float16, Float32, Float64}`):
+
+* `argmax` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
+* `argmin` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
+* `cummax` : For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cummax!`: For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cummin` : For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cummin!`: For `INTEGERS`, `FLOATS`, and `TimeType` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cumprod` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cumprod!`: For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cumsum` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `cumsum!` : For `INTEGERS` and `FLOATS` ignore missing values, however, by passing `missings = :skip` it jumps over missing values. When all values are `missing`, it returns the input.
+* `extrema` : For `INTEGERS`, `FLOATS`, and `TimeType` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
+* `findmax` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
+* `findmin` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `(missing, missing)`.
+* `maximum` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
+* `mean` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+* `median` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+* `median!`  : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+* `minimum` : For `INTEGERS`, `FLOATS`, `TimeType`, and `AbstractString` skip missing values. When all values are `missing`, it returns `missing`.
+* `std` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+* `sum` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+* `var` : For `INTEGERS` and `FLOATS` skip missing values. When all values are `missing`, it returns `missing`
+
+```jldoctest
+julia> x = [1,1,missing]
+3-element Vector{Union{Missing, Int64}}:
+ 1
+ 1
+  missing
+
+julia> sum(x)
+2
+
+julia> mean(x)
+1.0
+
+julia> maximum(x)
+1
+
+julia> minimum(x)
+1
+
+julia> findmax(x)
+(1, 1)
+
+julia> findmin(x)
+(1, 1)
+
+julia> cumsum(x)
+3-element Vector{Union{Missing, Int64}}:
+ 1
+ 2
+ 2
+
+julia> cumsum(x, missings = :skip)
+3-element Vector{Union{Missing, Int64}}:
+ 1
+ 2
+  missing
+
+julia> cumprod(x, missings = :skip)
+3-element Vector{Union{Missing, Int64}}:
+ 1
+ 1
+  missing
+
+julia> median(x)
+1.0
+```
+
+### Some remarks
+
+`var` and `std` will return `missing` when `dof = true` and an `AbstractVector{Union{T, Missing}}` of length one is passed as their argument. This is different from the behaviour of these functions defined in the `Statistics` package.
+
+```jldoctest
+julia> var(Union{Missing, Int}[1])
+missing
+
+julia> std(Union{Missing, Int}[1])
+missing
+
+julia> var([1]) # fallback to Statistics.var
+NaN
+
+julia> std([1]) # fallback to Statistics.std
+NaN
+```
+
+## Multithreaded functions
+
+The `sum`, `minimum`, and `maximum` functions also support the `threads` keyword argument. When it is set to `true`, they exploit all cores for calculation.
+
+## `topk`, `IMD.n`, and `IMD.nmissing`
+
+The following function is also exported by InMemoryDatasets:
+
+* `topk` : Return top(bottom) k values of a vector. It ignores `missing` values, unless all values are `missing` which it returns `[missing]`.
+
+and the following functions are not exported but are available via `dot` notation:
+
+* `InMemoryDatasets.n` or `IMD.n` : Return number of non-missing elements
+* `InMemoryDatasets.nmissing` or `IMD.nmissing` : Return number of `missing` elements
+
+```jldoctest
+julia> x = [13, 1, missing, 10]
+4-element Vector{Union{Missing, Int64}}:
+ 13
+  1
+   missing
+ 10
+
+julia> topk(x, 2)
+2-element Vector{Int64}:
+ 13
+ 10
+
+julia> topk(x, 2, rev = true)
+2-element Vector{Int64}:
+  1
+ 10
+julia> IMD.n(x)
+3
+
+julia> IMD.nmissing(x)
+1
+```
diff --git a/docs/src/man/sorting.md b/docs/src/man/sorting.md
@@ -8,7 +8,7 @@ Sorting is one of the key tasks for Datasets. Actually, when we group a data set
 
 ## `sort!/sort`
 
-The `sort!` function accepts a Dataset and a set of columns and sorts the given Dataset based on provided columns. By default the `sort!` function does the sorting based on the formatted values, however, using `mapformats = false` forces the sorting be done based on the actual values. `sort!` doesn't create a new dataset, it only replaces the original one with the sorted one. If the original data set needed to be untouched the `sort` function must be used. By default, both `sort!` and `sort` functions do a stable sort using a `Heap` sort algorithm. If the stability of the sort is not needed, using the keyword option `stable = false` can improve the performance. User can also change the default sorting algorithm to `QuickSort` by using the `alg = QuickSort` option. By default the ascending sorting is used for the sorting task, and using `rev = true` changes it to descending ordering, and for multiple columns a vector of  `true`, `false` can be supplied for this option, i.e. each column can be sorted in ascending or descending order independently. Note that:
+The `sort!` function accepts a Dataset and a set of columns and sorts the given Dataset based on provided columns. By default the `sort!` function does the sorting based on the formatted values, however, using `mapformats = false` forces the sorting be done based on the actual values. `sort!` doesn't create a new dataset, it only replaces the original one with the sorted one. If the original data set needed to be untouched the `sort` function must be used. By default, both `sort!` and `sort` functions do a stable sort using a hybrid `Heap` sort algorithm. If the stability of the sort is not needed, using the keyword option `stable = false` can improve the performance. User can also change the default sorting algorithm to hybrid `QuickSort` by using the `alg = QuickSort` option. By default the ascending sorting is used for the sorting task, and using `rev = true` changes it to descending ordering, and for multiple columns a vector of  `true`, `false` can be supplied for this option, i.e. each column can be sorted in ascending or descending order independently. Note that:
 
 * Datasets uses `isless` for checking the order of values.