Skip to content

Commit c392dc0

Browse files
committed
update code and add doc for filtering
1 parent 7c4335e commit c392dc0

File tree

2 files changed

+345
-10
lines changed

2 files changed

+345
-10
lines changed

docs/src/man/filter.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# Filter observations
2+
3+
## Introduction
4+
5+
In this section, the Datasets' APIs for filtering observations are discussed. We provides information about
6+
three main ways to filter observations based on some conditions, 1) using the `byrow` function, 2) using the `mask` function, 3) and using Julia broadcasting.
7+
8+
## `byrow`
9+
10+
`byrow` has been discussed previously in details. However, in this section we are going to use it for
11+
filtering observations. To use `byrow(ds, fun, cols, ...)` for filtering observations, the `fun` argument should
12+
be set as `all` or `any`, and supply the conditions by using the `by` keyword option. The supplied `by` will be checked for each observation in all selected columns. The function returns a boolean vector where its `j`th elements will be equivalent to the result of `all(by, [col1[j], col2[j], ...])` or `any(by, [col1[j], col2[j], ...])` when `all` or `any` is set as the `fun` argument, respectively.
13+
14+
The main feature of `byrow(ds, fun, cols, by = ...)` when `fun` is `all/any` is that the `by` keyword argument can be a vector of functions. Thus, when a multiple columns are supplied as `cols` each column can have its own `by`.
15+
16+
### Examples
17+
18+
The first expression creates a data set, and in the second one we use `byrow` to filter `all` rows which the values of all columns are equal to 1.
19+
20+
```jldoctest
21+
julia> ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
22+
10×3 Dataset
23+
Row │ x1 x2 x3
24+
│ identity identity identity
25+
│ Int64? Int64? Int64?
26+
─────┼──────────────────────────────
27+
1 │ 1 1 1
28+
2 │ 1 2 2
29+
3 │ 1 3 1
30+
4 │ 1 4 2
31+
5 │ 1 5 1
32+
6 │ 1 6 2
33+
7 │ 1 7 1
34+
8 │ 1 8 2
35+
9 │ 1 9 1
36+
10 │ 1 10 2
37+
38+
julia> byrow(ds, all, :, by = isequal(1))
39+
10-element Vector{Bool}:
40+
1
41+
0
42+
0
43+
0
44+
0
45+
0
46+
0
47+
0
48+
0
49+
0
50+
```
51+
52+
Note that only the first row is meeting the condition. As another example, let's see the code which
53+
filter all rows which the numbers in all columns are odd.
54+
55+
```jldoctest
56+
julia> _tmp = byrow(ds, all, :, by = isodd)
57+
10-element Vector{Bool}:
58+
1
59+
0
60+
1
61+
0
62+
1
63+
0
64+
1
65+
0
66+
1
67+
0
68+
69+
julia> ds[_tmp, :]
70+
5×3 Dataset
71+
Row │ x1 x2 x3
72+
│ identity identity identity
73+
│ Int64? Int64? Int64?
74+
─────┼──────────────────────────────
75+
1 │ 1 1 1
76+
2 │ 1 3 1
77+
3 │ 1 5 1
78+
4 │ 1 7 1
79+
5 │ 1 9 1
80+
```
81+
82+
In the next example we are going to filter all rows which the value of any of column is greater than 5.
83+
84+
```jldoctest
85+
julia> byrow(ds, any, :, by = >(5))
86+
10-element Vector{Bool}:
87+
0
88+
0
89+
0
90+
0
91+
0
92+
1
93+
1
94+
1
95+
1
96+
1
97+
```
98+
99+
The next example shows how a vector of functions can be supplied:
100+
101+
```jldoctest
102+
julia> byrow(ds, all, 2:3, by = [>(5), isodd])
103+
10-element Vector{Bool}:
104+
0
105+
0
106+
0
107+
0
108+
0
109+
0
110+
1
111+
0
112+
1
113+
0
114+
```
115+
116+
We can use the combination of `modify!/modify` and `byrow` to filter observations based on all values in a column, e.g. in the following example we filter all rows which `:x2` and `:x3` are larger than their means:
117+
118+
```jldoctest
119+
julia> modify!(ds, 2:3 .=> (x -> x .> mean(x)) .=> [:_tmp1, :_tmp2])
120+
10×5 Dataset
121+
Row │ x1 x2 x3 _tmp1 _tmp2
122+
│ identity identity identity identity identity
123+
│ Int64? Int64? Int64? Bool? Bool?
124+
─────┼──────────────────────────────────────────────────
125+
1 │ 1 1 1 false false
126+
2 │ 1 2 2 false true
127+
3 │ 1 3 1 false false
128+
4 │ 1 4 2 false true
129+
5 │ 1 5 1 false false
130+
6 │ 1 6 2 true true
131+
7 │ 1 7 1 true false
132+
8 │ 1 8 2 true true
133+
9 │ 1 9 1 true false
134+
10 │ 1 10 2 true true
135+
136+
julia> _tmp = byrow(ds, all, r"_tm")
137+
10-element Vector{Bool}:
138+
0
139+
0
140+
0
141+
0
142+
0
143+
1
144+
0
145+
1
146+
0
147+
1
148+
149+
julia> ds[_tmp, :]
150+
3×5 Dataset
151+
Row │ x1 x2 x3 _tmp1 _tmp2
152+
│ identity identity identity identity identity
153+
│ Int64? Int64? Int64? Bool? Bool?
154+
────┼──────────────────────────────────────────────────
155+
1 │ 1 6 2 true true
156+
2 │ 1 8 2 true true
157+
3 │ 1 10 2 true true
158+
```
159+
160+
> Note that to drop the temporary columns we can use the `select!` function.
161+
162+
## `mask`
163+
164+
`mask` is a function which calls a function (or a vector of functions) on all observations of a set of selected columns. The syntax for `mask` is very similar to `map` function:
165+
166+
> `mask(ds, funs, cols, [mapformats = true, missings = false, threads = true])`
167+
168+
however, unlike `map`, the function doesn't return the whole modified dataset, it returns a boolean data set with the same number of rows as `ds` and the same number of columns as the length of `cols`, while `fun` has been called on each observation. The return value of `fun` must be `true`, `false`, or `missing`. The combination of `mask` and `byrow` can be used to filter observations.
169+
170+
Compared to `byrow`, the `mask` function has some useful features which are handy in some scenarios:
171+
172+
* `mask` returns a boolean data set which shows exactly which observation will be selected when `fun` is called on it.
173+
* By default, the `mask` function filters observations based on their formatted values. However, this can be changed by setting `mapformats = false`.
174+
* By default, the `mask` function will treat the missing values as `false`, however, this behaviour can be modified by using the keyword option `missings`. This option can be set as `true`, `false`(default value), or `missing`.
175+
176+
### Examples
177+
178+
```jldoctest
179+
julia> ds = Dataset(x1 = repeat(1:2, 5), x2 = 1:10, x3 = repeat([missing, 2], 5))
180+
10×3 Dataset
181+
Row │ x1 x2 x3
182+
│ identity identity identity
183+
│ Int64? Int64? Int64?
184+
─────┼──────────────────────────────
185+
1 │ 1 1 missing
186+
2 │ 2 2 2
187+
3 │ 1 3 missing
188+
4 │ 2 4 2
189+
5 │ 1 5 missing
190+
6 │ 2 6 2
191+
7 │ 1 7 missing
192+
8 │ 2 8 2
193+
9 │ 1 9 missing
194+
10 │ 2 10 2
195+
196+
julia> setformat!(ds, 2 => isodd)
197+
10×3 Dataset
198+
Row │ x1 x2 x3
199+
│ identity isodd identity
200+
│ Int64? Int64? Int64?
201+
────┼────────────────────────────
202+
1 │ 1 true missing
203+
2 │ 2 false 2
204+
3 │ 1 true missing
205+
4 │ 2 false 2
206+
5 │ 1 true missing
207+
6 │ 2 false 2
208+
7 │ 1 true missing
209+
8 │ 2 false 2
210+
9 │ 1 true missing
211+
10 │ 2 false 2
212+
213+
julia> mask(ds, isequal(1), :) # simple use case
214+
10×3 Dataset
215+
Row │ x1 x2 x3
216+
│ identity identity identity
217+
│ Bool? Bool? Bool?
218+
─────┼──────────────────────────────
219+
1 │ true true false
220+
2 │ false false false
221+
3 │ true true false
222+
4 │ false false false
223+
5 │ true true false
224+
6 │ false false false
225+
7 │ true true false
226+
8 │ false false false
227+
9 │ true true false
228+
10 │ false false false
229+
230+
julia> _tmp = mask(ds, isequal(1), :, mapformats = false) # use the actual values instead of formatted values
231+
10×3 Dataset
232+
Row │ x1 x2 x3
233+
│ identity identity identity
234+
│ Bool? Bool? Bool?
235+
────┼──────────────────────────────
236+
1 │ true true false
237+
2 │ false false false
238+
3 │ true false false
239+
4 │ false false false
240+
5 │ true false false
241+
6 │ false false false
242+
7 │ true false false
243+
8 │ false false false
244+
9 │ true false false
245+
10 │ false false false
246+
247+
julia> ds[byrow(_tmp, any, :), :] # use the result of previous run
248+
5×3 Dataset
249+
Row │ x1 x2 x3
250+
│ identity isodd identity
251+
│ Int64? Int64? Int64?
252+
─────┼────────────────────────────
253+
1 │ 1 true missing
254+
2 │ 1 true missing
255+
3 │ 1 true missing
256+
4 │ 1 true missing
257+
5 │ 1 true missing
258+
259+
julia> mask(ds, [isodd, ==(2)], 2:3, missings = missing) # using a vector of functions and setting missings option
260+
10×2 Dataset
261+
Row │ x2 x3
262+
│ identity identity
263+
│ Bool? Bool?
264+
─────┼────────────────────
265+
1 │ true missing
266+
2 │ false true
267+
3 │ true missing
268+
4 │ false true
269+
5 │ true missing
270+
6 │ false true
271+
7 │ true missing
272+
8 │ false true
273+
9 │ true missing
274+
10 │ false true
275+
```
276+
277+
## Julia broadcasting
278+
279+
For simple use case (e.g. when working on a single column) we can use broadcasting directly. For example if we are interested on rows which the first column is greater than 5 we can directly use (assume the data set is called `ds`):
280+
281+
> `ds[ds[!, 1] .> 1, :]`
282+
283+
or use the column names.
284+
285+
### Examples
286+
287+
In the following examples we use `.` for broadcasting, and its important to include it in your code when you are going to use this option for filtering observations.
288+
289+
```jldoctest
290+
julia> ds = Dataset(x1 = repeat(1:2, 5), x2 = 1:10, x3 = repeat([missing, 2], 5))
291+
10×3 Dataset
292+
Row │ x1 x2 x3
293+
│ identity identity identity
294+
│ Int64? Int64? Int64?
295+
─────┼──────────────────────────────
296+
1 │ 1 1 missing
297+
2 │ 2 2 2
298+
3 │ 1 3 missing
299+
4 │ 2 4 2
300+
5 │ 1 5 missing
301+
6 │ 2 6 2
302+
7 │ 1 7 missing
303+
8 │ 2 8 2
304+
9 │ 1 9 missing
305+
10 │ 2 10 2
306+
307+
julia> ds[ds.x1 .== 2, :]
308+
5×3 Dataset
309+
Row │ x1 x2 x3
310+
│ identity identity identity
311+
│ Int64? Int64? Int64?
312+
────┼──────────────────────────────
313+
1 │ 2 2 2
314+
2 │ 2 4 2
315+
3 │ 2 6 2
316+
4 │ 2 8 2
317+
5 │ 2 10 2
318+
319+
julia> ds[(ds.x1 .== 1) .& (ds.x2 .> 5), :]
320+
2×3 Dataset
321+
Row │ x1 x2 x3
322+
│ identity identity identity
323+
│ Int64? Int64? Int64?
324+
────┼──────────────────────────────
325+
1 │ 1 7 missing
326+
2 │ 1 9 missing
327+
```

src/dataset/other.jl

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -545,15 +545,15 @@ julia> mask(ds, isodd, 1:3)
545545
3 │ true true true
546546
```
547547
"""
548-
mask(ds::AbstractDataset, f::Function, col::ColumnIndex; mapformats = false, threads = true) = mask(ds, f, [col]; mapformats = mapformats, threads = threads)
549-
function mask(ds::AbstractDataset, f::Function, cols::MultiColumnIndex; mapformats = false, threads = true)
548+
mask(ds::AbstractDataset, f::Function, col::ColumnIndex; mapformats = true, threads = true, missings = false) = mask(ds, f, [col]; mapformats = mapformats, threads = threads, missings = missings)
549+
function mask(ds::AbstractDataset, f::Function, cols::MultiColumnIndex; mapformats = true, threads = true, missings = false)
550550
colsidx = index(ds)[cols]
551551
v_f = Vector{Function}(undef, length(colsidx))
552552
fill!(v_f, f)
553-
mask(ds, v_f, cols; mapformats = mapformats, threads = threads)
553+
mask(ds, v_f, cols; mapformats = mapformats, threads = threads, missings = missings)
554554
end
555555

556-
function mask(ds::AbstractDataset, f::Vector{<:Function}, cols::MultiColumnIndex; mapformats = false, threads = true)
556+
function mask(ds::AbstractDataset, f::Vector{<:Function}, cols::MultiColumnIndex; mapformats = true, threads = true, missings = false)
557557
# Create Dataset
558558
ncol(ds) == 0 && return ds # skip if no columns
559559
colsidx = index(ds)[cols]
@@ -562,37 +562,45 @@ function mask(ds::AbstractDataset, f::Vector{<:Function}, cols::MultiColumnIndex
562562
for j in 1:length(colsidx)
563563
v = _columns(ds)[colsidx[j]]
564564
_col_f = getformat(ds, colsidx[j])
565-
fv = Vector{Bool}(undef, nrow(ds))
565+
fv = Vector{Union{Missing, Bool}}(undef, nrow(ds))
566566
if mapformats
567-
_fill_mask!(fv, v, _col_f, f[j], threads)
567+
_fill_mask!(fv, v, _col_f, f[j], threads, missings)
568568
else
569-
_fill_mask!(fv, v, f[j], threads)
569+
_fill_mask!(fv, v, f[j], threads, missings)
570570
end
571571
push!(vs, fv)
572572
end
573573
Dataset(vs, _names(ds)[colsidx], copycols=false)
574574
end
575575

576-
function _fill_mask!(fv, v, format, fj, threads)
576+
function _fill_mask!(fv, v, format, fj, threads, missings)
577577
if threads
578578
Threads.@threads for i in 1:length(fv)
579579
fv[i] = _bool_mask(fj)(format(v[i]))
580580
end
581+
Threads.@threads for i in 1:length(fv)
582+
ismissing(fv[i]) ? fv[i] = missings : nothing
583+
end
581584
else
582585
map!(_bool_mask(fjformat), fv, v)
586+
map!(x->ismissing(x) ? x = missings : x, fv, fv)
583587
end
584588
end
585589
# not using formats
586-
function _fill_mask!(fv, v, fj, threads)
590+
function _fill_mask!(fv, v, fj, threads, missings)
587591
if threads
588592
Threads.@threads for i in 1:length(fv)
589593
fv[i] = _bool_mask(fj)(v[i])
590594
end
595+
Threads.@threads for i in 1:length(fv)
596+
ismissing(fv[i]) ? fv[i] = missings : nothing
597+
end
591598
else
592599
map!(_bool_mask(fj), fv, v)
600+
map!(x->ismissing(x) ? x = missings : x, fv, fv)
593601
end
594602
end
595-
_bool_mask(f) = x->f(x)::Bool
603+
_bool_mask(f) = x->f(x)::Union{Bool, Missing}
596604

597605

598606
# Unique cases

0 commit comments

Comments
 (0)