-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
101 lines (77 loc) · 3.47 KB
/
README.Rmd
File metadata and controls
101 lines (77 loc) · 3.47 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
title: "Few functions"
output: rmarkdown::github_document
---
These are few functions I used at some point for something--for some of them, there are probably much better packages.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
average_excluding <- function(G, n) {
######## AVERAGE EXCLUDING ###########
# returns the mean of G variables
# for cases with more than n missing
# G is dataframe of desired vars
apply(G, 1,
function(x) {
if (sum(is.na(x)) > n) mean(x)
else mean(x, na.rm = TRUE)
}
)
}
imenuj <- function(x) {
ifelse(x > 999999,
paste(comma(x * 0.000001), "M"),
ifelse(x > 999,
paste(x * 0.001, "K"),
paste(x)
)
)
}
```
# Mean on at least (average-excluding.R)
More details here: https://mdjeric.github.io/f-mean-on-at-least.html
Does the same thing as SPSS `mean.#`. Calculates mean of a case if more than n values are not NA, otherwise returns NA. Handy for making indexes of several variables when you can specify to calculate it if e.g. more than 75% or 90% of values are present in each case.
## Sample
Doing a simple sample, six variables, each case has different number of missing values.
```{r}
DF <- data.frame(var_1 = c(10, 20, 30, 40, 50),
var_2 = c(11, 21, 31, 41, NA),
var_3 = c(12, 22, 32, NA, NA),
var_4 = c(13, 23, NA, NA, NA),
var_5 = c(14, NA, NA, NA, NA),
var_6 = c(NA, NA, NA, NA, NA)
)
DF$av_miss_2 <- average_excluding(DF[, 1:6], 2)
DF$av_miss_5 <- average_excluding(DF[, 1:6], 5)
DF
```
And this is how it comes in the end. We have means for cases which have no more than two missing values, and no more than five missing values.
# Scaling thousands and millions (scaling-axis-ggplot2.R)
This one has probably even more unique/rare appliation. We don't want to mix different "units", and an axis that would include from 100s to 1,000,000s, well this wouldn't be a helpful way.
But, it solved a problem in faceting data for Belgrade Airport which included number of passengers, plane operations, cargo, and passengers per airplane to be presented in one plot with numbers formated in a clean way.
```{r}
library(ggplot2)
library(scales)
imenuj <- function(x) {
ifelse(x > 999999,
paste(comma(x * 0.000001), "M"),
ifelse(x > 999,
paste(x * 0.001, "K"),
paste(x)
)
)
}
```
Once it is included in gglpot labels option,
```{r, results='hide'}
scale_y_continuous(labels = imenuj, position = "right")
```
the first graph becomes the second one. Much better.
<img src="images/f-scale-1.png" alt="Faceted graph with units in numbers" style="width:50%; border:0px solid; " align="left">
<img src="images/f-scale-2.png" alt="Faceted graph with scaled units - M and K included where appropriate" style="width:50%; border:0px solid; ">
# Summaries (various-summaries.R)
*Samples coming soon*
These are 6 different functions that summarize data .
Probably the most useful are `info.detail`, `out.tbls.wn`, and `out.stat`.
First returns min , max, mean, SD, and number NAs, along with type of variable and number of levels if factor in a data frame.
Second returns frequency, relative, and cumulative percent of for a variable values, along with min, max, mean, and SD (prints notification if it is a factor).
Third provides just min, max, mean, and SD for data frame.