-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
623 lines (440 loc) · 20.6 KB
/
README.Rmd
File metadata and controls
623 lines (440 loc) · 20.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
---
title: "Exploring the BRFSS data"
output:
html_document:
fig_height: 4
highlight: pygments
theme: spacelab
toc: yes
---
## Setup
### Load packages
```{r load-packages, message = FALSE}
library(ggplot2)
library(ggmosaic)
library(dplyr)
```
### Load data
```{r load-data}
load("brfss2013.RData")
```
* * *
## Part 1: Data
In this work we will deal with the Behavioral Risk Factor Surveillance System (BRFSS) dataset, produced in 2013. It was collected from the non-institutionalized adult population (18 years of age and older) residing in the US and participating US territories, namely, all 50 states, the District of Columbia, Puerto Rico, Guam, American Samoa, Federated States of Micronesia, Palau. The sample was built by surveing the random landline or cellular telephone, it's size is about 490K observations and research results can be generalized to the population described above (due to random nature of collecting data). As the BRFSS is the result of observational study, we can not make causal conclusions. Although sample was built based on random selection, there might be a bias related to the method of collecting data - telephone calls. People working at night, travelers, households without telephone, people who do not speak English or Spanish (the languages used for surveys) - that's the incomplete list of thouse who may not be represented in BRFSS dataset.
Although we can not establish causal connections between variables of interest, we can research association between them, so let's try to do it by answering on the next three questions.
* * *
## Part 2: Research questions
**Research quesion 1:**
Is the level of education of working people related to higher income?
**Research quesion 2:**
Does somehow people's weight related to vegetables consumption?
**Research quesion 3:**
Is self employed mans who sleeps less than it is necessary, drink more alcohol than employed who sleeps enough?
* * *
## Part 3: Exploratory data analysis
### Research question 1: Is the level of education of working people related to higher income?
Let's dive into Codebook in order to select variables that can help us answer this question.
As for employment status, it's 'employ1' variable. Take a look at it quickly.
```{r}
str(brfss2013$employ1)
```
```{r}
levels(brfss2013$employ1)
```
```{r}
brfss2013 %>%
group_by(employ1) %>%
summarize(count=n())
```
It has enough data to get subset of working people, so we can use it in research.
For education level the Codebook offers two variables: **educa** and calculated variable **\_educag**. While **educa** has six levels
```{r education-levels}
levels(brfss2013$educa)
```
**\_educag** has four (in it's internals R replaces **\_educag** with **X_educag**, so we will refer the latter)
```{r x-education-levels}
levels(brfss2013$X_educag)
```
This four levels of educations is enough to understand the problem, so we will use **\_educag** in our research.
Similary, let's check income variable. The Codebook offers two variables: **income2** and calculated variable**\_incomg**.
```{r income-levels}
levels(brfss2013$income2)
```
```{r x-income-levels}
levels(brfss2013$X_incomg)
```
**X_incomeg** seems detailed enough so we will use it.
Let's take a closer look at variables.
First of all, let's extract variables that we are interested into separate dataframe in order to work with less data volume to speedup calculations and do not mutate original dataframe if we will need. Also filter records to select data only for working peoples.
```{r create-education-income-data}
education.income <- brfss2013 %>%
filter(brfss2013$employ1 %in% c('Employed for wages', 'Self-employed')) %>%
select(income=X_incomg, education=X_educag)
head(education.income)
```
Let's check variables on the subject of NA (and related garbage) values.
```{r check-education-na}
education.income %>%
group_by(education) %>%
summarize(count=n())
```
```{r}
education.income %>%
group_by(income) %>%
summarize(count=n())
```
As we can see, there is NA values. They are useless in the context of current research, so let's get rid of them.
```{r}
education.income <- na.omit(education.income)
education.income %>%
group_by(education) %>%
summarize(count=n())
```
```{r}
education.income %>%
group_by(income) %>%
summarize(count=n())
```
Now, when we have clean dataset, it's time to take a look at distribution of each variable.
```{r}
education.income %>%
group_by(education) %>%
summarize(freq=sprintf("%.2f %%", n()/nrow(education.income) * 100))
```
```{r}
ggplot(education.income, aes(x=education)) + geom_bar() + coord_flip()
```
From the frequency table and bar plot above we can learn that the fraction of those who has no formal education is really small and is near 5%. The next two groups are nearly equal, the fraction of those who graduated high school and those who attended college or technical school are about 24% and 27% respectively. If we combine those groups we will see that the number of people who graduated high school, but didn't graduate college is dominant and equal 52%.
```{r}
education.income %>%
group_by(income) %>%
summarize(freq=sprintf("%.2f %%", n()/nrow(education.income) * 100))
```
```{r}
ggplot(education.income, aes(x=income)) + geom_bar() + coord_flip()
```
We can see, that the number of working Americans who earn realy well is big and is about 58% of all working peoples. On other side the ratio of working people who earn less than 15K/year is small and is about 5%.
```{r}
table.education.margin <- table(education.income$education, education.income$income) %>% prop.table(1)
colnames(table.education.margin) <- c('< $15K', '$15K-$25K', '$25K-$35K', '$35K-$50K', '$50K >')
table.education.margin
```
Above the contingency table with row proportions. Because the income rates vary between the levels of graduation, we can conclude that the *income* and *education* variables are associated.
```{r}
table.income.margin <- table(education.income$education, education.income$income) %>% prop.table(2)
colnames(table.income.margin) <- c('< $15K', '$15K-$25K', '$25K-$35K', '$35K-$50K', '$50K >')
table.income.margin
```
Above the contingency table with column proportions. Because the education rates vary between the levels of income, we can confirm that the *income* and *education* variables are associated.
```{r}
ggplot(education.income, aes(x=education, fill=income)) + geom_bar() + scale_x_discrete(labels = abbreviate)
```
```{r}
ggplot(education.income, aes(x=education, fill=income)) + geom_bar(position='fill') + scale_x_discrete(labels = abbreviate)
```
```{r}
ggplot(education.income) + geom_mosaic(aes(x = product(education), fill=income)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
By looking at the three diagrams above we can see that there is a strong tendency for income level to grow when education level grow. But the number of income and education levels seems too big to intuitively understand the results. As a suggestion, it will be more simpler, if we will make 3 levels of income (poor, average, good) and 3 for education (didn't graduate high school, graduated high school and graduated from college or technical school). The results of such transformations below.
```{r echo=TRUE}
education.income$education <- recode(education.income$education, "Attended college or technical school" = "Graduated high school")
```
```{r}
education.income %>%
group_by(education) %>%
summarize(freq=sprintf("%.2f %%", n()/nrow(education.income) * 100))
```
More than 50% of working people are graduated highschool and 44% graduated from college. Awesome! Only 5% have no formal education.
```{r}
ggplot(education.income, aes(x=education)) + geom_bar() + coord_flip()
```
```{r echo=TRUE}
education.income$income <- recode(education.income$income, "$15,000 to less than $25,000" = "$15,000 to less than $50,000",
"$25,000 to less than $35,000" = "$15,000 to less than $50,000",
"$35,000 to less than $50,000" = "$15,000 to less than $50,000")
```
```{r}
education.income %>%
group_by(income) %>%
summarize(freq=sprintf("%.2f %%", n()/nrow(education.income) * 100))
```
```{r}
ggplot(education.income, aes(x=income)) + geom_bar() + coord_flip()
```
```{r}
ggplot(education.income, aes(x=education, fill=income)) + geom_bar(position='fill') + scale_x_discrete(labels = abbreviate)
```
```{r}
table.education.margin <- table(education.income$education, education.income$income) %>% prop.table(1)
colnames(table.education.margin) <- c('< $15K', '$15K-$50K', '$50K >')
table.education.margin
```
```{r}
ggplot(education.income) + geom_mosaic(aes(x = product(education), fill=income)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
The final thoughts on the question "Is the level of education of working people related to higher income?": yes, it is. But by looking at the diagrams above we can make a couple of other conclusions. The ratio of those who earn less than 15k is small, around 5%. Even among working people without formal education there is a 15% ratio of those who earn greater than 50K. And even among people who graduated from college 1,4% earn less than 15k. While there is such unmotivated results, the overall trending is - learn, and there will be a good chance for you to earn enough money.
### Research question 2: Does somehow people's weight related to vegetables consumption?
In order to answer this question let's look at the variables that the codebook can provide. As for weight, there is a reference to **weight2** reported weight in pounds (without shoes). But this variable has mixed pounds/kilograms values (kilograms are used for values bigger than 9000), so we should preprocess it to achieve monotony. For our happiness dataset compilers already did this job and provide us with **wtkg3** - computed weight in kilograms. Let's explore it. Note, as codebook states, there is 2 implied decimal places, so the value 9754 kg we should perceive as 97,54 kg.
First of all, let's extract variables that we are interested into separate dataframe in order to work with less data volume to speedup calculations and do not mutate original dataframe if we will need. Also let's remove any NA from new data frame.
```{r create-weight-vegetables-data}
weight.vegetables <- data.frame(weight=brfss2013$wtkg3, vegetables=brfss2013$X_vegesum) %>% na.omit()
head(weight.vegetables)
```
```{r}
anyNA(weight.vegetables)
```
At first, let's explore 'weight' variable.
```{r}
weight <- weight.vegetables$weight
```
```{r}
str(weight)
```
It's a descrete numeric variable.
```{r}
range(weight)
```
With range of values from 1 (person with weight of 0,01 kilogramm???) to 290,3 kilograms.
Let's say a couple of words about central tendency.
```{r}
mean(weight)
```
Mean roughly equal to 80 kg.
```{r}
median(weight)
```
And median roughly equal to 77 kg.
As the mean greater than median, we can suggest it's the right skewed distribution.
As for dispersion:
```{r}
paste(
'Variance: ', round(var(weight)), '\n',
'Standard deviation: ', round(sd(weight)), '\n',
'IQR: ', round(IQR(weight)),
sep = ''
) %>% cat
```
Very dispersed dataset with dense IQR.
Let's take a look at how the distribution is looks like.
```{r}
ggplot(data.frame(weight), aes(x=weight)) + geom_histogram()
```
```{r}
ggplot(data.frame(weight), aes(y=weight)) + stat_boxplot(geom = 'errorbar', width = 0.1) + geom_boxplot() + coord_flip()
```
By looking at graphs we can confirm the suggestion that the distribution skewed to the right. But also we can note that while the IQR is condensed, there is a lot of outliers. What to do with them? By looking at box plot we can suggest there is two types of outliers - weights less than ~40 kg and greater than ~125 kg. By the intuition weights of, for example, 200 kg, or 30 kg, is unusual weights and we can get rid of it if there is just a few records of such kinds of weights. First of all, let's find quartiles Q1 and Q3 to define borders of IQR.
```{r}
quantile(weight)
```
```{r}
q1 <- 6577
q3 <- 9072
iqr <- IQR(weight)
```
A commonly used rule says that low outliers are below Q1-1.5\*IQR and high outliers are above Q3+1.5\*IQR, so lets find those borders.
```{r}
outliers.border.low <- q1 - 1.5 * iqr
outliers.border.high <- q3 + 1.5 * iqr
```
And check how many outliers we have.
```{r}
length(weight[weight < outliers.border.low | weight > outliers.border.high])
```
```{r}
10471/length(weight)
```
Not so much, just about 2%. Let's go futher without them and remove outliers from variable 'weight' and from dataset 'weight.vegetables'.
```{r}
weight <- weight[weight >= outliers.border.low & weight <= outliers.border.high]
```
```{r}
weight.vegetables <- subset(weight.vegetables, weight >= outliers.border.low & weight <= outliers.border.high)
```
Let's check how distribution whithout outliers is looks like.
```{r}
ggplot(data.frame(weight), aes(x=weight)) + geom_histogram(binwidth = 1000)
```
Unimodal, slightly skewed to the right.
```{r}
ggplot(data.frame(weight), aes(y=weight)) + stat_boxplot(geom = 'errorbar', width = 0.1) + geom_boxplot() + coord_flip()
```
At second, let's explore 'vegetables' variable.
```{r}
vegetables <- weight.vegetables$vegetables
```
```{r}
str(vegetables)
```
It's a descrete numeric variable.
```{r}
range(vegetables)
```
With range of values from 0 (no vegetables consumption) to 19827 (???).
Let's say a couple of words about central tendency.
```{r}
mean(vegetables)
```
```{r}
median(vegetables)
```
As the mean greater than median, we can suggest it's the right skewed distribution.
As for dispersion:
```{r}
paste(
'Variance: ', round(var(vegetables)), '\n',
'Standard deviation: ', round(sd(vegetables)), '\n',
'IQR: ', round(IQR(vegetables)),
sep = ''
) %>% cat
```
Very dispersed dataset with dense IQR.
Let's take a look at how the distribution is looks like.
```{r}
ggplot(data.frame(vegetables), aes(x=vegetables)) + geom_histogram()
```
```{r}
ggplot(data.frame(vegetables), aes(y=vegetables)) + stat_boxplot(geom = 'errorbar', width = 0.1) + geom_boxplot() + coord_flip()
```
By looking at graphs we can confirm the suggestion that the distribution skewed to the right. But also we can note that while the IQR is condensed, there is a lot of outliers. What to do with them? Let's explore if we can get rid of them. First of all, let's find quartiles Q1 and Q3 to define borders of IQR.
```{r}
quantile(vegetables)
```
```{r}
q1 <- 104
q3 <- 243
iqr <- IQR(vegetables)
```
A commonly used rule says that low outliers are below Q1-1.5\*IQR and high outliers are above Q3+1.5\*IQR, so lets find those borders.
```{r}
outliers.border.low <- q1 - 1.5 * iqr
outliers.border.high <- q3 + 1.5 * iqr
```
And check how many outliers we have.
```{r}
length(vegetables[vegetables < outliers.border.low | vegetables > outliers.border.high])
```
```{r}
10471/length(vegetables)
```
Not so much, just about 2%. Let's go futher without them and remove outliers from variable 'vegetablest' and from dataset 'weight.vegetables'.
```{r}
vegetables <- vegetables[vegetables >= outliers.border.low & vegetables <= outliers.border.high]
```
```{r}
weight.vegetables <- subset(weight.vegetables, vegetables >= outliers.border.low & vegetables <= outliers.border.high)
```
Let's check how distribution whithout outliers is looks like.
```{r}
ggplot(data.frame(vegetables), aes(x=vegetables)) + geom_histogram()
```
Unimodal, slightly skewed to the right.
```{r}
ggplot(data.frame(vegetables), aes(y=vegetables)) + stat_boxplot(geom = 'errorbar', width = 0.1) + geom_boxplot() + coord_flip()
```
Now, let's try to answer to the research question, is there any relation between weight and vegetables consumption?
```{r}
ggplot(weight.vegetables, aes(x=vegetables, y=weight)) + geom_point()
```
Wow. The scatterplot is so dense, so it's absolutely impossible to find any kind of relation here. We can make things a bit clearer by reducing sample size and make point more transparent to reflect the density.
```{r}
ggplot(sample_n(weight.vegetables, 10000), aes(x=vegetables, y=weight)) + geom_point(shape=1, alpha = 0.5)
```
While there is recognized some 'cloud' with high density, still nothing. Let's add regression line to the plot.
```{r}
ggplot(weight.vegetables, aes(x=vegetables, y=weight)) + geom_point(shape=1, alpha = 0.5) + geom_smooth(method='lm', se = FALSE)
```
Now we can see that there is small negative relation between weight and vegetables consumption. So there is a small chance the lighter people are, the more they eat vegetables. But for confidence, let's check the correlation coefficient between weight and vegetables.
```{r}
cor(weight.vegetables)
```
This table confirms the suggestion about small negative relation.
**Research quesion 3: Is self employed mans who sleeps less than it is necessary, drink more alcohol than employed mans with normal sleep?**
The main idea behind this investigation is the next: self employed who sleeps less are working hard and so under pressure and they need to releive stress, while employed with normal sleep don't.
Let's find in codebook variables we can use to make this research. We will look for something about sex, employment status, sleep hours and alcohol consumption.
With the sex all is simple, it's 'sex' variable.
```{r}
str(brfss2013$sex)
```
```{r}
brfss2013 %>%
group_by(sex) %>%
summarize(count=n())
```
As for employment status, it's 'employ1' variable. Take a look at it quickly.
```{r}
str(brfss2013$employ1)
```
```{r}
levels(brfss2013$employ1)
```
```{r}
brfss2013 %>%
group_by(employ1) %>%
summarize(count=n())
```
It's have enough data to split employed for wages and self employed people, so we can use it in research.
For filtering peoples by sleep adequacy, we should use 'sleptim1' variable, because it's the only one about sleep hours. In questionary, the question about sleep is looks the next: 'On average, how many hours of sleep do you get in a 24-hour period?', so we should decide ourselves how to split adequate/inadequate hours, but now let's just take a look at what the variable is.
```{r}
str(brfss2013$sleptim1)
```
```{r}
brfss2013 %>%
group_by(sleptim1) %>%
summarize(count=n())
```
Seems good enough to use in research. Go further.
As for alcohol consumption, the variable 'alcday5', Days In Past 30 Had Alcoholic Beverage, looks the most reasonable one, because other alcohol related variables have too much missing values.
```{r}
str(brfss2013$alcday5)
```
```{r}
brfss2013 %>%
group_by(alcday5) %>%
summarize(count=n())
```
[101 - 199] Days per week, [201 - 299] Days in the past 30 days - so we will need to transform them to one view.
Now, when all variables has been chosen, we will extract them into new data frame and will get rid of rows with any NA at the same time.
```{r}
employment.sleep.alcohol <- data.frame(
sex=brfss2013$sex,
employment=brfss2013$employ1,
sleep=brfss2013$sleptim1,
alcohol=brfss2013$alcday5
) %>% na.omit()
head(employment.sleep.alcohol)
```
```{r}
anyNA(employment.sleep.alcohol)
```
Filter new dataset and drop sex column.
```{r}
employment.sleep.alcohol <- employment.sleep.alcohol %>% filter(sex == 'Male' & employment %in% c('Self-employed', 'Employed for wages')) %>% select(-c(sex))
head(employment.sleep.alcohol)
```
That's a moot point, what sleep we can consider as inadequate, so, based on the opionin the adult person should sleeps 7-9 hours per day, let's take 6 hours as an upper bound of the inadequate sleep.
```{r}
employment.sleep.alcohol <- employment.sleep.alcohol %>% mutate(sleep = if_else(sleep > 6, 'Adequate', 'Inadequate'))
head(employment.sleep.alcohol)
```
Calculate variable we are interested in and drop not needed.
```{r}
employment.sleep.alcohol <- employment.sleep.alcohol %>%
filter((employment == 'Employed for wages' & sleep == 'Adequate') | (employment == 'Self-employed' & sleep == 'Inadequate'))
head(employment.sleep.alcohol)
```
Make 'alcohol' variable monotonous, transform it to drinking days per month, i.e, if value is 0, return it, if value is greater than or equal to 200, substract 200, if less than or equal to 107, subsctract 100, divide by 7 and multiply to 30.
```{r}
employment.sleep.alcohol <- employment.sleep.alcohol %>%
mutate(alcohol = case_when(
alcohol == 0 ~ 0,
(alcohol >= 100 & alcohol <= 107) ~ round((alcohol - 100)/7 * 30),
alcohol >= 200 ~ alcohol - 200
)
)
head(employment.sleep.alcohol)
```
And now it's time to give an answer to the main question.
```{r}
ggplot(data.frame(employment.sleep.alcohol), aes(x=employment, y=alcohol)) + stat_boxplot(geom = 'errorbar', width = 0.1) + geom_boxplot()
```
Wow, not the expected result. It seems there is no defference in alcohol consumption between self employed who sleeps little, and employed for wages who sleeps enough.