The following code reads in the data, formats the date column, creates a weekday column and presents summary statistics:
data <- read.csv(file = "activity.csv", stringsAsFactors = F)
data$date <- as.Date(data$date, format = "%Y-%m-%d")
data$week <- weekdays(data$date)
str(data)## 'data.frame': 17568 obs. of 4 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ week : chr "Monday" "Monday" "Monday" "Monday" ...
library(pastecs)## Loading required package: boot
options(scipen = 100)
options(digits = 2)
summ <- stat.desc(data[, -c(2, 4)])
summ## steps interval
## nbr.val 15264.00 17568.00
## nbr.null 11014.00 61.00
## nbr.na 2304.00 0.00
## min 0.00 0.00
## max 806.00 2355.00
## range 806.00 2355.00
## sum 570608.00 20686320.00
## median 0.00 1177.50
## mean 37.38 1177.50
## SE.mean 0.91 5.22
## CI.mean.0.95 1.78 10.24
## var 12543.00 479491.88
## std.dev 112.00 692.45
## coef.var 3.00 0.59
We first compute the total steps for each day, aggregating the data for the different intervals within each day.
totalstepsday <- aggregate(data$steps, by = list(data$date), function(x) sum(x,
na.rm = T))Then we present the boxplot and the histogram of the total steps for each day.
nf <- layout(mat = matrix(c(1, 2), 2, 1, byrow = TRUE), height = c(1, 1.5))
par(mar = c(3, 3, 0.2, 0.2))
boxplot(totalstepsday$x, horizontal = TRUE, outline = TRUE, ylim = c(0, 26000),
col = "lightblue", type = 3)
hist(totalstepsday$x, nclass = 20, xlab = "", ylab = "Frequency", col = "lightblue",
main = "", xlim = c(0, 26000))The mean is 9354.23 and the median is 10395.
We compute the average number of steps for each interval, across days.
totalinterval <- aggregate(data$steps, by = list(data$interval), function(x) mean(x,
na.rm = T))The plot presents a time series of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).
plot(y = totalinterval$x, x = totalinterval$Group.1, xlab = "Interval", ylab = "Average number of steps",
type = "l")activeinterval <- totalinterval[which(totalinterval$x == max(totalinterval$x)),
1]The 5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps is 835.
miss <- length(which(is.na(data$steps)))The dataset has 17568 observations and 2304 missing observations.
The missing information for the intervals will be replaced with the median number of steps for the interval across all days.
imputat <- aggregate(data$steps, by = list(data$interval), function(x) median(x,
na.rm = T))
datacomplete <- data[!is.na(data$steps), ]
datamissing <- data[is.na(data$steps), ]
dataimputed <- datamissing
dataimputed$steps <- imputat[match(datamissing$interval, imputat[, 1]), 2]
datacompleteimputed <- rbind(datacomplete, dataimputed)We then compute the total steps for each day, aggregating the data for the different intervals within each day, now considering the dataset including the imputed values.
totalstepsday <- aggregate(datacompleteimputed$steps, by = list(datacompleteimputed$date),
function(x) sum(x))Boxplot and the histogram of the total steps for each day.
nf <- layout(mat = matrix(c(1, 2), 2, 1, byrow = TRUE), height = c(1, 1.5))
par(mar = c(3, 3, 0.2, 0.2))
boxplot(totalstepsday$x, horizontal = TRUE, outline = TRUE, ylim = c(0, 26000),
col = "lightblue", type = 3)
hist(totalstepsday$x, nclass = 20, xlab = "", ylab = "Frequency", col = "lightblue",
main = "", xlim = c(0, 26000))The mean is 9503.87 and the median is 10395. The mean is now bigger than when excluding the missing values, but the median is the same.
We create e variable indicating "Weekend" or "Weekday":
datacompleteimputed$weekend <- ifelse(datacompleteimputed$week %in% c("Saturday",
"Sunday"), "Weekend", "Weekday")We compute the average number of steps for each interval, across weekend days.
totalintervalweekend <- aggregate(datacompleteimputed$steps[datacompleteimputed$weekend ==
"Weekend"], by = list(datacompleteimputed$interval[datacompleteimputed$weekend ==
"Weekend"]), function(x) mean(x))We compute the average number of steps for each interval, across weekday days.
totalintervalweekday <- aggregate(datacompleteimputed$steps[datacompleteimputed$weekend ==
"Weekday"], by = list(datacompleteimputed$interval[datacompleteimputed$weekend ==
"Weekday"]), function(x) mean(x))The plot presents the time series of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis), separately for weekends and weekdays.
tmp <- max(c(totalintervalweekend$x, totalintervalweekday$x)) + 1
nf <- layout(mat = matrix(c(1, 2), 2, 1, byrow = TRUE))
par(mar = c(3, 3, 3, 3))
plot(y = totalintervalweekend$x, x = totalintervalweekend$Group.1, xlab = "Interval",
ylab = "Average number of steps", type = "l", main = "Weekends", ylim = c(0,
tmp))
plot(y = totalintervalweekday$x, x = totalintervalweekday$Group.1, xlab = "Interval",
ylab = "Average number of steps", type = "l", main = "Weekdays", ylim = c(0,
tmp))


