RepData_PeerAssessment1/PA1_template.Rmd at master · tavin/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: "Reproducible Research: Peer Assessment 1"
output:
  html_document:
    keep_md: true
---

## Loading and preprocessing the data

```{r}
library(stringi)
library(lubridate)
library(ggplot2)
if (!file.exists("activity.csv")) unzip("activity.zip")
activity <- read.csv("activity.csv", colClasses = c("integer", "Date", "character"))
activity$time <- parse_date_time(stri_pad_left(activity$interval, 4, 0), "hm")
str(activity)
```

## What is mean total number of steps taken per day?

```{r}
total_steps_per_day <- aggregate(steps ~ date, activity, sum)
hist(total_steps_per_day$steps)
```

### Calculating the mean and median

```{r}
mean(total_steps_per_day$steps)
median(total_steps_per_day$steps)
```

## What is the average daily activity pattern?

```{r}
mean_steps_per_interval <- aggregate(steps ~ time, activity, mean)
```

### Calculating which 5-minute interval has the maximum of the mean steps

```{r}
max_steps <- mean_steps_per_interval[which.max(mean_steps_per_interval$steps),]
max_label <- paste(round(max_steps$steps, 1),
                   "steps maximum at", format(max_steps$time, "%H:%M"))
max_label
```

### Plotting the mean steps for each 5-minute interval

```{r}
with(mean_steps_per_interval, {
  plot(time, steps, type = "l")
  text(max_steps, max_label, pos = 4)
})
```

## Imputing missing values

### Calculating the total number of rows with missing values

```{r}
steps <- activity$steps
sum(is.na(steps))
```

### Filling in missing values with the mean for the 5-minute interval

```{r}
for (i in which(is.na(steps))) {
  steps[i] <- subset(mean_steps_per_interval, time == activity$time[i])$steps
}
filled_in_total_steps_per_day <- aggregate(steps ~ date,
                                           cbind(steps, subset(activity, select = 'date')), sum)
hist(filled_in_total_steps_per_day$steps)
```

### Mean and median with the missing values filled in

```{r}
mean(filled_in_total_steps_per_day$steps)
median(filled_in_total_steps_per_day$steps)
```

Using the means of the 5-minute intervals to fill in missing values did not affect the mean total steps per day, but it increased the median so that it became equal to the mean.

## Are there differences in activity patterns between weekdays and weekends?

### Assigning a factor to each observation identifying weekdays and weekends

```{r}
weekpart <- rep("weekday", length(activity$date))
weekpart[which(weekdays(activity$date) %in% c("Saturday", "Sunday"))] <- "weekend"
activity$weekpart <- as.factor(weekpart)
```

### Plotting the average daily activity for weekdays and weekends

```{r}
qplot(time, steps, geom = "line", facets = weekpart ~ .,
      data = aggregate(steps ~ time + weekpart, activity, mean))
```

The weekday and weekend activity patterns appear similar, but more steps were taken in the morning on weekdays, whereas more steps were taken later in the day on weekends.