hsma6_intro_to_python_book/extending_joining_dataframes.qmd at main · hsma-programme/hsma6_intro_to_python_book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
filters:
  - pyodide
---

# Extending and Joining Dataframes

:::{.callout-tip}

To allow all the exercises in this section to work, please run this code cell first!

This will import pandas and load the dataframe we'll be working with.

```{pyodide-python}
import pandas as pd
df = pd.read_csv(
    "https://raw.githubusercontent.com/hsma-programme/h6_1f_python_part_2/main/1f_python_programming_part_2/lecture_examples/input_data.csv",
    index_col="Patient ID"
    )
```

:::

## Adding data to a dataframe

{{< video https://youtu.be/9ngAR9sMu0I >}}

We can add additional rows (and columns) to a DataFrame.  To add a row, we must specify the values for all of the columns for this new row.  Similarly, to add a column, we must specify the value for this column for all of the rows.

### Adding rows

To add a row, we specify the index value for the new row (that doesn’t yet exist in our DataFrame) and assign a list of values for the other columns :

```{pyodide-python}
row_to_add = ["Dan", "Yes",42,"Cornwall"]
df.loc[12345] = row_to_add

df
```

:::{.callout-warning}
Be careful - you can overwrite an existing row this way if you provide an index that already exists.
:::

### Adding columns

We can add a new column as follows.

However - note that you’ll need a value for every single row in your dataset.

The code below makes a new dataframe that is just the first 5 rows of the existing dataframe, then adds the new column

```{pyodide-python}

first_five_rows = df.head()

first_five_rows["Admitted"] = ["Yes", "No", "No", "Yes", "No"]

first_five_rows
```


## Concatenation

{{< video https://youtu.be/EPYJ3n2TnuA >}}

Another way of adding to our DataFrame is by joining two DataFrames together to create a new one.

We can do this using the Pandas concat() method.

:::{.callout-tip}
axis = 0 joins by adding rows

axis = 1 joins by adding columns
:::

First, let's make two new dataframes.

This shows you another way you can create a dataframe - by passing a **list** of **dictionaries**, where every dictionary in the list represents one row.

The **keys** of the dictionary will be the column names, and the **values** will be the values in the row.


```{pyodide-python}
df_a = pd.DataFrame(
    [{"Name": "Dan", "Trainer?": "Yes"},
     {"Name": "Sammi", "Trainer?": "Yes"}]
)

df_b = pd.DataFrame(
    [{"Name": "Mike", "Trainer?": "No"},
     {"Name": "Tom", "Trainer?": "No"}]
)
```


```{pyodide-python}
df_a
```

```{pyodide-python}
df_b
```

We can then join these with the concat method.

```{pyodide-python}
df_c = pd.concat([df_a, df_b], axis=0)

df_c
```

Note : in this example, we have entries with duplicate index values.  We’d likely want to do something about this, but in practice, the DataFrames you’ll be concatenating will normally have their own unique index that will hold when both are joined.

If you do need to reset the index, you can use the `.reset_index()` method.

```{pyodide-python}
df_c.reset_index()
```

Note that the old index becomes a new column!
We can avoid this by passing `drop=True`.

```{pyodide-python}
df_c.reset_index(drop=True)
```


Let's have a quick look at concatenating on the other axis.

```{pyodide-python}
df_d = pd.DataFrame(
    [{"Favourite Decade": "80s", "Favourite Game": "Monkey Island 2"},
     {"Favourite Decade": "90s", "Favourite Game": "Outer Wilds"}]
)

df_d
```

```{pyodide-python}
df_e = pd.concat([df_a, df_d], axis=1)

df_e
```


You may notice that concatenating in this direction feels a bit risky!

You may prefer to use the **merge** method of pandas.
The kind of joins available with that method will feel very familiar if you have used SQL.

Using merge, you can specify columns or an index that must match, ensuring that the correct rows are matched up.

```{pyodide-python}
df_f = pd.DataFrame(
    [{"Name": "Dan", "Favourite Decade": "80s", "Favourite Game": "Monkey Island 2"},
     {"Name": "Sammi", "Favourite Decade": "90s", "Favourite Game": "Outer Wilds"},
     {"Name": "Hugo", "Favourite Decade": "70s", "Favourite Game": "Pong"}]
)

df_f
```

Let's join this to `df_a`. Run the cell below to remind yourself of what was in that dataframe.

```{pyodide-python}
df_a
```

```{pyodide-python}
pd.merge(df_a, df_f, how="right", on="Name")
```

:::{.callout-tip}
Note that because we haven't got a value in dataframe a for Hugo, we don't have a record of whether he is a trainer, so in the merged dataframe, we get `NaN`. This stands for 'not a number' and effectively means the cell in the dataframe is blank.
:::

You can read more about the types of joins available in `merge` here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html