Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/pandas cookbook/chapter1.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ keywords:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.style.use('default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
```

Expand Down
6 changes: 3 additions & 3 deletions content/pandas cookbook/chapter2.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ import pandas as pd
import matplotlib.pyplot as plt

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.style.use('default')

# This is necessary to show lots of columns in pandas 0.12.
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)

plt.rcParams['figure.figsize'] = (15, 5)
Expand Down
6 changes: 3 additions & 3 deletions content/pandas cookbook/chapter3.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ import matplotlib.pyplot as plt
import numpy as np

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.style.use('default')
plt.rcParams['figure.figsize'] = (15, 5)


# This is necessary to show lots of columns in pandas 0.12.
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
```

Expand Down
14 changes: 7 additions & 7 deletions content/pandas cookbook/chapter4.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ keywords:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.style.use('default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['font.family'] = 'sans-serif'

# This is necessary to show lots of columns in pandas 0.12.
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
```

Expand Down Expand Up @@ -232,7 +232,7 @@ Output:
This turns out to be really easy!
Dataframes have a [.groupby()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method that is similar to [SQL groupby](https://docs.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql), if you're familiar with that. I'm not going to explain more about it right now -- if you want to to know more, [the documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) is really good.

In this case, `berri_bikes.groupby('weekday').aggregate(sum)` means
In this case, `berri_bikes.groupby('weekday').aggregate(sum)` means

> "Group the rows by weekday and then add up all the values with the same weekday."

Expand Down Expand Up @@ -360,9 +360,9 @@ Let's put all that together, to prove how easy it is. 6 lines of magical pandas!
If you want to play around, try changing sum to max, numpy.median, or any other function you like.

```python
bikes = pd.read_csv('../data/bikes.csv',
sep=';', encoding='latin1',
parse_dates=['Date'], dayfirst=True,
bikes = pd.read_csv('../data/bikes.csv',
sep=';', encoding='latin1',
parse_dates=['Date'], dayfirst=True,
index_col='Date')
# Add the weekday column
berri_bikes = bikes[['Berri 1']].copy()
Expand Down
35 changes: 10 additions & 25 deletions content/pandas cookbook/chapter5.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default')
plt.style.use('default')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'
```
Expand Down Expand Up @@ -64,7 +64,7 @@ To get the data for March 2013, we need to format it with month=3, year=2012.

```python
url = url_template.format(month=3, year=2012)
weather_mar2012 = pd.read_csv(url, index_col='Date/Time', parse_dates=True)
weather_mar2012 = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True)
```

This is super great! We can just use the same read_csv function as before, and just give it a URL as a filename. Awesome.
Expand Down Expand Up @@ -1604,7 +1604,7 @@ Output:
Let's plot it!

```python
weather_mar2012[u"Temp (\xc2\xb0C)"].plot(figsize=(15, 5))
weather_mar2012[u"Temp (°C)"].plot(figsize=(15, 5))
```

Output:
Expand All @@ -1617,18 +1617,6 @@ Notice how it goes up to 25° C in the middle there? That was a big deal. It was

And I was out of town and I missed it. Still sad, humans.

I had to write '\xb0' for that degree character °. Let's fix up the columns. We're going to just print them out, copy, and fix them up by hand.

```python
weather_mar2012.columns = [
u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)',
u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag',
u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag',
u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill',
u'Wind Chill Flag', u'Weather']
```

You'll notice in the summary above that there are a few columns which are are either entirely empty or only have a few values in them. Let's get rid of all of those with dropna.

The argument `axis=1` to `dropna` means "drop columns", not rows", and `how='any'` means "drop the column if any value is null".
Expand Down Expand Up @@ -1758,12 +1746,12 @@ Output:
</div>
</div>

The Year/Month/Day/Time columns are redundant, though, and the Data Quality column doesn't look too useful. Let's get rid of those.
The Year/Month/Day/Time columns are redundant, though. Let's get rid of those.

The `axis=1` argument means "Drop columns", like before. The default for operations like `dropna` and `drop` is always to operate on rows.

```python
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
weather_mar2012[:5]
```

Expand Down Expand Up @@ -1857,7 +1845,7 @@ Awesome! We now only have the relevant columns, and it's much more manageable.
This one's just for fun -- we've already done this before, using groupby and aggregate! We will learn whether or not it gets colder at night. Well, obviously. But let's do it anyway.

```python
temperatures = weather_mar2012[[u'Temp (C)']].copy()
temperatures = weather_mar2012[[u'Temp (°C)']].copy()
print(temperatures.head)
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour
temperatures.groupby('Hour').aggregate(np.median).plot()
Expand All @@ -1866,7 +1854,7 @@ temperatures.groupby('Hour').aggregate(np.median).plot()
Output:

```bash
Date/Time
Date/Time
2012-03-01 00:00:00 -5.5
2012-03-01 01:00:00 -5.7
2012-03-01 02:00:00 -5.4
Expand Down Expand Up @@ -1948,13 +1936,10 @@ I noticed that there's an irritating bug where when I ask for January, it gives

```python
def download_weather_month(year, month):
if month == 1:
year += 1
url = url_template.format(year=year, month=month)
weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=True)
weather_data = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True)
weather_data = weather_data.dropna(axis=1)
weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
weather_data = weather_data.drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
return weather_data
```

Expand Down Expand Up @@ -2050,7 +2035,7 @@ Output:
Now we can get all the months at once. This will take a little while to run.

```python
data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]
data_by_month = [download_weather_month(2012, i) for i in range(1, 12)]
```

Once we have this, it's easy to concatenate all the dataframes together into one big dataframe using [pd.concat](http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html). And now we have the whole year's data!
Expand Down
14 changes: 7 additions & 7 deletions content/pandas cookbook/chapter6.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ prev: /pandas-cookbook/chapter5
title: Chapter 6 - String Operations
weight: 35
url: /pandas-cookbook/chapter6
description: String Operations in pandas. Using resampling and plotting temperature.
description: String Operations in pandas. Using resampling and plotting temperature.
keywords:
- pandas
- string
Expand All @@ -23,7 +23,7 @@ import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default')
plt.style.use('default')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'
```
Expand Down Expand Up @@ -165,7 +165,7 @@ Output:
If we wanted the median temperature each month, we could use the [resample()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) method like this:

```python
weather_2012['Temp (C)'].resample('M').apply(np.median).plot(kind='bar')
weather_2012['Temp (°C)'].resample('ME').apply('median').plot(kind='bar')
```

Output:
Expand Down Expand Up @@ -202,7 +202,7 @@ Name: Weather, dtype: float64
and then use resample to find the percentage of time it was snowing each month

```python
is_snowing.astype(float).resample('M').apply(np.mean)
is_snowing.astype(float).resample('ME').apply('mean')
```

Output:
Expand All @@ -225,7 +225,7 @@ Freq: M, Name: Weather, dtype: float64
```

```python
is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')
is_snowing.astype(float).resample('ME').apply('mean').plot(kind='bar')
```

Output:
Expand All @@ -242,9 +242,9 @@ So now we know! In 2012, December was the snowiest month. Also, this graph sugge
We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:

```python
temperature = weather_2012['Temp (C)'].resample('M').apply(np.median)
temperature = weather_2012['Temp (°C)'].resample('ME').apply('median')
is_snowing = weather_2012['Weather'].str.contains('Snow')
snowiness = is_snowing.astype(float).resample('M').apply(np.mean)
snowiness = is_snowing.astype(float).resample('ME').apply('mean')

# Name the columns
temperature.name = "Temperature"
Expand Down
28 changes: 13 additions & 15 deletions content/pandas cookbook/chapter7.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ keywords:
import pandas as pd

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.style.use('default')
figsize(15, 5)

# Always display all the columns
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
```

One of the main problems with messy data is: how do you know if it's messy or not?
Expand Down Expand Up @@ -62,7 +62,7 @@ Some of the problems:
- There are nans
- Some of the zip codes are 29616-0759 or 83
- There are some N/A values that pandas didn't recognize, like 'N/A' and 'NO CLUE'

What we can do:

- Normalize 'N/A' and 'NO CLUE' into regular nan values
Expand Down Expand Up @@ -767,15 +767,13 @@ Output:
This looks bad to me. Let's set these to nan.

```python
zero_zips = requests['Incident Zip'] == '00000'
requests['Incident Zip'][zero_zips] = np.nan
requests.loc[requests['Incident Zip'] == '00000', 'Incident Zip'] = np.nan
```

Great. Let's see where we are now:

```python
unique_zips = requests['Incident Zip'].unique()
unique_zips.sort()
unique_zips
```

Expand Down Expand Up @@ -829,7 +827,7 @@ zips = requests['Incident Zip']
is_close = zips.str.startswith('0') | zips.str.startswith('1')
# There are a bunch of NaNs, but we're not interested in them right now, so we'll say they're True
is_far = ~(is_close.fillna(True).astype(bool))
zips[is_far]
zips.loc[is_far].dropna()
```

Output:
Expand Down Expand Up @@ -955,7 +953,7 @@ Output:
Okay, there really are requests coming from LA and Houston! Good to know. Filtering by zip code is probably a bad way to handle this -- we should really be looking at the city instead.

```python
requests['City'].str.upper().value_counts()
requests[is_far][['Incident Zip', 'Descriptor', 'City']].dropna().sort_values('Incident Zip')
```

Output:
Expand Down Expand Up @@ -1003,20 +1001,20 @@ Here's what we ended up doing to clean up our zip codes, all together:

```python
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv',
na_values=na_values,
requests = pd.read_csv('311-service-requests.csv',
na_values=na_values,
dtype={'Incident Zip': str})

def fix_zip_codes(zips):
# Truncate everything to length 5
# Truncate everything to length 5
zips = zips.str.slice(0, 5)

# Set 00000 zip codes to nan
zero_zips = zips == '00000'
zips[zero_zips] = np.nan

return zips

requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip'])
requests['Incident Zip'].unique()
```
Expand Down