Important:
dfis the dataframe that you are using (i.e., the table of data that you are analyzing)new_dfis the target variable that stores the result of your actions.
new_df = df.loc[:, ['COL_1_NAME', 'COL_2_NAME', 'COL_N_NAME']]
new_df = df[df['COL_NAME']=='VALUE']- filtering on string valuesnew_df = df[df['COL_NAME']==VALUE]- filtering on numeric valuesnew_df = df_1[df_1['COL_NAME'].isin(df_2['COL_NAME'])]- filtering on the values of a column in another table
df.head()shows the start of the dataframedf.tail()shows the end of the dataframedf.sample(n)shows a random sample ofnrows
df.describe()shows descriptive statistics of the whole dataframedf['COL_NAME'].describe()shows descriptive statistics of one column of the dataframedf[['COL_1_NAME','COL_2_NAME']].describe()shows descriptive statistics of one column of the dataframe
Alternatively, you can ask for specific statistics:
df['COL_NAME'].mean()shows the mean of a column- You can replace
meanwithmin,max,unique, etc.
Sometimes you want to adjust the type of the values in a column. The following command changes the column type to int.
df['COL_NAME'] = df['COL_NAME'].astype(int)
A very quick look at the data often reveals important insights. The following command displays a scatterplot.
plt.scatter(df['COL_X_NAME'], df['COL_Y_NAME'])
chart = alt.Chart(data).mark_bar(stroke='transparent').encode(
x=alt.X('<COL_NAME_1>', axis=alt.Axis(title='<Title>')),
y=alt.Y('<COL_NAME_2>', scale=alt.Scale(domain=(<min>, <max>)), axis=alt.Axis(title='<Title>')),
column='<COL_NAME_3>'
)
chart
The column command allows you to group the bar chart.
Combining dataframes requires a shared column in both dataframes. The following command left-joins df1 and df2.
new_df = pd.merge(df1, df2, on='SHARED_COL_NAME', how='left')
You can generate descriptive statistics on group data. The following command groups the data on the values of COL_NAME and executes function on the groups:
df.groupby('COL_NAME')['COL_NAME'].function()
The following example shows the group-specific means for COL_NAME:
df.groupby('COL_NAME')['COL_NAME'].mean()
The following addendum sorts grouped data:
[..].sort_values('SORTING_COL', ascending=[True/False])
The following command allows you to apply multiple functions to multiple columns:
new_df = df.groupby(['COL_NAME_1','COL_NAME_2']).agg({'COL_NAME_1':'function','COL_NAME_2':'function'})
You can replace function with mean with min, max, unique, etc.
The following command allows you to count the number of occurences in a group.
new_df = df.groupby('COL_NAME_1')['COL_NAME_2'].value_counts().reset_index(name='count')
This tells you how often COL_NAME_2 appears in each group of COL_NAME_1.
Dummy variables help to understand the role of categorial variable. We can store the dummies back into the a dataframe.
new_df = df.join(df['COL_NAME'].str.get_dummies())
df['COL_NAME'] = df['COL_NAME'].map({VALUE_1: NEW_VALUE_1, VALUE_2: NEW_VALUE_2})
Use '' to denote string variables (e.g., 'NEW_VALUE_1')
new_df = df.pivot_table(index=['COL_NAME_1','COL_NAME_2'], columns='KEY', values='AGGREGATE', aggfunc='sum').reset_index()
indexmeans the column(s) that become the pivot index.columnsmeans the column(s) that become the keyed columns.valuesmeans the column(s) that contain the values of the pivot table.
Average between two subsequent rows
df['mean'] = (df['value'] + df['value'].shift(1))/2
Growth-rate between two subsequent rows
df['rate'] = df['value'].pct_change(1)
Remove the ,
df['value'] = df['value'].str.replace(',','')
Change the type of the column
df['value'] = df['value'].astype(float)