The goal of this project was to use pandas to examine sales and demographic data of an online gaming company. The dataset, which is in the Resources folder, contains data about the customers of the company and the items purchased. Included in this repo is this README contining a short analysis of the findings, the code used, and multiple data frames created from the data, and an ipynb file containg the code that can be run.
What the code does:
-
Loads in the csv and converts it to a data frame
-
Finds the total number of unique players, and makes a data frame of this information
-
Finds the total number if unique items purchased, the average price, number of purchases, and total revenue of the dataset. This info is converted into a readable data frame
-
Creates a data frame breaking down the players by gender and another data frame showing sales info by gender
-
Creates a data frame showing the player base by age, and another data frame with purchase statistics by age
-
Creates a data frame showing the top 5 spenders
-
Creates a data frame showing the most popular items and a data frame showing the most profitable items
Note: This project was originally part of a repo with multiple projects, it has been moved here to viewed on its own. Original location can be found here: https://github.com/mjr390/Pandas_In_Jupyter/tree/master/Heroes_of_Pymoli
1. While the vast majority of players are male, their average purchase price per item ($3.02), is lower than average purchase price of all players ($3.05).
2. The most popular item ("Oathbreaker, Last Hope of the Breaking Storm") is also the most profitable. The second ("Nirvana") and third ("Fiery Glass Crusader") most profitable items are also in the top 4 most popular items.
3. Lisosia93 has spent the most money on items ($18.96). Followed by Idastidru52 at ($15.45) and Chamjask73 at($13.83).
import pandas as pd
import numpy as npfile_to_load = "Resources/purchase_data.csv"
df = pd.read_csv(file_to_load)
df.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase ID | SN | Age | Gender | Item ID | Item Name | Price | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | Lisim78 | 20 | Male | 108 | Extraction, Quickblade Of Trembling Hands | 3.53 |
| 1 | 1 | Lisovynya38 | 40 | Male | 143 | Frenzied Scimitar | 1.56 |
| 2 | 2 | Ithergue48 | 24 | Male | 92 | Final Critic | 4.88 |
| 3 | 3 | Chamassasya86 | 24 | Male | 100 | Blindscythe | 3.27 |
| 4 | 4 | Iskosia90 | 23 | Male | 131 | Fury | 1.44 |
#Calcualte the total number of players
#Use len() on the unque values in "SN" to find the number of unique players, put that value into a dataframe and display it
totPlayers = len(df["SN"].unique())
totPlayersVar = [{"Total Players":totPlayers}]
totPlayersDF = pd.DataFrame(totPlayersVar)
totPlayersDF.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Total Players | |
|---|---|
| 0 | 576 |
#Find the total number of of Unique Items, Number of Purchases, Total Revenue, and the Average Purchase Price
#use len() on the unique values in "Item Name" and set that value to a variable
uniItems = len(df["Item Name"].unique())
#use .mean() on the "Price" column to find the average cost per item and set to a varibale
#use round() to round the number to 2 decimal places beacuse the value is a float
#and then create a new variable for the value as a string
avgPrice = df["Price"].mean()
avgPrice = round(avgPrice,2)
avgPriceDis = ("$" + str(avgPrice))
#use .sum() on "Price" column to find the sum of all it's values, use round() bacuse it is a float
#and create a new variable with the value as a string to display
totRev = df["Price"].sum()
totRevDis = ("$" + str(totRev))
totNumOfPur = df["Item ID"].count()
#Create a DataFrame to display the results and order them
Purchasing_Analysis = pd.DataFrame([{"Number of Unique Items":uniItems, "Average Price": avgPriceDis, "Number of Purchases": totNumOfPur
, "Total Revenue":totRevDis}])
Purchasing_Analysis_Ordered = Purchasing_Analysis[["Number of Unique Items", "Average Price", "Number of Purchases", "Total Revenue"]]
Purchasing_Analysis_Ordered.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Number of Unique Items | Average Price | Number of Purchases | Total Revenue | |
|---|---|---|---|---|
| 0 | 179 | $3.05 | 780 | $2379.77 |
# Find the Percentage and Count of Male Players, Percentage and Count of Female Players, and Percentage and Count of Other / Non-Disclosed
#create variables for the total count of each gender to use to find the % they are of the total player base
malePlayerCount = len(df.loc[df["Gender"] == "Male"])
femalePlayerCount = len(df.loc[df["Gender"] == "Female"])
otherPlayerCount = len(df.loc[df["Gender"] == "Other / Non-Disclosed"])
#Use the variables from the lasst step to create variables for each gender's % of the total
malePlayPer = round(malePlayerCount/totPlayers*100,2)
fePlayPer = round(femalePlayerCount/totPlayers*100,2)
othPlayPer = round(otherPlayerCount/totPlayers*100,2)
#use value_counts() to find how many of each gender there are and set to a variable
GenderCounts = df["Gender"].value_counts()
#create a DataFrame for the found values and labels
GenderCounts_df = pd.DataFrame(GenderCounts)
#Add in a new column with the % values found earlier and create a new DataFrame with the columns in a different order for display
GenderCounts_df["Percentage of Players"]=[malePlayPer, fePlayPer, othPlayPer]
GenderCounts_df_ordered = GenderCounts_df[["Percentage of Players", "Gender"]]
GenderCounts_df_ordered.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Percentage of Players | Gender | |
|---|---|---|
| Male | 113.19 | 652 |
| Female | 19.62 | 113 |
| Other / Non-Disclosed | 2.60 | 15 |
# Find the purchase count, Average Purchase Price, Total Purchase Value, and Normalized Totals by gender
#Run basic calculations to find the needed values
mPurCount = len(df.loc[df["Gender"] == "Male"])
fPurCount = len(df.loc[df["Gender"] == "Female"])
oPurCount = len(df.loc[df["Gender"] == "Other / Non-Disclosed"])
#For each gender's average price, locate the values for each gender and take the mean
#use round() to reduce each to 2 decimal places. create a new variavle to display the value in a DataFrame formatted as a string
mAvgPrice = df["Price"].loc[df["Gender"] == "Male"].mean()
mAvgPrice = round(mAvgPrice,2)
mAvgPrice_display = ("$" + str(mAvgPrice))
fAvgPrice = df["Price"].loc[df["Gender"] == "Female"].mean()
#The Female average ends in a '0' so a different method of formatting must be used
fAvgPrice = "%.2f" % round(fAvgPrice, 2)
fAvgPrice_diaplay = ("$" + str(fAvgPrice))
oAvgPrice = df["Price"].loc[df["Gender"] == "Other / Non-Disclosed"].mean()
oAvgPrice = round(oAvgPrice,2)
oAvgPrice_display = ("$" + str(oAvgPrice))
#Use sum() instead of mean() to find the total Purchase value for each gender and assign to a variable
#Create a new variable to use in the DataFrame
mTotPrice = df["Price"].loc[df["Gender"] == "Male"].sum()
mTotPrice_display = ("$" + str(mTotPrice))
fTotPrice = df["Price"].loc[df["Gender"] == "Female"].sum()
fTotPrice_display = ("$" + str(fTotPrice))
oTotPrice = df["Price"].loc[df["Gender"] == "Other / Non-Disclosed"].sum()
oTotPrice_display = ("$" + str(oTotPrice))
#Find the Normalized totals for each gender by dividing the total price variable by the purchase count
#Round each to 2 decimal places and then create a new variable to display in the DataFrame
mNorTot = mTotPrice/mPurCount
mNorTot = round(mNorTot,2)
mNorTot_display = ("$" + str(mNorTot))
fNorTot = fTotPrice/fPurCount
fNorTot = "%.2f" % round(fNorTot, 2)
fNorTot_display = ("$" + str(fNorTot))
oNorTot = oTotPrice/oPurCount
oNorTot = round(oNorTot,2)
oNorTot_display = ("$" + str(oNorTot))
#Create the DataFrame for the values
PA_Gender_df = pd.DataFrame({
"Gender" : ["Male", "Female", "Other / Non-Disclosed"],
"Purchase Count" : [mPurCount, fPurCount, oPurCount],
"Average Purchase Price" : [mAvgPrice_display, fAvgPrice_diaplay, oAvgPrice_display],
"Total Purchase Value" : [mTotPrice_display, fTotPrice_display, oTotPrice_display],
"Normalized Totals" : [mNorTot_display, fNorTot_display, oNorTot_display]
})
#Reorder the DataFrame, Set the index column to the genders, set the index to appear in alphabetical order and display the DataFrame
PA_Gender_df = PA_Gender_df[["Gender", "Purchase Count", "Average Purchase Price", "Total Purchase Value", "Normalized Totals"]]
PA_Gender_df = PA_Gender_df.set_index("Gender")
PA_Gender_df = PA_Gender_df.sort_index(ascending=True)
PA_Gender_df.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase Count | Average Purchase Price | Total Purchase Value | Normalized Totals | |
|---|---|---|---|---|
| Gender | ||||
| Female | 113 | $3.20 | $361.94 | $3.20 |
| Male | 652 | $3.02 | $1967.64 | $3.02 |
| Other / Non-Disclosed | 15 | $3.35 | $50.19 | $3.35 |
#create bins and corresponding labels
age_bins = [0, 9.90, 14.90, 19.90, 24.90, 29.90, 34.90, 39.90, 99999]
group_names = ["<10", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40+"]
#create a data frame off of the "Age" column in the data set
new_age_df = pd.DataFrame(df["Age"])
#create a new row with the ages grouped using .cut and the bins and labels
new_age_df["Age Group"] = pd.cut(new_age_df["Age"], age_bins, labels=group_names)
#group the data frame by "Age Group"
new_age_df = new_age_df.groupby("Age Group")
#find how many times an entry appears in a bucket
new_age_df.count()
#turn the found count into a data frame
new_age_df = pd.DataFrame(new_age_df.count())
#rename the columns
new_age_df = new_age_df.rename(columns={"Age":"Total Count"})
#add a new column for the "Percent of PLayers" by dividing the count in each bucket by the total number of players and multiply by 100
#round the result to 2 decimal places
new_age_df["Percent of Players"] = round(new_age_df["Total Count"]/totPlayers*100, 2)
#rearrange the columns
new_age_df = new_age_df[["Percent of Players", "Total Count"]]
#delete the index title
del new_age_df.index.name
#display the results
new_age_df.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Percent of Players | Total Count | |
|---|---|---|
| <10 | 3.99 | 23 |
| 10-14 | 4.86 | 28 |
| 15-19 | 23.61 | 136 |
| 20-24 | 63.37 | 365 |
| 25-29 | 17.53 | 101 |
| 30-34 | 12.67 | 73 |
| 35-39 | 7.12 | 41 |
| 40+ | 2.26 | 13 |
#create a new variable holding the "Age" and "Price" columns from the data set
paAge_df = df[["Age", "Price"]]
paAge_df.head()
#create bins and corresponding labels
age_bins = [0, 9.90, 14.90, 19.90, 24.90, 29.90, 34.90, 39.90, 99999]
group_names = ["<10", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40+"]
#add age groups to the data frame useing .cut on the "Age" column and the age_groups and group_names
paAge_df["Age Groups"] = pd.cut(paAge_df["Age"], age_bins, labels=group_names)
#groupby the new column and save to a variable to use later to find the total count of each bin
grouped_paAge = paAge_df.groupby(["Age Groups"])
#use .count() on the created variable to find how many values are in each bin
#save to a variable as a data frame to create a row to be used later
findPurCount = pd.DataFrame(grouped_paAge.count())
#group the data frame by the new column
paAge_df = paAge_df.groupby("Age Groups")
#take the sum of each bin and save to a data frame
paAge_df.sum()
paAge_df2 = pd.DataFrame(paAge_df.sum())
#add a column to the data frame which is the column from the data frame that .count() was used on
paAge_df2["Purchase Count"] = findPurCount["Age"]
#remove the unneeded columns
paAge_df2 = paAge_df2[["Price", "Purchase Count"]]
#rename columns
paAge_df2 = paAge_df2.rename(columns={"Price":"Total Purchase Value"})
#add columns to the data frame for "Average Purchase Price" and "Normalized Totals" doing the calcs for each
paAge_df2["Average Purchase Price"] = round(paAge_df2["Total Purchase Value"]/paAge_df2["Purchase Count"],2)
paAge_df2["Normalized Totals"] = round(paAge_df2["Total Purchase Value"]/paAge_df2["Purchase Count"],2)
#reorganize the columns to make more sense
paAge_df2 = paAge_df2[["Purchase Count", "Average Purchase Price", "Total Purchase Value", "Normalized Totals"]]
#check the data types of the columns for formatting
paAge_df2.dtypes
#Changethe format on the columns that are floats to represent money
pd.options.display.float_format = '${:,.2f}'.format
#Delete the index name
del paAge_df2.index.name
#display the data frame
paAge_df2C:\Users\Mike\Anaconda3\envs\PythonData\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# This is added back by InteractiveShellApp.init_path()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase Count | Average Purchase Price | Total Purchase Value | Normalized Totals | |
|---|---|---|---|---|
| <10 | 23 | $3.35 | $77.13 | $3.35 |
| 10-14 | 28 | $2.96 | $82.78 | $2.96 |
| 15-19 | 136 | $3.04 | $412.89 | $3.04 |
| 20-24 | 365 | $3.05 | $1,114.06 | $3.05 |
| 25-29 | 101 | $2.90 | $293.00 | $2.90 |
| 30-34 | 73 | $2.93 | $214.00 | $2.93 |
| 35-39 | 41 | $3.60 | $147.67 | $3.60 |
| 40+ | 13 | $2.94 | $38.24 | $2.94 |
#use vlaue counts on the "SN" column to find how many items each player bought and save to a variable
purCount = df["SN"].value_counts()
purCount.head()
#use groupby to seperate the data into fields by SN
findTotPurValue = df.groupby(["SN"])
findTotPurValue.count().head()
#find the sum of price for each SN and save to a variable
totSpendPerValue = findTotPurValue["Price"].sum()
totSpendPerValue.head(10)
#create a new DataFrame with the new variables
Top_Spenders = pd.DataFrame({
"Purchase Count":purCount,
"Total Purchase Value": totSpendPerValue
})
Top_Spenders.head(10)
#Add ac column for average price by dividing the values in "Total Purchase Value" by those in "Purchase Count"
#round the findings to 2 decimal places
Top_Spenders["Average Purchase Price"] = round((Top_Spenders["Total Purchase Value"]/Top_Spenders["Purchase Count"]),2)
Top_Spenders.head(10)
#format the columns with floats data into currency format
pd.options.display.float_format = '${:,.2f}'.format
Top_Spenders.head(10)
#Name the index in the DataFrame
Top_Spenders.index.name = "SN"
Top_Spenders.head(10)
#Sort the DataFrame by highest "Total Purchase Value" on top
Top_Spenders_Sorted = Top_Spenders.sort_values("Total Purchase Value", ascending=False)
Top_Spenders_Sorted.head()
#Rearrange the columns
Top_Spenders_Sorted = Top_Spenders_Sorted[["Purchase Count", "Average Purchase Price", "Total Purchase Value"]]
#display the finished data frame
Top_Spenders_Sorted.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase Count | Average Purchase Price | Total Purchase Value | |
|---|---|---|---|
| SN | |||
| Lisosia93 | 5 | $3.79 | $18.96 |
| Idastidru52 | 4 | $3.86 | $15.45 |
| Chamjask73 | 3 | $4.61 | $13.83 |
| Iral74 | 4 | $3.40 | $13.62 |
| Iskadarya95 | 3 | $4.37 | $13.10 |
#create a new data frame by taking the "Item Id", "Item Name", and ""Price columns from the main data frame
popItems_df = df[["Item ID", "Item Name", "Price"]]
popItems_df.head()
#find how many of each Item ID exists and set to a variable
itemPurCount = df["Item ID"].value_counts()
itemPurCount.head()
#use the last variable to create a new dataframe
itemNumPurchases = pd.DataFrame(itemPurCount)
#Change the column names
itemNumPurchases = itemNumPurchases.rename(columns={"Item ID": "Purchase Count"})
#name the index
itemNumPurchases.index.name = "Item ID"
itemNumPurchases.head()
#group the data frame popItems by "Item ID" and "Index"
grouped_Items = popItems_df.groupby(["Item ID", "Item Name"])
#create a data frame for grouped_Items with the sum of it's "Price" column
grouped_Items_df = pd.DataFrame(grouped_Items["Price"].sum())
grouped_Items_df.head()
#merge the grouped_Items_df with the itemNumPurchases by their indexes
mergedPopItems = pd.merge(grouped_Items_df, itemNumPurchases, left_index=True, right_index=True)
mergedPopItems.head()
#create a new variable holding the "Price" and correct purchase count columns from the merged data frame
correctMergedPopItems = mergedPopItems[["Price", "Purchase Count"]]
correctMergedPopItems.head()
#create a new varable with the value of the last data frame, but change the column titles
correctMergedPopItTitles = correctMergedPopItems.rename(columns = {"Price":"Total Purchase Value", "Purchase Count_y":"Purchase Count"})
correctMergedPopItTitles
#add a new column called "Item Price" to shw the price of each colum by dividing the item's total count
correctMergedPopItTitles["Item Price"] = (correctMergedPopItTitles["Total Purchase Value"]/correctMergedPopItTitles["Purchase Count"])
correctMergedPopItTitles.head()
#sort by the "Purchase Count" column and display in decending order, set to a new variable
mergedPopIt_sorted = correctMergedPopItTitles.sort_values("Purchase Count", ascending=False)
mergedPopIt_sorted.head()
#reorder the columns
mergedPopIt_sorted = mergedPopIt_sorted[["Purchase Count", "Item Price", "Total Purchase Value"]]
mergedPopIt_sorted.head()
#Change the columns displaying money to a currency format by changing the format of all columns that are floats
pd.options.display.float_format = '${:,.2f}'.format
mergedPopIt_sorted.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase Count | Item Price | Total Purchase Value | ||
|---|---|---|---|---|
| Item ID | Item Name | |||
| 178 | Oathbreaker, Last Hope of the Breaking Storm | 12 | $4.23 | $50.76 |
| 145 | Fiery Glass Crusader | 9 | $4.58 | $41.22 |
| 108 | Extraction, Quickblade Of Trembling Hands | 9 | $3.53 | $31.77 |
| 82 | Nirvana | 9 | $4.90 | $44.10 |
| 19 | Pursuit, Cudgel of Necromancy | 8 | $1.02 | $8.16 |
#sort the most popular items data frame by the "Total Purchase Value" and show in decending order to find the most profitable item
#set to a new variable and display the variable
highProfItems = mergedPopIt_sorted.sort_values("Total Purchase Value", ascending=False)
highProfItems.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| Purchase Count | Item Price | Total Purchase Value | ||
|---|---|---|---|---|
| Item ID | Item Name | |||
| 178 | Oathbreaker, Last Hope of the Breaking Storm | 12 | $4.23 | $50.76 |
| 82 | Nirvana | 9 | $4.90 | $44.10 |
| 145 | Fiery Glass Crusader | 9 | $4.58 | $41.22 |
| 92 | Final Critic | 8 | $4.88 | $39.04 |
| 103 | Singed Scalpel | 8 | $4.35 | $34.80 |