You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Cleansing_Projects/project2.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,4 +24,4 @@ execute:
24
24
warning: false
25
25
26
26
---
27
-
27
+
Fome reason this instance is not detecting certain portions of sqlite not recognizing 'collegeplaying' table from lamansbaseballdb. Contact author for code
"title: \"Client Report - [Finding Relationships in Baseball]\"\n",
9
+
"subtitle: \"Course DS 250\"\n",
10
+
"author: \"[MIGUEL SMITH]\"\n",
11
+
"format:\n",
12
+
" html:\n",
13
+
" self-contained: true\n",
14
+
" page-layout: full\n",
15
+
" title-block-banner: true\n",
16
+
" toc: true\n",
17
+
" toc-depth: 3\n",
18
+
" toc-location: body\n",
19
+
" number-sections: false\n",
20
+
" html-math-method: katex\n",
21
+
" code-fold: true\n",
22
+
" code-summary: \"Show the code\"\n",
23
+
" code-overflow: wrap\n",
24
+
" code-copy: hover\n",
25
+
" code-tools:\n",
26
+
" source: false\n",
27
+
" toggle: true\n",
28
+
" caption: See code\n",
29
+
"execute: \n",
30
+
" warning: false\n",
31
+
" \n",
32
+
"---"
33
+
],
34
+
"id": "e5c165ab"
35
+
},
36
+
{
37
+
"cell_type": "code",
38
+
"metadata": {},
39
+
"source": [
40
+
"import pandas as pd \n",
41
+
"import numpy as np\n",
42
+
"import sqlite3\n",
43
+
"from lets_plot import *\n",
44
+
"\n",
45
+
"LetsPlot.setup_html(isolated_frame=True)"
46
+
],
47
+
"id": "b6928eb3",
48
+
"execution_count": null,
49
+
"outputs": []
50
+
},
51
+
{
52
+
"cell_type": "code",
53
+
"metadata": {},
54
+
"source": [
55
+
"# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html\n",
56
+
"\n",
57
+
"# Include and execute your code here\n",
58
+
"sqlite_file = 'lahmansbaseballdb.sqlite'\n",
59
+
"# this file must be in the same location as your .qmd or .py file\n",
60
+
"con = sqlite3.connect(sqlite_file)"
61
+
],
62
+
"id": "8ec5a052",
63
+
"execution_count": null,
64
+
"outputs": []
65
+
},
66
+
{
67
+
"cell_type": "markdown",
68
+
"metadata": {},
69
+
"source": [
70
+
"## QUESTION|TASK 1\n",
71
+
"\n",
72
+
"__Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.__ \n"
73
+
],
74
+
"id": "c0174561"
75
+
},
76
+
{
77
+
"cell_type": "code",
78
+
"metadata": {},
79
+
"source": [
80
+
"# Include and execute your code here\n",
81
+
"q = \"\"\"SELECT p.playerID as 'Player',\n",
82
+
" sc.schoolID as 'School',\n",
83
+
" s.yearID as 'Year',\n",
84
+
" s.salary AS 'Salary', \n",
85
+
" s.teamID AS 'Team'\n",
86
+
" \n",
87
+
" \n",
88
+
" FROM collegeplaying sc\n",
89
+
" \n",
90
+
" JOIN people p ON p.playerID = sc.playerID \n",
91
+
" JOIN salaries s ON p.playerID = s.playerID\n",
92
+
" \n",
93
+
" WHERE sc.schoolID = 'idbyuid'\n",
94
+
" GROUP BY p.playerID\n",
95
+
" ORDER BY s.salary DESC\n",
96
+
"\n",
97
+
" \"\"\"\n",
98
+
"results = pd.read_sql_query(q,con)\n",
99
+
"results"
100
+
],
101
+
"id": "96139a90",
102
+
"execution_count": null,
103
+
"outputs": []
104
+
},
105
+
{
106
+
"cell_type": "markdown",
107
+
"metadata": {},
108
+
"source": [
109
+
"## QUESTION|TASK 2\n",
110
+
"\n",
111
+
"__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)__ \n",
112
+
" a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report. \n",
113
+
" b. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results. \n",
114
+
" c. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results. \n",
115
+
"\n",
116
+
"Upon writing these queries I have found that there is a significant difference in results when changing parameters of At Bats. It makes sense \n",
117
+
"to limit which players we take for these averages because we get more useable numbers instead of 'perfect' averages.\n"
118
+
],
119
+
"id": "ed79d08f"
120
+
},
121
+
{
122
+
"cell_type": "code",
123
+
"metadata": {},
124
+
"source": [
125
+
"# Include and execute your code here\n",
126
+
"q = \"\"\" SELECT playerID AS 'Player',\n",
127
+
" yearID AS 'Year',\n",
128
+
" AB AS 'At Bat',\n",
129
+
" H AS 'Hits',\n",
130
+
" CAST(H as FLOAT) / AB AS 'Batting Average'\n",
131
+
"\n",
132
+
" FROM batting\n",
133
+
" GROUP BY playerID\n",
134
+
" HAVING AB >= 1 AND H>=1\n",
135
+
" ORDER BY CAST(H as FLOAT) / AB DESC, playerID ASC\n",
136
+
" LIMIT 5\n",
137
+
" \n",
138
+
"\n",
139
+
"\n",
140
+
"\n",
141
+
"\"\"\"\n",
142
+
"results = pd.read_sql_query(q,con)\n",
143
+
"results"
144
+
],
145
+
"id": "71850978",
146
+
"execution_count": null,
147
+
"outputs": []
148
+
},
149
+
{
150
+
"cell_type": "code",
151
+
"metadata": {},
152
+
"source": [
153
+
"# Include and execute your code here\n",
154
+
"q = \"\"\" SELECT playerID AS 'Player',\n",
155
+
" yearID AS 'Year',\n",
156
+
" AB AS 'At Bat',\n",
157
+
" H AS 'Hits',\n",
158
+
" CAST(H as FLOAT) / AB AS 'Batting Average'\n",
159
+
"\n",
160
+
" FROM batting\n",
161
+
" GROUP BY playerID\n",
162
+
" HAVING AB >= 10 AND H>=1\n",
163
+
" ORDER BY CAST(H as FLOAT) / AB DESC, playerID \n",
164
+
" LIMIT 5\n",
165
+
" \n",
166
+
"\n",
167
+
"\n",
168
+
"\n",
169
+
"\"\"\"\n",
170
+
"results = pd.read_sql_query(q,con)\n",
171
+
"results"
172
+
],
173
+
"id": "19442b31",
174
+
"execution_count": null,
175
+
"outputs": []
176
+
},
177
+
{
178
+
"cell_type": "code",
179
+
"metadata": {},
180
+
"source": [
181
+
"# Include and execute your code here\n",
182
+
"q = \"\"\" SELECT \n",
183
+
" playerID AS 'Player',\n",
184
+
" SUM(AB) AS 'Career At-Bats',\n",
185
+
" SUM(H) AS 'Career Hits',\n",
186
+
" CAST(SUM(H) as FLOAT) / SUM(AB) AS 'Batting Average'\n",
187
+
"\n",
188
+
" FROM batting \n",
189
+
" GROUP BY playerID\n",
190
+
" HAVING SUM(AB) > 100\n",
191
+
" ORDER BY CAST(SUM(H) as FLOAT) / SUM(AB) DESC, playerID ASC\n",
192
+
" LIMIT 5\n",
193
+
" \n",
194
+
" \n",
195
+
" \n",
196
+
" \n",
197
+
"\n",
198
+
"\"\"\"\n",
199
+
"results = pd.read_sql_query(q,con)\n",
200
+
"results"
201
+
],
202
+
"id": "a42cb4ef",
203
+
"execution_count": null,
204
+
"outputs": []
205
+
},
206
+
{
207
+
"cell_type": "markdown",
208
+
"metadata": {},
209
+
"source": [
210
+
"## QUESTION|TASK 3\n",
211
+
"\n",
212
+
"__Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?__\n",
213
+
"\n",
214
+
"I decided to gather data of Outs Pitched for two teams on polar opposites of the United States: The Boston Red Sox and \n",
215
+
"the Los Angeles Dodgers. I wanted to see the differences year to year. To my surprise, the data is similar. Note the years 1918, 1981, and 1994 where both have similar dips. This puts forth the question of what happened to the MLB in those years.\n"
216
+
],
217
+
"id": "bd36fbb6"
218
+
},
219
+
{
220
+
"cell_type": "code",
221
+
"metadata": {},
222
+
"source": [
223
+
"# Include and execute your code here\n",
224
+
"q =\"\"\" SELECT tf.franchName 'Team Name', \n",
225
+
" t.yearID as 'Year',\n",
226
+
" t.W AS 'Wins',\n",
227
+
" t.G as 'Games',\n",
228
+
" CAST(t.W as FLOAT) / t.G AS 'Win%',\n",
229
+
" t.IPOuts as 'Outs Pitched'\n",
230
+
" FROM teams t \n",
231
+
" JOIN teamsfranchises tf on tf.franchID = t.franchID\n",
232
+
" WHERE tf.franchName IN ('Los Angeles Dodgers', 'Boston Red Sox')\n",
" + ggtitle('Red Sox vs Dodgers Performance Over Time')\n",
249
+
" + xlab('Year')\n",
250
+
" + ylab('Outs Pitched')\n",
251
+
" + theme_classic()+\n",
252
+
" scale_x_continuous(format=\"d\")\n",
253
+
")\n",
254
+
"\n",
255
+
"plot.show()\n"
256
+
],
257
+
"id": "492593fb",
258
+
"execution_count": null,
259
+
"outputs": []
260
+
},
261
+
{
262
+
"cell_type": "markdown",
263
+
"metadata": {},
264
+
"source": [
265
+
"---\n",
266
+
"\n",
267
+
"## STRETCH QUESTION|TASK 1\n",
268
+
"\n",
269
+
"__Advanced Salary Distribution by Position (with Case Statement):__ \n",
270
+
"\n",
271
+
" * Write an SQL query that provides a summary table showing the average salary for players in each position (e.g., pitcher, catcher, outfielder) across all years. Include the following columns:\n",
272
+
"\n",
273
+
" * position\n",
274
+
" * average_salary\n",
275
+
" * total_players\n",
276
+
" * highest_salary \n",
277
+
"\n",
278
+
" * The highest_salary column should display the highest salary ever earned by a player in that position. If no player in that position has a recorded salary, display “N/A” for the highest salary. \n",
279
+
"\n",
280
+
" * Additionally, create a new column called salary_category using a case statement: \n",
281
+
"\n",
282
+
" * If the average salary is above $1 million, categorize it as “High Salary.” \n",
283
+
" * If the average salary is between $500,000 and $1 million, categorize it as “Medium Salary.” \n",
284
+
" * Otherwise, categorize it as “Low Salary.” \n",
285
+
"\n",
286
+
" * Order the table by average salary in descending order.\n",
287
+
" * Print the top 10 rows of this summary table. \n",
288
+
"\n",
289
+
"_type your results and analysis here_\n"
290
+
],
291
+
"id": "e2adc75a"
292
+
},
293
+
{
294
+
"cell_type": "code",
295
+
"metadata": {},
296
+
"source": [
297
+
"# Include and execute your code here\n"
298
+
],
299
+
"id": "92cc0520",
300
+
"execution_count": null,
301
+
"outputs": []
302
+
},
303
+
{
304
+
"cell_type": "markdown",
305
+
"metadata": {},
306
+
"source": [
307
+
"## STRETCH QUESTION|TASK 2\n",
308
+
"\n",
309
+
"__Advanced Career Longevity and Performance (with Subqueries):__\n",
310
+
"\n",
311
+
" * Calculate the average career length (in years) for players who have played at least one game. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their: \n",
312
+
"\n",
313
+
" * playerID\n",
314
+
" * first_name\n",
315
+
" * last_name\n",
316
+
" * career_length\n",
317
+
"\n",
318
+
" * The career_length should be calculated as the difference between the maximum and minimum yearID for each player. \n",
319
+
"\n",
320
+
"MAX and MIN functions work very well in this context. Nick Altrock sounds like he had a lot of endurance.\n"
321
+
],
322
+
"id": "6ff899d3"
323
+
},
324
+
{
325
+
"cell_type": "code",
326
+
"metadata": {},
327
+
"source": [
328
+
"# Include and execute your code here\n",
329
+
"q = \"\"\"\n",
330
+
"SELECT p.playerID,\n",
331
+
" p.nameFirst as 'first_name',\n",
332
+
" p.nameLast as 'last_name',\n",
333
+
" MAX(CAST(a.yearID as INTEGER)) - MIN(CAST(a.yearID as INTEGER)) AS 'career_length'\n",
334
+
"FROM people p\n",
335
+
"JOIN appearances a ON p.playerID = a.playerID\n",
0 commit comments