diff --git a/02-SQL_Basics/exercise_solutions.md b/02-SQL_Basics/exercise_solutions.md index 464498e..ef39230 100644 --- a/02-SQL_Basics/exercise_solutions.md +++ b/02-SQL_Basics/exercise_solutions.md @@ -1,8 +1,114 @@ ### Problem 1 +##### How many survey responses are in the table? ```SQL -SELECT query +SELECT count(*) FROM public.av_survey WHERE status = 'COMPLETE'; ``` + ### Problem 2 +##### How many survey responses were gotten online vs in-person? + +```SQL +SELECT source_type, count(source_type) FROM public.av_survey GROUP BY source_type; +``` + + +### Problem 3 +##### How many survey responses were collected on each day in the duration of the survey? + +```SQL +SELECT distinct(date(start_date)), count(*) From public.av_survey GROUP BY date(start_date); +``` + + +### Problem 4 +##### What is the average time taken to complete the survey? + +If we take a simple average of completed surveys we can use the following: + +```SQL +SELECT avg(end_date - start_date) FROM public.av_survey WHERE status = 'COMPLETE'; +``` + +If we order by the time taken to complete the survey we see that there are several values that are outliers. For example, the longest time it took to complete the survey was 11 days. These outliers are skewing our average time calculation. + +```SQL +SELECT end_date, start_date, (end_date - start_date) as difference FROM public.av_survey WHERE status = 'COMPLETE' ORDER BY difference DESC; +``` + +Let's select a reasonable time difference like anything less than an hour and recalculate the average again. + +```SQL +SELECT avg(end_date - start_date) FROM public.av_survey WHERE status = 'COMPLETE' AND (end_date - start_date) < '1:00'; +``` + +This returns an average time of 6 minutes and 29 seconds. + + +### Problem 5 +##### How many distinct zip codes were included in the survey? + +```SQL +SELECT count(distinct(zipcode)) FROM public.av_survey WHERE zipcode <> 0; +``` + + +### Problem 6 +##### How many people in total have interacted with an Autonomous Vehicle? +```SQL +SELECT count(*) FROM public.av_survey WHERE interactbicycle = 'Yes' OR interactpedestrian = 'Yes'; +``` + +### Problem 7 +##### For each type of response to `FeelingsProvingGround`, count how many people gave that response? + +```SQL +SELECT feelingsprovingground, count(feelingsprovingground) +FROM public.av_survey +WHERE feelingsprovingground IS NOT NULL +GROUP BY feelingsprovingground; +``` + + +### Problem 8 +##### Same question as Q7, but only from people who have interacted with an Autonomous Vehicle. + +```SQL +SELECT feelingsprovingground, count(feelingsprovingground) FROM public.av_survey WHERE interactbicycle = 'Yes' OR interactpedestrian = 'Yes' GROUP BY feelingsprovingground; +``` + + +### Problem 9 +##### For each type of response to `AVSafetyPotential`, count how many people gave that response? + +```SQL +SELECT avsafetypotential, count(avsafetypotential) FROM public.av_survey WHERE avsafetypotential IS NOT NULL GROUP BY avsafetypotential ORDER BY count(avsafetypotential) DESC; +``` + + +### Problem 10 +##### Same question as Q9, but only from people who are **Mostly familiar** or **Extremely familiar** with the AV technology (`FamiliarityTechnology` column) + +```SQL +SELECT avsafetypotential, count(avsafetypotential) FROM public.av_survey WHERE familiaritytechnology = 'Extremely familiar' OR familiaritytechnology = 'Mostly familiar' GROUP BY avsafetypotential; +``` + + +### Problem 11 +##### How many people are in favor of any type of regulation on Autonomous Vehicles? + +```SQL +SELECT COUNT(*) FROM public.av_survey WHERE 'Yes' IN (regulationsharedata, regulationschoolzone, regulationspeed, regulationtesting); +``` + + +### Problem 12 +##### How many people who are following AVs in the news (`PayingAttentionAV` response is **To a moderate extent** or **To a large extent**) are not familiar with the Technology (`FamiliarityTechnology` is _lesser_ than **Somewhat familiar** and _below_)? + +```SQL +SELECT COUNT(payingattentionav) FROM public.av_survey +WHERE (payingattentionav = 'To a moderate extent' OR payingattentionav = 'To a large extent') +AND (familiaritytechnology = 'Mostly Unfamiliar' OR familiaritytechnology = 'Not familiar at all'); +``` diff --git a/03-SQL_Advanced/exercise_solutions.md b/03-SQL_Advanced/exercise_solutions.md new file mode 100644 index 0000000..aa8257e --- /dev/null +++ b/03-SQL_Advanced/exercise_solutions.md @@ -0,0 +1,246 @@ +# Problems + +## A: Drinking water availability + +1. Write a query which outputs the **total** drinking water availability for the most recent year in data, for each region. Sort the output by the value of availability in descending order. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) AS safe_total +FROM public.mytable +GROUP BY region_name +HAVING sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL +ORDER BY safe_total DESC; +``` + +2. **Rural/Urban ratio**: For regions where both rural and urban data are available, for most recent year, find the ratio of rural/urban drinking water availability. Sort the table by this ratio, in ascending order. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_rural::float/safe_drinking_water_urban ELSE NULL END) AS ru_ratio +FROM public.mytable +WHERE safe_drinking_water_rural IS NOT NULL AND safe_drinking_water_urban IS NOT NULL +GROUP BY region_name +ORDER BY ru_ratio; +``` + +3. **Growth**: For regions where **total** drinking water availability value is available, find the growth in drinking water availability. Growth is defined as the ratio of latest value to the oldest value. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS old_year_2005, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) AS latest_year_2015, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END)/sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS growth_ratio +FROM public.mytable +GROUP BY region_name +HAVING sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL +AND sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL +ORDER BY growth_ratio; +``` + +## B: GDP + +1. Write a query which outputs the GDP per capita for the most recent year in data, for each region. Sort the output by the value of GDP per capita in descending order. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END) AS gdp_per_cap_2016 +FROM public.gdp +GROUP BY region_name +HAVING sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END) IS NOT NULL +ORDER BY gdp_per_cap_2016 DESC; +``` + +2. **Comparison with average growth**: The _average growth_ of GDP per capita can be calculated by comparing the oldest and the most recent value for the "Total, all countries" region. We want to know which countries have grown faster than average, and which have grown slower than average. Write a query that outputs two rows - one row contains the list of all regions where growth is lower than average, and another which contains the regions where growth is higher than average. + +```SQL +SELECT + CASE WHEN avg_growth_per_region < 3.6518 THEN 'Low Growth' ELSE 'High Growth' END AS growth_category, + string_agg(region_name,', ') AS countries +FROM + (SELECT + region_name, + sum(CASE WHEN year = 1985 THEN gdp_per_capita ELSE NULL END) AS old, + sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END) AS recent, + sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 1985 THEN gdp_per_capita ELSE NULL END) AS avg_growth_per_region + FROM gdp + WHERE region_name != 'Total, all countries or areas' + GROUP BY region_name + HAVING sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 1985 THEN gdp_per_capita ELSE NULL END) IS NOT NULL) AS growth_avg +GROUP BY growth_category; +``` + + + +# C: Internet penetration + +1. Write a query which outputs the internet penetration for the most recent year in data, for each region. Sort the output by the value of availability in descending order. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END) AS net_pen_2016 +FROM public.internet_penetration +GROUP BY region_name +HAVING sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END) IS NOT NULL +ORDER BY net_pen_2016 DESC; +``` + +2. **Growth**: Find the growth in internet penetration for each region. Growth is defined as the ratio of latest value to the oldest value. + +```SQL +SELECT region_name, + sum(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END) AS old_net_pen, + sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END) AS recent_net_pen, + sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END)::float AS growth_net_pen +FROM public.internet_penetration +GROUP BY region_name +HAVING sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END)::float IS NOT NULL +AND sum(CASE WHEN year = 2000 THEN percent_internet_penetration ELSE NULL END) != 0 +ORDER BY growth_net_pen DESC; +``` + + +## Problems + +1. Is there any region where internet connectivity is higher than total drinking availability? + +```SQL +SELECT i.region_name, i.year, i.percent_internet_penetration, w.safe_drinking_water_total +FROM public.internet_penetration as i +JOIN public.mytable as w +ON i.year = w.year AND i.region_name = w.region_name +WHERE w.safe_drinking_water_total IS NOT NULL +AND percent_internet_penetration > safe_drinking_water_total; +``` + + +2. List all countries where growth in internet connectivity is lower than growth in drinkable water availability. + +```SQL +SELECT water.region_name, growth_net_pen_ratio, growth_water_ratio +FROM + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) AS old_net_pen_2005, + sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END) AS recent_net_pen_2015, + sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float AS growth_net_pen_ratio + FROM public.internet_penetration + GROUP BY region_name + HAVING sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float IS NOT NULL + AND sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) != 0 + ORDER BY growth_net_pen_ratio DESC) AS internet, + + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS old_year_2005, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) AS latest_year_2015, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END)/sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS growth_water_ratio + FROM public.mytable + GROUP BY region_name + HAVING sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + AND sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + ORDER BY growth_water_ratio) AS water +WHERE water.region_name = internet.region_name +AND growth_net_pen_ratio < growth_water_ratio +ORDER BY water.region_name; +``` + +3. Find the correlation between GDP per capita and GDP in current prices. Does the value look strange? What explains the value? + +```SQL +SELECT corr(gdp_per_capita,gdp_in_current_prices_millions) +FROM public.gdp; +``` +The correlation coefficient is .078. This value doesn't look strange because countries have vastly different population sizes in relation to their total GDP. + +For example, you could have a relatively small country, with a small population but have a high gdp per capita, but with a small population, e.g. Monaco. + +On the other hand, you could have a country with a large amount of people, but have a low gdp per capita, but due to the sheer size of the population, the total gdp of the country is quite large e.g.China in the 1990s + +Two countries could have similar GDP's, but their GDP per capita may be very different, which further explains no correlation between the two. + +Let's look at two countries with a similar gdp: South Africa and Denmark. According to the gdp table, in 2014, both these countries had a gdp of roughly 350,000 million. However, their gdp per capita was 6,433 for South Africa and 62,323 for Denmark, a difference by a factor of 10. + +What explains this difference? A quick google search of these countries' populations reveals that South Africa has a population of 56 million people, roughly 10 times that of Denmark with a population of 5.7 million. + +Put simply, the correlation coefficient between `gdp_per_capita` and `gdp_in_current_prices_millions` is close to zero because there is virtually no correlation between population size and GDP. + + + +4. Find the correlation between growth of internet connectivity and total drinkable water availability. + +```SQL +SELECT corr(internet.growth_net_pen_ratio, water.growth_water_ratio) +FROM + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) AS old_net_pen_2005, + sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END) AS recent_net_pen_2015, + sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float AS growth_net_pen_ratio + FROM public.internet_penetration + GROUP BY region_name + HAVING sum(CASE WHEN year = 2015 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float IS NOT NULL + AND sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) != 0 + ORDER BY growth_net_pen_ratio DESC) AS internet, + + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS old_year_2005, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) AS latest_year_2015, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END)/sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS growth_water_ratio + FROM public.mytable + GROUP BY region_name + HAVING sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + AND sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + ORDER BY growth_water_ratio) AS water +WHERE internet.region_name = water.region_name; +``` + +5. Find the correlation between growth of internet connectivity and growth of GDP per capita + +```SQL +SELECT corr(internet.growth_net_pen_ratio, gpc.growth_gdp_per_cap) +FROM + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) AS old_net_pen_2005, + sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END) AS recent_net_pen_2016, + sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float AS growth_net_pen_ratio + FROM public.internet_penetration + GROUP BY region_name + HAVING sum(CASE WHEN year = 2016 THEN percent_internet_penetration ELSE NULL END)/sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END)::float IS NOT NULL + AND sum(CASE WHEN year = 2005 THEN percent_internet_penetration ELSE NULL END) != 0 + ORDER BY growth_net_pen_ratio DESC) AS internet, + + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) AS old_gdp_per_cap_2005, + sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END) AS new_gdp_per_cap_2016, + sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) AS growth_gdp_per_cap + FROM public.gdp + GROUP BY region_name + HAVING sum(CASE WHEN year = 2016 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) IS NOT NULL + ORDER BY growth_gdp_per_cap DESC) AS gpc +WHERE internet.region_name = gpc.region_name; + +``` + +6. Find the correlation between growth of GDP per capita and drinking water availability. + +```SQL +SELECT corr(gpc.growth_gdp_per_cap, water.growth_water_ratio) +FROM + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) AS old_gdp_per_cap_2005, + sum(CASE WHEN year = 2015 THEN gdp_per_capita ELSE NULL END) AS new_gdp_per_cap_2016, + sum(CASE WHEN year = 2015 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) AS growth_gdp_per_cap + FROM public.gdp + GROUP BY region_name + HAVING sum(CASE WHEN year = 2015 THEN gdp_per_capita ELSE NULL END)::float/sum(CASE WHEN year = 2005 THEN gdp_per_capita ELSE NULL END) IS NOT NULL + ORDER BY growth_gdp_per_cap DESC) AS gpc, + + (SELECT region_name, + sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS old_year_2005, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) AS latest_year_2015, + sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END)/sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) AS growth_water_ratio + FROM public.mytable + GROUP BY region_name + HAVING sum(CASE WHEN year = 2005 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + AND sum(CASE WHEN year = 2015 THEN safe_drinking_water_total ELSE NULL END) IS NOT NULL + ORDER BY growth_water_ratio DESC) AS water +WHERE water.region_name = gpc.region_name; +```