Explaining my solutions for interview questions
This repository contains a solution to the "Active User Retention" problem from DataLemur.
As a data analyst at Facebook, I need to identify Monthly Active Users (MAUs) for July 2022. A "retained user" is defined as a user who was active in both the current month (July 2022) and the previous month (June 2022).
- Calculate the total number of retained users for July 2022.
- Output the month and the count of monthly active users.
To solve this problem efficiently, I focused on data modularity and readability using Common Table Expressions (CTEs).
- Data Isolation (CTEs):
- Created two separate CTEs (
juneandjuly) to isolate user activity for each respective month. - Used
TO_CHARandEXTRACTfunctions to filter records specifically for June 2022 and July 2022.
- Created two separate CTEs (
- Identifying Retained Users:
- Performed an INNER JOIN between the two CTEs on
user_id. - This approach ensures that only users present in both datasets are included in the final count, perfectly matching the business definition of a "retained user."
- Performed an INNER JOIN between the two CTEs on
- Aggregation:
- Counted unique
user_ids to produce the final MAU metric.
- Counted unique
- CTE vs Subqueries: Used CTEs to make the query more readable and maintainable, which is a best practice in professional data environments.
- Join Logic: Leveraged an Inner Join as a filtering mechanism to identify overlapping user segments across different time periods.
- Date Manipulation: Practiced extracting specific time intervals from
datetimeobjects to align with monthly reporting requirements.
DataLemur - Active User Retention
This repository contains a solution to the "Sending vs. Opening Snaps" problem from DataLemur.
As a data analyst at Snapchat, I need to analyze how users in different age groups spend their time on the platform. Specifically, I need to break down the proportion of time spent sending snaps compared to opening snaps.
- Calculate the percentage of time spent sending snaps and opening snaps for each
age_bucket. - Output the age bucket, send percentage, and open percentage, rounded to two decimal places.
To solve this problem efficiently, I focused on data modularity and conditional aggregation using Common Table Expressions (CTEs) and Window Functions.
- Data Isolation & Aggregation (CTEs):
- Created a CTE to join the
activitiesandage_breakdowntables. - Filtered out irrelevant data early by selecting only 'send' and 'open' activities.
- Used Window Functions (
SUM(...) OVER(PARTITION BY ...)) to simultaneously calculate the total time spent per activity type and the total time spent per age bucket without relying on complex nested subqueries.
- Created a CTE to join the
- Conditional Aggregation (Pivoting):
- Utilized the
MAX(CASE WHEN ... THEN ... END)pattern combined with aGROUP BYclause. - This approach cleanly pivots the row-level activity data into readable, separate columns (
send_percentageandopen_percentage).
- Utilized the
- Percentage Calculation:
- Divided the activity-specific time by the total age bucket time, multiplied by 100.0 to ensure accurate floating-point division, and applied the
ROUND(..., 2)function to format the final output.
- Divided the activity-specific time by the total age bucket time, multiplied by 100.0 to ensure accurate floating-point division, and applied the
- Window Functions: Leveraged
PARTITION BYto calculate multiple levels of aggregation (by age bucket and by activity type) simultaneously within a single dataset. - Data Pivoting: Mastered the
MAX(CASE WHEN ...)technique to effectively transform normalized row data into a structured, columnar report format. - Query Optimization: Improved query processing efficiency by filtering out unnecessary data (like 'chat' activities) at the CTE level before performing heavy aggregations.
DataLemur - Sending vs. Opening Snaps
This repository contains a solution to the "Consecutive Filing Years" problem from DataLemur.
As a data analyst, I need to identify individuals who have filed their taxes using any version of TurboTax for three or more consecutive years. Each user is allowed to file taxes once a year using a specific product.
- Identify the user IDs of individuals with 3 or more consecutive years of TurboTax filings.
- Display the output in ascending order of user IDs.
To solve this problem efficiently, I focused on window functions and mathematical logic using Common Table Expressions (CTEs).
- Data Isolation (CTE):
- Created a CTE (
turbotax_years) to filter for 'TurboTax' products using theLIKEoperator. - Used the
EXTRACTfunction to retrieve the filing year as an integer.
- Created a CTE (
- Identifying Historical Filings:
- Leveraged the
LAG()window function with an offset of 2 (LAG(filing_year, 2)). - Partitioned by
user_idand ordered byfiling_yearto directly retrieve the year from two records prior.
- Leveraged the
- Mathematical Filtering:
- Calculated the difference between the current
filing_yearand theyear_2steps_back. - A difference of exactly
2mathematically confirms 3 consecutive years of filing activity.
- Calculated the difference between the current
- Advanced Window Functions (
LAGOffset): Discovered thatLAG(column, offset)accepts a second argument. By usingLAG(filing_year, 2), I could cleanly look two rows back in a single step, making the query smarter and eliminating the need for nested CTEs. EXTRACTvsTO_CHAR: Initially usedTO_CHARto extract the year, which caused a type error during the final filtering stage because it converts the output into a string. Switching toEXTRACT(YEAR FROM filing_date)kept the value as an integer, allowing the arithmetic operation (filing_year - year_2steps_back = 2) to work flawlessly.- SQL Naming Conventions: Encountered a syntax error when trying to use an alias starting with a number (
as 2steps_back_year). Learned the strict rule that SQL aliases must begin with a letter (e.g.,year_2steps_back). - Regex in PostgreSQL: While trying to match the string "TurboTax", I learned that PostgreSQL 14 does not support the standard
REGEXPkeyword. Instead, it relies on specific operators like~(case-sensitive) or~*(case-insensitive). For this problem, using the standardLIKEoperator was the most straightforward approach.