SQL window functions: IntroductionÂ
SQL WINDOW functions are a family of SQL functions that are frequently asked during a data scientist or data engineer job interview.
Companies such as Meta and Amazon asked them often.
However, writing a bug-free WINDOW function-based SQL query could be quite challenging for any job candidate, especially those who have just gotten started with SQL.
I will share some typical WINDOW functions-specific interview questions, patterns, and step-by-step solutions in this article.
You can also practice SQL window functions by visiting our SQL playground.
If you are in a hurry, watch the following video to get a quick glimpse of what Windows functions do.
SQL Window Functions Tutorial: Outline
I am going to break down this article into 4 sections:
- In the first section, I will go through a few WINDOW functions based on regular aggregate functions, such as AVG, MIN/MAX, COUNT, SUM.Â
- In section 2, I will focus on rank-related functions, such as ROW_NUMBER, RANK, and RANK_DENSE. Those functions are handy when generating ranking indexes, and you need to be fluent in those functions before entering a data scientist SQL interview.
- In the third section, I will talk about generating statistics (e.g., percentiles, quartiles, median, etc .) with the NTILE function, a common task for a data scientist.
- In the last section, let’s focus on LAG and LEAD, two functions that are super important if you are interviewing for a role that requires dealing with time-series data.
1. Window functions for aggregation (running total/sum):Â AVG, MIN/MAX, SUM, COUNT
Window functions are functions that perform calculations across a set of rows related to the current row.
It is comparable to the type of calculation done with an aggregate function, but unlike regular aggregate functions, window functions do not group several rows into a single output row — the rows retain their own identities.
Behind the scenes, the window functions process more than just the query results' current row.
All examples in this article are based on movie DVD rental business data. In this first example, our goal is to compare each movie DVDs replacement cost to the average cost of movies sharing the same MPAA ratings.
SELECT
title,
rating,
replacement_cost,
AVG(replacement_cost) OVER(PARTITION BY rating) AS avg_cost
FROM film;
For those of you not based in the United States, an MPAA rating is a film rating system that decides a film’s suitability for specific audiences, based on a film’s content. For example, G means it’s appropriate for all ages, while PG-13 contains materials that could be inappropriate for children under 13.
There is no GROUP BY clause for the AVG function, but how does the SQL engine know which rows to use to compute the average? The answer is the PARTITION BY clause inside the OVER() utility, and we are calculating the average based on a unique value of rating.
In the final output, every row has the average cost from the same rating. You can perform other analyses such as dividing a movie’s cost to avg_cost and find out its expense relative to similar movies.
Â
All tables in this article are available on SQLPad’s online SQL playground. If you want to follow along and submit queries against those tables, please feel free to go to sqlpad.io/playground and have some fun.
Â
Let’s take a look at another example. In this example, I want to compare every movie’s length (in minutes) to the maximum length of movies from the same category.
SELECT
title,
name,
length,
MAX(length) OVER(PARTITION BY name) AS max_length
FROM (
SELECT F.title, C.name, F.length
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X;
It’s very similar to the first example. Still, I combined a MAX function with OVER and PARTITION BY to create a window function, which returned the maximum movie length inside the same movie category.
For the first row: story side, its length is 163 minutes, and the maximum length of an action movie (same category) is 185. If I compare each movie’s length to that full length, I can sense how long this specific movie is compared to others from the same category.  It could be helpful to know, as films from different classes can have quite different lengths.
Â
SELECT
film_id,
title,
length,
SUM(length) OVER(ORDER BY film_id) AS running_total,
SUM(length) OVER() AS overall,
SUM(length) OVER(ORDER BY film_id) * 100.0 /SUM(length) OVER() AS running_percentage
FROM film
ORDER BY film_id;
Let’s take a look at a more complicated example, where we calculated a running sum with a window function.
Assuming it’s the holiday season, I want to binge-watch all 1000 movies, starting from movie id=1. After finishing each film, I want to know what my overall progress is. I can use SUM and OVER to calculate a running total of time to get my progress.
Notice that there is no PARTITION BY clause because I am not grouping those movies into any sub-categories. I want to compute my overall progress but not based on any subgroups or categories.Â
Another thing to notice is that if I don’t add anything inside the OVER() function, I get the total number of minutes from the entire movie catalog. As you can see from the second from the last column: they all have the same value of 115267, but after I add the ORDER BY clause, I get the running total of the minutes up to that specific row (running_total column).
Again, please feel free to go to sqlpad’s playground and play with this film table until you become comfortable with the syntax.
If you are interested in practicing a few more WINDOW functions that we just covered, here are 4 exercises for you to reinforce your learning.
Time to complete: ~ 30 -45 mins.
- #58:Â Percentage of revenue per movie
- #59: Relative percentage of a movie’s revenue per category
- #60:Â Compare each film rentals with the average rental per category
- #61:Â Customer spend vs. average customer spend in the same store
2: Window functions for ranking: ROW_NUMBER, RANK, PARTITION BY
Let’s go through some of the essential WINDOW functions: ROW_NUMBER and RANK.
Â
SELECT
F.film_id,
F.title,
F.length,
ROW_NUMBER() OVER(ORDER BY length DESC) AS row_num
FROM film F
ORDER BY row_number;
In this example, our goal is to create a ranking index based on the movie's length for the entire movie catalog.
As you can see, the ROW_NUMBER function generates a sequence of integers, starting from 1, for each row.
But movies with the same lengths were given a DIFFERENT row number, as the database randomly assigned a unique number when there was a tie.
Â
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length DESC) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, row_number;
Let’s take a look at another example. Instead of comparing a movie’s length to all other films from the entire catalog, we can rank them within each movie category using PARTITION BY.
ROW_NUMBER with OVER and PARTITION BY is a regular pattern that is frequently used in advanced SQL. Mastering this pattern will make your life much easier.
For example, imagine you are working at an e-commerce company, and it has a global business. Your boss asks you to send her a list of best sellers for each country. You can use ROW_NUMBER and PARTITION BY to generate this list quickly.
Â
SELECT
F.film_id,
F.title,
F.length,
RANK() OVER(ORDER BY length DESC) AS ranking
FROM film F
ORDER BY ranking;
Let’s take a look at the RANK function, which is very similar to ROW_NUMBER. The difference between RANK and ROW_NUMBER is that RANK assigns the same unique values if there is a tie and restarts the next value with the total number of rows up to that row. Notice how it jumps from 1 to 11.
Â
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
RANK() OVER(PARTITION BY C.name ORDER BY F.length DESC) ranking
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, ranking;
Similarly, we can also generate rankings within a subgroup with the help of PARTITION BY.
Â
SELECT
F.film_id,
F.title,
F.length,
DENSE_RANK() OVER(ORDER BY length DESC) AS ranking
FROM film F
ORDER BY ranking;
The last function I want to show you is DENSE_RANK. It is very similar to RANK but differs in how it handles ties. It restarts with the following immediate consecutive value rather than creating a gap.
As you can see here, for the first 2 rows, two movies both have a value of 1. Instead of restarting from 3, the next dense_rank value starts as 2.
Â
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
DENSE_RANK() OVER(PARTITION BY C.name ORDER BY F.length DESC) ranking
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, ranking;
Similarly, if we combine PARTITION BY with DENSE_RANK, we can get consecutive rankings inside a category. The most significant value for a dense_rank is the total number of unique values inside of a partition.
In summary, ROW_NUMBER, RANK, and DENSE_RANK are 3 functions that are very helpful to generate rankings. I have used ROW_NUMBER quite often at work as a data scientist, and I have also used RANK occasionally when dealing with ties (rare).
Time for some exercises. I have prepared 3 exercises to help with your understanding.
Time to complete: ~30 -45 mins.
- #62:Â Shortest film by category
- #63:Â Top 5 customers by store
- #64:Â Top 2 films by category
3: Window functions for statistical distributions: NTILE
In this section, I am going to show you how to create statistics using NTILE.
NTILE is a handy function, especially for data analytics professionals. For example, as a data scientist, you probably need to create robust statistics such as quartile, quintile, median, etc., in your daily job, and NTILE makes it very easy to generate those numbers.
NTILE takes an argument of the number of buckets and then creates this number of buckets as equally as possible, based on how the rows are partitioned and ordered inside the OVER function.
Â
SELECT
film_id,
title,
length,
NTILE(100) OVER(ORDER BY length DESC) AS percentile
FROM film
ORDER BY percentile;
Let’s take a look at this example, where we created 100 buckets, and we ordered all of the movies by their length descendingly. Therefore, the longest ones are assigned to bucket #1 and the shortest ones to bucket #100.
Â
For the second example, we created a few more statistics, such as DECILES (10 buckets) and QUARTILES (4 buckets). We also partitioned them by MPAA ratings, so the statistics are relative to each unique MPAA rating.
NTILE is a very straightforward window function that can be very useful for your daily job as a data scientist. Let’s do some exercises to help you remember its syntax and reinforce your learning in this lecture.
Some exercises for fun.
Time to complete: ~30 -45 mins.
- #65:Â Movie percentiles by revenue
- #66:Â Movie percentiles by revenue by category
- #67:Â Quartile buckets by number of rentals
4: Window functions for time series data: LAG, LEAD
In the last section, I will walk you through two WINDOW functions: LAG and LEAD, which are extremely useful for dealing with time-related data.
LAG and LEAD's main difference is that LAG gets data from previous rows, while LEAD is the opposite, which fetches data from the following rows.
We can use either one of the two functions to compare month-over-month growth, for example. As a data analytics professional, you are very likely to work on time-related data. If you can use LAG or LEAD efficiently, you will be a very productive data scientist.
Their syntax is very similar to other window functions. Instead of focusing on the format of the syntax, let me show you a couple of examples.
Â
WITH daily_revenue AS (
SELECT
DATE(payment_ts) date,
SUM(amount) revenue
FROM payment
WHERE DATE(payment_ts) >= '2020–05–24'
AND DATE(payment_ts) <= '2020–05–31'
GROUP BY DATE(payment_ts)
)
SELECT
date,
revenue,
LAG(revenue, 1) OVER (ORDER BY date) prev_day_sales,
revenue *1.0/LAG(revenue,1) OVER (ORDER BY date) AS dod
FROM daily_revenue
ORDER BY date;
- In the first step, we created daily movie rental revenue with CTE (common table expression).
- And in the second step, we appended the previous day’s revenue to the current day using the LAG function.
- Notice that the first row of the last 2 columns is empty. It’s simply because there is no previous day since May 24th is the first row available.
- We also specified the offset, which is 1, so we fetch the next row. If you change this number to 2, then you compare the current day’s revenue to the day before the previous day.
- Finally, we divided the current day’s revenue by the previous day to create our daily revenue growth.
Â
WITH daily_revenue AS (
SELECT
DATE(payment_ts) date,
SUM(amount) revenue
FROM payment
WHERE DATE(payment_ts) >= '2020–05–24'
AND DATE(payment_ts) <= '2020–05–31'
GROUP BY DATE(payment_ts)
)
SELECT
date,
revenue,
LAG(revenue, 2) OVER (ORDER BY date) prev_day_sales,
revenue *1.0/LAG(revenue,2) OVER (ORDER BY date) AS dod
FROM daily_revenue
ORDER BY date;
Let’s take a look at another example. It’s very similar to the previous one, but instead of appending the previous day’s revenue, we used the LEAD function with an offset of 1 to get the next day’s movie rental revenue.
We then divided the next day’s revenue by the current day’s revenue to get the day-over-day growth.
Â
For this lecture, you can try the following 2 exercises to help you get familiar with the syntax.
Time to complete: ~45 mins — 1 hour.
Summary
Great job. If you have followed through all the examples, you have seen most of the common WINDOW functions/patterns. Congratulations!
WINDOW functions are a family of SQL utilities that are often asked during a data scientist job interview. Writing a bug-free WINDOW function query could be quite challenging. It takes time and practice to become a master, and that’s why I created sqlpad.io, where you can practice a list of 230 SQL interview questions. Please feel free to give it a try.
If you want more help on WINDOW functions, please check out my cracking the SQL interview for the data scientist course, and if you want some extra help for your job search, you can also hire me as your job search mentor.