Forum

Posted by Andrew, Aug. 31, 2022, 12:58 a.m.

Question #59 Why does LIMIT work, and not a WHERE condition?

Hi, working through this question, and basically have everything right except for the LIMIT portion of the solution. My query is:

WITH film_revenue AS(
  SELECT 
      f.film_id, 
      c.name, 
      SUM(p.amount) as total_revenue
  FROM film AS f
  LEFT JOIN inventory AS inv
  ON f.film_id = inv.film_id
  LEFT JOIN rental AS r
  ON inv.inventory_id = r.inventory_id
  LEFT JOIN payment AS p
  ON r.rental_id = p.rental_id
  INNER JOIN film_category AS fc
  ON inv.film_id = fc.film_id
  INNER JOIN category AS c
  ON fc.category_id = c.category_id
  GROUP BY f.film_id, c.name
  )

SELECT 
    fr.film_id, 
    fr.name,
    fr.total_revenue * 100.0 / SUM(fr.total_revenue) OVER(PARTITION BY fr.name) AS revenue_percentage
FROM film_revenue AS fr
WHERE fr.film_id <= 10
ORDER BY fr.film_id;

Results in:

film_id name revenue_percentage
1 Documentary 49.2565304755525787
2 Horror 16.5716969317470257
3 Documentary 50.7434695244474213
4 Horror 28.7319974953036944
5 Family 100.0000000000000000
6 Foreign 100.0000000000000000
7 Comedy 100.0000000000000000
8 Horror 32.1916092673763306
9 Horror 22.5046963055729493
10 Sports 100.0000000000000000

So, the percentage returned is the film revenue's percentage, partitioned by the category, but only for the 10 rows. So if there is only 1 category for that film id, it is 100% of the category's revenue.

You change the WHERE clause into a LIMIT 10, and you get the correct result. But why is that? Are there certain rules in the PARTITION BY clause that that don't apply to the regular window function without the partition? In #58, the WHERE condition worked fine without the PARTITION BY clause so I assume that's where something is going wrong here.

Side note: I used LEFT JOINS from the film table, in the chance that one of the film ids from 1-10 did not have any revenue, so LIMIT 10 would still work (I think - please let me know if I'm wrong here!), but I'm more interested in why the WHERE condition does not work here.

 

EDIT: Same issue comes up in problem #60 and in #61. In #61, sub query is used specifically to bypass this issue, so I'm guessing that the WHERE condition is being run before the window function. In a situation where you cannot use LIMIT and you need to use a WHERE condition, you'd have to run a sub query for the columns you want to return that contains the partition, then use the WHERE clause outside of the sub query. Is this the optimal way to go about this problem?

Answers

Without testing your specific query, I believe you've come across the answer on your own. Much like math & programming languages, there is an order of operations that occurs within SQL. In this case, the WHERE clause(s) effectively create the pool of data for the window functions to calculate against. LIMIT simply returns a maximum number of rows after everything has been calculated / processed. 

This order of operations is why you can't do certain tasks within certain clauses (ie, why WHERE clauses don't work for filtering aggregate values and instead need the HAVING clause - you can't calculate / filter on an aggregate until you have all the data). Learning and understanding where these apply are some of the things that separate beginner and experienced SQL developers. Taking it a step deeper to learning how the underlying SQL engine behaves is an area where you can learn to optimize queries / optimize the system it's running on (ie, how many CPU cores, RAM, type and speed of storage) - this is more of an area of data engineering / DBA vs pure analytics, but it's definitely helpful to know & understand.

testing

Leon, Aug. 31, 2022, 9:30 a.m.

Yep, I assumed that was the case - thanks Mike for clarifying it a bit further. I'm very much a beginner still, and this definitely helps and forces me to understand WHY certain things work, rather than just get the correct solution. Question #61, with the sub query, really got my brain thinking back on this problem, and helped me reach the conclusion myself. I really enjoyed the sequence of problems from #59-61.

Andrew, Aug. 31, 2022, 9:14 p.m.
SQLPad user avatar

Mike (228)

Aug. 31, 2022, 7:25 a.m.