Forum

Posted by Sylvia, April 26, 2022, 1:58 p.m.

Question 102

I have two questions given the assumption below: 

  • A user can have multiple sessions with different duration and can be counted multiple times.
    • Does it mean we only count the average OR the maximum duration for any session? It does not seem to make a difference whether I use "AVG(duration)" or "MAX(duration)" because it is on a user_id and session_id level, but I could be wrong. 
    • If a user has multiple sessions, should I use count(distinct session_id) OR count(distinct user_id)? I think it should be session_id based on the fact that a user can have multiple sessions. 

Thanks in advance for your detailed explanations! I found the business sense aspect of the SQL questions to be more challenging than the technical aspect. :)

My Answer: 

WITH session_user_duration AS
(
SELECT a.user_id, a.session_id, 
        AVG(b.duration) as duration 
FROM session_web a
INNER JOIN session_web_duration b
ON a.session_id = b.session_id 
GROUP BY a.user_id, a.session_id
), 

session_user_label AS (
SELECT user_id, 
         session_id, 
         CASE WHEN duration < 10 THEN '< 10'
            WHEN duration >= 10 AND duration < 20 THEN '10, 20'
            WHEN duration >= 20 AND duration < 45 THEN '20, 45'
            WHEN duration >= 45 AND duration < 60 THEN '45, 60'
            WHEN duration >= 60 AND duration < 200 THEN '60, 200'
            WHEN duration >= 200 AND duration < 600 THEN '200, 600'
            WHEN duration >= 600 THEN '>= 600'
            ELSE NULL
            END AS duration_category    
FROM session_user_duration
)

SELECT duration_category, count(distinct session_id) as count
FROM session_user_label 
GROUP by duration_category

Solution: 

WITH session_user_duration AS (
    SELECT 
        W.session_id,
        W.user_id, 
        MAX(D.duration) AS duration
    FROM session_web W
    INNER JOIN session_web_duration D
    ON D.session_id = W.session_id
    GROUP BY W.session_id, W.user_id
),
session_user_label AS (
    SELECT session_id,
           CASE WHEN duration < 10 THEN '< 10'
                WHEN duration < 20 THEN '10, 20'
                WHEN duration < 45 THEN '20, 45'
                WHEN duration < 60 THEN '45, 60'
                WHEN duration < 200 THEN '60, 200'
                WHEN duration < 600 THEN '200, 600'
                WHEN duration >= 600 THEN '>= 600'
               ELSE NULL END AS duration_category,
           user_id
    FROM session_user_duration
)
SELECT duration_category, COUNT(DISTINCT user_id)
FROM session_user_label
GROUP BY duration_category
ORDER BY duration_category;

Answers

Hi there,

     The AVG(b.duration) as duration  is not correct. Because the following part is used to remove duplicate records .  For each session, there will be a record for each event.   So we should remove the duplicates at first.  

WITH session_user_duration AS (
    SELECT 
        W.session_id,
        W.user_id, 
        MAX(D.duration) AS duration
    FROM session_web W
    INNER JOIN session_web_duration D
    ON D.session_id = W.session_id
    GROUP BY W.session_id, W.user_id
)

Actually, I think there is an error  on offical solution.  Because the question is avg_session_per_user, not the number of users. 

Here is my answer, I hope it can help you. :) 


WITH C AS 
(
SELECT 
     user_id,
     X.session_id,
     CASE WHEN duration < 10 THEN '<10'
          WHEN duration >= 10 AND duration < 20 THEN '10,20'
          WHEN duration >= 20 AND duration < 45 THEN '20,45'
          WHEN duration >= 45 AND duration < 60 THEN '45,60'
          WHEN duration >= 60 AND duration < 200 THEN '60,200'
          WHEN duration >= 200 AND duration < 600 THEN '200,600'
          ELSE '>=600' 
          END  AS date    
FROM 
  (
 SELECT user_id,session_id
  FROM session_web
  GROUP BY user_id,session_id
 )
   X

inner join session_web_duration S
ON X.session_id = S.session_id            
)

SELECT date, count(session_id)/COUNT(DISTINCT user_id) AS avg_session_per_user
FROM C
GROUP BY date

 

 

Thank you very much for walking me through your answer! :) 

Sylvia, April 29, 2022, 11:13 p.m.
SQLPad user avatar

Li (10)

April 27, 2022, 2:27 p.m.

Hi Sylvia,

It's a good question, we can probably add some more instruction to the table definition.

1. For session_web_duration table

The primary key for this table is session_id, and each session has only one duration.

SELECT 

    COUNT(session_id),  

    COUNT(DISTINCT session_id)

FROM session_web_duration;

-- 1000, 10000

For the session_web table, in the CTE, we want to know each user, by each session id and the session duration, since the session_web table has an event column, the session_id column has duplicates, but each session_id has only one unique value of duration.

So any aggregation function will generate the same results: AVG, MAX, MIN will all have the same results for the duration, as one session can only have one duration value.

With the CTE, we get a unique session_id, its duration, and the distinct user_id, the next job is simply to count them.

 

2. We are not counting the number of sessions but unique user count.

 

Hope it helps,

Leon

Thank you for your detailed answer! 

Since one user can have multiple sessions with different durations, can I understand it as - 

  • If one user has one session with a duration < 5, and another session with a duration > 30, this user essentially will be counted twice in these two duration categories because he has two unique sessions? 

Sylvia, April 29, 2022, 11:12 p.m.

It looks like the sample results has the columns misnamed:

Sample results

 date    |   avg_session_per_user
-----------+--------------
< 10       | 20000
10, 20     | 12000
20, 45     | 5000

When the solution has this format:
duration_category count
< 10 39
200, 600 82
60, 200 68

Steven S, May 8, 2022, 7:39 p.m.
SQLPad user avatar

Leon (949)

April 28, 2022, 4:35 p.m.