********************************#192
- ***********************************************************************************************************************************2001
- ******************************************************************************************************************************************************************************************************************
- ****************************************************************************************************
Table 1: tweets
col_name | col_type ----------------+-------------------------- user_id | bigint tweet_id | bigint reply_tweet_id | bigint created_at | timestamp content | varchar(280) -- max of 280 chars
Table 2: twitter_experiment_users
Users that have been assigned into a treatment group or not for each ab testing. A user can only belong to either treatment or control group by a single experiment.
col_name | col_type --------------+-------------------------- exp_id | bigint user_id | bigint is_treatment | boolean
Sample results
diff ------------------------ 0.33333333333333333333
Solution postgres
WITH user_tweet_replies AS (
SELECT user_id, COUNT(DISTINCT tweet_id) AS num_tweets, SUM(num_replies) AS total_replies
FROM (
SELECT T1.user_id, T1.tweet_id, COUNT(T2.tweet_id) AS num_replies
FROM (
SELECT * FROM tweets
WHERE reply_tweet_id IS NULL
) T1
LEFT JOIN tweets T2
ON T1.tweet_id = T2.reply_tweet_id
GROUP BY 1,2
) X
GROUP BY user_id
),
exp_2001 AS (
SELECT user_id, is_treatment
FROM twitter_experiment_users
WHERE exp_id = 2001
)
SELECT
SUM(CASE WHEN E.is_treatment is true then total_replies ELSE 0 END) *1.0 / SUM(CASE WHEN E.is_treatment is true then num_tweets ELSE 0 END) -
SUM(CASE WHEN E.is_treatment is false then total_replies ELSE 0 END) *1.0 / SUM(CASE WHEN E.is_treatment is false then num_tweets ELSE 0 END) AS diff
FROM user_tweet_replies R
INNER JOIN exp_2001 E
ON E.user_id = R.user_id
;
Explanation
This query is calculating the difference in average number of replies per tweet between a treatment group and a control group in a Twitter experiment.
The first part of the query creates a temporary table called "user_tweet_replies" that calculates the total number of tweets and replies for each user in the dataset. It does this by joining the "tweets" table with itself on tweet_id and reply_tweet_id, and grouping the result by user_id and tweet_id.
The second part of the query creates another temporary table called "exp_2001" that selects the user_id and is_treatment columns from a separate table called "twitter_experiment_users", where the experiment ID is 2001.
Finally, the query joins the two temporary tables and calculates the difference in average number of replies per tweet between the treatment and control groups using a CASE statement. It does this by summing the total number of replies and tweets for each group and dividing the result. The final output is a single value representing the difference in averages between the two groups.
Expected results