目录
- 1 题目
- 2 建表语句
- 3 题解
题目来源:滴滴
1 题目
已知有用户登录记录表包含登录日期和登录用户ID,请查询出截止到当前日期累积登录用户数及登陆用户列表。
样例数据
+-------------+----------+
| log_date | user_id |
+-------------+----------+
| 2024-01-01 | a |
| 2024-01-02 | a |
| 2024-01-02 | b |
| 2024-01-03 | b |
| 2024-01-04 | c |
| 2024-01-05 | b |
| 2024-01-05 | c |
| 2024-01-05 | d |
| 2024-01-05 | e |
+-------------+----------+
期望结果
+-------------+-----------+------------------------+
| log_date | user_cnt | user_list |
+-------------+-----------+------------------------+
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
+-------------+-----------+------------------------+
2 建表语句
-- 建表语句
CREATE TABLE t_user_login
(
log_date STRING,
user_id STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- 数据插入语句
INSERT INTO t_user_login
(log_date, user_id) VALUES
('2024-01-01','a'),
('2024-01-02','a'),
('2024-01-02','b'),
('2024-01-03','b'),
('2024-01-04','c'),
('2024-01-05','b'),
('2024-01-05','c'),
('2024-01-05','d'),
('2024-01-05','e');
3 题解
(1)数据去重聚合
select log_date,user_id,count(distinct user_id) over (order by log_date asc) as user_cnt,collect_set(user_id) over (order by log_date asc) as user_list
from t_user_login
执行结果
+-------------+----------+-----------+------------------------+
| log_date | user_id | user_cnt | user_list |
+-------------+----------+-----------+------------------------+
| 2024-01-01 | a | 1 | ["a"] |
| 2024-01-02 | b | 2 | ["a","b"] |
| 2024-01-02 | a | 2 | ["a","b"] |
| 2024-01-03 | b | 2 | ["a","b"] |
| 2024-01-04 | c | 3 | ["a","b","c"] |
| 2024-01-05 | e | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | d | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | c | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | b | 5 | ["a","b","c","e","d"] |
+-------------+----------+-----------+------------------------+
(2)分组得到结果
select log_date,user_cnt,user_list
from (select log_date,user_id,count(distinct user_id) over (order by log_date asc) as user_cnt,sort_array(collect_set(user_id) over (order by log_date asc)) as user_listfrom t_user_login) t
group by log_date, user_cnt, user_list
执行结果
+-------------+-----------+------------------------+
| log_date | user_cnt | user_list |
+-------------+-----------+------------------------+
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
+-------------+-----------+------------------------+