Eddie昌的博客 Eddie昌的博客
首页
  • 数据理论

    • 《数据仓库工具箱》
    • 《阿里巴巴大数据之路》
    • 《DAMA数据治理》
  • 数据实践

    • TypeScript
  • 数据分析1
  • 数据分析2
  • Hadoop生态
  • Linux
  • Git
  • 爱SQL
  • 学习
  • 面试
  • 心情杂货
  • 实用技巧
  • 分类
  • 标签
  • 归档

Eddie昌

数据领域小学生
首页
  • 数据理论

    • 《数据仓库工具箱》
    • 《阿里巴巴大数据之路》
    • 《DAMA数据治理》
  • 数据实践

    • TypeScript
  • 数据分析1
  • 数据分析2
  • Hadoop生态
  • Linux
  • Git
  • 爱SQL
  • 学习
  • 面试
  • 心情杂货
  • 实用技巧
  • 分类
  • 标签
  • 归档
  • Hadoop生态

  • Linux

  • Git

  • 爱SQL

    • 连续登录问题
      • 1. 题目
      • 2. 解题思路
      • 3. 变化1:求用户最大连续登录天数
      • 4. 变化2:求用户连续登录的日期段
    • 求股票波峰波谷
  • 技术
  • 爱SQL
Eddie昌
2024-04-07
目录

连续登录问题原创

# 连续登录问题

# 1. 题目

有一个用户登录表dayly_sql.user_login_log,记录用户登录ID及登录日期,求最近7天内连续登录3天以上的客户清单,表结构及示例数据如下:

User_id Login_Date
000001 2024-03-29
000001 2024-03-31
000001 2024-04-01
000001 2024-04-02
000001 2024-04-02
000001 2024-04-04
000001 2024-04-05
000002 2024-04-04
000002 2024-04-05
000002 2024-04-06
000002 2024-04-07
000002 2024-04-01
000002 2024-04-02
000002 2024-04-04

# 2. 解题思路

  1. 对表按用户及登录去重并限制最近7天(假设今天是2024-04-07);
  2. 使用row_number按User_id分组,按Login_Date降序,得到对应序号RN;
  3. 将Login_Date减去RN,得到DATE2:dateidff(Login_Date,rn);
  4. 按User_id、DATE2分组,取count(1) >=3的即为最近7天连续登录3天以上用户。

第一步:限制日期及去重程:

SELECT  user_id,login_date
FROM    dayly_sql.user_login_log
WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date;
1
2
3
4

得到以下数据集:

User_id Login_Date
000001 2024-03-31
000001 2024-04-01
000001 2024-04-02
000001 2024-04-04
000001 2024-04-05
000002 2024-04-01
000002 2024-04-02
000002 2024-04-04
000002 2024-04-05
000002 2024-04-06
000002 2024-04-07

第二步:日期排序及算出date2重点是date_sub(t2.login_date,t2.rn) as date2

SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
    (
       SELECT  t1.user_id, t1.login_date
                ,row_number() over (partition by user_id order by t1.login_date ) as rn
        FROM
            (
                SELECT  user_id,login_date
                FROM    dayly_sql.user_login_log
                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                GROUP BY user_id,login_date
            )t1
    )t2
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14

得到以下数据集:

user_id login_date date2 rn
000001 2024-03-31 2024-03-30 1
000001 2024-04-01 2024-03-30 2
000001 2024-04-02 2024-03-30 3
000001 2024-04-04 2024-03-31 4
000001 2024-04-05 2024-03-31 5
000002 2024-04-01 2024-03-31 1
000002 2024-04-02 2024-03-31 2
000002 2024-04-04 2024-04-01 3
000002 2024-04-05 2024-04-01 4
000002 2024-04-06 2024-04-01 5
000002 2024-04-07 2024-04-01 6

第三步:按user_id,date2分组,count(*) >= 3 的即为连续登录天数大于等于3的用户跟对应的登录前1天:

SELECT t3.user_id,t3.date2,count(*) as dm
FROM
    (
        SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
        FROM
            (
               SELECT  t1.user_id, t1.login_date
                        ,row_number() over (partition by user_id order by t1.login_date ) as rn
                FROM
                    (
                        SELECT  user_id,login_date
                        FROM    dayly_sql.user_login_log
                        WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                        GROUP BY user_id,login_date
                    )t1
            )t2
    )t3
GROUP BY t3.user_id, t3.date2
having count(*) >= 3
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

得到结果,其中date2是连续登录的前1天:

user_id date2 dm
000001 2024-03-30 3
000002 2024-04-01 4

# 3. 变化1:求用户最大连续登录天数

求某用户最近7天最大连续登录天数。

思路:先限制用户ID,然后去重。后面计算跟求所有用户连续登录天数一样,再取max(dm)即可

SELECT t4.user_id,max(t4.dm) as dm
FROM
    (
        SELECT t3.user_id,t3.date2,count(*) as dm
        FROM
            (
                SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
                FROM
                    (
                       SELECT  t1.user_id, t1.login_date
                                ,row_number() over (partition by user_id order by t1.login_date ) as rn
                        FROM
                            (
                                SELECT  user_id,login_date
                                FROM    dayly_sql.user_login_log
                                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                                AND     user_id = '000001'
                                GROUP BY user_id,login_date
                            )t1
                    )t2
            )t3
        GROUP BY t3.user_id, t3.date2
    )t4
GROUP BY t4.user_id
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

得到连续登录天数:

user_id dm
000001 3

# 4. 变化2:求用户连续登录的日期段

第一步:按上面步骤,日期排序及算出date2

SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
    (
       SELECT  t1.user_id, t1.login_date
                ,row_number() over (partition by user_id order by t1.login_date ) as rn
        FROM
            (
                SELECT  user_id,login_date
                FROM    dayly_sql.user_login_log
                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                GROUP BY user_id,login_date
            )t1
    )t2
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14

得到以下数据集:

user_id login_date date2 rn
000001 2024-03-31 2024-03-30 1
000001 2024-04-01 2024-03-30 2
000001 2024-04-02 2024-03-30 3
000001 2024-04-04 2024-03-31 4
000001 2024-04-05 2024-03-31 5
000002 2024-04-01 2024-03-31 1
000002 2024-04-02 2024-03-31 2
000002 2024-04-04 2024-04-01 3
000002 2024-04-05 2024-04-01 4
000002 2024-04-06 2024-04-01 5
000002 2024-04-07 2024-04-01 6

第二步:算出用户连续登录天数

with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
    (
       SELECT  t1.user_id, t1.login_date
                ,row_number() over (partition by user_id order by t1.login_date ) as rn
        FROM
            (
                SELECT  user_id,login_date
                FROM    dayly_sql.user_login_log
                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                GROUP BY user_id,login_date
            )t1
    )t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
    tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT * FROM tmp_dm;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

得到以下数据集,其中date2为连续登录的前1天,dm为连续登录天数:

user_id date2 dm
000001 2024-03-31 2
000001 2024-03-30 3
000002 2024-03-31 2
000002 2024-04-01 4

第三步:第一步的结果tmp_rn按user_id,date2分组,取max(rn) as day_add_num,得出需要加的天数,然后关联tmp_dm:

with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
    (
       SELECT  t1.user_id, t1.login_date
                ,row_number() over (partition by user_id order by t1.login_date ) as rn
        FROM
            (
                SELECT  user_id,login_date
                FROM    dayly_sql.user_login_log
                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                GROUP BY user_id,login_date
            )t1
    )t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
    tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT *
FROM
    (
        SELECT user_id,date2,max(rn) as day_add_num FROM tmp_rn GROUP BY user_id,date2
    )t1
INNER JOIN
    (
       SELECT * FROM tmp_dm
    )t2
ON t1.user_id = t2.user_id AND t1.date2 = t2.date2
ORDER BY t1.user_id,t1.day_add_num
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

得出以下结果,其中date2为连续登录开始前1天,date_add(date2,day_add_num)为连续登录结束日期:

t1.user_id t1.date2 t1.day_add_num t2.user_id t2.date2 dm
000001 2024-03-30 3 000001 2024-03-30 3
000001 2024-03-31 5 000001 2024-03-31 2
000002 2024-03-31 2 000002 2024-03-31 2
000002 2024-04-01 6 000002 2024-04-01 4

第四步:算出连续登录开始日期及结束日期

with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
    (
       SELECT  t1.user_id, t1.login_date
                ,row_number() over (partition by user_id order by t1.login_date ) as rn
        FROM
            (
                SELECT  user_id,login_date
                FROM    dayly_sql.user_login_log
                WHERE   login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
                GROUP BY user_id,login_date
            )t1
    )t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
    tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT   t1.user_id
        ,date_add(t1.date2,cast(1 AS INT)) AS start_date
        ,date_add(t1.date2,cast(t1.day_add_num AS INT)) AS end_date
FROM
    (
        SELECT user_id,date2,max(rn) as day_add_num FROM tmp_rn GROUP BY user_id,date2
    )t1
INNER JOIN
    (
       SELECT * FROM tmp_dm
    )t2
ON t1.user_id = t2.user_id AND t1.date2 = t2.date2
ORDER BY t1.user_id,date_add(t1.date2,cast(1 AS INT))
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

结果:

user_id start_date end_date
000001 2024-03-31 2024-04-02
000001 2024-04-01 2024-04-05
000002 2024-04-01 2024-04-02
000002 2024-04-02 2024-04-07
#SQL#Hive
git设置用户邮箱并关联github账号
求股票波峰波谷

← git设置用户邮箱并关联github账号 求股票波峰波谷→

最近更新
01
06.搭建一套源生hadoop、Spark、Flink集群3-Hive安装 原创
04-26
02
搭建一套源生Hadoop、Spark、Flink集群1-环境准备 原创
04-25
03
搭建一套源生hadoop、Spark、Flink集群2-Hadoop安装 原创
04-25
更多文章>
Theme by Vdoing | Copyright © 2019-2024 Evan Xu | MIT License | 粤ICP备2023070487号-1 | 粤公网安备44200102445447
  • 跟随系统
  • 浅色模式
  • 深色模式
  • 阅读模式