Hive/Spark窗口函数

窗口函数

hive文档链接
spark文档链接

1. OVER支持的函数

  • 自然序编号
    Syntax: ROW_NUMBER
  • 按等级编号
    Syntax: RANK | DENSE_RANK | PERCENT_RANK
  • 分组内分桶,并返回对应桶的序号
    Syntax: NTILE(n)
  • Analytic Functions(分析函数)
    Syntax: CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE
  • Aggregate Functions(聚合函数)
    Syntax: MAX | MIN | COUNT | SUM | AVG | …

1.1. 准备工作

创建测试表并插入测试数据

CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);INSERT INTO employees VALUES 
("Lisa", "Sales", 10000, 35)
,("Evan", "Sales", 32000, 38)
,("Fred", "Engineering", 21000, 28)
,("Alex", "Sales", 30000, 33)
,("Tom", "Engineering", 23000, 33)
,("Jane", "Marketing", 29000, 28)
,("Jeff", "Marketing", 35000, 38)
,("Paul", "Engineering", 29000, 23)
,("Chloe", "Engineering", 23000, 25);

1.2. row_number()

row_number() over() row_number可能是窗口函数中使用最频繁的函数,作用是在分组内按自然序进行编号,结果值为:1 2 3 4 5

select *,row_number() over(partition by dept order by salary) as rn from employees;
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 3   |
| Paul            | Engineering     | 29000             | 23             | 4   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

1.3. rank函数

rank有等级的含义,函数作用是在分组内按照order by的结果得到有等级编号。
rank() over () 并列有间隔,结果如:1 2 2 4 5
dense_rank() over() dense有密集的含义,并列无间隔,结果如:1 2 2 3 4
percent_rank() over() 百分比的rank,含义是(当前行-1)/(当前组总行数-1),当前行从1开始

注意,如果order by的结果相同,则rank得到的结果都相同,在这里的语义是排序结果相同,因此等级编号也相同。

select *,rank() over(partition by dept order by salary) as rn from employees;
-- 执行结果中,engineering中salary相同的编号相同,paul的值为4
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 4   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+select *,dense_rank() over(partition by dept order by salary) as rn from employees;
-- 执行结果中,engineering中salary相同的编号相同,paul的值为3
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 3   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+select *,percent_rank() over(partition by dept order by salary) as rn from employees;
-- rn的结果是每行在当前分组中的百分比,注意order by中相同值结果相同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |         rn          |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.0                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Fred            | Engineering     | 21000             | 28             | 0.0                 |
| Tom             | Engineering     | 23000             | 33             | 0.3333333333333333  |
| Chloe           | Engineering     | 23000             | 25             | 0.3333333333333333  |
| Paul            | Engineering     | 29000             | 23             | 1.0                 |
| Lisa            | Sales           | 10000             | 35             | 0.0                 |
| Alex            | Sales           | 30000             | 33             | 0.5                 |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+-- order by结果相同的情况,order by dept
select *,rank() over(partition by dept order by dept) as rn from employees;
select *,dense_rank() over(partition by dept order by dept) as rn from employees;
-- 二者结果相同
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Paul            | Engineering     | 29000             | 23             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 1   |
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 1   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Evan            | Sales           | 32000             | 38             | 1   |
| Alex            | Sales           | 30000             | 33             | 1   |
+-----------------+-----------------+-------------------+----------------+-----+select *,percent_rank() over(partition by dept order by dept) as rn from employees;
+-----------------+-----------------+-------------------+----------------+------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  rn  |
+-----------------+-----------------+-------------------+----------------+------+
| Fred            | Engineering     | 21000             | 28             | 0.0  |
| Tom             | Engineering     | 23000             | 33             | 0.0  |
| Paul            | Engineering     | 29000             | 23             | 0.0  |
| Chloe           | Engineering     | 23000             | 25             | 0.0  |
| Jane            | Marketing       | 29000             | 28             | 0.0  |
| Jeff            | Marketing       | 35000             | 38             | 0.0  |
| Lisa            | Sales           | 10000             | 35             | 0.0  |
| Evan            | Sales           | 32000             | 38             | 0.0  |
| Alex            | Sales           | 30000             | 33             | 0.0  |
+-----------------+-----------------+-------------------+----------------+------+

1.4. ntile(n)

ntile(n) over tile有瓷砖、瓦片的含义,作用是在分组内对数据进行分桶,然后返回桶的序号。

按照order by结果将数据平均的放入到分好的桶中,如果数据无法按照桶个数均分,则将多余的数据放在第一个桶内。

使用场景:例如获取每个部门薪资前三分之一的员工数据,则按照salary降序排序,然后分成3个桶,最后取第一个桶中的数据。

select *,NTILE(2) over(partition by dept order by salary) as rn from employees;
-- ntile(2)表示分成2桶,
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 2   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 1   |
| Evan            | Sales           | 32000             | 38             | 2   |
+-----------------+-----------------+-------------------+----------------+-----+
select *,NTILE(3) over(partition by dept order by salary) as rn from employees;
-- 分成3桶
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 3   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

1.5. cume_dist()

cume表示累计,dist表示距离
用于求累计分布,即分组中按照order by结果的分位数

select *,cume_dist() over(partition by dept order by salary) from employees;
-- enginnering分组中,tom和chloe中salary相同,二者的分位数结果相同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.5                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Fred            | Engineering     | 21000             | 28             | 0.25                |
| Tom             | Engineering     | 23000             | 33             | 0.75                |
| Chloe           | Engineering     | 23000             | 25             | 0.75                |
| Paul            | Engineering     | 29000             | 23             | 1.0                 |
| Lisa            | Sales           | 10000             | 35             | 0.3333333333333333  |
| Alex            | Sales           | 30000             | 33             | 0.6666666666666666  |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+select *,cume_dist() over(partition by dept order by age) from employees;
-- 按照age排序,tom和chloe中age不同,因此分位数不同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.5                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Paul            | Engineering     | 29000             | 23             | 0.25                |
| Chloe           | Engineering     | 23000             | 25             | 0.5                 |
| Fred            | Engineering     | 21000             | 28             | 0.75                |
| Tom             | Engineering     | 23000             | 33             | 1.0                 |
| Alex            | Sales           | 30000             | 33             | 0.3333333333333333  |
| Lisa            | Sales           | 10000             | 35             | 0.6666666666666666  |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+

1.6. lag和lead

lag 表示落后的含义,在使用场景中是小于的含义。
lead 表示领先的含义,在使用场景中是大于的含义。

可以实现不自连接的前提下,按照order by结果得到当前行指定列 前/后移动num行 的列值。

函数参数
LAG/LEAD(col,num,default_value):col表示指定的列名;num表示指定的位移行数,默认为1;default_value表示末尾或开头返回的值,默认值null。

SELECT *,LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS lag,LEAD(salary) OVER (PARTITION BY dept ORDER BY salary) AS leadFROM employees;
-- 从结果看,按照order by排序结果,lag取的值是当前行前1行的值,lead取的值是当前行后1行的值,对于第1行或者最后1行,取值默认为null
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | NULL   | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | NULL   |
| Fred            | Engineering     | 21000             | 28             | NULL   | 23000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 23000  | 29000  |
| Paul            | Engineering     | 29000             | 23             | 23000  | NULL   |
| Lisa            | Sales           | 10000             | 35             | NULL   | 30000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 32000  |
| Evan            | Sales           | 32000             | 38             | 30000  | NULL   |
+-----------------+-----------------+-------------------+----------------+--------+--------+SELECT *,LAG(salary,2,0) OVER (PARTITION BY dept ORDER BY salary) AS lag,LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY salary) AS leadFROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 0      | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 0      | 0      |
| Fred            | Engineering     | 21000             | 28             | 0      | 23000  |
| Tom             | Engineering     | 23000             | 33             | 0      | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 29000  |
| Paul            | Engineering     | 29000             | 23             | 23000  | 0      |
| Lisa            | Sales           | 10000             | 35             | 0      | 30000  |
| Alex            | Sales           | 30000             | 33             | 0      | 32000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 0      |
+-----------------+-----------------+-------------------+----------------+--------+--------+-- 排序结果和取值列不同的情况
SELECT *,LAG(salary) OVER (PARTITION BY dept ORDER BY age) AS lag,LEAD(salary) OVER (PARTITION BY dept ORDER BY age) AS leadFROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | NULL   | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | NULL   |
| Paul            | Engineering     | 29000             | 23             | NULL   | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 29000  | 21000  |
| Fred            | Engineering     | 21000             | 28             | 23000  | 23000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | NULL   |
| Alex            | Sales           | 30000             | 33             | NULL   | 10000  |
| Lisa            | Sales           | 10000             | 35             | 30000  | 32000  |
| Evan            | Sales           | 32000             | 38             | 10000  | NULL   |
+-----------------+-----------------+-------------------+----------------+--------+--------+

1.7. first_value和last_value

在分组中按照order by的结果,获取指定列的第一个或最后一个值。
注意,默认情况下last_value取的是第一行截止到当前行的最后一个值(当前行的值),并不是整个分区中排序后的最后一个值。

SELECT *,first_value(salary) OVER (PARTITION BY dept ORDER BY age) AS first,last_value(salary) OVER (PARTITION BY dept ORDER BY age) AS lastFROM employees;
-- 注意last的结果
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Paul            | Engineering     | 29000             | 23             | 29000  | 29000  |
| Chloe           | Engineering     | 23000             | 25             | 29000  | 23000  |
| Fred            | Engineering     | 21000             | 28             | 29000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 29000  | 23000  |
| Alex            | Sales           | 30000             | 33             | 30000  | 30000  |
| Lisa            | Sales           | 10000             | 35             | 30000  | 10000  |
| Evan            | Sales           | 32000             | 38             | 30000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+SELECT *,first_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS first,last_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS lastFROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Fred            | Engineering     | 21000             | 28             | 21000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 23000  |
| Paul            | Engineering     | 29000             | 23             | 21000  | 29000  |
| Lisa            | Sales           | 10000             | 35             | 10000  | 10000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 30000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+

1.8. nth_value

nth表示第几个的含义

作用,在分组中返回order by结果中指定列的第N行值。

注意,hive中无此函数

SELECT *,nth_value(salary,2) OVER (PARTITION BY dept ORDER BY salary) AS nth FROM employees;

2. OVER从句

  • 在hive中over语句支持仅有partition by语句,没有order by语句。当没有order by时,order by默认使用partition by指定的字段序列,
  • 在spark中如果未指定order by,将会报错Error in query: Window function row_number() requires window to be ordered, please add ORDER BY clause
select *, row_number() over(partition by name) as rn from employees;-- 执行结果如下,从结果中得知按照name的字典序排序
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Alex            | Sales           | 30000             | 33             | 1   |
| Evan            | Sales           | 32000             | 38             | 1   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 1   |
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 1   |
| Paul            | Engineering     | 29000             | 23             | 1   |
+-----------------+-----------------+-------------------+----------------+-----+

2.1. window specification

在OVER语句中可以带有一个window specification,支持以下格式:

格式:
(ROWS | RANGE) BETWEEN xxx AND xxx
具体支持以下3种格式
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
  • ROWS 表示在同分区中,按照order by结果按行进行逐行处理
  • RANGE 表示在同分区中按照order by结果的范围处理,如1,2,3,3,4,4,5排序中,3,3和4,4的排序结果相同,则3,3和4,4将分别当成整体处理,每行对应开窗函数的结果相同。
  • UNBOUNDED PRECEDING 表示未绑定当前行之前的行。整个分区中,从第一行开始处理
  • num PRECEDING 表示限制当前行的前num行。如在sum时,每行sum的结果是从前num行到当前行的累加值
  • CURRENT ROW 表示处理过程中的当前行(处理数据过程中的游标指针)
  • UNBOUNDED FOLLOWING 表示未绑定当前行之后的行。整个分区,处理到最后一行
  • num FOLLOWING 表示限制当前行的后num行。如在sum时,每行值的sum的结果将处理到后num行。

注意:在hive中函数不支持使用window specification
rank,dense_rankpercent_rank()
cume_dist()ntile
leadlag

2.1.1. 显式order by下的默认值

当指定了order by语句而未指定window specification语句时,默认的window specification语句是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

即按照order by结果的range处理,并且从分区中的第一行处理到当前行。

select *,sum(salary) over(partition by dept order by salary) as sum_salary from employees;
select *,sum(salary) over(partition by dept order by salary range between unbounded preceding and current row) as sum_salary from employees;-- 二者的执行结果相同
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 29000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 21000       |
| Tom             | Engineering     | 23000             | 33             | 67000       |
| Chloe           | Engineering     | 23000             | 25             | 67000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 10000       |
| Alex            | Sales           | 30000             | 33             | 40000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

注意:Engineering部门的tom和chloe员工的sum_salary值相同。这是因为在Engineering分区下,二者order by salary结果相同。在range策略下,tom和chloe将当成整体处理,即46000,因此tom和chloe的累计值都为67000(21000+460000),而不是44000和67000。

在SQL中显示指定如下语句,实现分区中的按行累加效果,即
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Engineering部门的tom和chloe员工的sum_salary结果将分别为44000和67000。

select *,sum(salary) over(partition by dept order by salary rows between unbounded preceding and current row) as sum_salary from employees;-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 29000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 21000       |
| Tom             | Engineering     | 23000             | 33             | 44000       |
| Chloe           | Engineering     | 23000             | 25             | 67000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 10000       |
| Alex            | Sales           | 30000             | 33             | 40000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

在SQL中显示指定如下语句,实现相同分组内进行全部值求和效果。
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

select *,sum(salary) over(partition by dept order by salary range between unbounded preceding and unbounded following) as sum_salary from employees;-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 64000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 96000       |
| Tom             | Engineering     | 23000             | 33             | 96000       |
| Chloe           | Engineering     | 23000             | 25             | 96000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 72000       |
| Alex            | Sales           | 30000             | 33             | 72000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

结论:

  • 情形一:在partition by和order by同时存在的情况下,对于MAX | MIN | COUNT | SUM | AVG 等函数
    如果想要对相同分组中的数整体进行计算,则要显示指定RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  • 情形二:如果按在分组中实现按行处理,则要显示指定ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • 情形三:默认情况下(即未显示指定window specification时)对于order by列相同的值处理结果相同。
2.1.2. 无order by下的默认值

当over语句中order by和window specification都缺失时,window specification的默认值是ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

结论

  • 对MAX | MIN | COUNT | SUM | AVG 等函数,在order by缺失或order by和partition by相同时,效果同上述情形一
select *,sum(salary) over(partition by dept) as sum_salary from employees;-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Fred            | Engineering     | 21000             | 28             | 96000       |
| Tom             | Engineering     | 23000             | 33             | 96000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Chloe           | Engineering     | 23000             | 25             | 96000       |
| Jane            | Marketing       | 29000             | 28             | 64000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Lisa            | Sales           | 10000             | 35             | 72000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
| Alex            | Sales           | 30000             | 33             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

结果中,对于相同dept中不同员工,效果并不是按行并逐行处理,而是对相同dept下的员工进行了统一处理。即UNBOUNDED FOLLOWING表示不跟随当前处理的行,直接对中整个分区中进行计算。

疑问:ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING中rows的作用是什么了?

3. last_value函数的注意事项

前面在提到last_value时,特意强调了该函数的结果并不是分区中的最后一个值,结合上述介绍的window specification再来看下该函数的结果值。

SELECT *,first_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS first,last_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS lastFROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Fred            | Engineering     | 21000             | 28             | 21000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 23000  |
| Paul            | Engineering     | 29000             | 23             | 21000  | 29000  |
| Lisa            | Sales           | 10000             | 35             | 10000  | 10000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 30000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+

结果中,对于每一行的last_value的结果都是当前值,并不分区中按salary升序的最后一个值。造成这个结果的原因正是由于默认的window specification导致的(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)。
如想要实现得到整个分区中按salary升序的最大值,则需要显示设置window specification为RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

select *
,first_value(salary) over(partition by dept order by salary) as first_salary
,last_value(salary) over(partition by dept order by salary range between unbounded preceding and unbounded following) as last_salary from employees;-- 结果如下
+-----------------+-----------------+-------------------+----------------+---------------+--------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first_salary  | last_salary  |
+-----------------+-----------------+-------------------+----------------+---------------+--------------+
| Jane            | Marketing       | 29000             | 28             | 29000         | 35000        |
| Jeff            | Marketing       | 35000             | 38             | 29000         | 35000        |
| Fred            | Engineering     | 21000             | 28             | 21000         | 29000        |
| Tom             | Engineering     | 23000             | 33             | 21000         | 29000        |
| Chloe           | Engineering     | 23000             | 25             | 21000         | 29000        |
| Paul            | Engineering     | 29000             | 23             | 21000         | 29000        |
| Lisa            | Sales           | 10000             | 35             | 10000         | 32000        |
| Alex            | Sales           | 30000             | 33             | 10000         | 32000        |
| Evan            | Sales           | 32000             | 38             | 10000         | 32000        |
+-----------------+-----------------+-------------------+----------------+---------------+--------------+

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/diannao/43762.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

odoo17 常见升级问题

通用问题 模型名变更 字段变更 方法名变更 方法参数变更 xml数据结构定义变化 xml的id变更 view视图变化,导致xpath路径出差 template结构变化,,导致xpath路径出差,或者id不存在 升16问题 前端owl的架构变化 升17问题 前端 标…

什么,有狗快跑!慢着,这次手把手教你怎么过安全狗!(sql注入篇)

前言 在记忆里上次绕安全狗还是在上次,开开心心把自己之前绕过狗的payload拿出来,发现全部被拦截了,事情一下子就严肃起来了,这就开整。 环境 本次环境如下sqli-lab的sql注入靶场 网站安全狗APACHE版V4.0版本的最高防护等级绕过…

秋招Java后端开发冲刺——并发篇2(ThreadLocal、Future接口)

本文对ThreadLocal类和Future接口进行了总结概括,包括ThreadLocal类的原理、内存泄露等问题,和Future接口的使用等问题。 一、ThreadLocal 1. 介绍 ThreadLocal(线程局部变量)是Java中的一个类,线程通过维护一个本地…

一文带你彻底搞懂什么是责任链模式!!

文章目录 什么是责任链模式?详细示例SpingMVC 中的责任链模式使用总结 什么是责任链模式? 在我们日常生活中,经常会出现一种场景:一个请求需要经过多个对象的处理才能得到最终的结果。比如,一个请假申请,需…

STM32智能仓库管理系统教程

目录 引言环境准备智能仓库管理系统基础代码实现:实现智能仓库管理系统 4.1 数据采集模块 4.2 数据处理与控制算法 4.3 通信与网络系统实现 4.4 用户界面与数据可视化应用场景:仓库管理与优化问题解决方案与优化收尾与总结 1. 引言 智能仓库管理系统通…

藏汉翻译通作为翻译软件的优势有哪些?

藏汉翻译通作为一款专业的藏汉双语翻译软件,具有以下优势: 人工智能技术应用:藏汉翻译通利用了人工智能翻译和语音识别合成技术,提供智能藏文翻译服务。 高准确率:文字识别准确率可达90%,语音识别转化文字…

苍穹外卖--导入分类模块功能代码

把各层代码拷贝到所需文件夹下, 进行编译 在运行 提交和推送仓库

解锁AI大模型潜能:预训练、迁移学习与中间件编程的协同艺术

在人工智能的浩瀚星空中,大型预训练模型(Large Language Models, LLMs)犹如璀璨的星辰,引领着技术革新的浪潮。这些模型通过海量数据的滋养,学会了理解语言、生成文本乃至执行复杂任务的能力。然而,要让这些…

【正点原子i.MX93开发板试用连载体验】项目计划和开箱体验

本文最早发表于电子发烧友:【   】【正点原子i.MX93开发板试用连载体验】基于深度学习的语音本地控制 - 正点原子学习小组 - 电子技术论坛 - 广受欢迎的专业电子论坛! (elecfans.com)https://bbs.elecfans.com/jishu_2438354_1_1.html 有一段时间没有参加电子发…

Pyspider WebUI 未授权访问致远程代码执行漏洞复现

0x01 产品简介 Pyspider是由国人binux编写的强大的网络爬虫系统,它带有强大的WebUI(Web用户界面),为用户提供了可视化的编写、调试和管理爬虫的能力。这一特点使得Pyspider在爬虫框架中脱颖而出,尤其适合那些希望快速上手并高效开发爬虫的用户。允许用户直接在网页上编写…

for in和for of对比

不同点: 时间点不同:for in 在js出现之初就有,for of出现在ES6之后 遍历的内容不同:for in用于遍历对象的可枚举属性(包括原型链上的可枚举属性),for of用于遍历可迭代对象的值 看个例子 // for in const arr [a,b,…

Linux--线程的控制

目录 0.前言 1.pthread库 2.关于控制线程的接口 2.1.创建线程(pthread_create) 2.2.线程等待(pthread_join) 代码示例1: ​编辑 ***一些问题*** 2. 3.创建多线程 3.线程的终止 (pthread_exit /…

给数组/对象添加一个(key-value)对象

需要将一个value值前面加上key值,放进数组/对象中 this.$set(res.data[0],type,1) this.$set( target, key, value ) target:要更改的数据源(可以是对象或者数组) key:要更改的具体数据 value :重新赋的值。 结果:…

文华财经盘立方多空变色波段趋势线指标公式源码

文华财经盘立方多空变色波段趋势线指标公式源码&#xff1a; N1:20; N2:ROUND(N1/2,1); N3:ROUND(SQRT(N1),1); N4:2*EMA2(C,N2)-EMA2(C,N1); 尊重市场:EMA2(N4,N3),COLORRED,LINETHICK2; 尊重市场1:IF(尊重市场<REF(尊重市场,1), 尊重市场,NULL),COLORGREEN,LINETHIC…

C++之List模拟实现

目录 list的逻辑结构 构造函数 拷贝构造函数 赋值运算符重载 返回迭代器的初始位置 返回迭代器的最终位置 元素的插入 头插 尾插 删除元素 头删 尾删 清空整个链表 析构函数 正向迭代器 反向迭代器 整体代码 上期我们学写了list的基本操作&#xff0c;本期我…

苏东坡传-读书笔记十一

苏东坡对写作与风格所表示的意见最为清楚。他说做文章“大略如行云流水&#xff0c;初无定质&#xff0c;但常行于所当行&#xff0c;常止于所不可不止。文理自然&#xff0c;姿态横生。孔子曰&#xff1a;‘言之不文&#xff0c;行而不远。’又曰&#xff1a;‘辞达而已矣。’…

【cocos creator】2.4.x实现简单3d功能,点击选中,旋转,材质修改,透明材质

demo下载:(待审核) https://download.csdn.net/download/K86338236/89527924 const {ccclass, property } = cc._decorator;const enum box_color {NORMAL = 0,DASHED_LINE = 1,//虚线TRANSLUCENT = 2,//半透明 }@ccclass export default class main extends cc.Component {…

STC32G/F/8H通用无刷电机驱动板

STC32G/F/8H通用无刷电机驱动板 &#x1f4cc;相关篇《低成本STC32G8K64驱动控制BLDC开源入门学习方案》 ✨该驱动板是在上一版的基础上改版而来。这里的STC32G/F/8H所指的是封装型号为-LQFP48的STC32G8K64、STC32G12K128、STC32F12K54、STC8H8K64U。是一款兼容有感和无感设计的…

数据结构--树和二叉树的一些知识点总结

树是n个结点的有限集&#xff0c;当n0时&#xff0c;称为空树。树是一种递归的数据结构&#xff0c;树作为一种逻辑结构同时也是一种分层的结构结点的深度是从根开始自顶向下累加&#xff1b;结点的高度是从叶结点自底向上累加由于树中的分支是有向的&#xff0c;即从双亲指向孩…

【Java算法】二分查找 下

&#x1f525;个人主页&#xff1a; 中草药 &#x1f525;专栏&#xff1a;【算法工作坊】算法实战揭秘 一.山脉数组的峰顶索引 题目链接&#xff1a;852.山脉数组的峰顶 ​ 算法原理 这段代码实现了一个查找山峰数组中峰值索引的算法。山峰数组是一个先递增后递减的数组&…