HQL写topN、Spark写topN

HQL写topN用窗口函数rank() 、row_number()、dense_rank()

1、rank(),跳跃排序，假如第一第二相同，那么第三个就是3

select 
* 
from(
select 
id,
cn,
score,
rank() over(partition by id order by score desc)as ranks from top
N
) A 
where ranks<5;

在这里插入图片描述

2、row_number()

select 
* 
from(
select 
id,
cn,
score,
row_number() over(partition by id order by score desc)as ranks from top
N
) A 
where ranks<5;

在这里插入图片描述

3、dense_rank(),假如第一第二相同，那么第三个就是2

select 
* 
from(
select 
id,
cn,
score,
dense_rank() over(partition by id order by score desc)as ranks from top
N
) A 
where ranks<5;

在这里插入图片描述

Spark写topN

（1）按照key对数据进行聚合（groupByKey）
（2）将value转换为数组，利用scala的sortBy或者sortWith进行排序（mapValues）数据量太大，会OOM。

1	数学	100
1	语文	99
1	英语	80
1	物理	99
2	数学	99
2	语文	80
2	英语	10
2	物理	99
3	数学	100
3	语文	79
3	英语	79
3	物理	80

package spark01import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object topN {def main(args : Array[String]) : Unit = {val conf = new SparkConf().setAppName("topN").setMaster("local[*]")val sc = new SparkContext(conf)val linesRDD: RDD[String] = sc.textFile("D:\\薛艳春\\桌面\\大数据\\Spark\\topn.txt")val lineRDD : RDD[(Int, (String, Int))] = linesRDD.map(lines => {val strings : Array[String] = lines.split("\t")Tuple2(strings(0).toInt, Tuple2(strings(1), strings(2).toInt))})val groupByKeyRDD: RDD[(Int, Iterable[(String, Int)])] = lineRDD.groupByKey(1) //这里不把分区为1输出会出错乱val reduceRDD : RDD[(Int, List[(String, Int)])] = groupByKeyRDD.map(css => {val key : Int = css._1val value : Iterable[(String, Int)] = css._2val list : List[(String, Int)] = value.toList.sortWith(_._2>_._2).take(3)   //注意排序比较用的是Int型，一开始用的String找了好久错误(key, list)})reduceRDD.foreach(v=>{print(v._1+":")v._2.foreach(println)})sc.stop()}
}