hive UDF函数取最新分区

1.pom文件

<dependencies><!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --><dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>1.2.1</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>2.7.6</version><exclusions><exclusion><groupId>log4j</groupId><artifactId>log4j</artifactId></exclusion><exclusion><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId></exclusion></exclusions></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.10</version><scope>test</scope></dependency></dependencies><build><pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) --><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.0</version><configuration><source>1.8</source><target>1.8</target><encoding>UTF-8</encoding></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-shade-plugin</artifactId><version>2.2</version><executions><execution><phase>package</phase><goals><goal>shade</goal></goals><configuration><filters><filter><artifact>*:*</artifact><excludes><exclude>META-INF/*.SF</exclude><exclude>META-INF/*.DSA</exclude><exclude>META-INF/*/RSA</exclude></excludes></filter></filters></configuration></execution></executions></plugin></plugins></pluginManagement></build>
</project>

2.代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.exec.UDF;import java.net.URI;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;/**
* 该UDF函数取表最新分区，通过文件检索以降低能耗
*  参数 tableName ：schema.table_name
*  返回  latesttPatition：最新分区名称
*/public class latest_partition extends UDF {public String evaluate(String tableName) {StringBuffer sb = new StringBuffer();String latesttPatition = null;// 获取shemaString split1 = tableName.split("\\.")[0];// 获取table_nameString split2 = tableName.split("\\.")[1];// 拼接路径String fileName = sb.append("/user/hive/warehouse/").append(split1).append(".db/").append(split2).toString();try{// 调用方法获取最新分区latesttPatition = getFileList(fileName);}catch (Exception e){System.out.println("获取结果异常" +e.getMessage());}return latesttPatition;}// 获取最新分区public static String getFileList(String path) throws Exception{String res = null;Configuration conf=new Configuration(false);conf.set("fs.default.name", "hdfs://hacluster/");FileSystem hdfs = FileSystem.get(URI.create(path),conf);FileStatus[] fs = hdfs.listStatus(new Path(path));Path[] listPath = FileUtil.stat2Paths(fs);List<String> list = new ArrayList();for(Path p : listPath){String s = p.toString();// hdfs上有可能有非分区文件，只处理分区文件if(s.contains("=")) {String partition = s.split("=")[1];list.add(partition);}}if(list.size() != 0) {res = Collections.max(list).toString();}return  res;}}

大表查询最新分区往往由于各种原因，可能需要几个小时，使用该函数可以实现秒级返回数据。性能可大范围提升。

-- 优化前sql查询语句(耗时特别久，全表扫描)
SELECT MAX(dt) as latest_dt FROM table_name;
-- 优化后(通过文件系统查询，数秒返回结果)
SELECT LST_DT('schema.table_name');

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/556957.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

hive UDF函数取最新分区

hive UDF函数取最新分区

1.pom文件

2.代码

相关文章

app每秒并发数_性能测试连载 (38) jmeter 线程数与性能测试的负载模式

用自定义注解做点什么——自定义注解有什么用

HIVE 优化浅谈

c++ 查找文件夹下最新创建的文件_云计算开发总结：搜索Linux文件和文件夹的方法...

RSA 非对称加密原理

HIVE 数据倾斜浅谈

es6 依赖循环_require 和 import 的循环依赖详解

浅谈对称加密与非对称加密

ios跨线程通知_一种基于Metal、Vulkan多线程渲染能力的渲染架构

58同城面试盘点

stringbuffer判断是否为空

virtualbox: win11主机安装deepin双向复制问题

基坑监测日报模板_刚刚！温州瓯海突发塌陷，初步判断为临近地块地下室基坑支护桩移位...

java 从一个总的list集合中，去掉指定的集合元素，得到新的集合——removeAll()

virtualbox:win11上的deepin如何设置与宿主机共享文件

三角形外接球万能公式_秒杀三角形问题！！三角形分角线的几个重要结论及其应用...

java自定义注解为空值——自定义注解的魅力你到底懂不懂

uniapp动态修改样式_掌握Photoshop图层样式技术

一个专业搬砖人的幻想：全国实现旬休制度

IoT -- (七)MQTT协议详解