大数据技术之Hudi

第1章 Hudi概述

1.1 Hudi简介

Apache Hudi（Hadoop Upserts Delete and Incremental）是下一代流数据湖平台。Apache Hudi将核心仓库和数据库功能直接引入数据湖。Hudi提供了表、事务、高效的upserts/delete、高级索引、流摄取服务、数据集群/压缩优化和并发，同时保持数据的开源文件格式。

Apache Hudi不仅非常适合于流工作负载，而且还允许创建高效的增量批处理管道。

Apache Hudi可以轻松地在任何云存储平台上使用。Hudi的高级性能优化，使分析工作负载更快的任何流行的查询引擎，包括Apache Spark、Flink、Presto、Trino、Hive等。

1.2 发展历史

2015 年：发表了增量处理的核心思想/原则（O'reilly 文章）。

2016 年：由 Uber 创建并为所有数据库/关键业务提供支持。

2017 年：由 Uber 开源，并支撑 100PB 数据湖。

2018 年：吸引大量使用者，并因云计算普及。

2019 年：成为 ASF 孵化项目，并增加更多平台组件。

2020 年：毕业成为 Apache 顶级项目，社区、下载量、采用率增长超过 10 倍。

2021 年：支持 Uber 500PB 数据湖，SQL DML、Flink 集成、索引、元服务器、缓存。

1.3 Hudi特性

可插拔索引机制支持快速Upsert/Delete。
支持增量拉取表变更以进行处理。
支持事务提交及回滚，并发控制。
支持Spark、Presto、Trino、Hive、Flink等引擎的SQL读写。
自动管理小文件，数据聚簇，压缩，清理。
流式摄入，内置CDC源和工具。
内置可扩展存储访问的元数据跟踪。
向后兼容的方式实现表结构变更的支持。

1.4 使用场景

1）近实时写入

减少碎片化工具的使用。
CDC 增量导入 RDBMS 数据。
限制小文件的大小和数量。

2）近实时分析

相对于秒级存储（Druid, OpenTSDB），节省资源。
提供分钟级别时效性，支撑更高效的查询。
Hudi作为lib，非常轻量。

3）增量 pipeline

区分arrivetime和event time处理延迟数据。
更短的调度interval减少端到端延迟（小时 -> 分钟） => Incremental Processing。

4）增量导出

替代部分Kafka的场景，数据导出到在线服务存储 e.g. ES。

第2章编译安装

2.1 编译环境准备

本次的相关组件版本如下：

Hadoop	3.1.3
Hive	3.1.2
Flink	1.13.6，scala-2.12
Spark	3.2.2，scala-2.12

1）安装Maven

（1）上传apache-maven-3.6.1-bin.tar.gz到/opt/soft目录，并解压更名

[hadoop@hadoop1 hudi]$ tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/mod/
mv apache-maven-3.6.1 maven-3.6.1

（2）添加环境变量到/etc/profile.d/my_env.sh中

[hadoop@hadoop1 mod]$ sudo vi /etc/profile.d/my_env.sh
#MAVEN_HOME
export MAVEN_HOME=/opt/mod/maven-3.6.1
export PATH=$PATH:$MAVEN_HOME/bin

（3）测试安装结果

[hadoop@hadoop1 mod]$ source /etc/profile.d/my_env.sh 
[hadoop@hadoop1 mod]$ mvn -v

2）修改为阿里镜像

（1）修改setting.xml，指定为阿里仓库地址

[hadoop@hadoop1 mod]$ vi /opt/mod/maven-3.6.1/conf/settings.xml<!-- 添加阿里云镜像-->
<mirror><id>nexus-aliyun</id><mirrorOf>central</mirrorOf><name>Nexus aliyun</name><url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

2.2 编译Hudi

2.2.1 上传源码包

将hudi-0.12.0.src.tgz上传到/opt/soft，并解压

[hadoop@hadoop1 mod]$ tar -zxvf /opt/soft/hudi/hudi-0.12.0.src.tgz -C /opt/soft

2.2.2 修改pom文件

[hadoop@hadoop1 mod]$ vi /opt/soft/hudi-0.12.0/pom.xml

1）新增repository加速依赖下载

<repository><id>nexus-aliyun</id><name>nexus-aliyun</name><url>http://maven.aliyun.com/nexus/content/groups/public/</url><releases><enabled>true</enabled></releases><snapshots><enabled>false</enabled></snapshots>
</repository>

2）修改依赖的组件版本

<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>

2.2.3 修改源码兼容hadoop3

Hudi默认依赖的hadoop2，要兼容hadoop3，除了修改版本，还需要修改如下代码：

[hadoop@hadoop1 mod]$ vi /opt/soft/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

修改第110行，原先只有一个参数，添加第二个参数null：

2.2.4 手动安装Kafka依赖

有几个kafka的依赖需要手动安装，否则编译报错如下：

1）下载jar包

通过网址下载：http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip

解压后找到以下jar包，上传服务器hadoop1

common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar

2）install到maven本地仓库

[hadoop@hadoop1 hudi]$ pwd
/opt/soft/hudi[hadoop@hadoop1 hudi]$ mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar[hadoop@hadoop1 hudi]$ mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar[hadoop@hadoop1 hudi]$ mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar[hadoop@hadoop1 hudi]$ mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar

2.2.5 解决spark模块依赖冲突

修改了Hive版本为3.1.2，其携带的jetty是0.9.3，hudi本身用的0.9.4，存在依赖冲突。

1）修改hudi-spark-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:

vi /opt/soft/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml

在382行的位置，修改如下：

<!-- Hive --><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-service</artifactId><version>${hive.version}</version><scope>${spark.bundle.hive.scope}</scope><exclusions><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.pentaho</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-service-rpc</artifactId><version>${hive.version}</version><scope>${spark.bundle.hive.scope}</scope></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version><scope>${spark.bundle.hive.scope}</scope><exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-metastore</artifactId><version>${hive.version}</version><scope>${spark.bundle.hive.scope}</scope><exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.datanucleus</groupId><artifactId>datanucleus-core</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-common</artifactId><version>${hive.version}</version><scope>${spark.bundle.hive.scope}</scope><exclusions><exclusion><groupId>org.eclipse.jetty.orbit</groupId><artifactId>javax.servlet</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>
</dependency><!-- 增加hudi配置版本的jetty --><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-server</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-util</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-webapp</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-http</artifactId><version>${jetty.version}</version></dependency>

否则在使用spark向hudi表插入数据时，会报错如下：

java.lang.NoSuchMethodError: org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V

2）修改hudi-utilities-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:

[hadoop@hadoop1 hudi]$ vi /opt/soft/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml在405行的位置，修改如下 ：
<!-- Hoodie --><dependency><groupId>org.apache.hudi</groupId><artifactId>hudi-common</artifactId><version>${project.version}</version><exclusions><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><dependency><groupId>org.apache.hudi</groupId><artifactId>hudi-client-common</artifactId><version>${project.version}</version><exclusions><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><!-- Hive --><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-service</artifactId><version>${hive.version}</version><scope>${utilities.bundle.hive.scope}</scope><exclusions><exclusion><artifactId>servlet-api</artifactId><groupId>javax.servlet</groupId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.pentaho</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-service-rpc</artifactId><version>${hive.version}</version><scope>${utilities.bundle.hive.scope}</scope></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version><scope>${utilities.bundle.hive.scope}</scope><exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-metastore</artifactId><version>${hive.version}</version><scope>${utilities.bundle.hive.scope}</scope><exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.datanucleus</groupId><artifactId>datanucleus-core</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion></exclusions></dependency><dependency><groupId>${hive.groupid}</groupId><artifactId>hive-common</artifactId><version>${hive.version}</version><scope>${utilities.bundle.hive.scope}</scope><exclusions><exclusion><groupId>org.eclipse.jetty.orbit</groupId><artifactId>javax.servlet</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>
</dependency><!-- 增加hudi配置版本的jetty --><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-server</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-util</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-webapp</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-http</artifactId><version>${jetty.version}</version></dependency>

否则在使用DeltaStreamer工具向hudi表插入数据时，也会报Jetty的错误。

2.2.6 执行编译命令

mvn clean package -DskipTests -Dspark3.2 -Dflink1.13 -Dscala-2.12 -Dhadoop.version=3.1.3 -Pflink-bundle-shade-hive3

2.2.7 编译成功

编译成功后，进入hudi-cli说明成功：

[hadoop@hadoop1 hudi-0.12.0]$ pwd
/opt/soft/hudi-0.12.0
[hadoop@hadoop1 hudi-0.12.0]$ hudi-cli/hudi-cli.sh

编译完成后，相关的包在packaging目录的各个模块中：

[hadoop@hadoop1 packaging]$ pwd
/opt/soft/hudi-0.12.0/packaging[hadoop@hadoop1 packaging]$ ls
hudi-aws-bundle           hudi-flink-bundle  hudi-hadoop-mr-bundle  hudi-integ-test-bundle     hudi-presto-bundle  hudi-timeline-server-bundle  hudi-utilities-bundle       README.md
hudi-datahub-sync-bundle  hudi-gcp-bundle    hudi-hive-sync-bundle  hudi-kafka-connect-bundle  hudi-spark-bundle   hudi-trino-bundle            hudi-utilities-slim-bundle

比如，flink与hudi的包

[hadoop@hadoop1 target]$ pwd
/opt/soft/hudi-0.12.0/packaging/hudi-flink-bundle/target[hadoop@hadoop1 target]$ ls
classes generated-sources hudi-flink1.13-bundle-0.12.0-sources.jar maven-shared-archive-resources original-hudi-flink1.13-bundle-0.12.0.jar rat.txt dependency-reduced-pom.xml  hudi-flink1.13-bundle-0.12.0.jar  maven-archiver maven-status                    original-hudi-flink1.13-bundle-0.12.0-sources.jar  test-classes