接上文:一文说清flink从编码到部署上线
网上关于flink sink drois的例子较多,大部分不太全面,故本文详细说明,且提供完整代码。
flink doris版本对照表
1.添加依赖
<!--doris cdc--><!-- 参考:"https://doris.apache.org/zh-CN/docs/ecosystem/flink-doris-connector"版本对照表。到"https://repo.maven.apache.org/maven2/org/apache/doris/"下载对应版本的jar--><!--mvn install:install-file -Dfile=D:/maven/flink-doris-connector-1.14_2.11-1.1.1.jar -DgroupId=com.flink.doris -DartifactId=flink-doris-connector-1.14_2.11 -Dversion=1.1.1 -Dpackaging=jar--><dependency><groupId>com.flink.doris</groupId><artifactId>flink-doris-connector-1.14_2.11</artifactId><version>1.1.1</version></dependency>
2.运行环境工具类
EnvUtil具体实现如下:
package com.zl.utils;import org.apache.flink.configuration.Configuration;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;import java.time.Duration;
import java.time.ZoneOffset;
import java.util.concurrent.TimeUnit;/*** EnvUtil* @description:*/
public class EnvUtil {/*** 设置flink执行环境* @param parallelism 并行度*/public static StreamExecutionEnvironment setFlinkEnv(int parallelism) {// System.setProperty("HADOOP_USER_NAME", "用户名") 对应的是 hdfs文件系统目录下的路径:/user/用户名的文件夹名,本文为rootSystem.setProperty("HADOOP_USER_NAME", "root");Configuration conf = new Configuration();conf.setInteger("rest.port", 1000);StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);if (parallelism >0 ){//设置并行度env.setParallelism(parallelism);} else {env.setParallelism(1);// 默认1}// 添加重启机制
// env.setRestartStrategy(RestartStrategies.fixedDelayRestart(50, Time.minutes(6)));// 没有这个配置,会导致“Flink 任务没报错,但是无法同步数据到doris”。// 启动checkpoint,设置模式为精确一次 (这是默认值),10*60*1000=60000env.enableCheckpointing(60000, CheckpointingMode.EXACTLY_ONCE);//rocksdb状态后端,启用增量checkpointenv.setStateBackend(new EmbeddedRocksDBStateBackend(true));//设置checkpoint路径CheckpointConfig checkpointConfig = env.getCheckpointConfig();// 同一时间只允许一个 checkpoint 进行(默认)checkpointConfig.setMaxConcurrentCheckpoints(1);//最小间隔,10*60*1000=60000checkpointConfig.setMinPauseBetweenCheckpoints(60000);// 取消任务后,checkpoint仍然保存checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);//checkpoint容忍失败的次数checkpointConfig.setTolerableCheckpointFailureNumber(5);//checkpoint超时时间 默认10分钟checkpointConfig.setCheckpointTimeout(TimeUnit.MINUTES.toMillis(10));//禁用operator chain(方便排查反压)env.disableOperatorChaining();return env;}
}
3.CDC实现
相关sql详见文末代码:“resources/doris”。
3.1 创建doris数据库脚本
CREATE DATABASE IF NOT EXISTS `flink_test`;
USE `flink_test`;DROP TABLE IF EXISTS `rv_table`;
CREATE TABLE `rv_table` (`dt` date NULL COMMENT '分区时间',`uuid` varchar(30) NULL COMMENT '',`report_time` bigint(20) NULL COMMENT '过车时间'
) ENGINE=OLAP
DUPLICATE KEY(`dt`, `uuid`)
COMMENT 'RV 数据'
PARTITION BY RANGE(`dt`)
()
DISTRIBUTED BY HASH(`uuid`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"is_being_synced" = "false",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-2147483648",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p_rv_table",
"dynamic_partition.replication_allocation" = "tag.location.default: 1",
"dynamic_partition.buckets" = "32",
"dynamic_partition.create_history_partition" = "false",
"dynamic_partition.history_partition_num" = "-1",
"dynamic_partition.hot_partition_num" = "0",
"dynamic_partition.reserved_history_periods" = "NULL",
"dynamic_partition.storage_policy" = "",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false"
);
3.2 创建mysql数据库脚本
CREATE DATABASE IF NOT EXISTS `flink_test1`;
USE `flink_test1`;SET FOREIGN_KEY_CHECKS=0;-- ----------------------------
-- Table structure for rv_table
-- ----------------------------
DROP TABLE IF EXISTS `rv_table`;
CREATE TABLE `rv_table` (`dt` varchar(10) NOT NULL,`uuid` varchar(30) DEFAULT NULL,`report_time` bigint(20) DEFAULT NULL,PRIMARY KEY (`dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
注意:“dt”是varchar类型,开始尝试使用“date”类型,运行正常。但是本来是“2024-12-20”结果到doris是“2020-07-09”,由于不在动态分区,保存失败。后续继续定位……
3.3 核心代码实现
package com.zl.doris;import com.alibaba.fastjson.JSONObject;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.JsonDebeziumSchemaSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;import java.time.LocalDate;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;public class DorisExampleCDC {public static void main(String[] args) throws Exception {// 配置运行环境,并行度1StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);// 程序间隔离,每个程序单独设置env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleCDC");env.setRuntimeMode(RuntimeExecutionMode.STREAMING);// 跟DataSteam、RowData不一样。DorisOptions.Builder dorisOptions = DorisOptions.builder();dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT.setTableIdentifier("flink_test.rv_table")// db.table.setUsername("root")// root.setPassword("pwd");// passwordProperties properties = new Properties();// 上游是 json 写入时,需要开启配置properties.setProperty("format", "json");properties.setProperty("read_json_by_line", "true");DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();executionBuilder.setLabelPrefix("doris_example_cdc3") //streamload label prefix.setDeletable(true)// 此处配置与DataSteam、RowData不同.setStreamLoadProp(properties);DorisSink.Builder<String> builder = DorisSink.builder();builder.setDorisReadOptions(DorisReadOptions.builder().build()).setDorisExecutionOptions(executionBuilder.build()).setDorisOptions(dorisOptions.build()).setSerializer(JsonDebeziumSchemaSerializer.builder().setDorisOptions(dorisOptions.build()).build());/*// 这种写法,整个运行过程都正常,但是最后就是没把数据存储到doris,当然doris官方示例,也没这么写。具体原因后面再找……JSONObject rvJsonObject = new JSONObject();rvJsonObject.put("dt","2024-12-21");rvJsonObject.put("uuid","cdc-2");rvJsonObject.put("report_time",1733881971621L);env.fromElements(JSONObject.toJSONString(rvJsonObject)).sinkTo(builder.build()).name("doris_sink").uid("doris_sink");*//// mysql sourceList<String> SYNC_TABLES = Arrays.asList("flink_test.rv_table");MySqlSource<String> mySqlSource = MySqlSource.<String>builder().hostname("10.86.37.169").port(3306).databaseList("flink_test").tableList(String.join(",", SYNC_TABLES)).username("root").password("pwd")// 记得修改为实际密码.startupOptions(StartupOptions.initial()).deserializer(new JsonDebeziumDeserializationSchema()).build();env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source").sinkTo(builder.build()).name("doris_sink").uid("doris_sink");env.execute("dorisSinkJob");}// main}
3.4 运行效果
控制台输出,如下图所示:
flink web UI:
cdc成功后mysql与doris数据一致如下图所示:
4.DataSteam实现
注意:相关数据库脚本,参考CDC部分说明。
package com.zl.doris;import com.alibaba.fastjson.JSONObject;
import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.SimpleStringSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;import java.time.LocalDate;
import java.util.Properties;public class DorisExampleDataSteam {public static void main(String[] args) throws Exception {// 配置运行环境,并行度1StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);// 程序间隔离,每个程序单独设置env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleDataSteam");env.setRuntimeMode(RuntimeExecutionMode.BATCH);// 没有这个配置,会导致“Flink 任务没报错,但是无法同步数据到doris”。DorisOptions.Builder dorisOptions = DorisOptions.builder();dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT.setTableIdentifier("flink_test.rv_table")// db.table.setUsername("root")// root.setPassword("pwd");// passwordProperties properties = new Properties();// 上游是 json 写入时,需要开启配置properties.setProperty("format", "json");properties.setProperty("read_json_by_line", "true");DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();executionBuilder.setLabelPrefix("doris_example_datasteam") //streamload label prefix.setDeletable(false).setStreamLoadProp(properties);DorisSink.Builder<String> builder = DorisSink.builder();builder.setDorisReadOptions(DorisReadOptions.builder().build()).setDorisExecutionOptions(executionBuilder.build()).setSerializer(new SimpleStringSerializer()) //serialize according to string.setDorisOptions(dorisOptions.build());JSONObject rvJsonObject = new JSONObject();rvJsonObject.put("dt","2024-12-20");// 日期取当天rvJsonObject.put("uuid","data-stream-1");rvJsonObject.put("report_time",1733881971621L);env.fromElements(JSONObject.toJSONString(rvJsonObject)).sinkTo(builder.build()).name("doris_sink").uid("doris_sink");env.execute("dorisSinkJob");}// main}
5.RowData实现
注意:相关数据库脚本,参考CDC部分说明。
package com.zl.doris;import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.RowDataSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.DataTypes;
import org.apache.flink.table.data.GenericRowData;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.data.StringData;
import org.apache.flink.table.types.DataType;import java.time.LocalDate;
import java.util.Properties;public class DorisExampleRowData {public static void main(String[] args) throws Exception {// 配置运行环境,并行度1StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);// 程序间隔离,每个程序单独设置env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleRowData");env.setRuntimeMode(RuntimeExecutionMode.BATCH);// 没有这个配置,会导致“Flink 任务没报错,但是无法同步数据到doris”。DorisOptions.Builder dorisOptions = DorisOptions.builder();dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT.setTableIdentifier("flink_test.rv_table")// db.table.setUsername("root")// root.setPassword("pwd");// passwordProperties properties = new Properties();// 上游是 json 写入时,需要开启配置properties.setProperty("format", "json");properties.setProperty("read_json_by_line", "true");DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();executionBuilder.setLabelPrefix("doris_example_rowdata") //streamload label prefix.setDeletable(false).setStreamLoadProp(properties);String[] fields = {"dt", "uuid", "report_time"};DataType[] types = {DataTypes.DATE(),DataTypes.VARCHAR(30), DataTypes.BIGINT() };DorisSink.Builder<RowData> builder = DorisSink.builder();builder.setDorisReadOptions(DorisReadOptions.builder().build()).setDorisExecutionOptions(executionBuilder.build()).setSerializer(RowDataSerializer.builder() //serialize according to rowdata.setFieldNames(fields).setType("json") //json format.setFieldType(types).build()).setDorisOptions(dorisOptions.build());DataStream<RowData> source = env.fromElements("").map(new MapFunction<String, RowData>() {@Overridepublic RowData map(String value) throws Exception {GenericRowData genericRowData = new GenericRowData(3);Long myDate= LocalDate.of(2024,12,21).toEpochDay();// 日期取当天genericRowData.setField(0, myDate.intValue());// 计算指定日期与1970-01-01相差的天数genericRowData.setField(1, StringData.fromString("row-data-1"));genericRowData.setField(2, 1733881971621L);return genericRowData;}});source.sinkTo(builder.build()).setParallelism(1).name("doris_sink").uid("doris_sink").name("doris_sink").uid("doris_sink");env.execute("dorisSinkJob");}// main}
6.完整代码
完整代码见:完整代码
官方文档:flink-doris-connector
首先还是要参考官方文档,其次是照着官方文档多尝试。不同产品、不同章节,编写文档水平、代码风格还是有些差异,多体会其中核心思想。
7.遇到问题
7.1 问题1
遇到问题:整个执行过程都很正常,但是最后表里面没有数据。
问题原因:插入的数据不属于任何分区。
解决方法:注意本实例中是以日期(dt)作为分区字段的,所以dt要是具体的日期,比如“2024-12-20”。
7.2 问题2
遇到问题:整个执行过程都很正常,但是最后表里面没有数据。
官方文档:Connector1.1.0 版本以前,是攒批写入的,写入均是由数据驱动,需要判断上游是否有数据写入。1.1.0 之后,依赖 Checkpoint,必须开启 Checkpoint 才能写入。
其实还要:env.setRuntimeMode(RuntimeExecutionMode.BATCH)。
7.3 问题3
问题:DorisRuntimeException: Failed to get backend via
解决:doris连接信息(ip、端口、账号、密码)错误导致。
7.4 问题4
问题:DorisRuntimeException: stream load error: [INTERNAL_ERROR]too many filtered rows
原因:所在动态分区不存在。