注意 : 本文章是基于flinkcdc 3.0 版本写的
我们在前面的文章已经提到过,flinkcdc3.0版本分为4层,API接口层,Connect链接层,Composer同步任务构建层,Runtime运行时层,这篇文章会对API接口层进行一个探索.探索一下flink-cdc-cli模块,看看是如何将一个yaml配置文件转换成一个任务实体类,并且启动任务的.
概述
flink-cdc-cli 模块目录结构
可以看到一共有6个类,1个接口,其中在上一篇文章探索flink-cdc.sh脚本的时候我们知道入口类是CliFrontend,所以接下来会从这个类来一步一步探索这一模块.
入口类 CliFrontend
main方法
public static void main(String[] args) throws Exception {Options cliOptions = CliFrontendOptions.initializeOptions();CommandLineParser parser = new DefaultParser();CommandLine commandLine = parser.parse(cliOptions, args);// Help messageif (args.length == 0 || commandLine.hasOption(CliFrontendOptions.HELP)) {HelpFormatter formatter = new HelpFormatter();formatter.setLeftPadding(4);formatter.setWidth(80);formatter.printHelp(" ", cliOptions);return;}// Create executor and execute the pipelinePipelineExecution.ExecutionInfo result = createExecutor(commandLine).run();// Print execution resultprintExecutionInfo(result);}
这里首先是初始化了一下选项,这里用到了Apache Commons CLI 这个工具类,可以很方便的处理命令行参数
大概的步骤有3步
1.定义阶段 : 定义要解析的命令选项,对应的每个选项就是一个Option类,Options类是Option类的一个集合
2.解析阶段 : 通过CommandLineParser的parser方法将main方法的args参数解析,获得一个CommandLine对象
3.查询阶段 : 就是具体使用解析后的结果,可以通过hasOption来判断是否有该选项,getOptionValue来获取选项对应的值
具体可以参考我的另外一系列文章,有详细介绍这个工具的用法
超强命令行解析工具 Apache Commons CLI
超强命令行解析工具 Apache Commons CLI 各个模块阅读
解析了入参后就判断入参args是否是空或者是否包含-h或者–help这个选项,如果是的话就打印一下帮助信息
接着通过CommandLine对象创建执行器并且执行任务
最后在打印一下结果信息
这个类中最核心的一行就是创建执行器并且执行任务
// Create executor and execute the pipeline
PipelineExecution.ExecutionInfo result = createExecutor(commandLine).run();static CliExecutor createExecutor(CommandLine commandLine) throws Exception {// The pipeline definition file would remain unparsedList<String> unparsedArgs = commandLine.getArgList();if (unparsedArgs.isEmpty()) {throw new IllegalArgumentException("Missing pipeline definition file path in arguments. ");}// Take the first unparsed argument as the pipeline definition filePath pipelineDefPath = Paths.get(unparsedArgs.get(0));if (!Files.exists(pipelineDefPath)) {throw new FileNotFoundException(String.format("Cannot find pipeline definition file \"%s\"", pipelineDefPath));}// Global pipeline configurationConfiguration globalPipelineConfig = getGlobalConfig(commandLine);// Load Flink environmentPath flinkHome = getFlinkHome(commandLine);Configuration flinkConfig = FlinkEnvironmentUtils.loadFlinkConfiguration(flinkHome);// Additional JARsList<Path> additionalJars =Arrays.stream(Optional.ofNullable(commandLine.getOptionValues(CliFrontendOptions.JAR)).orElse(new String[0])).map(Paths::get).collect(Collectors.toList());// Build executorreturn new CliExecutor(pipelineDefPath,flinkConfig,globalPipelineConfig,commandLine.hasOption(CliFrontendOptions.USE_MINI_CLUSTER),additionalJars);}
可以看到最后是构建了一个CliExecutor类,并执行了它的run方法.
选项类 CliFrontendOptions
这个类主要是用来定义命令行选项的,使用的是Apache Commons CLI这个类库,代码比较简单
这里主要细看一下各个选项都有什么作用
package org.apache.flink.cdc.cli;import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;/** Command line argument options for {@link CliFrontend}. */
public class CliFrontendOptions {public static final Option FLINK_HOME =Option.builder().longOpt("flink-home").hasArg().desc("Path of Flink home directory").build();public static final Option HELP =Option.builder("h").longOpt("help").desc("Display help message").build();public static final Option GLOBAL_CONFIG =Option.builder().longOpt("global-config").hasArg().desc("Path of the global configuration file for Flink CDC pipelines").build();public static final Option JAR =Option.builder().longOpt("jar").hasArgs().desc("JARs to be submitted together with the pipeline").build();public static final Option USE_MINI_CLUSTER =Option.builder().longOpt("use-mini-cluster").hasArg(false).desc("Use Flink MiniCluster to run the pipeline").build();public static Options initializeOptions() {return new Options().addOption(HELP).addOption(JAR).addOption(FLINK_HOME).addOption(GLOBAL_CONFIG).addOption(USE_MINI_CLUSTER);}
}
–flink-home
指定flink-home的地址,有了这个参数我们就可以不使用系统环境自带的FLINK_HOME,可以使用指定的flink版本
–global-config
flink cdc pipelines 的全局配置文件 也就是 flink conf目录下的那个 flink-cdc.yaml文件,这里面的参数很少,我看只有配置一个并发度,其他的配置没看到,这块有感兴趣的老铁可以再仔细看看
–jar
和任务一起提交的依赖jar包
–use-mini-cluster
使用mini-cluster模式启动,mini-cluster相当于就是本地local模式启动,会用多个现成模拟JobManager,TaskManager,ResourceManager,Dispatcher等组件,一般用于测试
-h 或者 --help
打印帮助信息
执行类 CliExecutor
package com.ververica.cdc.cli;import com.ververica.cdc.cli.parser.PipelineDefinitionParser;
import com.ververica.cdc.cli.parser.YamlPipelineDefinitionParser;
import com.ververica.cdc.cli.utils.FlinkEnvironmentUtils;
import com.ververica.cdc.common.annotation.VisibleForTesting;
import com.ververica.cdc.common.configuration.Configuration;
import com.ververica.cdc.composer.PipelineComposer;
import com.ververica.cdc.composer.PipelineExecution;
import com.ververica.cdc.composer.definition.PipelineDef;import java.nio.file.Path;
import java.util.List;/** Executor for doing the composing and submitting logic for {@link CliFrontend}. */
public class CliExecutor {private final Path pipelineDefPath;private final Configuration flinkConfig;private final Configuration globalPipelineConfig;private final boolean useMiniCluster;private final List<Path> additionalJars;private PipelineComposer composer = null;public CliExecutor(Path pipelineDefPath,Configuration flinkConfig,Configuration globalPipelineConfig,boolean useMiniCluster,List<Path> additionalJars) {this.pipelineDefPath = pipelineDefPath;this.flinkConfig = flinkConfig;this.globalPipelineConfig = globalPipelineConfig;this.useMiniCluster = useMiniCluster;this.additionalJars = additionalJars;}public PipelineExecution.ExecutionInfo run() throws Exception {// Parse pipeline definition filePipelineDefinitionParser pipelineDefinitionParser = new YamlPipelineDefinitionParser();PipelineDef pipelineDef =pipelineDefinitionParser.parse(pipelineDefPath, globalPipelineConfig);// Create composerPipelineComposer composer = getComposer(flinkConfig);// Compose pipelinePipelineExecution execution = composer.compose(pipelineDef);// Execute the pipelinereturn execution.execute();}private PipelineComposer getComposer(Configuration flinkConfig) {if (composer == null) {return FlinkEnvironmentUtils.createComposer(useMiniCluster, flinkConfig, additionalJars);}return composer;}@VisibleForTestingvoid setComposer(PipelineComposer composer) {this.composer = composer;}@VisibleForTestingpublic Configuration getFlinkConfig() {return flinkConfig;}@VisibleForTestingpublic Configuration getGlobalPipelineConfig() {return globalPipelineConfig;}@VisibleForTestingpublic List<Path> getAdditionalJars() {return additionalJars;}
}
这个类的核心就是run 方法
首先是构建了一个yaml解析器用于解析yaml配置文件
然后调用parser 方法 获得一个PipelineDef
类,这相当与将yaml配置文件转换成了一个配置实体Bean,方便之后操作
接下来获取到PipelineComposer
对象,然后调用compose 方法传入刚刚的配置实体BeanPiplineDef对象,就获得了一个PiplineExecution
对象
最后调用execute方法启动任务(这个方法底层就是调用了flink 的 StreamExecutionEnvironment.executeAsync方法)
配置文件解析类 YamlPipelineDefinitionParser
看这个类之前先看一下官网给的任务构建的demo yaml
################################################################################
# Description: Sync MySQL all tables to Doris
################################################################################
source:type: mysqlhostname: localhostport: 3306username: rootpassword: 12345678tables: doris_sync.\.*server-id: 5400-5404server-time-zone: Asia/Shanghaisink:type: dorisfenodes: 127.0.0.1:8031username: rootpassword: ""table.create.properties.light_schema_change: truetable.create.properties.replication_num: 1route:- source-table: doris_sync.a_\.*sink-table: ods.ods_a- source-table: doris_sync.abcsink-table: ods.ods_abc- source-table: doris_sync.table_\.*sink-table: ods.ods_tablepipeline:name: Sync MySQL Database to Dorisparallelism: 2
这个类的主要目标就是要将这个yaml文件解析成一个实体类PipelineDef
方便之后的操作
代码解释就直接写到注释中了
package com.ververica.cdc.cli.parser;import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.type.TypeReference;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.JsonNode;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.yaml.YAMLFactory;import com.ververica.cdc.common.configuration.Configuration;
import com.ververica.cdc.composer.definition.PipelineDef;
import com.ververica.cdc.composer.definition.RouteDef;
import com.ververica.cdc.composer.definition.SinkDef;
import com.ververica.cdc.composer.definition.SourceDef;import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Optional;import static com.ververica.cdc.common.utils.Preconditions.checkNotNull;/** Parser for converting YAML formatted pipeline definition to {@link PipelineDef}. */
public class YamlPipelineDefinitionParser implements PipelineDefinitionParser {// Parent node keysprivate static final String SOURCE_KEY = "source";private static final String SINK_KEY = "sink";private static final String ROUTE_KEY = "route";private static final String PIPELINE_KEY = "pipeline";// Source / sink keysprivate static final String TYPE_KEY = "type";private static final String NAME_KEY = "name";// Route keysprivate static final String ROUTE_SOURCE_TABLE_KEY = "source-table";private static final String ROUTE_SINK_TABLE_KEY = "sink-table";private static final String ROUTE_DESCRIPTION_KEY = "description";// 这个是 解析的核心工具方法,可以获取yaml文件中的值,或者将其中的值转换成java实体类private final ObjectMapper mapper = new ObjectMapper(new YAMLFactory());/** Parse the specified pipeline definition file. */@Overridepublic PipelineDef parse(Path pipelineDefPath, Configuration globalPipelineConfig)throws Exception {// 首先将 pipelineDefPath (就是传入的mysql-to-doris.yaml文件) 通过readTree 转换成 一个JsonNode 对象,方便之后解析JsonNode root = mapper.readTree(pipelineDefPath.toFile());// Source is requiredSourceDef sourceDef =toSourceDef(checkNotNull(root.get(SOURCE_KEY), // 获取 source 这个json对象"Missing required field \"%s\" in pipeline definition",SOURCE_KEY));// 这个和source 同理,不解释了// Sink is requiredSinkDef sinkDef =toSinkDef(checkNotNull(root.get(SINK_KEY), // 获取 sink json对象"Missing required field \"%s\" in pipeline definition",SINK_KEY));// 这里是路由配置,是个数组,而且是个可选项,所以这里优雅的使用了Optional对root.get(ROUTE_KEY) 做了一层包装// 然后调用ifPresent方法来判断,如果参数存在的时候才会执行的逻辑,就是遍历数组然后加到 routeDefs 中// Routes are optionalList<RouteDef> routeDefs = new ArrayList<>();Optional.ofNullable(root.get(ROUTE_KEY)).ifPresent(node -> node.forEach(route -> routeDefs.add(toRouteDef(route))));// Pipeline configs are optional// pipeline 参数,是可选项,这个如果不指定,配置就是用的flink-cdc中的配置Configuration userPipelineConfig = toPipelineConfig(root.get(PIPELINE_KEY));// Merge user config into global config// 合并用户配置和全局配置// 这里可以看到是先addAll 全局配置,后addAll 用户配置,这的addAll实际上就是HashMap的putAll,新值会把旧值覆盖,所以用户的配置优先级大于全局配置Configuration pipelineConfig = new Configuration();pipelineConfig.addAll(globalPipelineConfig);pipelineConfig.addAll(userPipelineConfig);// 返回 任务的实体类return new PipelineDef(sourceDef, sinkDef, routeDefs, null, pipelineConfig);}private SourceDef toSourceDef(JsonNode sourceNode) {// 将sourceNode 转换成一个 Map类型Map<String, String> sourceMap =mapper.convertValue(sourceNode, new TypeReference<Map<String, String>>() {});// "type" field is requiredString type =checkNotNull(sourceMap.remove(TYPE_KEY), // 将type 字段移除取出"Missing required field \"%s\" in source configuration",TYPE_KEY);// "name" field is optionalString name = sourceMap.remove(NAME_KEY); // 将 name 字段移除取出// 构建SourceDef对象return new SourceDef(type, name, Configuration.fromMap(sourceMap));}private SinkDef toSinkDef(JsonNode sinkNode) {Map<String, String> sinkMap =mapper.convertValue(sinkNode, new TypeReference<Map<String, String>>() {});// "type" field is requiredString type =checkNotNull(sinkMap.remove(TYPE_KEY),"Missing required field \"%s\" in sink configuration",TYPE_KEY);// "name" field is optionalString name = sinkMap.remove(NAME_KEY);return new SinkDef(type, name, Configuration.fromMap(sinkMap));}private RouteDef toRouteDef(JsonNode routeNode) {String sourceTable =checkNotNull(routeNode.get(ROUTE_SOURCE_TABLE_KEY),"Missing required field \"%s\" in route configuration",ROUTE_SOURCE_TABLE_KEY).asText(); // 从JsonNode 获取String类型的值String sinkTable =checkNotNull(routeNode.get(ROUTE_SINK_TABLE_KEY),"Missing required field \"%s\" in route configuration",ROUTE_SINK_TABLE_KEY).asText();String description =Optional.ofNullable(routeNode.get(ROUTE_DESCRIPTION_KEY)).map(JsonNode::asText).orElse(null);return new RouteDef(sourceTable, sinkTable, description);}private Configuration toPipelineConfig(JsonNode pipelineConfigNode) {if (pipelineConfigNode == null || pipelineConfigNode.isNull()) {return new Configuration();}Map<String, String> pipelineConfigMap =mapper.convertValue(pipelineConfigNode, new TypeReference<Map<String, String>>() {});return Configuration.fromMap(pipelineConfigMap);}
}
配置信息工具类 ConfigurationUtils
package com.ververica.cdc.cli.utils;import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.type.TypeReference;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.yaml.YAMLFactory;import com.ververica.cdc.common.configuration.Configuration;import java.io.FileNotFoundException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Map;/** Utilities for handling {@link Configuration}. */
public class ConfigurationUtils {public static Configuration loadMapFormattedConfig(Path configPath) throws Exception {if (!Files.exists(configPath)) {throw new FileNotFoundException(String.format("Cannot find configuration file at \"%s\"", configPath));}ObjectMapper mapper = new ObjectMapper(new YAMLFactory());try {Map<String, String> configMap =mapper.readValue(configPath.toFile(), new TypeReference<Map<String, String>>() {});return Configuration.fromMap(configMap);} catch (Exception e) {throw new IllegalStateException(String.format("Failed to load config file \"%s\" to key-value pairs", configPath),e);}}
}
这个类就一个方法,主要的作用就是将一个配置文件转换成 Configuration
对象
来看一下具体的实现细节吧
首先是 Files.exists 判断了一下文件是否存在,不存在就直接抛异常
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
这行代码主要是用了Jackson库中的两个核心类,ObjectMapper
和YAMLFactory
ObjectMapper 是 Jackson 库中用于序列化(将对象转换为字节流或其他格式)和反序列化(将字节流或其他格式转换为对象)的核心类。它提供了各种方法来处理 JSON、YAML 等格式的数据。
YAMLFactory 是 Jackson 库中专门用于处理 YAML 格式的工厂类。在这里,我们通过 new YAMLFactory() 创建了一个 YAML 格式的工厂实例,用于处理 YAML 数据。
new ObjectMapper(new YAMLFactory()):这部分代码创建了一个 ObjectMapper 实例,并使用指定的 YAMLFactory 来配置它,这样 ObjectMapper 就能够处理 YAML 格式的数据了。
Map<String, String> configMap =mapper.readValue(configPath.toFile(), new TypeReference<Map<String, String>>() {});
这行的意思就是传入yaml配置文件,容纳后将其转换成一个Map类型,kv都是String
因为这个类的主要用途是解析global-conf的,也就是conf目录下的flink-cdc.yaml,这个文件仅只有kv类型的,所以要转换成map
flink-cdc.yaml
# Parallelism of the pipeline
parallelism: 4# Behavior for handling schema change events from source
schema.change.behavior: EVOLVE
这里再简单看一下mapper的readValue方法
Jackson ObjectMapper的readValue方法主要用途就是将配置文件转换成java实体,主要可以三个重载
public <T> T readValue(File src, Class<T> valueType); // 将配置转换成一个实体类
public <T> T readValue(File src, TypeReference<T> valueTypeRef); // 针对一些Map,List,数组类型可以用这个
public <T> T readValue(File src, JavaType valueType); // 这个一般不常用
最后这行就是将一个map转换成Configuration对象
return Configuration.fromMap(configMap);
这里的Configuration就是将HashMap做了一个封装,方便操作
FLink环境工具类 FlinkEnvironmentUtils
package com.ververica.cdc.cli.utils;import com.ververica.cdc.common.configuration.Configuration;
import com.ververica.cdc.composer.flink.FlinkPipelineComposer;import java.nio.file.Path;
import java.util.List;/** Utilities for handling Flink configuration and environment. */
public class FlinkEnvironmentUtils {private static final String FLINK_CONF_DIR = "conf";private static final String FLINK_CONF_FILENAME = "flink-conf.yaml";public static Configuration loadFlinkConfiguration(Path flinkHome) throws Exception {Path flinkConfPath = flinkHome.resolve(FLINK_CONF_DIR).resolve(FLINK_CONF_FILENAME);return ConfigurationUtils.loadMapFormattedConfig(flinkConfPath);}public static FlinkPipelineComposer createComposer(boolean useMiniCluster, Configuration flinkConfig, List<Path> additionalJars) {if (useMiniCluster) {return FlinkPipelineComposer.ofMiniCluster();}return FlinkPipelineComposer.ofRemoteCluster(org.apache.flink.configuration.Configuration.fromMap(flinkConfig.toMap()),additionalJars);}
}
public static Configuration loadFlinkConfiguration(Path flinkHome) throws Exception {Path flinkConfPath = flinkHome.resolve(FLINK_CONF_DIR).resolve(FLINK_CONF_FILENAME);return ConfigurationUtils.loadMapFormattedConfig(flinkConfPath);}
这个方法的主要目的就是通过找到FLINK_HOME/conf/flink-conf.yaml文件,然后将这个文件转换成一个Configuration对象,转换的方法在上一节中介绍过了
这里还用到了Path 的 resolve 方法,就是用于拼接两个Path然后形成一个新Path的方法
public static FlinkPipelineComposer createComposer(boolean useMiniCluster, Configuration flinkConfig, List<Path> additionalJars) {if (useMiniCluster) {return FlinkPipelineComposer.ofMiniCluster();}return FlinkPipelineComposer.ofRemoteCluster(org.apache.flink.configuration.Configuration.fromMap(flinkConfig.toMap()),additionalJars);}
这个是通过一些参数来初始化Composer,Composer就是将用户的任务翻译成一个Pipeline作业的核心类
这里首先是判断了一下是否使用miniCluster,如果是的话就用minicluster ,这个可以在启动的时候用–use-mini-cluster 来指定,具体在上文中介绍过.
如果不是那么就用remoteCluster,这里就不多介绍了,之后的文章会介绍
总结
上面几个类写的比较多,这里做一个总结,简单的来总结一下这个模块
flink-cdc-cli 模块的主要作用
1.解析任务配置yaml文件,转换成一个PipelineDef
任务实体类
2.通过FLINK_HOME获取flink的相关配置信息,然后构建出一个PipelineComposer
对象
3.调用composer的comoose方法,传入任务实体类获取PipelineExecution
任务执行对象,然后启动任务
再简单的概述一下 : 解析配置文件生成任务实体类,然后启动任务
通过阅读这模块的源码的收获 :
1.学习使用了Apache Commons CLI 工具,之后如果自己写命令行工具的话也可以用这个
2.学习了 Jackson 解析yaml文件
3.加深了对Optional类判断null值的印象,之后对于null值判断有个一个更优雅的写法
4.对flink-cdc-cli模块有了个全面的认识,但是具体还有些细节需要需要深入到其他模块再去了解
总之阅读大佬们写的代码真是收获很大~
参考
[mini-cluster介绍] : https://www.cnblogs.com/wangwei0721/p/14052016.html
[Jackson ObjectMapper#readValue 使用] : https://www.cnblogs.com/del88/p/13098678.html