视频地址:尚硅谷大数据项目《在线教育之采集系统》_哔哩哔哩_bilibili
目录
P047
P048
P049
P050
P051
P052
P053
P054
P055
P056
P047
/opt/module/datax/job/base_province.json
[atguigu@node001 ~]$ hadoop fs -mkdir /base_province/2022-02-22
[atguigu@node001 ~]$ cd /opt/module/datax/
[atguigu@node001 datax]$ python bin/datax.py -p"-Ddt=2022-02-22" job/base_province.json
P048
{"job": {"content": [{"reader": {"name": "hdfsreader","parameter": {"defaultFS": "hdfs://node001:8020","path": "/base_province","column": ["*"],"fileType": "text","compress": "gzip","encoding": "UTF-8","nullFormat": "\\N","fieldDelimiter": "\t",}},"writer": {"name": "mysqlwriter","parameter": {"username": "root","password": "123456","connection": [{"table": ["test_province"],"jdbcUrl": "jdbc:mysql://node001:3306/edu?useUnicode=true&characterEncoding=utf-8"}],"column": ["id","name","region_id","area_code","iso_code","iso_3166_2"],"writeMode": "replace"}}}],"setting": {"speed": {"channel": 1}}}
}
DROP TABLE IF EXISTS `test_province`;CREATE TABLE `test_province` (`id` BIGINT(20) NOT NULL,`name` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,`region_id` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,`area_code` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,`iso_code` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,`iso_3166_2` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,PRIMARY KEY (`id`)
) ENGINE = INNODB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = DYNAMIC;
P049
MysqlReader插件文档:https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
并行度 task数量
2 11
3 16
4 21
n n*5+1
P050
HFDS Writer并未提供nullFormat参数:也就是用户并不能自定义null值写到HFDS文件中的存储格式。默认情况下,HFDS Writer会将null值存储为空字符串(''),而Hive默认的null值存储格式为\N。所以后期将DataX同步的文件导入Hive表就会出现问题。
解决该问题的方案有两个:
- 一是修改DataX HDFS Writer的源码,增加自定义null值存储格式的逻辑,可参考记Datax3.0解决MySQL抽数到HDFSNULL变为空字符的问题_datax nullformat_谭正强的博客-CSDN博客。
- 二是在Hive中建表时指定null值存储格式为空字符串(''),例如:
DROP TABLE IF EXISTS base_province;CREATE EXTERNAL TABLE base_province
(`id` STRING COMMENT '编号',`name` STRING COMMENT '省份名称',`region_id` STRING COMMENT '地区ID',`area_code` STRING COMMENT '地区编码',`iso_code` STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',`iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用'
) COMMENT '省份表'ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'NULL DEFINED AS ''LOCATION '/base_province/';
P051
第5章 DataX优化
P052
Maxwell 是由美国Zendesk公司开源,用Java编写的MySQL变更数据抓取软件。它会实时监控Mysql数据库的数据变更操作(包括insert、update、delete),并将变更数据以 JSON 格式发送给 Kafka、Kinesi等流数据处理平台。官网地址:Maxwell's Daemon
P053
P054
[mysqld]#数据库id
server-id = 1##启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin#binlog类型,maxwell要求为row类型
binlog_format=row#启用binlog的数据库,需根据实际情况作出修改
binlog-do-db=edu
P055
[atguigu@node001 ~]$ mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.7.29 MySQL Community Server (GPL)Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.mysql> show master status;
Empty set (0.00 sec)mysql> ^DBye
[atguigu@node001 ~]$ mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.29-log MySQL Community Server (GPL)Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.mysql> show master status;
+------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 | 154 | edu | | |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)mysql> CREATE DATABASE maxwell;
Query OK, 1 row affected (0.01 sec)mysql> set global validate_password_policy=0;
ERROR 1193 (HY000): Unknown system variable 'validate_password_policy'
mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
Query OK, 0 rows affected (0.02 sec)mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.01 sec)mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)mysql> quit
Bye
[atguigu@node001 ~]$
P056
- node001:启动zookeeper、kafka、maxwell。
- node002:[atguigu@node002 ~]$ kafka-console-consumer.sh --bootstrap-server node001:9092 --topic maxwell
[atguigu@node001 maxwell]$ cd /opt/module/maxwell/
[atguigu@node001 maxwell]$ ll
总用量 4
drwxrwxr-x 4 atguigu atguigu 4096 8月 9 16:00 maxwell-1.29.2
[atguigu@node001 maxwell]$ vim /etc/my.cnf
[atguigu@node001 maxwell]$
[atguigu@node001 maxwell]$ sudo vim /etc/my.cnf
[atguigu@node001 maxwell]$ sudo systemctl restart mysqld
[atguigu@node001 maxwell]$
[atguigu@node001 maxwell]$ cd /opt/module/maxwell/maxwell-1.29.2/
[atguigu@node001 maxwell-1.29.2]$ cp config.properties.example config.properties
[atguigu@node001 maxwell-1.29.2]$ bin/maxwell --config config.properties --daemon
Redirecting STDOUT to /opt/module/maxwell/maxwell-1.29.2/bin/../logs/MaxwellDaemon.out
Using kafka version: 1.0.0
[atguigu@node001 maxwell-1.29.2]$ jps
5600 Maxwell
5631 Jps
[atguigu@node001 maxwell-1.29.2]$ zk.sh start
---------- zookeeper node001 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
---------- zookeeper node002 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
---------- zookeeper node003 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[atguigu@node001 maxwell-1.29.2]$ kf.sh start
--------------- node001 Kafka 启动 ---------------
--------------- node002 Kafka 启动 ---------------
--------------- node003 Kafka 启动 ---------------
[atguigu@node001 maxwell-1.29.2]$ myhadoop.sh start================ 启动 hadoop集群 ================---------------- 启动 hdfs ----------------
Starting namenodes on [node001]
Starting datanodes
Starting secondary namenodes [node003]--------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers--------------- 启动 historyserver ---------------
[atguigu@node001 maxwell-1.29.2]$ jpsall
================ node001 ================
5600 Maxwell
7314 Jps
7059 NodeManager
6483 NameNode
6647 DataNode
7276 JobHistoryServer
5742 QuorumPeerMain
================ node002 ================
4583 NodeManager
4921 Jps
4461 ResourceManager
4254 DataNode
3727 QuorumPeerMain
================ node003 ================
4240 DataNode
3703 QuorumPeerMain
4344 SecondaryNameNode
4474 NodeManager
4090 Kafka
4606 Jps
[atguigu@node001 maxwell-1.29.2]$ kf.sh stop
--------------- node001 Kafka 停止 ---------------
No kafka server to stop
--------------- node002 Kafka 停止 ---------------
No kafka server to stop
--------------- node003 Kafka 停止 ---------------
[atguigu@node001 maxwell-1.29.2]$ kf.sh start
--------------- node001 Kafka 启动 ---------------
--------------- node002 Kafka 启动 ---------------
--------------- node003 Kafka 启动 ---------------
[atguigu@node001 maxwell-1.29.2]$ jpsall
================ node001 ================
5600 Maxwell
7937 Kafka
7059 NodeManager
6483 NameNode
8004 Jps
6647 DataNode
7276 JobHistoryServer
5742 QuorumPeerMain
================ node002 ================
5457 Jps
4583 NodeManager
5402 Kafka
4461 ResourceManager
4254 DataNode
3727 QuorumPeerMain
================ node003 ================
4240 DataNode
3703 QuorumPeerMain
4344 SecondaryNameNode
4474 NodeManager
5195 Jps
5102 Kafka
[atguigu@node001 maxwell-1.29.2]$ mock.sh
[atguigu@node001 maxwell-1.29.2]$
[atguigu@node002 ~]$ kafka-console-consumer.sh --bootstrap-server node001:9092 --topic maxwell