Orchestrator源码解读2-故障失败发现

目录

目录

前言

核心流程函数调用路径

GetReplicationAnalysis

故障类型和对应的处理函数

拓扑结构警告类型

与MHA相比


前言

Orchestrator另外一个重要的功能是监控集群,发现故障。根据从复制拓扑本身获得的信息,它可以识别各种故障场景。Orchestrator介绍四-失败/故障检测_orchestrator 心跳-CSDN博客

核心流程函数调用路径

ContinuousDiscovery
--> CheckAndRecover // 检查恢复的入口函数
--> GetReplicationAnalysis  // 查询SQL,根据实例的状态确定故障或者警告类型。检查复制问题  (dead master; unreachable master; 等)
--> executeCheckAndRecoverFunction  // 根据分析结果选择正确的检查和恢复函数。然后会同步执行该函数。
--> getCheckAndRecoverFunction // 根据分析结果选择正确的检查和恢复函数。然后会同步执行该函数。
--> runEmergentOperations // 
--> checkAndExecuteFailureDetectionProcesses //   尝试往数据库中插入这个故障发现记录,然后执行故障发现阶段所需的操作,执行 OnFailureDetectionProcesses (故障发现阶段的)钩子脚本
--> AttemptFailureDetectionRegistration //  尝试往数据库中插入故障发现记录 ,如果失败 意味着这个问题可能已经被发现了,
--> checkAndRecoverDeadMaster  // 根据故障情况 执行恢复,这里选择DeadMaster 故障类型举例 
--> AttemptRecoveryRegistration // 尝试往数据库中插入一条恢复记录;如果这一尝试失败,那么意味着恢复已经在进行中。
--> recoverDeadMaster //  用于恢复DeadMaster故障类型的函数,其中包含了完整的逻辑。会执行PreFailoverProcesses 钩子脚本

GetReplicationAnalysis

该函数将检查复制问题 (dead master; unreachable master; 等),根据实例的状态确定故障或者警告类型。

该函数会运行一个复杂SQL ,该SQL会每秒执行一次 ,执行间隔是由定时器 recoveryTick决定 ,SQL如下

SELECTmaster_instance.hostname,master_instance.port,master_instance.read_only AS read_only,MIN(master_instance.data_center) AS data_center,MIN(master_instance.region) AS region,MIN(master_instance.physical_environment) AS physical_environment,MIN(master_instance.master_host) AS master_host,MIN(master_instance.master_port) AS master_port,MIN(master_instance.cluster_name) AS cluster_name,MIN(master_instance.binary_log_file) AS binary_log_file,MIN(master_instance.binary_log_pos) AS binary_log_pos,MIN(IFNULL(master_instance.binary_log_file = database_instance_stale_binlog_coordinates.binary_log_fileAND master_instance.binary_log_pos = database_instance_stale_binlog_coordinates.binary_log_posAND database_instance_stale_binlog_coordinates.first_seen < NOW() - interval 10 second,0)) AS is_stale_binlog_coordinates,MIN(IFNULL(cluster_alias.alias,master_instance.cluster_name)) AS cluster_alias,MIN(IFNULL(cluster_domain_name.domain_name,master_instance.cluster_name)) AS cluster_domain,MIN(master_instance.last_checked <= master_instance.last_seenand master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second) = 1 AS is_last_check_valid,/* To be considered a master, traditional async replication must not be present/valid AND the host should either *//* not be a replication group member OR be the primary of the replication group */MIN(master_instance.last_check_partial_success) as last_check_partial_success,MIN((master_instance.master_host IN ('', '_')OR master_instance.master_port = 0OR substr(master_instance.master_host, 1, 2) = '//')AND (master_instance.replication_group_name = ''OR master_instance.replication_group_member_role = 'PRIMARY')) AS is_master,MIN(master_instance.replication_group_name != ''AND master_instance.replication_group_member_state != 'OFFLINE') AS is_replication_group_member,MIN(master_instance.is_co_master) AS is_co_master,MIN(CONCAT(master_instance.hostname,':',master_instance.port) = master_instance.cluster_name) AS is_cluster_master,MIN(master_instance.gtid_mode) AS gtid_mode,COUNT(replica_instance.server_id) AS count_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen),0) AS count_valid_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.slave_io_running != 0AND replica_instance.slave_sql_running != 0),0) AS count_valid_replicating_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.slave_io_running = 0AND replica_instance.last_io_error like '%error %connecting to master%'AND replica_instance.slave_sql_running = 1),0) AS count_replicas_failing_to_connect_to_master,MIN(master_instance.replication_depth) AS replication_depth,GROUP_CONCAT(concat(replica_instance.Hostname,':',replica_instance.Port)) as slave_hosts,MIN(master_instance.slave_sql_running = 1AND master_instance.slave_io_running = 0AND master_instance.last_io_error like '%error %connecting to master%') AS is_failing_to_connect_to_master,MIN(master_downtime.downtime_active is not nulland ifnull(master_downtime.end_timestamp, now()) > now()) AS is_downtimed,MIN(IFNULL(master_downtime.end_timestamp, '')) AS downtime_end_timestamp,MIN(IFNULL(unix_timestamp() - unix_timestamp(master_downtime.end_timestamp),0)) AS downtime_remaining_seconds,MIN(master_instance.binlog_server) AS is_binlog_server,MIN(master_instance.pseudo_gtid) AS is_pseudo_gtid,MIN(master_instance.supports_oracle_gtid) AS supports_oracle_gtid,MIN(master_instance.semi_sync_master_enabled) AS semi_sync_master_enabled,MIN(master_instance.semi_sync_master_wait_for_slave_count) AS semi_sync_master_wait_for_slave_count,MIN(master_instance.semi_sync_master_clients) AS semi_sync_master_clients,MIN(master_instance.semi_sync_master_status) AS semi_sync_master_status,SUM(replica_instance.is_co_master) AS count_co_master_replicas,SUM(replica_instance.oracle_gtid) AS count_oracle_gtid_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.oracle_gtid != 0),0) AS count_valid_oracle_gtid_replicas,SUM(replica_instance.binlog_server) AS count_binlog_server_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.binlog_server != 0),0) AS count_valid_binlog_server_replicas,SUM(replica_instance.semi_sync_replica_enabled) AS count_semi_sync_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.semi_sync_replica_enabled != 0),0) AS count_valid_semi_sync_replicas,MIN(master_instance.mariadb_gtid) AS is_mariadb_gtid,SUM(replica_instance.mariadb_gtid) AS count_mariadb_gtid_replicas,IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.mariadb_gtid != 0),0) AS count_valid_mariadb_gtid_replicas,IFNULL(SUM(replica_instance.log_binAND replica_instance.log_slave_updates),0) AS count_logging_replicas,IFNULL(SUM(replica_instance.log_binAND replica_instance.log_slave_updatesAND replica_instance.binlog_format = 'STATEMENT'),0) AS count_statement_based_logging_replicas,IFNULL(SUM(replica_instance.log_binAND replica_instance.log_slave_updatesAND replica_instance.binlog_format = 'MIXED'),0) AS count_mixed_based_logging_replicas,IFNULL(SUM(replica_instance.log_binAND replica_instance.log_slave_updatesAND replica_instance.binlog_format = 'ROW'),0) AS count_row_based_logging_replicas,IFNULL(SUM(replica_instance.sql_delay > 0),0) AS count_delayed_replicas,IFNULL(SUM(replica_instance.slave_lag_seconds > 10),0) AS count_lagging_replicas,IFNULL(MIN(replica_instance.gtid_mode), '') AS min_replica_gtid_mode,IFNULL(MAX(replica_instance.gtid_mode), '') AS max_replica_gtid_mode,IFNULL(MAX(case when replica_downtime.downtime_active is not nulland ifnull(replica_downtime.end_timestamp, now()) > now() then '' else replica_instance.gtid_errant end),'') AS max_replica_gtid_errant,IFNULL(SUM(replica_downtime.downtime_active is not nulland ifnull(replica_downtime.end_timestamp, now()) > now()),0) AS count_downtimed_replicas,COUNT(DISTINCT case when replica_instance.log_binAND replica_instance.log_slave_updates then replica_instance.major_version else NULL end) AS count_distinct_logging_major_versionsFROMdatabase_instance master_instanceLEFT JOIN hostname_resolve ON (master_instance.hostname = hostname_resolve.hostname)LEFT JOIN database_instance replica_instance ON (COALESCE(hostname_resolve.resolved_hostname,master_instance.hostname) = replica_instance.master_hostAND master_instance.port = replica_instance.master_port)LEFT JOIN database_instance_maintenance ON (master_instance.hostname = database_instance_maintenance.hostnameAND master_instance.port = database_instance_maintenance.portAND database_instance_maintenance.maintenance_active = 1)LEFT JOIN database_instance_stale_binlog_coordinates ON (master_instance.hostname = database_instance_stale_binlog_coordinates.hostnameAND master_instance.port = database_instance_stale_binlog_coordinates.port)LEFT JOIN database_instance_downtime as master_downtime ON (master_instance.hostname = master_downtime.hostnameAND master_instance.port = master_downtime.portAND master_downtime.downtime_active = 1)LEFT JOIN database_instance_downtime as replica_downtime ON (replica_instance.hostname = replica_downtime.hostnameAND replica_instance.port = replica_downtime.portAND replica_downtime.downtime_active = 1)LEFT JOIN cluster_alias ON (cluster_alias.cluster_name = master_instance.cluster_name)LEFT JOIN cluster_domain_name ON (cluster_domain_name.cluster_name = master_instance.cluster_name)WHEREdatabase_instance_maintenance.database_instance_maintenance_id IS NULLAND '' IN ('', master_instance.cluster_name)GROUP BYmaster_instance.hostname,master_instance.portHAVING(MIN(master_instance.last_checked <= master_instance.last_seenand master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second) = 1/* AS is_last_check_valid */) = 0OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.slave_io_running = 0AND replica_instance.last_io_error like '%error %connecting to master%'AND replica_instance.slave_sql_running = 1),0)/* AS count_replicas_failing_to_connect_to_master */> 0)OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen),0)/* AS count_valid_replicas */< COUNT(replica_instance.server_id)/* AS count_replicas */)OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seenAND replica_instance.slave_io_running != 0AND replica_instance.slave_sql_running != 0),0)/* AS count_valid_replicating_replicas */< COUNT(replica_instance.server_id)/* AS count_replicas */)OR (MIN(master_instance.slave_sql_running = 1AND master_instance.slave_io_running = 0AND master_instance.last_io_error like '%error %connecting to master%')/* AS is_failing_to_connect_to_master */)OR (COUNT(replica_instance.server_id)/* AS count_replicas */> 0)ORDER BYis_master DESC,is_cluster_master DESC,count_replicas DESC\G

故障类型和对应的处理函数

故障类型处理函数
NoProblem没有
DeadMasterWithoutReplicas没有
DeadMastercheckAndRecoverDeadMaster
DeadMasterAndReplicascheckAndRecoverGenericProblem
DeadMasterAndSomeReplicascheckAndRecoverDeadMaster
UnreachableMasterWithLaggingReplicascheckAndRecoverGenericProblem
UnreachableMastercheckAndRecoverGenericProblem
MasterSingleReplicaNotReplicating
MasterSingleReplicaDead
AllMasterReplicasNotReplicatingcheckAndRecoverGenericProblem
AllMasterReplicasNotReplicatingOrDeadcheckAndRecoverGenericProblem
LockedSemiSyncMasterHypothesis
LockedSemiSyncMastercheckAndRecoverLockedSemiSyncMaster
MasterWithTooManySemiSyncReplicascheckAndRecoverMasterWithTooManySemiSyncReplicas
MasterWithoutReplicas
DeadCoMastercheckAndRecoverDeadCoMaster
DeadCoMasterAndSomeReplicascheckAndRecoverDeadCoMaster
UnreachableCoMaster
AllCoMasterReplicasNotReplicating
DeadIntermediateMastercheckAndRecoverDeadIntermediateMaster
DeadIntermediateMasterWithSingleReplicacheckAndRecoverDeadIntermediateMaster
DeadIntermediateMasterWithSingleReplicaFailingToConnectcheckAndRecoverDeadIntermediateMaster
DeadIntermediateMasterAndSomeReplicascheckAndRecoverDeadIntermediateMaster
DeadIntermediateMasterAndReplicascheckAndRecoverGenericProblem
UnreachableIntermediateMasterWithLaggingReplicascheckAndRecoverGenericProblem
UnreachableIntermediateMaster
AllIntermediateMasterReplicasFailingToConnectOrDeadcheckAndRecoverDeadIntermediateMaster
AllIntermediateMasterReplicasNotReplicating
FirstTierReplicaFailingToConnectToMaster
BinlogServerFailingToConnectToMaster
// Group replication problems
DeadReplicationGroupMemberWithReplicascheckAndRecoverDeadGroupMemberWithReplicas

拓扑结构警告类型

  • StatementAndMixedLoggingReplicasStructureWarning
  • StatementAndRowLoggingReplicasStructureWarning
  • MixedAndRowLoggingReplicasStructureWarning
  • MultipleMajorVersionsLoggingReplicasStructureWarning
  • NoLoggingReplicasStructureWarning
  • DifferentGTIDModesStructureWarning
  • ErrantGTIDStructureWarning
  • NoFailoverSupportStructureWarning
  • NoWriteableMasterStructureWarning
  • NotEnoughValidSemiSyncReplicasStructureWarning

与MHA相比

MHA探活

核心功能主要有 MHA::HealthCheck::wait_until_unreachable 实现:

  1. 该函数通过一个死循环,检测 4 次,每次 sleep ping_interval 秒(这个值在配置文件指定,参数是 ping_interval),持续四次失败,就认为数据已经宕机;

  2. 如果有二路检测脚本,需要二路检测脚本检测主库宕机,才是真正的宕机,否则只是推出死循环,结束检测,不切换

  3. 通过添加锁来保护数据库的访问,防止脚本多次启动;

  4. 该函数可调用三种检测方法:ping_select、ping_insert、ping_connect

MHA探活流程图

MHAOC
故障类型

故障类型单一,

主要就是主库是否存活

故障类型丰富

有多种故障与警告类型

见上面的总结

探活节点

支持多个节点探活

即manager服务器与二次检查脚本中定义的服务器

多个OC节点 以及 配合该实例的从库复制状态是否正常进行检测
探活方式

ping_select、

ping_insert、

ping_connect

查询被管理的数据库,将数据写入到OC的后台管理数据库中,进行时间戳的比较或状态值比较等方式
探活间隔默认3秒默认每1秒
探活次数4次1次
故障探测速度慢,需要多次探活。
探活理念多次多个节点避免网络抖动等配合从副本的主从状态以及OC节点的探活

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/608752.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Apollo基础 - Frenet坐标系

Frenet与笛卡尔坐标系的转换详细推导见&#xff1a;b站老王 自动驾驶决策规划学习记录&#xff08;四&#xff09; Apollo相关代码&#xff1a; modules/common/math/cartesian_frenet_conversion.h #pragma once #include <array> #include "modules/common/mat…

怎么一边讲PPT一边录视频 如何一边录制PPT一边录制人像 录屏软件免费录屏 PPT录制怎么录制

随着新媒体技术的发展&#xff0c;短视频和直播越来越火。越来越多的小伙伴加入了视频制作的大军&#xff0c;那么你想知道怎么一边讲PPT一边录视频&#xff0c;如何一边录制PPT一边录制人像吗&#xff1f; 一、怎么一边讲PPT一边录视频 我们可以借助PPT本身自带的屏幕录制功能…

Linux的发展历程:从诞生到全球应用

一、前言 Linux作为一个开源操作系统&#xff0c;经历了令人瞩目的发展历程。从最初的创意到如今在全球范围内得到广泛应用&#xff0c;Linux不仅是技术的杰出代表&#xff0c;更是开源精神的典范。本文将追溯Linux的发展历程&#xff0c;深入了解它是如何从一个个人项目演变为…

【docker笔记】Docker容器数据卷

Docker容器数据卷 卷就是目录或者文件&#xff0c;存在于一个或多个容器中&#xff0c;由docker挂载到容器&#xff0c;但不属于联合文件系统&#xff0c;因此能够绕过Union File System提供一些用于持续存储或共享数据的特性 卷的设计目的就是数据的持久化&#xff0c;完全独…

VMware Workstation安装以及配置模板机

文章目录 一、VMware Workstation软件下载安装1、下载2、安装 二、CentOS7模板机安装1、创建虚拟机2、安装系统 三、网络配置 一、VMware Workstation软件下载安装 1、下载 下载地址&#xff1a;https://download3.vmware.com/software/wkst/file/VMware-workstation-full-15…

css中的变量和辅助函数

变量 --name 两个破折号加变量名称&#xff08;可以在当前的选择器内定义&#xff09;var(--*) 命名规则 body {--深蓝: #369;background-color: var(--深蓝); } 变量值只能做用属性值&#xff0c;不能用做属性名。变量命名不能包含 $,[,^,(,% 等字符 普通字符局限在只要是数…

软件测试|MySQL逻辑运算符使用详解

简介 在MySQL中&#xff0c;逻辑运算符用于处理布尔类型的数据&#xff0c;进行逻辑判断和组合条件。逻辑运算符主要包括AND、OR、NOT三种&#xff0c;它们可以帮助我们在查询和条件语句中进行复杂的逻辑操作。本文将详细介绍MySQL中逻辑运算符的使用方法和示例。 AND运算符 …

邮政快递单号查询入口,对快递单号进行提前签收分析

一款优秀的快递单号筛选软件能够给你的工作和生活带来极大的便利。通过合理选择和使用该软件&#xff0c;你将能够轻松管理、高效筛选快递单号&#xff0c;提升工作效率和生活品质。不妨试试我们的【快递批量查询高手】&#xff0c;让你的物流管理更加智能、便捷&#xff01; …

Pytest成魔之路 —— fixture 之大解剖!

1. 简介 fixture是pytest的一个闪光点&#xff0c;pytest要精通怎么能不学习fixture呢&#xff1f;跟着我一起深入学习fixture吧。其实unittest和nose都支持fixture&#xff0c;但是pytest做得更炫。 fixture是pytest特有的功能&#xff0c;它用pytest.fixture标识&#xff0c…

【算法Hot100系列】搜索插入位置

💝💝💝欢迎来到我的博客,很高兴能够在这里和您见面!希望您在这里可以感受到一份轻松愉快的氛围,不仅可以获得有趣的内容和知识,也可以畅所欲言、分享您的想法和见解。 推荐:kwan 的首页,持续学习,不断总结,共同进步,活到老学到老导航 檀越剑指大厂系列:全面总结 jav…

运用AI翻译漫画(二)

构建代码 构建这个PC桌面应用&#xff0c;我们需要几个步骤&#xff1a; 在得到第一次的显示结果后&#xff0c;经过测试&#xff0c;有很大可能会根据结果再对界面进行调整&#xff0c;实际上也是一个局部的软件工程中的迭代开发。 界面设计 启动Visual Studio 2017, 创建…

并发程序设计--D10线程池及gdb调试多线程

线程池 概念&#xff1a; 通俗的讲就是一个线程的池子&#xff0c;可以循环的完成任务的一组线程集合 必要性&#xff1a; 我们平时创建一个线程&#xff0c;完成某一个任务&#xff0c;等待线程的退出。但当需要创建大量的线程时&#xff0c;假设T1为创建线程时间&#xf…

贯穿设计模式-中介模式+模版模式

样例代码 涉及到的项目样例代码均可以从https://github.com/WeiXiao-Hyy/Design-Patterns.git获取 需求 购买商品时会存在着朋友代付的场景&#xff0c;可以抽象为购买者&#xff0c;支付者和中介者之间的关系 -> 中介者模式下单&#xff0c;支付&#xff0c;发货&#xff0…

什么是软件测试

一、软件测试的定义 软件测试的经典定义是在规定条件下对程序进行操作&#xff0c;以发现错误&#xff0c;对软件质量进行评估。因为软件是由文档、数据以及程序组成的&#xff0c;所以软件测试的对象也就不仅仅是程序本身&#xff0c;而是包括软件形成过程的文档、数据以及程…

什么是博若莱新酒节?

在红酒圈儿里混&#xff0c;一定不能不知道博若莱新酒节&#xff0c;这是法国举世闻名的以酒为主题的重要节日之一。现已成为世界范围内庆祝当年葡萄收获和酿制的节日&#xff0c;被称为一年一度的酒迷盛会。 云仓酒庄的品牌雷盛红酒LEESON分享博若莱位于法国勃艮第南部&#x…

Spark Core------算子介绍

RDD基本介绍 什么是RDD RDD:英文全称Resilient Distributed Dataset&#xff0c;叫做弹性分布式数据集&#xff0c;是Spark中最基本的数据抽象&#xff0c;代表一个不可变、可分区、里面的元素可并行计算的集合。 Resilient弹性&#xff1a;RDD的数据可以存储在内存或者磁盘…

C# OpenCvSharp DNN FreeYOLO 目标检测

目录 效果 模型信息 项目 代码 下载 C# OpenCvSharp DNN FreeYOLO 目标检测 效果 模型信息 Inputs ------------------------- name&#xff1a;input tensor&#xff1a;Float[1, 3, 192, 320] --------------------------------------------------------------- Outp…

Eureka注册中心Eureka提供者与消费者,Eureka原理分析,创建EurekaServer和注册user-service

提示&#xff1a;文章写完后&#xff0c;目录可以自动生成&#xff0c;如何生成可参考右边的帮助文档 文章目录 前言一、Eureka提供者与消费者二、Eureka原理分析eurekaeureka的作用eureka总结 三、创建EurekaServer和注册user-service创建EurekaServer总结 服务的拉取总结-Eur…

docker拉取镜像提示 remote trust data does not exist for xxxxxx

1、How can I be sure that I am pulling a trusted image from docker 2、docker: you are not authorized to perform this operation: server returned 401. 以上两个问题可以试试以下解决办法 DOCKER_CONTENT_TRUSTfalse 本人是使用jenkins部署自己的项目到docker容器出现…

Spring MVC参数接收、参数传递

Springmvc中&#xff0c;接收页面提交的数据是通过方法形参来接收&#xff1a; 处理器适配器调用springmvc使用反射将前端提交的参数传递给controller方法的形参 springmvc接收的参数都是String类型&#xff0c;所以spirngmvc提供了很多converter&#xff08;转换器&#xff0…