某客户管理系统Oracle RAC节点异常重启问题详细分析记录

一、故障概述

        某日10:58分左右客户管理系统数据库节点1所有实例异常重启,重启后业务恢复正常。经过分析发现,此次实例异常重启数据库节点1

二、故障原因分析

1、数据库日志分析

        从节点1的数据库日志来看,10:58:49的时候数据库进程开始被abort,最终PMON进程因为481错误而终止实例,这个报错一般表示网络问题

alert_reportdb1.log:

***********************************************************************

Sat Dec 07 10:58:49 XXXX

***********************************************************************

Fatal NI connect error 12537, connecting to:

 (LOCAL=NO)

Fatal NI connect error 12537, connecting to:

 (LOCAL=NO)

Fatal NI connect error 12537, connecting to:

 (LOCAL=NO)

  VERSION INFORMATION:

TNS for Linux: Version 11.2.0.4.0 - Production

Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

TNS-12537: TNS:connection closed

    ns main err code: 12537

    ns secondary err code: 12560

TNS-12537: TNS:connection closed  

    nt main err code: 0

    ns secondary err code: 12560

    nt secondary err code: 0

    nt main err code: 0

TNS-12537: TNS:connection closed

    nt OS err code: 0

    nt secondary err code: 0

    ns secondary err code: 12560

    nt OS err code: 0

    nt main err code: 0

    nt secondary err code: 0

    nt OS err code: 0

opiodr aborting process unknown ospid (36742) as a result of ORA-609

opiodr aborting process unknown ospid (36722) as a result of ORA-609

opiodr aborting process unknown ospid (36738) as a result of ORA-609

Sat Dec 07 10:58:49 2023

***********************************************************************

Fatal NI connect error 12537, connecting to:

 (LOCAL=NO)

  VERSION INFORMATION:

TNS for Linux: Version 11.2.0.4.0 - Production

Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

  Time: 07-DEC-XXXX 10:58:49

  Tracing not turned on.

  Tns error struct:

    ns main err code: 12537

TNS-12537: TNS:connection closed

    ns secondary err code: 12560

    nt main err code: 0

    nt secondary err code: 0

Sat Dec 07 10:58:49 2023

***********************************************************************

    nt OS err code: 0

Fatal NI connect error 12537, connecting to:

 (LOCAL=NO)

  VERSION INFORMATION:

TNS for Linux: Version 11.2.0.4.0 - Production

Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

  Time: 07-DEC-XXXX 10:58:49

  Tracing not turned on.

  Tns error struct:

    ns main err code: 12537

TNS-12537: TNS:connection closed

opiodr aborting process unknown ospid (36751) as a result of ORA-609

    ns secondary err code: 12560

    nt main err code: 0

    nt secondary err code: 0

    nt OS err code: 0

opiodr aborting process unknown ospid (36761) as a result of ORA-609

Sat Dec 07 10:58:49 2023

。。。。。

opiodr aborting process unknown ospid (36746) as a result of ORA-609

opiodr aborting process unknown ospid (36777) as a result of ORA-609opiodr aborting process unknown ospid (36807) as a result of ORA-609

opiodr aborting process unknown ospid (36819) as a result of ORA-609

Sat Dec 07 10:58:49 2023

PMON (ospid: 48234): terminating the instance due to error 481

2、Crs alert日志分析

        从crsalertlog信息中可以知道,10:58:49的时候,所有数据库资源监测失败,这个和数据库实例abort时间点一致,应该是数据库中止后的表现。

------------------------------节点1 crs alert trace文件----------------------

xxxx- 12-07 10:58:49.068 [CRSD(46493)]CRS-5825: Agent '/u01/app/grid/12.1.0.2/bin/oraagent_grid' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd.trc.

xxxx- 12-07 10:58:49.094 [ORAAGENT(47263)]CRS-5832: Agent '/u01/app/grid/12.1.0.2/bin/oraagent_grid' was unable to process commands. Details at (:CRSAGF00128:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc.

xxxx- 12-07 10:58:49.094 [ORAAGENT(47263)]CRS-5818: Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc.

xxxx- 12-07 10:58:50.173 [ORAAGENT(47494)]CRS-5011: Check of resource "reportdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:50.298 [ORAAGENT(47494)]CRS-5011: Check of resource "managedb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.029 [ORAAGENT(47494)]CRS-5011: Check of resource "hwxddb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.222 [ORAAGENT(47494)]CRS-5011: Check of resource "hwwlxtdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.284 [ORAAGENT(47494)]CRS-5011: Check of resource "hwyyxtdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.285 [ORAAGENT(47494)]CRS-5011: Check of resource "yxgldb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.297 [ORAAGENT(47494)]CRS-5011: Check of resource "mhlwyxdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:51.298 [ORAAGENT(47494)]CRS-5011: Check of resource "boarddb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:52.273 [ORAAGENT(47494)]CRS-5011: Check of resource "tyjgdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:52.285 [ORAAGENT(47494)]CRS-5011: Check of resource "obsadb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

xxxx- 12-07 10:58:52.969 [ORAAGENT(36712)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 36712

xxxx- 12-07 10:58:54.741 [ORAAGENT(41064)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/ohasd_oraagent_grid.trc"

xxxx- 12-07 10:58:55.406 [ORAAGENT(36712)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc"

xxxx- 12-07 10:58:55.424 [ORAAGENT(36712)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc"

xxxx- 12-07 10:58:55.455 [ORAAGENT(41064)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/ohasd_oraagent_grid.trc"

xxxx- 12-07 10:58:55.527 [ORAAGENT(36712)]CRS-5017: The resource action "ora.RECOC1.dg start" encountered the following error:

xxxx- 12-07 10:58:55.527+ORA-01092: ORACLE instance terminated. Disconnection forced

3、Asm trace日志分析

        从1#asm的alertlog可以看到10:58:48,2#实例发起对1#asm实例的abort,需要通过2#alert和LMON trace分析,同时还伴有IPC Send timeout的信息,这个一般是心跳网络超时的报错。

2#ASM的alertlog可以看到10:56:35就发生了2#核心后台进程发给1#LMD0(44844)的超时报错,随即判断1#asm实例僵死而发起kill!

         再看2#ASM的LMON日志,10:56:35开始尝试reconfig,并设置100s超时vote:

......

        随后在10:58:28通过选举驱逐1#asm实例:

4、节点1 diag日志分析

        看看1#asm实例crash时的diag文件+ASM1_diag_44836_20231207105849.trc,看看2#asm进程接受进程LMD0(44844)的状态:

        可以看到故障前的等待都是“ges remote message”,最后1个历史等待7分29s,这个是典型的IPC网络等待:

        查看系统和网络丢包有关的参数,发现2个节点都会有大量的“packet reassembles failed”丢包发生:

[root@mpc01dbadm01 trace]# netstat -s

Ip:

36764567053 total packets received

70116 with invalid addresses

0 forwarded

0 incoming packets discarded

24572526733 incoming packets delivered

21770066525 requests sent out

692241 outgoing packets dropped

30980 fragments dropped after timeout

15457160506 reassemblies required

3265291587 packets reassembled ok

226816 packet reassembles failed

1796293625 fragments received ok

664 fragments failed

7885036302 fragments created

[root@mpc01dbadm02 trace]# netstat -s

Ip:

30349664623 total packets received

79036 with invalid addresses

0 forwarded

0 incoming packets discarded

23893920057 incoming packets delivered

23820631106 requests sent out

295480 outgoing packets dropped

186 dropped because of missing route

28255 fragments dropped after timeout

8368295089 reassemblies required

1912747085 packets reassembled ok

202513 packet reassembles failed

3389250826 fragments received ok

3337 fragments failed

16013866546 fragments created

5、OS内核设置

        当前数据库系统计算节点为RHEL6.8,存储节点为RHEL7.2,查看ipfrag参数为默认值:

        MOS有篇相关文档:RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1),现象和当前故障匹配,workaound是加大ipfrag相关参数:

          根据REDHAT官方文章说明,这种现象发生在如下场景:

  1. RHEL6.6/6.7,根据我们经验RHEL6/7都有类似故障发生;
  2. CPU较多(本机为56个);
  3. Oracle RAC环境

三、结论

  1. 本次故障由于ASM进程间通讯超时,导致2#实例发起了对1#asm实例的驱逐;
  2. 发现2个节点网络均存在大量“packet reassembles failed”丢包,根据MOS文档RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1),这个是由于RHEL6/7在主机存在大量CPU时,IP分片组包超出分片缓存区导致,处理方案是使用巨桢(jumbo frame)或者调整IPFRAG相关系统配置。

四、处理建议

        1、所有节点按如下最佳实践调整系统内核参数:

        net.ipv4.ipfrag_high_thresh = 41943040

        net.ipv4.ipfrag_low_thresh = 40894464

        net.ipv4.ipfrag_time = 120

        net.ipv4.ipfrag_secret_interval = 600

        net.ipv4.ipfrag_max_dist = 1024

        2、为便于故障分析,所有节点部署OSW

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/45142.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

新火种AI|微软和苹果放弃OpenAI董事会观察员席位

作者:一号 编辑:美美 微软苹果双双不做OpenAI“观察员”,OpenAI能更自由吗? 7月10消息,微软当地时间周一宣布将放弃在OpenAI董事会的观察员席位,他们称,OpenAI在过去八个月中取得了“重大进展…

代码随想录算法训练营第三十一天 |1049. 最后一块石头的重量 II 494. 目标和 474.一和零

1049. 最后一块石头的重量 II 有一堆石头&#xff0c;用整数数组 stones 表示。其中 stones[i] 表示第 i 块石头的重量。 每一回合&#xff0c;从中选出任意两块石头&#xff0c;然后将它们一起粉碎。假设石头的重量分别为 x 和 y&#xff0c;且 x < y。那么粉碎的可能结果…

国内的几款强大的智能—AI语言模型

AI 绘图 链接&#xff1a;点我进入 1、国内百度研发的&#xff0c;文心一言&#xff1a; https://yiyan.baidu.com/welcome 大家如果像我的界面一样有【开始体验】就是可以使用的&#xff0c;否则就是说明在等待中&#xff01; 优点&#xff1a;会画画&#xff0c;暂无次数限…

C++各种类型转换

string转为float #include <iostream> #include <string>int main() {std::string str "3.14";float num std::stof(str);std::cout << num << std::endl;return 0; } int转string to_string&#xff08;C11&#xff09; #include <…

python程序打包.exe文件

python程序打包.exe文件 1. cxfreeze# 1.1 安装cxfreeze1.2 创建setup.py文件1.3 生成.exe 当我们开发完一个深度学习程序时&#xff0c;往往在另一台电脑上运行代码&#xff0c;还得继续安装深度学习环境这些依赖&#xff0c;但是将整个代码程序打包成.exe文件就会同时将程序所…

nginx 中no live upstreams while connecting to upstream错误的解决

将netcore的站点服务器从IIS切换到linux的nginx站点之后&#xff0c;站点错误日志里一直报下面这样一个错误&#xff1a; 2024/07/11 10:17:19 [error] 477#0: *70 no live upstreams while connecting to upstream, client: 120.78.72.223, server: tahm.域名.com, request: …

回归树模型

目录 一、回归树模型vs决策树模型&#xff1a;二、回归树模型的叶结点&#xff1a;三、如何决定每个非叶结点上的特征类型&#xff1a; 本文只介绍回归树模型与决策树模型的区别。如需了解完整的理论&#xff0c;请看链接&#xff1a;决策树模型笔记 一、回归树模型vs决策树模…

Java中的多线程是如何实现的?

Java中的多线程实现主要通过以下几种方式&#xff1a; 1. 继承Thread类 这是实现多线程的一种基本方式。你需要创建一个类来继承java.lang.Thread类&#xff0c;然后重写其run()方法。run()方法包含了线程执行的任务代码。创建该类的实例后&#xff0c;通过调用该实例的start…

c++ learn five five day

1.A-B数对 二分法 http://t.csdnimg.cn/2GNeH 将A-BC转化成ABC&#xff0c;然后遍历数组&#xff0c;让数组的每个元素加C&#xff0c;再查找原数组中是否存在对应数组元素C之后的值。&#xff08;数据量比较大&#xff0c;所以我们就用二分在查找过程中提高效率&#xff0c…

Linux设备驱动的并发控制

一、概述 Linux设备驱动中必须解决的一个问题就是多个进程对共享资源(如全局变量、静态变量、硬件资源等)的并发访问&#xff0c;会导致竟态&#xff0c;如可能会出现以下情况&#xff1a;导致执行单元C独处的数据不符合预期 导致竟态发生有如下几种情况&#xff1a; 对称多处…

int类型变量表示范围的计算原理

文章目录 1. 了解2. 为什么通常情况下int类型整数的取值范围是-2147483648 ~ 21474836473. int类型究竟占几个字节4. 推荐 1. 了解 通常情况下int类型变量占4个字节&#xff0c;1个字节有8位&#xff0c;每位都有0和1两种状态&#xff0c;所以int类型变量一共可以表示 2^32 种状…

date 命令学习

文章目录 date 命令学习1. 命令简介2. 语法参数2.1 使用语法2.2 说明2.3 参数说明 3. 使用案例:arrow_right: 星期名缩写 %a:arrow_right: 星期名全写 %A:arrow_right: 月名缩写 %b:arrow_right: 月名全称 %B:arrow_right: 日期和时间 %c:arrow_right: 世纪 %C:arrow_right: 按…

从零开始学习嵌入式---- C高级编译工具

走进编译工具箱&#xff1a;GCC、GDB 和 Make 你是否曾对程序员如何将一行行代码变成可以运行的软件感到好奇&#xff1f;答案就藏在编译工具箱里&#xff01;今天&#xff0c;我们将揭开三个重要工具的神秘面纱&#xff1a;GCC、GDB 和 Make&#xff0c;它们是程序员的左膀右臂…

【全面介绍Oracle】

🌈个人主页: 程序员不想敲代码啊 🏆CSDN优质创作者,CSDN实力新星,CSDN博客专家 👍点赞⭐评论⭐收藏 🤝希望本文对您有所裨益,如有不足之处,欢迎在评论区提出指正,让我们共同学习、交流进步! 目录 🎥前言🎥基本概念和安装🎥SQL语言🎥PL/SQL编程🎥数据库…

【计算机组成原理 | 第三篇】各个硬件的组成部分

前言&#xff1a; 在前面的文章中&#xff0c;我们介绍了计算机架构的基本组成。可以知道计算机的基本架构由“存储器”&#xff0c;“运算器”&#xff0c;“控制器”&#xff0c;“输入设备”&#xff0c;“输出设备”这五部分组成。 在这片文章中&#xff0c;我们来深入的了…

【斯坦福因果推断课程全集】2_无混淆和倾向分1

目录 Beyond a single randomized controlled trial Aggregating difference-in-means estimators Continuous X and the propensity score 随机试验的一个最简单的扩展是无约束下的干预效果估计。从定性上讲&#xff0c;当我们想估计一种并非随机的治疗效果&#xff0c;但一…

数列分块<2>

本期是数列分块入门<2>。该系列的所有题目来自hzwer在LOJ上提供的数列分块入门系列。 Blog:http://hzwer.com/8053.html sto hzwer orz %%% [转载] 好像上面的链接↑打不开&#xff0c;放一个转载:https://www.cnblogs.…

tensorflow卷积层操作

全连接NN&#xff1a; 每个神经元与前后相邻层的每一个神经元都有全连接关系。输入是特征&#xff0c;输出为预测结果。 参数个数(前层*后层后层&#xff09; 实际应用时&#xff0c;会先对原始图像进行特征提取&#xff0c;再把提取到的特征送给全连接网络 会先进行若干层提…

在Linux中使用Typora将Markdown文档导出为docx格式

在Linux中使用Typora将Markdown文档导出为docx格式 步骤一&#xff1a;安装Typora 首先&#xff0c;如果你还没有安装Typora&#xff0c;请访问Typora官网下载并安装适用于你操作系统的版本。Typora支持Windows、macOS和Linux系统。 步骤二&#xff1a;编写或打开Markdown文…

C嘎嘎类与对象上篇

类的定义 1. class为定义类的关键字&#xff0c;Stack为类的名字&#xff0c;{}中为类的主体&#xff0c;注意类定义结束时后⾯分号不能省略 。类体中内容称为类的成员&#xff1a;类中的变量称为类的属性或成员变量; 类中的函数称为类的⽅法或者成员函数。 2. C中struct也可以…