Linux环境下内存错误问题排查与修复

最近这几天服务器总是掉线,要查一下服务器的问题。可以首先查看一下计算机硬件,这是一台某鱼上拼凑的服务器:

sudo lshw -short
H/W path           Device          Class          Description
=============================================================system         NF5270M3 (To be filled by O.E.M.)
/0                                 bus            NF5270M3
/0/0                               memory         64KiB BIOS
/0/4                               processor      Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
/0/4/5                             memory         384KiB L1 cache
/0/4/6                             memory         1536KiB L2 cache
/0/4/7                             memory         15MiB L3 cache
/0/6                               processor      Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
/0/6/9                             memory         384KiB L1 cache
/0/6/a                             memory         1536KiB L2 cache
/0/6/b                             memory         15MiB L3 cache
/0/2c                              memory         24GiB System Memory
/0/2c/0                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/1                            memory         DIMM Synchronous [empty]
/0/2c/2                            memory         DIMM Synchronous [empty]
/0/2c/3                            memory         DIMM Synchronous [empty]
/0/2c/4                            memory         DIMM Synchronous [empty]
/0/2c/5                            memory         DIMM Synchronous [empty]
/0/2c/6                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/7                            memory         DIMM Synchronous [empty]
/0/2c/8                            memory         DIMM Synchronous [empty]
/0/2c/9                            memory         DIMM Synchronous [empty]
/0/2c/a                            memory         DIMM Synchronous [empty]
/0/2c/b                            memory         DIMM Synchronous [empty]
/0/2c/c                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/d                            memory         DIMM Synchronous [empty]
/0/2c/e                            memory         DIMM Synchronous [empty]
/0/2c/f                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/10                           memory         DIMM Synchronous [empty]
/0/2c/11                           memory         DIMM Synchronous [empty]
/0/2c/12                           memory         DIMM Synchronous [empty]
/0/2c/13                           memory         DIMM Synchronous [empty]
/0/100/3/0         /dev/nvme0      storage        LITEON CA3-8D128-HP
/0/100/3/0/0       hwmon0          disk           NVMe disk
/0/100/3/0/2       /dev/ng0n1      disk           NVMe disk
/0/100/3/0/1       /dev/nvme0n1    disk           128GB NVMe disk
/0/100/3/0/1/1                     volume         1074MiB Windows FAT volume
/0/100/3/0/1/2     /dev/nvme0n1p2  volume         2GiB EXT4 volume
/0/100/3/0/1/3     /dev/nvme0n1p3  volume         116GiB EFI partition
/0/100/1f.2/0      /dev/sda        disk           500GB WDC WD5000AAKX-0
/0/100/1f.2/0/1    /dev/sda1       volume         465GiB EXT4 volume
/0/100/1f.2/1      /dev/sdb        disk           500GB WDC WD5000AAKX-2
/0/100/1f.2/1/1    /dev/sdb1       volume         465GiB EXT4 volume

网络掉线后插上 HDMI 显示屏查看屏幕显示状态,发现 Memory 相关字样,推测可能和内存条错误有关。

重启后查看系统日志:

tail -200 /var/log/syslog
2025-04-04T18:23:54.720029+08:00 talos kernel: Memory failure: 0x46fab5: unhandlable page.
2025-04-04T18:23:55.230128+08:00 talos kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
2025-04-04T18:23:55.230140+08:00 talos kernel: {3}[Hardware Error]: It has been corrected by h/w and requires no further action
2025-04-04T18:23:55.230141+08:00 talos kernel: {3}[Hardware Error]: event severity: corrected
2025-04-04T18:23:55.230143+08:00 talos kernel: {3}[Hardware Error]: Error 0, type: corrected
2025-04-04T18:23:55.230144+08:00 talos kernel: {3}[Hardware Error]: fru_text: CorrectedErr
2025-04-04T18:23:55.230145+08:00 talos kernel: {3}[Hardware Error]: section_type: memory error
2025-04-04T18:23:55.230146+08:00 talos kernel: {3}[Hardware Error]: node:0 device:0 
2025-04-04T18:23:55.230147+08:00 talos kernel: {3}[Hardware Error]: error_type: 2, single-bit ECC
2025-04-04T18:24:01.695052+08:00 talos kernel: RAS: Soft-offlining pfn: 0x104e5c
2025-04-04T18:24:01.695076+08:00 talos kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
2025-04-04T18:24:01.695080+08:00 talos kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00050000010092
2025-04-04T18:24:01.695082+08:00 talos kernel: EDAC sbridge MC0: TSC 0 
2025-04-04T18:24:01.695084+08:00 talos kernel: EDAC sbridge MC0: ADDR 104e5c8c0 
2025-04-04T18:24:01.695086+08:00 talos kernel: EDAC sbridge MC0: MISC 40584e86 
2025-04-04T18:24:01.695088+08:00 talos kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1743762241 SOCKET 0 APIC 0
2025-04-04T18:24:01.695090+08:00 talos kernel: EDAC MC0: 20 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x104e5c offset:0x8c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
2025-04-04T18:24:01.695094+08:00 talos kernel: Memory failure: 0x104e5c: unhandlable page.

从系统日志中可以看出,系统正在经历严重的内存错误(Memory Errors),主要涉及硬件层面的问题。

检查详细错误日志:

sudo dmesg | grep -i error
[19108.267949] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[19108.267972] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[19108.267976] {2}[Hardware Error]: event severity: corrected
[19108.267985] {2}[Hardware Error]:  Error 0, type: corrected
[19108.267992] {2}[Hardware Error]:  fru_text: CorrectedErr
[19108.267997] {2}[Hardware Error]:   section_type: memory error
[19108.268003] {2}[Hardware Error]:   node:0 device:0 
[19108.268005] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[19114.873932] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19114.874122] EDAC MC0: 16385 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f934 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19118.239275] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19118.239533] EDAC MC0: 25284 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fb35 offset:0x5c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19128.825566] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19128.825743] EDAC MC0: 16708 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab5 offset:0x5c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19133.700096] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19133.700127] EDAC MC0: 32750 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f834 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19135.870233] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19135.870309] EDAC MC0: 16687 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fa34 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19138.224432] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19138.224502] EDAC MC0: 15745 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46fcb4 offset:0xd80 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19140.213293] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19140.213328] EDAC MC0: 15575 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x10aac5 offset:0x1c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19141.210137] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19141.210164] EDAC MC0: 19211 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab4 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19141.906759] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19141.906780] EDAC MC0: 16437 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f9b4 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19143.127824] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19143.127876] EDAC MC0: 24609 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46f835 offset:0x680 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19145.175716] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19145.175754] EDAC MC0: 5555 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7f3ec offset:0x280 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19148.183616] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19148.183654] EDAC MC0: 4858 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1021ad offset:0x180 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19149.143580] mce: [Hardware Error]: Machine check events logged
[19149.143583] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19149.143619] EDAC MC0: 4223 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x7f3ee offset:0xec0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19149.143629] mce: [Hardware Error]: Machine check events logged
[19151.167012] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19151.167036] EDAC MC0: 4119 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7f3ec offset:0x280 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19152.151462] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19152.151502] EDAC MC0: 3976 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46f835 offset:0x680 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19153.175444] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19153.175485] EDAC MC0: 24245 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fc34 offset:0x9c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19169.174851] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19169.174898] EDAC MC0: 48 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab5 offset:0x3c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )

dmesg 输出的硬件错误日志来看,系统正在经历严重的ECC内存错误,主要集中在 Channel 2, DIMM 0Channel 0, DIMM 0

内存插槽与 CPU 信息

sudo dmidecode -t memory | grep -A10 "Memory Device$" | egrep "Locator|Bank Locator|Size"
	Size: 8 GBLocator: Node0_Dimm0Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm1Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm2Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm3Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm4Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm5Bank Locator: Node0_Bank0Size: 8 GBLocator: Node0_Dimm6Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm7Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm8Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm9Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm10Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm11Bank Locator: Node0_Bank0Size: 8 GBLocator: Node1_Dimm0Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm1Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm2Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm3Bank Locator: Node1_Bank0Size: 8 GBLocator: Node1_Dimm4Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm5Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm6Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm7Bank Locator: Node1_Bank0

从上可以看出应该是 CPU0 的第一个插槽。直接将本插槽的内存条移出恢复正常。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/75600.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

函数和模式化——python

一、模块和包 将一段代码保存为应该扩展名为.py 的文件,该文件就是模块。Python中的模块分为三种,分别为:内置模块、第三方模块和自定义模块。 内置模块和第三方模块又称为库内置模块,有 python 解释器自带,不用单独安…

windows下载安装远程桌面工具RealVNC-Server教程(RealVNC_E4_6_1版带注册码)

文章目录 前言一、下载安装包二、安装步骤三、使用VNC-Viewer客户端远程连接,输入ip地址,密码完成连接 前言 在现代工作和生活中,远程控制软件为我们带来了极大的便利。RealVNC - Server 是一款功能强大的远程控制服务器软件,通过…

Android Dagger 2 框架的注解模块深入剖析 (一)

本人掘金号,欢迎点击关注:https://juejin.cn/user/4406498335701950 一、引言 在 Android 开发中,依赖注入(Dependency Injection,简称 DI)是一种强大的设计模式,它能够有效降低代码的耦合度&…

HTML语言的空值合并

HTML语言的空值合并 引言 在现代Web开发中,HTML(超文本标记语言)是构建网页的基础语言。随着前端技术的快速发展,开发者们面临着大量不同的工具和技术,尤其是在数据处理和用户交互方面。空值合并是一些编程语言中常用…

【数据结构】树的介绍

目录 一、树1.1什么是树?1.2 树的概念与结构1.3树的相关术语1.4 树形结构实际运用场景 二、二叉树2.1 概念与结构2.2 特殊的二叉树2.2.1 满二叉树2.2.2 完全二叉树 个人主页,点击这里~ 数据结构专栏,点击这里~ 一、树 1.1什么是树&#xff1…

Muduo网络库实现 [十三] - HttpRequest模块

目录 设计思路 成员设计 模块实现 设计思路 首先我们要先知道HTTP的请求的流程是什么样子的,不然我们会学的很迷糊。对于HTTP请求如何到来以及去往哪里,我们应该很清楚的知道 HTTP请求在服务器系统中的传递流程是一个多层次的过程: 客户端发起请求…

6. RabbitMQ 死信队列的详细操作编写

6. RabbitMQ 死信队列的详细操作编写 文章目录 6. RabbitMQ 死信队列的详细操作编写1. 死信的概念2. 消息 TTL 过期(触发死信队列)3. 队列超过队列的最大长度(触发死信队列)4. 消息被拒(触发死信队列)5. 最后: 1. 死信的概念 先从概念上解释上搞清楚这个定义&#…

如何使用Selenium进行自动化测试?

🍅 点击文末小卡片 ,免费获取软件测试全套资料,资料在手,涨薪更快 对于很多刚入门的测试新手来说,大家都将自动化测试作为自己职业发展的一个主要阶段。可是,在成为一名合格的自动化测试工程师之前&#…

洛谷题单3-P5724 【深基4.习5】求极差 最大跨度值 最大值和最小值的差-python-流程图重构

题目描述 给出 n n n 和 n n n 个整数 a i a_i ai​,求这 n n n 个整数中的极差是什么。极差的意思是一组数中的最大值减去最小值的差。 输入格式 第一行输入一个正整数 n n n,表示整数个数。 第二行输入 n n n 个整数 a 1 , a 2 … a n a_1,…

STM32智能手表——任务线程部分

RTOS和LVGL我没学过,但是应该能硬啃这个项目例程 ├─Application/User/Tasks # 用于存放任务线程的函数 │ ├─user_TaskInit.c # 初始化任务 │ ├─user_HardwareInitTask.c # 硬件初始化任务 │ ├─user_RunModeTasks.c…

ubuntu22.04LTS设置中文输入法

打开搜狗网址直接下载软件,软件下载完成后,会弹出安装教程说明书。 网址:搜狗输入法linux-首页搜狗输入法for linux—支持全拼、简拼、模糊音、云输入、皮肤、中英混输https://shurufa.sogou.com/linux

SQL Server数据库异常-[SqlException (0x80131904): 执行超时已过期] 操作超时问题及数据库日志已满的解决方案

🧑 博主简介:CSDN博客专家、CSDN平台优质创作者,获得2024年博客之星荣誉证书,高级开发工程师,数学专业,拥有高级工程师证书;擅长C/C、C#等开发语言,熟悉Java常用开发技术&#xff0c…

php8 ?-> nullsafe 操作符 使用教程

简介 PHP 8 引入了 ?->(Nullsafe 操作符),用于简化 null 检查,减少繁琐的 if 语句或 isset() 代码,提高可读性。 ?-> Nullsafe 操作符的作用 在 PHP 7 及以下,访问对象的属性或方法时&#xff0…

WORD+VISIO输出PDF图片提高清晰度的方法

WORDVISIO输出PDF图片提高清晰度的方法 part 1: visio 绘图part 2: word 导出 part 1: visio 绘图 先在visio中把图片和对应的文字调整为适合插入到文章中的尺寸; 在visio中把所有元素进行组合; 把组合后的图片长和宽等比例放缩,如放大10倍…

重要头文件下的函数

1、<cctype> #include<cctype>加入这个头文件就可以调用以下函数&#xff1a; 1、isalpha(x) 判断x是否为字母 isalpha 2、isdigit(x) 判断x是否为数字 isdigit 3、islower(x) 判断x是否为小写字母 islower 4、isupper(x) 判断x是否为大写字母 isupper 5、isa…

基于大模型预测不稳定性心绞痛的多维度研究与应用

目录 一、引言 1.1 研究背景与意义 1.2 研究目的 1.3 国内外研究现状 二、不稳定性心绞痛概述 2.1 定义与分类 2.2 发病机制 2.3 临床表现 三、大模型技术原理与应用基础 3.1 大模型介绍 3.2 在医疗领域的应用现状 3.3 用于不稳定性心绞痛预测的可行性 四、术前预…

第一讲—函数的极限与连续(一)

思维导图 笔记 双曲正弦函数及其反函数

Mac VM 卸载 win10 安装win7系统

卸载 找到相应直接删除&#xff08;移动到废纸篓&#xff09; 可参考&#xff1a;mac如何卸载虚拟机win 下载 win7下载地址

免费送源码:Java+SSM+Android Studio 基于Android Studio游戏搜索app的设计与实现 计算机毕业设计原创定制

摘要 本文旨在探讨基于SSM框架和Android Studio的游戏搜索App的设计与实现。首先&#xff0c;我们详细介绍了SSM框架&#xff0c;这是一种经典的Java Web开发框架&#xff0c;由Spring、SpringMVC和MyBatis三个开源项目整合而成&#xff0c;为开发企业级应用提供了高效、灵活、…

网络安全的现状与防护措施

随着数字化和信息化的迅猛发展&#xff0c;互联网已成为人们日常生活、工作和学习不可或缺的一部分。然而&#xff0c;随着网络技术的普及&#xff0c;网络安全问题也日益突出。近年来&#xff0c;数据泄露、恶意软件、网络攻击等事件层出不穷&#xff0c;给企业和个人带来了巨…