K8S集群中PLEG问题排查

一、背景

k8s集群排障真的很麻烦

今天集群有同事找我,节点报 PLEG is not healthy 集群中有的节点出现了NotReady,这是什么原因呢?

二、kubernetes源码分析

PLEG is not healthy 也是一个经常出现的问题

POD 生命周期事件生成器

先说下PLEG 这部分代码在kubelet 里,我们看一下在kubelet中的注释:

// GenericPLEG is an extremely simple generic PLEG that relies solely on
// periodic listing to discover container changes. It should be used
// as temporary replacement for container runtimes do not support a proper
// event generator yet.
//
// Note that GenericPLEG assumes that a container would not be created,
// terminated, and garbage collected within one relist period. If such an
// incident happens, GenenricPLEG would miss all events regarding this
// container. In the case of relisting failure, the window may become longer.
// Note that this assumption is not unique -- many kubelet internal components
// rely on terminated containers as tombstones for bookkeeping purposes. The
// garbage collector is implemented to work with such situations. However, to
// guarantee that kubelet can handle missing container events, it is
// recommended to set the relist period short and have an auxiliary, longer
// periodic sync in kubelet as the safety net.
type GenericPLEG struct {// The period for relisting.relistPeriod time.Duration// The container runtime.runtime kubecontainer.Runtime// The channel from which the subscriber listens events.eventChannel chan *PodLifecycleEvent// The internal cache for pod/container information.podRecords podRecords// Time of the last relisting.relistTime atomic.Value// Cache for storing the runtime states required for syncing pods.cache kubecontainer.Cache// For testability.clock clock.Clock// Pods that failed to have their status retrieved during a relist. These pods will be// retried during the next relisting.podsToReinspect map[types.UID]*kubecontainer.Pod
}

也就是说kubelet 会定时把 拉取pod 的列表,然后记录下结果。

运行代码后会执行一个定时任务,定时调用relist函数

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

relist函数里关键代码:

	// Get all the pods.podList, err := g.runtime.GetPods(true)if err != nil {klog.ErrorS(err, "GenericPLEG: Unable to retrieve pods")return}g.updateRelistTime(timestamp)

我们可以看到kubelet 定期调用 docker.sock 或者containerd.sock 去调用CRI 去拉取pod列表,然后更新下relist时间。

我们在看Health 函数,是被定时调用的健康检查处理函数:

// Healthy check if PLEG work properly.
// relistThreshold is the maximum interval between two relist.
func (g *GenericPLEG) Healthy() (bool, error) {relistTime := g.getRelistTime()if relistTime.IsZero() {return false, fmt.Errorf("pleg has yet to be successful")}// Expose as metric so you can alert on `time()-pleg_last_seen_seconds > nn`metrics.PLEGLastSeen.Set(float64(relistTime.Unix()))elapsed := g.clock.Since(relistTime)if elapsed > relistThreshold {return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)}return true, nil
}

他是用当前时间 减去 relist更新时间,得到的时间如果超过relistThreshold就代表可能不健康

	// The threshold needs to be greater than the relisting period + the// relisting time, which can vary significantly. Set a conservative// threshold to avoid flipping between healthy and unhealthy.relistThreshold = 3 * time.Minute

进一步思考这个问题,我们就把问题锁定在了CRI 容器运行时的地方

三、锁定错误

这个问题出错的根源是在容器运行时超时,意味着dockerd 或者 contaienrd 出现故障,我们到那台机器上看到kubelet 的日志发现很多CRI 超时的 不可用的日志

Nov 02 13:41:43 app04 kubelet[8411]: E1102 13:41:43.111882    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.036729    8411 kubelet.go:2396] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady messag
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.112993    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113027    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113041    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114281    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114319    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114335    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.344912    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345214    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345501    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.630715    8411 kubelet.go:2040] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was las
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115226    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115265    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115280    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116608    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116647    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116667    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081612    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081611    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082134    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082201    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082378    8411 remote_runtime.go:6

想办法重启运行时 或者去排查containerd

Nov 02 12:58:45 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:46 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:47 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:48 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:49 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:50 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:51 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:52 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:53 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:54 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:55 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:56 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:57 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:58 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:59 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:00 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:01 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:02 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:03 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:04 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

发现是CRI 服务端接受太多套接字,导致accept 失败了,可以适当调大ulimit

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/180468.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

for for in for of 的区别

for是JavaScript中最基本的循环语句,它可以用于遍历数组和对象中的元素,语法如下: for (初始化; 判断条件; 增量) {// 循环体 }其中,初始化是循环开始前执行的语句,判断条件是判断循环是否可以继续的条件,…

Pybullet -------[ 1 ]

1. p.addUserDebugText() 这个函数允许在仿真环境中动态添加文本,用于调试和可视化。你可以指定文本的内容、位置、颜色、大小等属性。 p.addUserDebugText(text, textPosition, textColorRGB[1,1,1], textSize1, lifeTime0, parentObjectUniqueId0,parentLinkInde…

机器视觉双目测宽仪具体有什么优势?

双目测宽仪是机器视觉原来制造而成的智能宽度检测设备,广泛应用于板材类产品的宽度检测。通过测宽仪的使用,实时了解产品宽度品质,进行超差提示,减少废品的生产。 双目测宽仪优势 测量软件界面显示:产品规格、标称宽…

MATLAB算法实战应用案例精讲-【图像处理】FPGA

目录 前言 算法原理 FPGA 是什么 FPGA的定义以及和GPU的类比 FPGA 有什么用 1)通信领域

Android控件全解手册 - 任意View缩放平移工具-源码

Unity3D特效百例案例项目实战源码Android-Unity实战问题汇总游戏脚本-辅助自动化Android控件全解手册再战Android系列Scratch编程案例软考全系列Unity3D学习专栏蓝桥系列ChatGPT和AIGC 👉关于作者 专注于Android/Unity和各种游戏开发技巧,以及各种资源分…

详细介绍如何使用 PaddleOCR 进行光学字符识别-(含源码及讲解)

阅读巨大的文档可能会非常累并且非常耗时。您一定见过许多软件或应用程序,只需单击图片即可从文档中获取关键信息。这是通过一种称为光学字符识别 (OCR) 的技术来完成的。光学字符识别是近年来人工智能领域的重点研究之一。光学字符识别是通过理解和分析图像的基本模式来识别图…

竞赛选题 题目:基于机器视觉的图像矫正 (以车牌识别为例) - 图像畸变校正

文章目录 0 简介1 思路简介1.1 车牌定位1.2 畸变校正 2 代码实现2.1 车牌定位2.1.1 通过颜色特征选定可疑区域2.1.2 寻找车牌外围轮廓2.1.3 车牌区域定位 2.2 畸变校正2.2.1 畸变后车牌顶点定位2.2.2 校正 7 最后 0 简介 🔥 优质竞赛项目系列,今天要分享…

yolov8-pose姿势估计,站立识别

系列文章目录 基于yolov8-pose的姿势估计模式,实现站姿,坐姿,伏案睡姿识别,姿态动作识别接口逻辑作参考。本文以学习交流,分享,欢迎留言讨论优化。 yoloPose-姿势动作识别 系列文章目录前言一、环境安装二、使用yolov8-pose1.导入模型,预测图像三.姿势动作识别之站立总…

unity实时保存对象的位姿,重新运行程序时用最后保存的数据给物体赋值

using UnityEngine; using System.IO; // using System.Xml.Serialization; public class SaveCoordinates : MonoBehaviour {public GameObject MainObject;//读取坐标private float x;private float y;private float z;private Quaternion quaternion;private void Start(){/…

如何使用torchrun启动单机多卡DDP并行训练

如何使用torchrun启动单机多卡DDP并行训练 这是一个最近项目中需要使用的方式,新近的数据集大概在40w的规模,而且载入的原始特征都比较大(5~7M),所以准备尝试DistributedDataParallel; 主要目…

Qt 自定义标题栏

在Qt中,如果你想要自定义窗口的标题栏,你可以通过覆盖窗口的windowTitleChanged信号来实现。然而,直接修改Qt的标题栏可能会带来一些问题,因为Qt的设计是尽量使窗口系统的行为标准化。 以下是一个基本的示例,如何在Qt…

Java中的集合

Java中的集合 java.util 包中的集合 Java 集合框架提供了各种集合类,用于存储和管理对象。以下是 Java 集合框架中常见的集合类: List 接口表示一个有序的集合,其中的元素可以重复。List 接口有以下实现类: ArrayList&#xff1…

人工智能_机器学习053_支持向量机SVM目标函数推导_SVM条件_公式推导过程---人工智能工作笔记0093

然后我们再来看一下支持向量机SVM的公式推导情况 来看一下支持向量机是如何把现实问题转换成数学问题的. 首先我们来看这里的方程比如说,中间的黑线我们叫做l2 那么上边界线我们叫l1 下边界线叫做l3 如果我们假设l2的方程是上面这个方程WT.x+b = 0 那么这里 我们只要确定w和…

<Linux> 文件理解与操作

目录 前言: 一、关于文件的预备知识 二、C语言文件操作 1. fope 2. fclose 3. 文件写入 3.1 fprintf 3.2 snprintf 三、系统文件操作 1. open 2. close 3. write 4. read 四、C文件接口与系统文件IO的关系 五、文件描述符 1. 理解文件描述符 2. 文…

时延抖动和通信的本质

先从网络时延抖动的根源说起。 信息能否过去取决于信道容量,而信道利用率则取决于编码。这是香农定律决定的。 考虑到主机处理非常快,忽略处理时延,端到端时延就是信息传播时延,但现实中通信信道利用率非常不均匀,统…

一则 MongoDB 副本集迁移实操案例

文中详细阐述了通过全量 增量 Oplog 的迁移方式,完成一套副本集 MongoDB 迁移的全过程。 作者:张然,DBA 数据库技术爱好者~ 爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。 本文约 900…

python炒股自动化(1),量化交易接口区别

要实现股票量化程序化自动化,就需要券商提供的API接口,重点是个人账户小散户可以申请开通,上手要简单,接口要足够全面,功能完善,首先,第一步就是要找对渠道和方法,这里我们不讨论量化…

linux 内核等待队列

等待队列在Linux内核中用来阻塞或唤醒一个进程,也可以用来同步对系统资源的访问,还可以实现延迟功能 在软件开发中任务经常由于某种条件没有得到满足而不得不进入睡眠状态,然后等待条件得到满足的时候再继续运行,进入运行状态。这…

网络安全--基于Kali的网络扫描基础技术

文章目录 1. 标准ICMP扫描1.1使用Ping命令1.1.1格式1.1.2实战 1.2使用Nmap工具1.2.1格式1.2.2实战1.2.2.1主机在线1.2.2.2主机不在线 1.3使用Fping命令1.3.1格式1.3.2实战 2. 时间戳查询扫描2.1格式2.2实战 3. 地址掩码查询扫描3.1格式3.2实战 2. TCP扫描2.1TCP工作机制2.2TCP …

MySQL 索引类型

什么是索引? 索引是一种用于提高数据库查询性能的数据结构。它是在表中一个或多个列上创建的,可以加快对这些列的数据检索速度。 索引的作用是通过创建一个额外的数据结构,使得数据库可以更快地定位和访问数据。当执行查询语句时&#xff0c…