yarn 怎么查看有多个job在跑_flink on yarn 模式下提示yarn资源不足问题分析

背景

在实时计算平台上通过YarnClient向yarn上提交flink任务时一直卡在那里,并在client端一直输出如下日志:

(YarnClusterDescriptor.java:1036)- Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

看到这个的第一反应是yarn上的资源分配问题,于是来到yarn控制台,发现Cluster Metrics中Apps Pending项为1。what?新提交的job为什么会处于pending状态了?

1. 先确定cpu和内存情况如下:

d0735c0410bd152a46147b478eef7706.png

可以看出cpu和内存资源充足,没有发现问题。

2. 查看调度器的使用情况

集群中使用的调度器的类型如下图:

be57efb63b0b7551e81957b991c3a727.png

可以看到,集群中使用的是Capacity Scheduler调度器,也就是所谓的容量调度,这种方案更适合多租户安全地共享大型集群,以便在分配的容量限制下及时分配资源。采用队列的概念,任务提交到队列,队列可以设置资源的占比,并且支持层级队列、访问控制、用户限制、预定等等配置。但是,对于资源的分配占比调优需要更多的经验处理。但它不会出现在使用FIFO Scheduler时会出现的有大任务独占资源,会导致其他任务一直处于 pending 状态的问题。

3. 查看任务队列的情况

e4e1ffaabf59e9fa9d0adb6fcc6663c4.png

从上图中可以看出Configured Minimum User Limit Percent的配置为100%,由于集群目前相对较小,用户队列没有做租户划分,用的都是default队列,从图中可以看出使用的容量也只有38.2%,队列中最多可存放10000个application,而实际的远远少于10000,貎似这里也看不出来什么问题。

4. 查看resourceManager的日志

日志内容如下:

2020-11-26 19:33:46,669 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 3172020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 317 submitted by user root2020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1593338489799_03172020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    IP=x.x.x.x    OPERATION=Submit Application Request    TARGET=ClientRMService    RESULT=SUCCESS    APPID=application_1593338489799_03172020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from NEW to NEW_SAVING on event=START2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1593338489799_03172020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from NEW_SAVING to SUBMITTED on event=APP_NEW_SAVED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application added - appId: application_1593338489799_0317 user: root leaf-queue of parent: root #applications: 162020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Accepted application application_1593338489799_0317 from user: root, in queue: default2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from SUBMITTED to ACCEPTED on event=APP_ACCEPTED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1593338489799_0317_0000012020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1593338489799_0317_000001 State change from NEW to SUBMITTED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit2020-11-26 19:33:48,877 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1593338489799_0317 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@6c0d5b4d, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 15 #queue-pending-applications: 1 #queue-active-applications: 15

从日志中可以看到一个Application在yarn上进行资源分配的完整流程,只是这个任务因为一些原因进入了pending队列而已,与我们要查找的问题相关的日志主要是如下几行:

2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit2020-11-26 19:33:48,877 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1593338489799_0317 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@6c0d5b4d, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 15 #queue-pending-applications: 1 #queue-active-applications: 15

没错,问题就出来not starting application as amIfStarted exceeds amLimit,那么这个是什么原因引起的呢,我们看下stackoverflow[1]上的解释:

ba4409ea3f232384e0da6acbcd022b7c.png

那么yarn.scheduler.capacity.maximum-am-resource-percent参数的真正含义是什么呢?国语意思就是集群中可用于运行application master的资源比例上限,这通常用于限制并发运行的应用程序数目,它的默认值为0.1。

查看了下集群上目前的任务总数有15个左右,每个任务分配有一个约1G左右的jobmanager(jobmanager为Application master类型的application),占15G左右,而集群上的总内存为144G,那么15>144 * 0.1 ,从而导致jobmanager的创建处于pending状态。

5. 解决验证

修改capacity-scheduler.xml的yarn.scheduler.capacity.maximum-am-resource-percent配置为如下:

    yarn.scheduler.capacity.maximum-am-resource-percent    0.5  

除了动态减少队列数目外,capacity-scheduler.xml的其他配置的修改是可以动态更新的,更新命令为:

yarn rmadmin -refreshQueues

执行命令后,在resourceManager的日志中可以看到如下输出:

2020-11-27 09:37:56,340 INFO org.apache.hadoop.conf.Configuration: found resource capacity-scheduler.xml at file:/work/hadoop-2.7.4/etc/hadoop/capacity-scheduler.xml2020-11-27 09:37:56,356 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Re-initializing queues...---------------------------------------------------------------------------2020-11-27 09:37:56,371 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue mappings, override: false2020-11-27 09:37:56,372 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    IP=x.x.x.x    OPERATION=refreshQueues    TARGET=AdminServicRESULT=SUCCESS

仔细查看日志可以看到更新已经成功,在平台上重新发布任务显示成功,问题解决。

yarn Queue的配置

Resource Allocation

PropertyDescription
yarn.scheduler.capacity..capacityQueue capacity in percentage (%) as a float (e.g. 12.5) OR as absolute resource queue minimum capacity. The sum of capacities for all queues, at each level, must be equal to 100. However if absolute resource is configured, sum of absolute resources of child queues could be less than it’s parent absolute resource capacity. Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.
yarn.scheduler.capacity..maximum-capacityMaximum queue capacity in percentage (%) as a float OR as absolute resource queue maximum capacity. This limits the elasticity for applications in the queue. 1) Value is between 0 and 100. 2) Admin needs to make sure absolute maximum capacity >= absolute capacity for each queue. Also, setting this value to -1 sets maximum capacity to 100%.
yarn.scheduler.capacity..minimum-user-limit-percentEach queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value. The former (the minimum value) is set to this property value and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer.
yarn.scheduler.capacity..user-limit-factorThe multiple of the queue capacity which can be configured to allow a single user to acquire more resources. By default this is set to 1 which ensures that a single user can never take more than the queue’s configured capacity irrespective of how idle the cluster is. Value is specified as a float.
yarn.scheduler.capacity..maximum-allocation-mbThe per queue maximum limit of memory to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration yarn.scheduler.maximum-allocation-mb. This value must be smaller than or equal to the cluster maximum.
yarn.scheduler.capacity..maximum-allocation-vcoresThe per queue maximum limit of virtual cores to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration yarn.scheduler.maximum-allocation-vcores. This value must be smaller than or equal to the cluster maximum.
yarn.scheduler.capacity..user-settings..weightThis floating point value is used when calculating the user limit resource values for users in a queue. This value will weight each user more or less than the other users in the queue. For example, if user A should receive 50% more resources in a queue than users B and C, this property will be set to 1.5 for user A. Users B and C will default to 1.0.

Resource Allocation using Absolute Resources configuration

CapacityScheduler supports configuration of absolute resources instead of providing Queue capacity in percentage. As mentioned in above configuration section for yarn.scheduler.capacity..capacity and yarn.scheduler.capacity..max-capacity, administrator could specify an absolute resource value like [memory=10240,vcores=12]. This is a valid configuration which indicates 10GB Memory and 12 VCores.

Running and Pending Application Limits

The CapacityScheduler supports the following parameters to control the running and pending applications:

PropertyDescription
yarn.scheduler.capacity.maximum-applications / yarn.scheduler.capacity..maximum-applicationsMaximum number of applications in the system which can be concurrently active both running and pending. Limits on each queue are directly proportional to their queue capacities and user limits. This is a hard limit and any applications submitted when this limit is reached will be rejected. Default is 10000. This can be set for all queues with yarn.scheduler.capacity.maximum-applications and can also be overridden on a per queue basis by setting yarn.scheduler.capacity..maximum-applications. Integer value expected.
yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity..maximum-am-resource-percentMaximum percent of resources in the cluster which can be used to run application masters - controls number of concurrent active applications. Limits on each queue are directly proportional to their queue capacities and user limits. Specified as a float - ie 0.5 = 50%. Default is 10%. This can be set for all queues with yarn.scheduler.capacity.maximum-am-resource-percent and can also be overridden on a per queue basis by setting yarn.scheduler.capacity..maximum-am-resource-percent

更多配置

参考:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

如果想要不重启集群来动态刷新hadoop配置可尝试如下方法:

1、刷新hdfs配置

在两个(以三节点的集群为例)namenode节点上执行:

hdfs dfsadmin -fs hdfs://node1:9000 -refreshSuperUserGroupsConfigurationhdfs dfsadmin -fs hdfs://node2:9000 -refreshSuperUserGroupsConfiguration

2、刷新yarn配置

在两个(以三节点的集群为例)namenode节点上执行:

yarn rmadmin -fs hdfs://node1:9000 -refreshSuperUserGroupsConfigurationyarn rmadmin -fs hdfs://node2:9000 -refreshSuperUserGroupsConfiguration

参考

•https://stackoverflow.com/questions/33465300/why-does-yarn-job-not-transition-to-running-state•https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html•https://stackoverflow.com/questions/29917540/capacity-scheduler•https://cloud.tencent.com/developer/article/1357111•https://cloud.tencent.com/developer/article/1194501

References

[1] stackoverflow: https://stackoverflow.com/questions/33465300/why-does-yarn-job-not-transition-to-running-state

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/455325.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MPEG-2TS码流编辑的原理及其应用(转载

[作者:辽宁电视台 赵季伟] 在当今数字媒体不断发展、新媒体业务不断涌现 的前提下,实践证明襁褓中的新媒体只有两种经营方略可供选择:或是购买并集成整套节目,或是低成本深加工新节目,再不可能去按照传统生产模式…

FLV文件格式(Z)(转载)

刚才在看一些关于demux的东西,在处理flv格式的文件的时候,由于自己对flv文件的格式不了解,所以就比较云头转向,正好看到了一篇讲述flv文件格式的文章,写的比较明白,所以就转过来了。O(∩_∩)O~flv头文件比较…

mysql-5.7中的innodb_buffer_pool_prefetching(read-ahead)详解

一、innodb的read-ahead是什么: 所谓的read-ahead就是innodb根据你现在访问的数据,推测出你接下来可能要访问的数据,并把它们(可能要访问的数据)读入 内存。 二、read-ahead是怎么做到的: 1、总的来说read-ahead利用的是程序的局部…

mp4文件格式解析(一)

原文地址:mp4文件格式解析(一)作者:可下人间目前MP4的概念被炒得很火,也很乱。最开始MP4指的是音频(MP3的升级版),即MPEG-2 AAC标准。随后MP4概念被转移到视频上,对应的是…

shiro身份验证测试

2019独角兽企业重金招聘Python工程师标准>>> 一、登录验证 1、首先在shiro.ini里准备一些用户身份/凭据,后面这里会使用数据库代替,如: [users] [main] #realm jdbcRealmcom.learnging.system.shiro.ShiroRealm securityManager…

shell if多个条件判断_萌新关于Excel VBA中IF条件判断语句的一点心得体会

作者:金人瑞 《Excel VBA175例无理论纯实战教程》学员最近正在学习郑广学老师的VBA 175例教程,这是一篇新手向的文章,也是一个新手的总结,高手可以批评文章中的不足之处,也可以无视,VBA中的IF判断, 判断一般起到控制作…

编程语言难度排名_谷歌排名第一的编程语言,小学生拿来做答题,分分钟钟搞定高难度算法!...

点击上方蓝色文字关注我们吧谷歌排名第一的编程语言时什么?毫无疑问:肯定是 Python。 也难怪,作为大数据时代和人工智能时代的必备语言,Python 的优点太多了,语言简洁、易学、开发效率高、可移植性强...... 另外&#…

【转载】fullpage.js学习

参考网址:http://www.dowebok.com/77.html 上面有详细介绍及案例展示,很不错哦,可以先去看看demo 一、简介 fullPage.js 是一个基于jQuery的插件,它能够很方便、很轻松的制作出全屏网站,主要功能有: 1.支持…

webpack打包测试_webpack入门笔记(一)

webpack 是一个现代 JavaScript 应用程序的静态模块打包器(module bundler)。当 webpack 处理应用程序时,它会递归地构建一个依赖关系图(dependency graph),其中包含应用程序需要的每个模块,然后将所有这些模块打包成一个或多个 bundle。webp…

mysql中的内置函数

mysql内置函数列表可以从mysql官方文档查询,这里仅分类简单介绍一些可能会用到的函数。 1 数学函数 abs(x) pi() mod(x,y) sqrt(x) ceil(x)或者ceiling(x) rand(),rand(N):返回0-1间的浮点数,使用不同的seed N可以获得不同的随机数 round(x, D)&#xff…

使用 sitemesh/decorator装饰器装饰jsp页面(原理及详细配置)

摘要:首先这个Decorator解释一下这个单词:“装饰器”,我觉得其实可以这样理解,他就像我们用到的Frame,他把每个页面共有的东西提炼了出来,也可能我们也会用各种各样的include标签,将我们的常用页…

安卓开发 新浪微博share接口实现发带本地图片的微博

1.微博share接口 在开始之前,我们先看一下要用到的这个接口: 我们这次是要上传本地图片,可以很明确的知道,除了要用POST方式提交请求,还要采用multipart/form-data编码方式。 那么这个multipart/form-data编码方式是什…

VirtualBox安装Centos6.8出现——E_INVALIDARG (0x80070057)

VirtualBox使用已有的虚拟硬盘出错: 问题描述:UUID已经存在 Cannot register the hard disk E:\system_iso\centos6.8.vdi {05f096aa-67fc-4191-983d-1ed00fc6cce9} because a hard disk E:\system_iso\centos68_02\centos6.8.vdi with UUID {05f096aa-6…

非线性动力学_非线性动力学特辑 低维到高维的联通者

序言: 本文将以维度为主线, 带量大家进入非线性动力学的世界。 文章数学部分不需要全部理解, 理解思维方法为主非线性动力学,是物理学的思维进入传统方法所不能解决的问题的一座丰碑。它可以帮助我们理解不同复杂度和时间空间尺度…

成本预算的四个步骤_全网推广步骤有哪些?

全网推广的步骤是什么?一般来说,搜索引擎优化是大多数中小企业常用的推广方法。主要是通过对一些搜索引擎的排名来提高网站的曝光率,从而更好的提高自己网站的流量,从而更好的实现互联网层面的销售。接下来,让我们学习…

python生成requirements.txt的两种方法

python项目如何在另一个环境上重新构建项目所需要的运行环境依赖包? 使用的时候边记载是个很麻烦的事情,总会出现遗漏的包的问题,这个时候手动安装也很麻烦,不能确定代码报错的需要安装的包是什么版本。这些问题,requi…

node.js 安装使用http-server

node.js npm全局安装了http-server后我该怎么使用它?我在它的安装目录下创建了inde.html,浏览器localhost:8080可以访问,那我的项目需要放在它的安装目录下?还是需要在我的项目下配置什么或者使用什么指令启动它?我在我…

您的apple id 暂时不符合使用此应用程序_Mac相机不工作时该怎么办

苹果公司的许多台式机和笔记本电脑都包含一个内置网络摄像头,该公司愉快地将其称为FaceTime相机。但是,如果您的Mac网络摄像头无法正常工作,并且在尝试访问它时显示为断开连接或不可用,则您可能不会感到高兴。您可以尝试以下操作来…

基于DirectShow的流媒体解码和回放

一、 前言  流媒体的定义很广泛,大多数时候指的是把连续的影像和声音信息经过压缩处理后放上网站服务器,让用户一边下载一边观看、收听,而不需要等整个压缩文件下载到自己机器就可以观看的视频/音频传输、压缩技术。流媒体也指代由这种技术…

汕头市队赛 SRM16 T2

描述 猫和老鼠,看过吧?猫来了,老鼠要躲进洞里。在一条数轴上,一共有n个洞,位置分别在xi,能容纳vi只老鼠。一共有m只老鼠位置分别在Xi,要躲进洞里,问所有老鼠跑进洞里的距离总和最小是…