线上部分job运行失败,报OOM的错误:
wKiom1No-TCyEeI5AAOLNlmRctQ989.jpg

因为是maptask报错,怀疑是map数量过少,导致oom,因此调整参数,增加map数量,但是问题依然存在。看来和map的数量没有关系。
通过jobid查找jobhistory中对应的日志信息,定位到出错的task id和对应的host.通过日志查看出问题的containerid.
由于container是由RM进行分配的,查看RM的日志,可以看到container的分配情况:
比如下面的例子:
2014-05-06 16:00:00,632 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_1399267192386_43455_01_000037 of capacity <memory:1536, vCores:1> on host xxxx:44614, which currently has 4 containers, <memory:6144, vCores:4> used and <memory:79872, vCores:42> available
可以看到container的id,host和内存大小,cpu 大小。
进一步查看NM的相关container日志:
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1399203487215_21532_01_000035 by user hdfs
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111       OPERATION=Start Container Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1399203487215_21532   CONTAINERID=container_1399203487215_21532_01_000035
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1399203487215_21532_01_000035 to application application_1399203487215_21532
2014-05-05 10:14:47,055 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from NEW to LOCALIZING
2014-05-05 10:14:47,058 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1399203487215_21532_01_000035
2014-05-05 10:14:47,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035.tokens. Credentials list:
2014-05-05 10:14:47,412 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZING to LOCALIZED
2014-05-05 10:14:47,454 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZED to RUNNING
2014-05-05 10:14:47,493 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /home/vipshop/hard_disk/6/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035/default_container_executor.sh]
2014-05-05 10:14:48,827 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035.tokens to /home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035.tokens
2014-05-05 10:14:49,169 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1399203487215_21532_01_000035
2014-05-05 10:14:49,305 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 66.7 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:53,063 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 984.1 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:56,379 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 984.5 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
.......
2014-05-05 10:19:26,823 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container-id container_1399203487215_21532_01_000035: 1.1 GB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111       OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1399203487215_21532   CONTAINERID=container_1399203487215_21532_01_000035
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from RUNNING to KILLING
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1399203487215_21532_01_000035
2014-05-05 10:19:27,800 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
可以看到,虽然container分配的内存为1.5,但是在使用到1.1G(1.1 GB of 1.5 GB physical memory used)时task被kill掉了。。还有400多M的剩余,看来不是task的整个内存大小分配的太小导致,比较像perm的问题(默认为64m)
更新mapred的设置如下:
<property><name>mapreduce.map.java.opts</name><value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value></property><property><name>mapreduce.reduce.java.opts</name><value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value></property><property><name>mapred.child.java.opts</name><value>-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</value><final>true</final></property>
重新运行job,成功。
其实对应java的oom问题来说,最好的方法是打印gc的信息和dump内存的堆栈,然后使用MAT一类的工具来进行分析。