1.问题重现
- 测试环境
1.CDH6.1
2.Redhat7.4
3.集群未启用Kerberos
1.集群有一台服务器的NodeManager服务器启动失败,查看日志有如下报错:
Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sstorg.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:282) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:343) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:838) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:911)
2.重启该NodeManager服务多次以后仍旧报相同的错误。
2.问题解决
1.备份该NodeManager节点上的
/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state目录:
[root@cdh03 hadoop-yarn]# tar cvzf nmstate.tar.gz /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/*
2.删除该NodeManager节点上的
/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state目录:
[root@cdh03 hadoop-yarn]# rm -rf /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state
3.再次重启该NodeManager服务
启动成功,问题解决!
3.总结
1.NodeManager服务如果因为意外关闭会导致在运行的container都关闭,当NodeManager重启成功后,NodeManager会重新启动container进程,但为了能获取到之前的container的状态,NodeManager会将container的状态保存到本地的文件系统。主要通过以下2个参数来控制:
yarn.nodemanager.recovery.enabled在CDH中,默认是true即启用NodeManager该特性。–NodeManager 恢复目录yarn.nodemanager.recovery.dir –启用恢复时 NodeManager 在其中存储状态的本地文件系统目录。默认情况下已启用恢复。在CDH中默认:/var/lib/hadoop-yarn/yarn-nm-recovery
2.对于本文提到的异常,即NodeManager用于保存container状态的文件损坏或者丢失,根本原因还有待确认,据说在YARN上有任务运行时,如果NodeManager所在节点的服务器重启,有可能导致该问题。
3.本文是采用删除恢复目录的方式来修复该问题,你也可以通过在CM中禁用恢复功能来解决:
a).通过CM进入YARN服务;
b).选择“配置”,搜索yarn-site
c).在YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml中添加以下内容:
<property><name>yarn.nodemanager.recovery.enabled</name><value>false</value></property>
d)根据指引分发配置,并重启服务即可。