网络异常导致RM failover失败

测试RM HA时发现一个JIRA:YARN-2578

以前我测试过重启RM所在机器/直接kill -9杀掉RM进程,都是可以正常failover的。

在google的过程中发现这个jira,如果网络失败会导致RM failover失败。

在虚拟机上重现这个bug:

部署了一个3节点的集群(ubuntu 10.04,JDK7u75)
hadoop1:ZK、JN、StandbyNN、DN、NM
hadoop2:ZK、JN、ActiveNN、ActiveRM、DN、NM
hadoop3:ZK、JN、StandbyRM、DN、NM

然后在hadoop2上执行sudo ifconfig eth0 down
模拟网络挂掉的情况,相当于hadoop2跟其他节点网络都隔离开了。

hadoop2上的RM日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2015-05-04 14:24:24,726 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26668ms for sessionid 0x14d1d76951c0001, closing socket connection and attempting reconnect
2015-05-04 14:24:24,830 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:888)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:885)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1004)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1023)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:885)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$500(ZKRMStateStore.java:82)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:976)
2015-05-04 14:24:24,831 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1
2015-05-04 14:24:25,080 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop3/192.168.177.140:2181. Will not attempt to authenticate using SASL (unknown error)
2015-05-04 14:24:25,080 ERROR org.apache.zookeeper.ClientCnxnSocketNIO: Unable to open socket to hadoop3/192.168.177.140:2181
2015-05-04 14:24:25,081 WARN org.apache.zookeeper.ClientCnxn: Session 0x14d1d76951c0000 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.SocketException: 网络不可达
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:277)
at org.apache.zookeeper.ClientCnxnSocketNIO.connect(ClientCnxnSocketNIO.java:287)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:967)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1003)

可以看到是zk客户端出错,不断尝试连接。

这时,NN的failover正常。hadoop3虽然变成ActiveRM了,但没有NM到hadoop3上注册,跟jira中描述的症状一样。
为何NN failover正常呢?因为NN HA和RM HA的机制不同。DN会同时向ANN和SNN发送心跳。而NM只会向ActiveRM发送心跳。见另一篇文章

问题的根源在于NM发送心跳时没有超时机制,卡在了心跳那一步。

过了大概15分钟,NM终于超时,重新向hadoop3注册。日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
2015-05-04 14:39:48,161 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking nodeHeartbeat of class ResourceTrackerPBClientImpl over rm1. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: 没有到主机的路由; Host Details : local host is: "hadoop1/192.168.177.139"; destination host is: "hadoop2":8031;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy27.nodeHeartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:512)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: 没有到主机的路由
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:512)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1055)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:950)
2015-05-04 14:39:48,203 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2015-05-04 14:39:48,211 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node is out of sync with ResourceManager, hence resyncing.
2015-05-04 14:39:48,211 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: Node not found resyncing hadoop1:60618
2015-05-04 14:39:48,214 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Notifying ContainerManager to block new container-requests

在这15分钟内,可以提交任务到RM,但任务是无法执行的。

为何是15分钟?因为所有的RPC请求都有一个最基础的超时机制,见org.apache.hadoop.ipc.Client类。由两个参数控制:
ipc.client.connect.timeout,默认20秒
ipc.client.connect.max.retries.on.timeouts,默认重试45次
所以超时时间是(45-1)*20=880秒,约15分钟。

jira上的patch也非常简单,换一个有超时参数的getProxy方法即可。

hadoop的RPC机制还有点复杂的。