DNS异常导致RM/NM进程挂掉

某天凌晨4点左右DNS异常导致测试集群RM进程挂掉。
这是个已知的bug:https://issues.apache.org/jira/browse/YARN-713

但后续发现所有NM进程也挂掉了,需要研究下是怎么回事。

异常日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
2015-02-10 04:21:33,454 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Status Updater,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From inspur129.photo.163.org/10.160.128.197 to inspur116.photo.163.or
g:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:490)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Call From inspur129.photo.163.org/10.160.128.197 to inspur116.photo.163.org:8031 failed on connection exception: java.n
et.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor68.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy23.nodeHeartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at $Proxy24.nodeHeartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:441)
... 1 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399)
at org.apache.hadoop.ipc.Client.call(Client.java:1318)
... 12 more
// 省略部分日志
2015-02-10 04:21:35,599 INFO org.apache.hadoop.ipc.Server: Stopping server on 58350
2015-02-10 04:21:35,599 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2015-02-10 04:21:35,599 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.n
odemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2015-02-10 04:21:35,599 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 58350
2015-02-10 04:21:35,600 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemana
ger.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2015-02-10 04:21:35,631 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
2015-02-10 04:21:35,632 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2015-02-10 04:21:35,632 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
2015-02-10 04:21:35,632 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2015-02-10 04:21:42,605 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2015-02-10 04:21:42,606 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: ganglia thread interrupted.
2015-02-10 04:21:42,606 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2015-02-10 04:21:42,606 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2015-02-10 04:21:42,608 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at inspur129.photo.163.org/10.160.128.197
************************************************************/

看代码研究了下,NodeManager连不上RM的话,进程确实会自杀,只是延迟一段时间,大约20分钟。

NodeStatusUpdaterImpl.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
protected void startStatusUpdater() {
// 这个线程定期向RM发送心跳
statusUpdaterRunnable = new Runnable() {
@Override
@SuppressWarnings("unchecked")
public void run() {
int lastHeartBeatID = 0;
while (!isStopped) {
// Send heartbeat
try {
NodeHeartbeatResponse response = null;
NodeStatus nodeStatus = getNodeStatusAndUpdateContainersInContext();
nodeStatus.setResponseId(lastHeartBeatID);
NodeHeartbeatRequest request = recordFactory
.newRecordInstance(NodeHeartbeatRequest.class);
request.setNodeStatus(nodeStatus);
request
.setLastKnownContainerTokenMasterKey(NodeStatusUpdaterImpl.this.context
.getContainerTokenSecretManager().getCurrentKey());
request
.setLastKnownNMTokenMasterKey(NodeStatusUpdaterImpl.this.context
.getNMTokenSecretManager().getCurrentKey());
response = resourceTracker.nodeHeartbeat(request);
//get next heartbeat interval from response
nextHeartBeatInterval = response.getNextHeartBeatInterval();
updateMasterKeys(response);
// 省略了一些无关的代码
} catch (ConnectException e) {
// 连接RM异常的话,会有重试机制,超过重试次数(默认30),才会抛出ConnectionException
// 发送一个SHUTDOWN事件,NM会异步去处理
//catch and throw the exception if tried MAX wait time to connect RM
dispatcher.getEventHandler().handle(
new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
throw new YarnRuntimeException(e);
} catch (Throwable e) {
// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
} finally {
synchronized (heartbeatMonitor) {
nextHeartBeatInterval = nextHeartBeatInterval <= 0 ?
YarnConfiguration.DEFAULT_RM_NM_HEARTBEAT_INTERVAL_MS :
nextHeartBeatInterval;
try {
heartbeatMonitor.wait(nextHeartBeatInterval);
} catch (InterruptedException e) {
// Do Nothing
}
}
}
}
}
statusUpdater =
new Thread(statusUpdaterRunnable, "Node Status Updater");
statusUpdater.start();
}
NodeManager.java
1
2
3
4
5
6
7
8
9
10
11
12
13
// NM端的处理逻辑
public void handle(NodeManagerEvent event) {
switch (event.getType()) {
case SHUTDOWN:
shutDown();
break;
case RESYNC:
resyncWithRM();
break;
default:
LOG.warn("Invalid shutdown event " + event.getType() + ". Ignoring.");
}
}

关于RM的重试机制,见RMProxy类:

RMProxy.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public static RetryPolicy createRetryPolicy(Configuration conf) {
// 最大等待时间,默认900秒
long rmConnectWaitMS =
conf.getInt(
YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS,
YarnConfiguration.DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS);
// 每次尝试的间隔,默认30秒
long rmConnectionRetryIntervalMS =
conf.getLong(
YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS,
YarnConfiguration
.DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS);
// 省略部分代码
// 这里其实是尝试固定次数的,默认尝试900/30=30次,每次尝试间隔30秒
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS,
rmConnectionRetryIntervalMS,
TimeUnit.MILLISECONDS);
// 省略
}

但其实每次尝试也是要花一些时间的。hadoop中所有的RPC请求都有一个最基础的重试机制,见org.apache.hadoop.ipc.Client类:

Client.java
1
2
3
4
5
6
7
8
9
10
11
12
13
static ConnectionId getConnectionId(InetSocketAddress addr,
Class<?> protocol, UserGroupInformation ticket, int rpcTimeout,
RetryPolicy connectionRetryPolicy, Configuration conf) throws IOException {
// 默认重试10次,每次间隔1秒
if (connectionRetryPolicy == null) {
final int max = conf.getInt(
CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY,
CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_DEFAULT);
connectionRetryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep(
max, 1, TimeUnit.SECONDS);
}
// 省略
}

重试阶段会出现很多异常日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2015-02-12 10:34:15,959 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:16,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:17,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 2 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:18,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 3 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:19,961 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 4 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:20,961 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 5 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2015-02-12 10:34:21,961 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: inspur116.photo.163.org/10.160.128.191:8031. Already tried 6 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

经过以上分析可知,RM挂掉后,NM在下次心跳时会发现并重试(心跳间隔不是固定的,由RM决定)。
默认重试30次,每次间隔30秒,每次重试耗时9秒,总共需要2930+309=1140秒(这算法不一定对,没实际测过)。之后放弃重试并自杀。
也就是说,NM在RM挂掉后20分钟左右也会自动关闭。
以前没发现这个问题是因为RM挂掉后及时重启了。

P.S:近日在hbase的文档中看到一句话

RegionServer suicides are ‘normal’, as this is what they do when something goes wrong.

我觉得hadoop也是这样。。。自从升到2.2.0版本,NN/RM/NM自杀我们都见识过。。。就差DN了。。。