HA failover与zookeeper超时 // foolbear的冥想盆

最近想到一个问题，在一个HA的集群当中，Active挂掉后，多久才会发生failover？
换言之，Active挂了后，多久才能发现？

对HDFS HA而言，它是用一个单独的进程zkfc监控NN进程是否存活。进程间通过RPC协议（HAServiceProtocol）通信。zkfc定期调用HAServiceProtocol.monitorHealth()方法。如果这个方法抛出HealthCheckFailedException异常，进入SERVICE_UNHEALTHY状态，说明NN还活着但有问题；抛出任何其他异常，进入SERVICE_NOT_RESPONDING状态，说明RPC服务连不上，很可能NN进程挂了。相关逻辑见o.a.h.ha.HealthMonitor类。
每次HealthMonitor的状态变化都会触发callback（HealthCallbacks.enteredState方法）。回调中如果发现当前状态不是SERVICE_HEALTHY，会立即failover。

HealthMonitor的检查间隔是ha.health-monitor.check-interval.ms，默认1秒；RPC的超时时间是ha.health-monitor.rpc-timeout.ms，默认45秒；RPC的重试次数设置为1（见HAServiceTarget.getProxy方法）。
所以ActiveNN挂掉后，最多45+1=46秒后能发现，最少要45秒才能发现。

但如果是机器直接挂掉呢？zkfc没有机会去重新发起选举，只能等待zookeeper自己的超时机制了（虽然我看代码没验证过）。

其实HDFS HA的机制是比较复杂的。大多数其他的HA机制都只是利用zookeeper的ephemeral nodes，当客户端挂掉后对应的节点会自动删除。其他节点通过Watcher获知变化，重新发起选举。2.5.2里的RM HA就是这样的。
这是一种比较简单的HA实现。

客户端挂掉后ephemeral node多久才会删除？

其实zookeeper的Programmer’s Guide说的很清楚了。

Ticks
When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout.

Getting Started Guide也有提到：

tickTime
the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.

通俗的说，tickTime就是心跳，超时时间是2倍的tickTime。如果tickTime是2秒，那么zookeeper如果超过4秒没收到客户端的心跳，就认为客户端挂了。会删除客户端建的ephemeral node。

其实客户端也可以自己指定session timeout时间，但有限制，最小2个tick（minSessionTimeout），最大20个tick（maxSessionTimeout）。也可以自己在zoo.cfg里指定最大值与最小值，见Administrator’s Guide。

// zookeeper实例的构造函数，可以自己指定超时时间
public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher)
        throws IOException
    {
        this(connectString, sessionTimeout, watcher, false);
    }

如果用zk的这个机制去实现HA，客户端挂掉后，最少过1个tick多一点就能发现，最多要过2个tick才能发现。跟客户端挂掉的时机有关。

这就解释了最近碰到的一个RM HA的bug。
最近在测试2.5.2的RM HA，发现如果用yarn-daemon.sh停止ActiveRM，会立刻触发failover。但是直接kill -9杀掉RM的进程，要过很长时间才能failover。
原因是RM在启动时注册了ShutdownHook：

ResourceManager.java

public static void main(String argv[]) {
  Thread.setDefaultUncaughtExceptionHandler(new YarnUncaughtExceptionHandler());
  StringUtils.startupShutdownMessage(ResourceManager.class, argv, LOG);
  try {
    Configuration conf = new YarnConfiguration();
    ResourceManager resourceManager = new ResourceManager();
    // 普通的kill命令会触发这个钩子（gracefully stop），yarn-daemon.sh的本质也是kill
    // 但kill -9无法触发
    ShutdownHookManager.get().addShutdownHook(
      new CompositeServiceShutdownHook(resourceManager),
      SHUTDOWN_HOOK_PRIORITY);
    resourceManager.init(conf);
    resourceManager.start();
  } catch (Throwable t) {
    LOG.fatal("Error starting ResourceManager", t);
    System.exit(-1);
  }
}
// hook触发后会执行serviceStop方法
protected void serviceStop() throws Exception {
  if (webApp != null) {
    webApp.stop();
  }
  if (fetcher != null) {
    fetcher.stop();
  }
  if (configurationProvider != null) {
    configurationProvider.close();
  }
  super.serviceStop();
  // 如果是正常kill的话，会先更改zk的状态，立即触发failover
  // 但是kill -9不会触发这段代码，只有等zk超时才能failover
  transitionToStandby(false);
  rmContext.setHAServiceState(HAServiceState.STOPPING);
}

kill -9后，虽然StandbyRM还存活着，但它根据zk中的状态，一直认为自己是Standby，无法正常服务。大概过了18分钟，它才变成Active。
因为在我们的zk配置中，tickTime是10分钟（因为以前网络不好），所以最多要等20分钟，session才能超时，failover才能生效。

这个问题不只影响RM。在这18分钟里，所有的NM都在不停的failover（客户端的failover机制），试图找一个活着的RM发送心跳。但一直失败，最后所有的NM进程都自杀了。NM这个自杀机制见另一篇文章。

所以设置zookeeper参数时还是要谨慎些。