Linux-HA开源软件Heartbeat（测试篇）推荐

如何才能得知HA集群是否正常工作，模拟环境测试是个不错的方法，在把Heartbeat高可用性集群放到生产环境中之前，需要做如下五个步骤的测试，从而确定HA是否正常工作。

一、正常关闭和重启主节点的heartbeat首先在主节点node1上执行 service heartbeat stop 正常关闭主节点的Heartbeat进程，此时通过ifconfig命令查看主节点网卡信息，正常情况下，应该可以看到主节点已经释放了集群的服务IP地址，同时也释放了挂载的共享磁盘分区，然后查看备份节点，现在备份节点已经接管了集群的服务IP，同时也自动挂载上了共享的磁盘分区。在这个过程中，使用ping命令对集群服务IP进行测试，可以看到，集群IP一致处于可通状态，并没有任何延时和阻塞现象，也就是说在正常关闭主节点的情况下，主备节点的切换是无缝的，HA对外提供的服务也可以不间断运行。接着，将主节点heartbeat正常启动，heartbeat启动后，备份节点将自动释放集群服务IP，同时卸载共享磁盘分区，而主节点将再次接管集群服务IP和挂载共享磁盘分区，其实备份节点释放资源与主节点绑定资源是同步进行的。因而，这个过程也是一个无缝切换。

二、在主节点上拔去网线拔去主节点连接公共网络的网线后，heartbeat插件ipfail通过ping测试可以立刻检测到网络连接失败，接着自动释放资源，而就在此时，备用节点的ipfail插件也会检测到主节点出现网络故障，在等待主节点释放资源完毕后，备用节点马上接管了集群资源，从而保证了网络服务不间断持续运行。同理，当主节点网络恢复正常时，由于设置了 auto_failback on 选项，集群资源将自动从备用节点切会主节点。在主节点拔去网线后日志信息如下，注意日志中的斜体部分：Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link node2:eth0 dead.Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link 192.168.60.1:192.168.60.1 dead.Nov 26 09:04:09 node1 ipfail: [3712]: info: Status update: Node 192.168.60.1 now has status deadNov 26 09:04:09 node1 harc[4279]: info: Running /etc/ha.d/rc.d/status statusNov 26 09:04:10 node1 ipfail: [3712]: info: NS: We are dead. : Nov 26 09:04:10 node1 ipfail: [3712]: info: Link Status update: Link node2/eth0 now has status dead 中间部分省略 Nov 26 09:04:20 node1 heartbeat: [3689]: info: node1 wants to go standby [all]Nov 26 09:04:20 node1 heartbeat: [3689]: info: standby: node2 can take our all resourcesNov 26 09:04:20 node1 heartbeat: [4295]: info: give up all HA resources (standby).Nov 26 09:04:21 node1 ResourceManager[4305]: info: Releasing resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3Nov 26 09:04:21 node1 ResourceManager[4305]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 stopNov 26 09:04:21 node1 Filesystem[4343]: INFO: Running stop for /dev/sdb5 on /webdataNov 26 09:04:21 node1 Filesystem[4343]: INFO: Trying to unmount /webdataNov 26 09:04:21 node1 Filesystem[4343]: INFO: unmounted /webdata successfullyNov 26 09:04:21 node1 Filesystem[4340]: INFO: SuccessNov 26 09:04:22 node1 ResourceManager[4305]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 stopNov 26 09:04:22 node1 IPaddr[4428]: INFO: /sbin/ifconfig eth0:0 192.168.60.200 downNov 26 09:04:22 node1 avahi-daemon[1854]: Withdrawing address record for 192.168.60.200 on eth0.Nov 26 09:04:22 node1 IPaddr[4407]: INFO: Success备用节点在接管主节点资源时的日志信息如下：Nov 26 09:02:58 node2 heartbeat: [2110]: info: Link node1:eth0 dead.Nov 26 09:02:58 node2 ipfail: [2134]: info: Link Status update: Link node1/eth0 now has status deadNov 26 09:02:59 node2 ipfail: [2134]: info: Asking other side for ping node count.Nov 26 09:02:59 node2 ipfail: [2134]: info: Checking remote count of ping nodes.Nov 26 09:03:02 node2 ipfail: [2134]: info: Telling other node that we have more visible ping nodes.Nov 26 09:03:09 node2 heartbeat: [2110]: info: node1 wants to go standby [all]Nov 26 09:03:10 node2 heartbeat: [2110]: info: standby: acquire [all] resources from node1Nov 26 09:03:10 node2 heartbeat: [2281]: info: acquire all HA resources (standby).Nov 26 09:03:10 node2 ResourceManager[2291]: info: Acquiring resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3Nov 26 09:03:10 node2 IPaddr[2315]: INFO: Resource is stoppedNov 26 09:03:11 node2 ResourceManager[2291]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 startNov 26 09:03:11 node2 IPaddr[2393]: INFO: Using calculated netmask for 192.168.60.200: 255.255.255.0Nov 26 09:03:11 node2 IPaddr[2393]: DEBUG: Using calculated broadcast for 192.168.60.200: 192.168.60.255Nov 26 09:03:11 node2 IPaddr[2393]: INFO: eval /sbin/ifconfig eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255Nov 26 09:03:12 node2 avahi-daemon[1844]: Registering new address record for 192.168.60.200 on eth0.Nov 26 09:03:12 node2 IPaddr[2393]: DEBUG: Sending Gratuitous Arp for 192.168.60.200 on eth0:0 [eth0]Nov 26 09:03:12 node2 IPaddr[2372]: INFO: SuccessNov 26 09:03:12 node2 Filesystem[2482]: INFO: Resource is stoppedNov 26 09:03:12 node2 ResourceManager[2291]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 startNov 26 09:03:13 node2 Filesystem[2523]: INFO: Running start for /dev/sdb5 on /webdataNov 26 09:03:13 node2 kernel: kjournald starting. Commit interval 5 secondsNov 26 09:03:13 node2 kernel: EXT3 FS on sdb5, internal journalNov 26 09:03:13 node2 kernel: EXT3-fs: mounted filesystem with ordered data mode.Nov 26 09:03:13 node2 Filesystem[2520]: INFO: Success

三、在主节点上拔去电源线在主节点拔去电源后，备用节点的heartbeat进程会立刻收到主节点已经shutdown的消息，如果在集群上配置了Stonith设备，那么备用节点将会把电源关闭或者复位到主节点，当Stonith设备完成所有操作时，备份节点才拿到接管主节点资源的所有权，从而接管主节点的资源。在主节点拔去电源后，备份节点有类似如下的日志输出：Nov 26 09:24:54 node2 heartbeat: [2110]: info: Received shutdown notice from ‘node1’.Nov 26 09:24:54 node2 heartbeat: [2110]: info: Resources being acquired from node1.Nov 26 09:24:54 node2 heartbeat: [2712]: info: acquire local HA resources (standby).Nov 26 09:24:55 node2 ResourceManager[2762]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 startNov 26 09:24:57 node2 ResourceManager[2762]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 start

四、切断主节点的所有网络连接在主节点上断开心跳线后，主备节点都会在日志中输出 eth1 dead 的信息，但是不会引起节点间的资源切换，如果再次拔掉主节点连接公共网络的网线，那么就会发生主备节点资源切换，资源从主节点转移到备用节点，此时，连上主节点的心跳线，观察系统日志，可以看到，备用节点的heartbeat进程将会重新启动，进而再次控制集群资源，最后，连上主节点的对外网线，集群资源再次从备用节点转移到主节点，这就是整个的切换过程。

五、在主节点上非正常关闭heartbeat守护进程在主节点上通过 killall -9 heartbeat 命令关闭heartbeat进程，由于是非法关闭heartbeat进程，因此heartbeat所控制的资源并没有释放，备份节点在很短一段时间没有收到主节点的响应后，就会认为主节点出现故障，进而接管主节点资源，在这种情况下，就出现了资源争用情况，两个节点都占用一个资源，造成数据冲突。针对这个情况，可以通过linux提供的内核监控模块watchdog来解决这个问题，将watchdog集成到heartbeat中，如果heartbeat异常终止，或者系统出现故障，watchdog都会自动重启系统，从而释放集群资源，避免了数据冲突的发生。本内容我们没有配置watchdog到集群中，如果配置了watchdog，在执行 killall -9 heartbeat 时，会在/var/log/messages中看到如下信息：Softdog: WDT device closed unexpectedly. WDT will not stop!这个错误告诉我们，系统出现问题，将重新启动。

你曾经说，最大的愿望，

相关文章：

你感兴趣的文章：

标签云：