Red Hat Cluster Suite: Configuring and Managing a Cluster | ||
---|---|---|
Prev | Appendix E. Supplementary Software Information | Next |
Common Causes: Serial power switch disconnected from controlling member. Network power switch disconnected from network.
Expected Behavior: Members controlled by the power switch will not be able to be shut down or restarted. In this case, if the member hangs, services will not fail-over from any member controlled by the switch in question.
Verification: Run clustat to verify that services are still marked as running on the member, even though it is inactive according to membership.
Common Causes: A majority of cluster members (for example, 3 of 5 members) go offline
Test Case: In a 3 member cluster, stop the cluster software on two members.
Expected Behavior: All members which do not have controlling power switches reboot immediately. All services stop immediately and their states are not updated on the shared media (when running clustat, the service status blocks may still display that the service is running). Service managers exit. Cluster locks are lost and become unavailable.
Verification: Run clustat on one of the remaining active members.
Common Causes: Total loss of connectivity to other members.
Test Case: Disconnect all network cables from a cluster member.
Expected Behavior: If the member has no controlling power switches, it reboots immediately. Otherwise, it attempts to stop services as quickly as possible. If a quorum exists, the set of members comprising the cluster quorum will fence the member.
Test Case: killall -KILL clumembd
Expected Behavior: System reboot.
Test Case: killall -STOP clumembd
Expected Behavior: System reboot may occur if clumembd hangs for a time period greater than (failover_time - 1) seconds. Triggered externally by watchdog timer.
Test Case: killall -STOP clumembd
Expected Behavior: System reboot may occur if clumembd hangs for a time period greater than (failover_time) seconds. Triggered internally by clumembd.
Test Case: killall -KILL cluquorumd
Expected Behavior: System reboot.
Test Case: killall -KILL clusvcmgrd
Expected Behavior: cluquorumd re-spawns clusvcmgrd, which runs the stop phase of all services. Services which are stopped are started.
Verification: Consult system logs for a warning message from cluquorumd.
Test Case: killall -KILL clulockd
Expected Behavior: cluquorumd re-spawns clulockd. Locks may be unavailable (preventing service transitions) for a short period of time.
Verification: Consult system logs for a warning message from cluquorumd.
Common Causes: Any noted scenario which causes a system reboot.
Test Case: reboot -fn; pressing the reset switch.
Expected Behavior: If a power switch controls the rebooting member in question, the system will also be fenced (generally, power-cycled) if a cluster quorum exists.
Test Case: Stop cluster services (service clumanager stop) on all members.
Expected Behavior: Any remaining services are stopped uncleanly.
Verification: Consult the system logs for warning message.
Expected Behavior: Services on member which was fenced are started elsewhere in the cluster, if possible.
Verification: Verify that services are, in fact, started after the member is fenced. This should only take a few seconds.
Common Causes: Power switch returned error status or is not reachable.
Test Case: Disconnect power switch controlling a member and run reboot -fn on the member.
Expected Behavior: Services on a member which fails to be fenced are not started elsewhere in the cluster. If the member recovers, services on the cluster are restarted. Because there is no way to accurately determine the member's state, it is assumed that it is now still running even if heartbeats have stopped. Thus, all services should be reported as running on the down member.
Verification: Run clustat to verify that services are still marked as running on the member, even though it is inactive according to membership. Messages will be logged stating that the member is now in the PANIC state.
Test Case: Run dd to write zeros to the shared partition
dd if=/dev/zero of=/dev/raw/raw1 bs=512 count=1 |
shutil -p /cluster/header |
Expected Behavior: Event is logged. The data from the good shared partition is copied to the partition which returned errors
Verification: A second read pass of the same data should not produce a second error message.
Common Causes: Shared media is either unreachable or both partitions have corruption.
Test Case: Unplug SCSI or Fibre Channel cable from a member.
Expected Behavior: The event is logged. Configured action is taken to address loss of access to shared storage (reboot/halt/stop/ignore). Default action is to reboot