How to Find Why CRS/CSSD Rebooted a Server in Oracle RAC – Complete Checklist
Oracle RAC node eviction is one of the most critical issues in clustered database environments. When a node is evicted, Oracle Clusterware (CRS/CSSD) removes the node from the cluster to prevent split-brain conditions and ensure data consistency. In many cases, this eviction results in an automatic server reboot.
This complete Oracle RAC node eviction troubleshooting guide explains how to identify whether CRS rebooted the server, how to locate the exact cause, and how to fix it using a step-by-step checklist. This guide includes the best commands, logs, and root cause analysis techniques used by RAC administrators.
What is Oracle RAC Node Eviction?
Oracle RAC node eviction occurs when Oracle Clusterware detects a node is no longer stable or reliable for cluster participation. The node is removed forcibly, usually due to:
- Interconnect heartbeat loss
- Voting disk inaccessibility
- Storage latency or multipath failure
- OS CPU hang or kernel lockup
- Time drift between nodes (NTP/Chrony failure)
- Hardware issues (NIC, memory, CPU)
Eviction is triggered mainly by CSSD (Cluster Synchronization Services Daemon).
Does CRS Really Reboot the Server?
This is a common question.
CRS does not reboot the server randomly.
Instead, Oracle Clusterware can:
- Evict the node
- Trigger OS watchdog/hangcheck
- Stop CRS stack
- Force restart of cluster services
- Initiate node reboot in extreme conditions
The OS reboot can happen because:
- watchdog mechanism triggers reboot
- hangcheck timer triggers reboot
- node becomes unresponsive and system resets
Step 1: Confirm If the Server Rebooted (OS Level Proof)
The first step in Oracle RAC reboot troubleshooting is confirming whether the operating system restarted.
1.1 Check current uptime
uptime
1.2 Check last reboot time
who -b
1.3 Check reboot/shutdown history
last -x | head -50
What to look for:
-
reboot system boot -
shutdown system down -
crash
If reboot time matches eviction time, OS reboot occurred.
Step 2: Confirm CRS Stack Restart (CRS Restart vs OS Restart)
Sometimes the OS stays up but CRS restarts.
2.1 Check CRS stack status
crsctl check crs
Expected output includes:
- OHASD online
- CSSD online
- CRSD online
- EVMD online
2.2 Check cluster status on all nodes
crsctl check cluster -all
Step 3: Identify the Exact Node Eviction Reason (Most Important Log)
The #1 log file for Oracle RAC node eviction analysis is:
ocssd.log
Location:
$GRID_HOME/log/<hostname>/cssd/ocssd.log
3.1 View recent eviction details
tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log
3.2 Search for eviction messages
grep -i "evict" $GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -50
3.3 Search for common root cause patterns
grep -i "misscount\|voting\|heartbeat\|reconfig\|panic\|interface" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -100
Step 4: Check Grid Infrastructure Alert Log (CRS Alert Log)
The CRS alert log gives high-level cluster failure information.
Location:
$GRID_HOME/log/<hostname>/alert<hostname>.log
4.1 View recent log section
tail -400 $GRID_HOME/log/$(hostname)/alert$(hostname).log
4.2 Search for reboot and eviction keywords
grep -i "evict\|reboot\|cssd\|shutdown\|fatal\|restart" \
$GRID_HOME/log/$(hostname)/alert$(hostname).log | tail -100
Step 5: Check OHASD Log (Why CRS Restarted)
OHASD controls the entire cluster stack startup.
Location:
$GRID_HOME/log/<hostname>/ohasd/ohasd.log
5.1 View OHASD log
tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log
5.2 Search for failure patterns
grep -i "restart\|terminate\|shutdown\|fail\|kill" \
$GRID_HOME/log/$(hostname)/ohasd/ohasd.log | tail -100
Step 6: Check CRSD Log (Resource Failure Analysis)
CRSD manages cluster resources (DB, VIP, listener, ASM).
Location:
$GRID_HOME/log/<hostname>/crsd/crsd.log
6.1 View CRSD log
tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log
6.2 Search for eviction triggers
grep -i "evict\|fatal\|terminate\|restart\|offline" \
$GRID_HOME/log/$(hostname)/crsd/crsd.log | tail -100
Step 7: Check Voting Disk and OCR Health
Voting disk loss is the 2nd most common cause of node eviction.
7.1 Check voting disk status
crsctl query css votedisk
7.2 Check OCR health
ocrcheck
If OCR is unhealthy, node eviction can occur.
Step 8: Check ASM Diskgroup and Storage Status
If voting disk is stored in ASM, ASM disk problems can evict node.
8.1 Check ASM diskgroup status
asmcmd lsdg
8.2 Check ASM disks
asmcmd lsdsk
8.3 Check ASM alert log
tail -200 $GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log
Search for storage errors:
grep -i "error\|I/O\|offline\|fail\|timeout" \
$GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log | tail -50
Step 9: Check Multipath / SAN Issues (Path Flapping)
Storage instability causes intermittent voting disk access loss.
9.1 Check multipath status
multipath -ll
9.2 Check SAN timeout errors
dmesg -T | egrep -i "scsi|timeout|abort|reset|I/O error"
9.3 Check OS messages log
egrep -i "path.*down|path.*up|scsi|timeout|abort|reset" \
/var/log/messages | tail -200
Step 10: Check RAC Interconnect (Private Network Failure)
Interconnect issues are the most frequent root cause of eviction.
10.1 Identify interconnect interface
oifcfg getif
10.2 Check interface health and packet drops
ip -s link show <interface>
10.3 Check bonding (if configured)
cat /proc/net/bonding/bond0
10.4 Ping test between RAC nodes (private IP)
ping -I <private_interface> <other_node_private_ip>
Packet loss indicates network issue.
Step 11: MTU Mismatch Check (Very Common Hidden Issue)
MTU mismatch causes intermittent cluster heartbeat failures.
11.1 Check MTU
ip link show | grep mtu
Ensure all nodes + switches have same MTU.
Step 12: Check CPU Hang / OS Lockup / Soft Lockup Errors
If CPU is overloaded, heartbeat delays occur.
12.1 Check kernel logs
dmesg -T | tail -200
12.2 Search for hung tasks
dmesg -T | egrep -i "soft lockup|hard lockup|blocked for more than"
12.3 Check for OOM killer
dmesg -T | grep -i oom
Step 13: Check for Watchdog or Hangcheck Timer Reboot
Some RAC setups reboot node automatically if hung.
13.1 Check hangcheck module
lsmod | grep -i hangcheck
13.2 Check watchdog errors
grep -i watchdog /var/log/messages | tail -100
If you see watchdog triggers, OS reboot was forced.
Step 14: Check Time Drift / NTP / Chrony Issues (Critical)
Time drift can break RAC heartbeat synchronization.
14.1 Check chrony status
timedatectl status
chronyc tracking
chronyc sources -v
14.2 Compare node times
date
ssh <othernode> date
If time difference is > 1 second, fix NTP.
14.3 Check CTSSD log
tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log
Search:
grep -i "drift\|clock\|time" \
$GRID_HOME/log/$(hostname)/ctssd/ctssd.log | tail -50
Step 15: Check Hardware Errors (ECC/MCE)
Hardware faults can cause sudden eviction and reboot.
dmesg -T | egrep -i "mce|ecc|hardware error|fatal"
journalctl -k | egrep -i "mce|ecc|hardware error"
If errors exist, escalate to hardware vendor.
Step 16: Check OS Journal Logs for Shutdown Events
If system uses systemd:
16.1 List boots
journalctl --list-boots
16.2 Check previous boot logs
journalctl -b -1 | tail -200
Search:
journalctl -b -1 | grep -i "reboot\|shutdown\|panic\|watchdog\|hangcheck"
Step 17: Confirm If CRS Triggered the Reboot (Direct Proof)
The best confirmation is finding these in ocssd.log:
-
evicting node -
initiating eviction -
misscount exceeded -
voting file inaccessible -
reconfiguration started -
clssnmvDHBValidateNCopy
Search all logs:
grep -Rni "evict" $GRID_HOME/log/$(hostname) | tail -100
grep -Rni "reboot" $GRID_HOME/log/$(hostname) | tail -100
If eviction exists in ocssd.log and reboot exists in OS logs, CRS eviction caused reboot.
Step 18: Single Command Script to Collect All Evidence
Run immediately after node comes back:
echo "===== BASIC INFO ====="
date
hostname
uptime
who -b
last -x | head -30
echo "===== CRS STATUS ====="
crsctl check crs
crsctl check cluster -all
crsctl stat res -t
crsctl query css votedisk
ocrcheck
echo "===== GI ALERT LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/alert$(hostname).log
echo "===== CSSD LOG ====="
tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log
echo "===== OHASD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log
echo "===== CRSD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log
echo "===== CTSSD LOG ====="
tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log
echo "===== OS LOGS ====="
dmesg -T | tail -200
tail -200 /var/log/messages
echo "===== TIME SYNC ====="
timedatectl status
chronyc tracking
chronyc sources -v
echo "===== NETWORK ====="
oifcfg getif
ip link show
ip -s link
echo "===== MULTIPATH ====="
multipath -ll
Step 19: Root Cause Mapping (Fast Troubleshooting Table)
If ocssd.log shows misscount exceeded
Root cause: Interconnect latency OR CPU starvation
Fix: Check NIC drops, switch issues, MTU mismatch, CPU load
If ocssd.log shows voting file not accessible
Root cause: Storage/SAN/multipath issue
Fix: Check multipath, SAN latency, ASM disk offline events
If OS log shows watchdog reset
Root cause: OS hang/hardware freeze
Fix: Kernel tuning, firmware update, CPU/memory check
If CTSSD shows time drift
Root cause: NTP/Chrony problem
Fix: Fix chrony config and synchronize time across nodes
If /var/log/messages shows MCE/ECC
Root cause: Hardware failure
Fix: Replace memory/CPU/NIC and check ILO/IDRAC logs
Step 20: Best Practices to Prevent Oracle RAC Node Eviction
To prevent high-frequency evictions:
Recommended RAC Hardening Checklist
- Use redundant private interconnect
- Use bonding active-backup for private network
- Keep MTU consistent across all nodes and switches
- Ensure voting disks have redundant storage paths
- Configure multipath correctly with proper timeouts
- Ensure chrony/ntp is stable across all nodes
- Keep GI and DB patched with latest RU
- Monitor network latency and packet loss
- Monitor SAN response times and path flapping
How to Prove CRS Rebooted the Node
CRS triggered reboot if ALL are true:
OS reboot timestamp matches incident time (last -x)
ocssd.log contains eviction messages (evicting node)
alert log shows CSSD heartbeat/voting failure
OS log shows watchdog/hangcheck reboot or forced restart
If ocssd.log has eviction but OS reboot is absent → CRS evicted but OS stayed up (only stack restart).
If you paste the output of this command, the exact root cause can be identified quickly:
grep -i "evict\|misscount\|voting\|heartbeat\|reconfig" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -80

Post a Comment
Post a Comment