How to Find Why CRS/CSSD Rebooted a Server in Oracle RAC – Complete Checklist 

Oracle RAC node eviction is one of the most critical issues in clustered database environments. When a node is evicted, Oracle Clusterware (CRS/CSSD) removes the node from the cluster to prevent split-brain conditions and ensure data consistency. In many cases, this eviction results in an automatic server reboot.

This complete Oracle RAC node eviction troubleshooting guide explains how to identify whether CRS rebooted the server, how to locate the exact cause, and how to fix it using a step-by-step checklist. This guide includes the best commands, logs, and root cause analysis techniques used by RAC administrators.


What is Oracle RAC Node Eviction?

Oracle RAC node eviction occurs when Oracle Clusterware detects a node is no longer stable or reliable for cluster participation. The node is removed forcibly, usually due to:

  • Interconnect heartbeat loss
  • Voting disk inaccessibility
  • Storage latency or multipath failure
  • OS CPU hang or kernel lockup
  • Time drift between nodes (NTP/Chrony failure)
  • Hardware issues (NIC, memory, CPU)

Eviction is triggered mainly by CSSD (Cluster Synchronization Services Daemon).


Does CRS Really Reboot the Server?

This is a common question.

CRS does not reboot the server randomly.

Instead, Oracle Clusterware can:

  • Evict the node
  • Trigger OS watchdog/hangcheck
  • Stop CRS stack
  • Force restart of cluster services
  • Initiate node reboot in extreme conditions

The OS reboot can happen because:

  • watchdog mechanism triggers reboot
  • hangcheck timer triggers reboot
  • node becomes unresponsive and system resets

Step 1: Confirm If the Server Rebooted (OS Level Proof)

The first step in Oracle RAC reboot troubleshooting is confirming whether the operating system restarted.

1.1 Check current uptime

uptime

1.2 Check last reboot time

who -b

1.3 Check reboot/shutdown history

last -x | head -50

What to look for:

  • reboot system boot
  • shutdown system down
  • crash

If reboot time matches eviction time, OS reboot occurred.


Step 2: Confirm CRS Stack Restart (CRS Restart vs OS Restart)

Sometimes the OS stays up but CRS restarts.

2.1 Check CRS stack status

crsctl check crs

Expected output includes:

  • OHASD online
  • CSSD online
  • CRSD online
  • EVMD online

2.2 Check cluster status on all nodes

crsctl check cluster -all

Step 3: Identify the Exact Node Eviction Reason (Most Important Log)

The #1 log file for Oracle RAC node eviction analysis is:

ocssd.log

Location:

$GRID_HOME/log/<hostname>/cssd/ocssd.log

3.1 View recent eviction details

tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log

3.2 Search for eviction messages

grep -i "evict" $GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -50

3.3 Search for common root cause patterns

grep -i "misscount\|voting\|heartbeat\|reconfig\|panic\|interface" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -100

Step 4: Check Grid Infrastructure Alert Log (CRS Alert Log)

The CRS alert log gives high-level cluster failure information.

Location:

$GRID_HOME/log/<hostname>/alert<hostname>.log

4.1 View recent log section

tail -400 $GRID_HOME/log/$(hostname)/alert$(hostname).log

4.2 Search for reboot and eviction keywords

grep -i "evict\|reboot\|cssd\|shutdown\|fatal\|restart" \
$GRID_HOME/log/$(hostname)/alert$(hostname).log | tail -100


Step 5: Check OHASD Log (Why CRS Restarted)

OHASD controls the entire cluster stack startup.

Location:

$GRID_HOME/log/<hostname>/ohasd/ohasd.log

5.1 View OHASD log

tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log

5.2 Search for failure patterns

grep -i "restart\|terminate\|shutdown\|fail\|kill" \
$GRID_HOME/log/$(hostname)/ohasd/ohasd.log | tail -100

Step 6: Check CRSD Log (Resource Failure Analysis)

CRSD manages cluster resources (DB, VIP, listener, ASM).

Location:

$GRID_HOME/log/<hostname>/crsd/crsd.log

6.1 View CRSD log

tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log

6.2 Search for eviction triggers

grep -i "evict\|fatal\|terminate\|restart\|offline" \
$GRID_HOME/log/$(hostname)/crsd/crsd.log | tail -100

Step 7: Check Voting Disk and OCR Health

Voting disk loss is the 2nd most common cause of node eviction.

7.1 Check voting disk status

crsctl query css votedisk

7.2 Check OCR health

ocrcheck

If OCR is unhealthy, node eviction can occur.


Step 8: Check ASM Diskgroup and Storage Status

If voting disk is stored in ASM, ASM disk problems can evict node.

8.1 Check ASM diskgroup status

asmcmd lsdg

8.2 Check ASM disks

asmcmd lsdsk

8.3 Check ASM alert log

tail -200 $GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log

Search for storage errors:

grep -i "error\|I/O\|offline\|fail\|timeout" \
$GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log | tail -50

Step 9: Check Multipath / SAN Issues (Path Flapping)

Storage instability causes intermittent voting disk access loss.

9.1 Check multipath status

multipath -ll

9.2 Check SAN timeout errors

dmesg -T | egrep -i "scsi|timeout|abort|reset|I/O error"

9.3 Check OS messages log

egrep -i "path.*down|path.*up|scsi|timeout|abort|reset" \
/var/log/messages | tail -200

Step 10: Check RAC Interconnect (Private Network Failure)

Interconnect issues are the most frequent root cause of eviction.

10.1 Identify interconnect interface

oifcfg getif

10.2 Check interface health and packet drops

ip -s link show <interface>

10.3 Check bonding (if configured)

cat /proc/net/bonding/bond0

10.4 Ping test between RAC nodes (private IP)

ping -I <private_interface> <other_node_private_ip>

Packet loss indicates network issue.


Step 11: MTU Mismatch Check (Very Common Hidden Issue)

MTU mismatch causes intermittent cluster heartbeat failures.

11.1 Check MTU

ip link show | grep mtu

Ensure all nodes + switches have same MTU.


Step 12: Check CPU Hang / OS Lockup / Soft Lockup Errors

If CPU is overloaded, heartbeat delays occur.

12.1 Check kernel logs

dmesg -T | tail -200

12.2 Search for hung tasks

dmesg -T | egrep -i "soft lockup|hard lockup|blocked for more than"

12.3 Check for OOM killer

dmesg -T | grep -i oom

Step 13: Check for Watchdog or Hangcheck Timer Reboot

Some RAC setups reboot node automatically if hung.

13.1 Check hangcheck module

lsmod | grep -i hangcheck

13.2 Check watchdog errors

grep -i watchdog /var/log/messages | tail -100

If you see watchdog triggers, OS reboot was forced.


Step 14: Check Time Drift / NTP / Chrony Issues (Critical)

Time drift can break RAC heartbeat synchronization.

14.1 Check chrony status

timedatectl status
chronyc tracking
chronyc sources -v

14.2 Compare node times

date
ssh <othernode> date

If time difference is > 1 second, fix NTP.

14.3 Check CTSSD log

tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log

Search:

grep -i "drift\|clock\|time" \
$GRID_HOME/log/$(hostname)/ctssd/ctssd.log | tail -50

Step 15: Check Hardware Errors (ECC/MCE)

Hardware faults can cause sudden eviction and reboot.

dmesg -T | egrep -i "mce|ecc|hardware error|fatal"
journalctl -k | egrep -i "mce|ecc|hardware error"

If errors exist, escalate to hardware vendor.


Step 16: Check OS Journal Logs for Shutdown Events

If system uses systemd:

16.1 List boots

journalctl --list-boots

16.2 Check previous boot logs

journalctl -b -1 | tail -200

Search:

journalctl -b -1 | grep -i "reboot\|shutdown\|panic\|watchdog\|hangcheck"

Step 17: Confirm If CRS Triggered the Reboot (Direct Proof)

The best confirmation is finding these in ocssd.log:

  • evicting node
  • initiating eviction
  • misscount exceeded
  • voting file inaccessible
  • reconfiguration started
  • clssnmvDHBValidateNCopy

Search all logs:

grep -Rni "evict" $GRID_HOME/log/$(hostname) | tail -100
grep -Rni "reboot" $GRID_HOME/log/$(hostname) | tail -100

If eviction exists in ocssd.log and reboot exists in OS logs, CRS eviction caused reboot.


Step 18: Single Command Script to Collect All Evidence

Run immediately after node comes back:

echo "===== BASIC INFO ====="
date
hostname
uptime
who -b
last -x | head -30

echo "===== CRS STATUS ====="
crsctl check crs
crsctl check cluster -all
crsctl stat res -t
crsctl query css votedisk
ocrcheck

echo "===== GI ALERT LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/alert$(hostname).log

echo "===== CSSD LOG ====="
tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log

echo "===== OHASD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log

echo "===== CRSD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log

echo "===== CTSSD LOG ====="
tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log

echo "===== OS LOGS ====="
dmesg -T | tail -200
tail -200 /var/log/messages

echo "===== TIME SYNC ====="
timedatectl status
chronyc tracking
chronyc sources -v

echo "===== NETWORK ====="
oifcfg getif
ip link show
ip -s link

echo "===== MULTIPATH ====="
multipath -ll

Step 19: Root Cause Mapping (Fast Troubleshooting Table)

If ocssd.log shows misscount exceeded

Root cause: Interconnect latency OR CPU starvation
Fix: Check NIC drops, switch issues, MTU mismatch, CPU load

If ocssd.log shows voting file not accessible

Root cause: Storage/SAN/multipath issue
Fix: Check multipath, SAN latency, ASM disk offline events

If OS log shows watchdog reset

Root cause: OS hang/hardware freeze
Fix: Kernel tuning, firmware update, CPU/memory check

If CTSSD shows time drift

Root cause: NTP/Chrony problem
Fix: Fix chrony config and synchronize time across nodes

If /var/log/messages shows MCE/ECC

Root cause: Hardware failure
Fix: Replace memory/CPU/NIC and check ILO/IDRAC logs


Step 20: Best Practices to Prevent Oracle RAC Node Eviction

To prevent high-frequency evictions:

Recommended RAC Hardening Checklist

  • Use redundant private interconnect
  • Use bonding active-backup for private network
  • Keep MTU consistent across all nodes and switches
  • Ensure voting disks have redundant storage paths
  • Configure multipath correctly with proper timeouts
  • Ensure chrony/ntp is stable across all nodes
  • Keep GI and DB patched with latest RU
  • Monitor network latency and packet loss
  • Monitor SAN response times and path flapping

How to Prove CRS Rebooted the Node

CRS triggered reboot if ALL are true:

OS reboot timestamp matches incident time (last -x)
ocssd.log contains eviction messages (evicting node)
alert log shows CSSD heartbeat/voting failure
OS log shows watchdog/hangcheck reboot or forced restart

If ocssd.log has eviction but OS reboot is absent → CRS evicted but OS stayed up (only stack restart).


If you paste the output of this command, the exact root cause can be identified quickly:

grep -i "evict\|misscount\|voting\|heartbeat\|reconfig" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -80








Please do like and subscribe to my youtube channel: https://www.youtube.com/@foalabs If you like this post please follow,share and comment