Oracle RAC Node Eviction – Full Checklist. and Troubleshooting

How to Find Why CRS/CSSD Rebooted a Server in Oracle RAC – Complete Checklist

Oracle RAC node eviction is one of the most critical issues in clustered database environments. When a node is evicted, Oracle Clusterware (CRS/CSSD) removes the node from the cluster to prevent split-brain conditions and ensure data consistency. In many cases, this eviction results in an automatic server reboot.

This complete Oracle RAC node eviction troubleshooting guide explains how to identify whether CRS rebooted the server, how to locate the exact cause, and how to fix it using a step-by-step checklist. This guide includes the best commands, logs, and root cause analysis techniques used by RAC administrators.

What is Oracle RAC Node Eviction?

Oracle RAC node eviction occurs when Oracle Clusterware detects a node is no longer stable or reliable for cluster participation. The node is removed forcibly, usually due to:

Interconnect heartbeat loss
Voting disk inaccessibility
Storage latency or multipath failure
OS CPU hang or kernel lockup
Time drift between nodes (NTP/Chrony failure)
Hardware issues (NIC, memory, CPU)

Eviction is triggered mainly by CSSD (Cluster Synchronization Services Daemon).

Does CRS Really Reboot the Server?

This is a common question.

CRS does not reboot the server randomly.

Instead, Oracle Clusterware can:

Evict the node
Trigger OS watchdog/hangcheck
Stop CRS stack
Force restart of cluster services
Initiate node reboot in extreme conditions

The OS reboot can happen because:

watchdog mechanism triggers reboot
hangcheck timer triggers reboot
node becomes unresponsive and system resets

Step 1: Confirm If the Server Rebooted (OS Level Proof)

The first step in Oracle RAC reboot troubleshooting is confirming whether the operating system restarted.

1.1 Check current uptime


uptime

1.2 Check last reboot time


who -b

1.3 Check reboot/shutdown history


last -x | head -50

What to look for:

reboot system boot
shutdown system down
crash

If reboot time matches eviction time, OS reboot occurred.

Step 2: Confirm CRS Stack Restart (CRS Restart vs OS Restart)

Sometimes the OS stays up but CRS restarts.

2.1 Check CRS stack status


crsctl check crs

Expected output includes:

OHASD online
CSSD online
CRSD online
EVMD online

2.2 Check cluster status on all nodes


crsctl check cluster -all

Step 3: Identify the Exact Node Eviction Reason (Most Important Log)

The #1 log file for Oracle RAC node eviction analysis is:

ocssd.log

Location:


$GRID_HOME/log/<hostname>/cssd/ocssd.log

3.1 View recent eviction details


tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log

3.2 Search for eviction messages


grep -i "evict" $GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -50

3.3 Search for common root cause patterns


grep -i "misscount\|voting\|heartbeat\|reconfig\|panic\|interface" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -100

Step 4: Check Grid Infrastructure Alert Log (CRS Alert Log)

The CRS alert log gives high-level cluster failure information.

Location:


$GRID_HOME/log/<hostname>/alert<hostname>.log

4.1 View recent log section


tail -400 $GRID_HOME/log/$(hostname)/alert$(hostname).log

4.2 Search for reboot and eviction keywords


grep -i "evict\|reboot\|cssd\|shutdown\|fatal\|restart" \
$GRID_HOME/log/$(hostname)/alert$(hostname).log | tail -100

Step 5: Check OHASD Log (Why CRS Restarted)

OHASD controls the entire cluster stack startup.

Location:


$GRID_HOME/log/<hostname>/ohasd/ohasd.log

5.1 View OHASD log


tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log

5.2 Search for failure patterns


grep -i "restart\|terminate\|shutdown\|fail\|kill" \
$GRID_HOME/log/$(hostname)/ohasd/ohasd.log | tail -100

Step 6: Check CRSD Log (Resource Failure Analysis)

CRSD manages cluster resources (DB, VIP, listener, ASM).

Location:


$GRID_HOME/log/<hostname>/crsd/crsd.log

6.1 View CRSD log


tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log

6.2 Search for eviction triggers


grep -i "evict\|fatal\|terminate\|restart\|offline" \
$GRID_HOME/log/$(hostname)/crsd/crsd.log | tail -100

Step 7: Check Voting Disk and OCR Health

Voting disk loss is the 2nd most common cause of node eviction.

7.1 Check voting disk status


crsctl query css votedisk

7.2 Check OCR health


ocrcheck

If OCR is unhealthy, node eviction can occur.

Step 8: Check ASM Diskgroup and Storage Status

If voting disk is stored in ASM, ASM disk problems can evict node.

8.1 Check ASM diskgroup status


asmcmd lsdg

8.2 Check ASM disks


asmcmd lsdsk

8.3 Check ASM alert log


tail -200 $GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log

Search for storage errors:


grep -i "error\|I/O\|offline\|fail\|timeout" \
$GRID_HOME/diag/asm/+asm/+ASM*/trace/alert_+ASM*.log | tail -50

Step 9: Check Multipath / SAN Issues (Path Flapping)

Storage instability causes intermittent voting disk access loss.

9.1 Check multipath status


multipath -ll

9.2 Check SAN timeout errors


dmesg -T | egrep -i "scsi|timeout|abort|reset|I/O error"

9.3 Check OS messages log


egrep -i "path.*down|path.*up|scsi|timeout|abort|reset" \
/var/log/messages | tail -200

Step 10: Check RAC Interconnect (Private Network Failure)

Interconnect issues are the most frequent root cause of eviction.

10.1 Identify interconnect interface


oifcfg getif

10.2 Check interface health and packet drops


ip -s link show <interface>

10.3 Check bonding (if configured)


cat /proc/net/bonding/bond0

10.4 Ping test between RAC nodes (private IP)


ping -I <private_interface> <other_node_private_ip>

Packet loss indicates network issue.

Step 11: MTU Mismatch Check (Very Common Hidden Issue)

MTU mismatch causes intermittent cluster heartbeat failures.

11.1 Check MTU


ip link show | grep mtu

Ensure all nodes + switches have same MTU.

Step 12: Check CPU Hang / OS Lockup / Soft Lockup Errors

If CPU is overloaded, heartbeat delays occur.

12.1 Check kernel logs


dmesg -T | tail -200

12.2 Search for hung tasks


dmesg -T | egrep -i "soft lockup|hard lockup|blocked for more than"

12.3 Check for OOM killer


dmesg -T | grep -i oom

Step 13: Check for Watchdog or Hangcheck Timer Reboot

Some RAC setups reboot node automatically if hung.

13.1 Check hangcheck module


lsmod | grep -i hangcheck

13.2 Check watchdog errors


grep -i watchdog /var/log/messages | tail -100

If you see watchdog triggers, OS reboot was forced.

Step 14: Check Time Drift / NTP / Chrony Issues (Critical)

Time drift can break RAC heartbeat synchronization.

14.1 Check chrony status


timedatectl status
chronyc tracking
chronyc sources -v

14.2 Compare node times


date
ssh <othernode> date

If time difference is > 1 second, fix NTP.

14.3 Check CTSSD log


tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log

Search:


grep -i "drift\|clock\|time" \
$GRID_HOME/log/$(hostname)/ctssd/ctssd.log | tail -50

Step 15: Check Hardware Errors (ECC/MCE)

Hardware faults can cause sudden eviction and reboot.


dmesg -T | egrep -i "mce|ecc|hardware error|fatal"
journalctl -k | egrep -i "mce|ecc|hardware error"

If errors exist, escalate to hardware vendor.

Step 16: Check OS Journal Logs for Shutdown Events

If system uses systemd:

16.1 List boots


journalctl --list-boots

16.2 Check previous boot logs


journalctl -b -1 | tail -200

Search:


journalctl -b -1 | grep -i "reboot\|shutdown\|panic\|watchdog\|hangcheck"

Step 17: Confirm If CRS Triggered the Reboot (Direct Proof)

The best confirmation is finding these in ocssd.log:

evicting node
initiating eviction
misscount exceeded
voting file inaccessible
reconfiguration started
clssnmvDHBValidateNCopy

Search all logs:


grep -Rni "evict" $GRID_HOME/log/$(hostname) | tail -100
grep -Rni "reboot" $GRID_HOME/log/$(hostname) | tail -100

If eviction exists in ocssd.log and reboot exists in OS logs, CRS eviction caused reboot.

Step 18: Single Command Script to Collect All Evidence

Run immediately after node comes back:


echo "===== BASIC INFO ====="
date
hostname
uptime
who -b
last -x | head -30

echo "===== CRS STATUS ====="
crsctl check crs
crsctl check cluster -all
crsctl stat res -t
crsctl query css votedisk
ocrcheck

echo "===== GI ALERT LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/alert$(hostname).log

echo "===== CSSD LOG ====="
tail -500 $GRID_HOME/log/$(hostname)/cssd/ocssd.log

echo "===== OHASD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/ohasd/ohasd.log

echo "===== CRSD LOG ====="
tail -300 $GRID_HOME/log/$(hostname)/crsd/crsd.log

echo "===== CTSSD LOG ====="
tail -200 $GRID_HOME/log/$(hostname)/ctssd/ctssd.log

echo "===== OS LOGS ====="
dmesg -T | tail -200
tail -200 /var/log/messages

echo "===== TIME SYNC ====="
timedatectl status
chronyc tracking
chronyc sources -v

echo "===== NETWORK ====="
oifcfg getif
ip link show
ip -s link

echo "===== MULTIPATH ====="
multipath -ll

Step 19: Root Cause Mapping (Fast Troubleshooting Table)

If ocssd.log shows misscount exceeded

Root cause: Interconnect latency OR CPU starvation
Fix: Check NIC drops, switch issues, MTU mismatch, CPU load

If ocssd.log shows voting file not accessible

Root cause: Storage/SAN/multipath issue
Fix: Check multipath, SAN latency, ASM disk offline events

If OS log shows watchdog reset

Root cause: OS hang/hardware freeze
Fix: Kernel tuning, firmware update, CPU/memory check

If CTSSD shows time drift

Root cause: NTP/Chrony problem
Fix: Fix chrony config and synchronize time across nodes

If /var/log/messages shows MCE/ECC

Root cause: Hardware failure
Fix: Replace memory/CPU/NIC and check ILO/IDRAC logs

Step 20: Best Practices to Prevent Oracle RAC Node Eviction

To prevent high-frequency evictions:

Recommended RAC Hardening Checklist

Use redundant private interconnect
Use bonding active-backup for private network
Keep MTU consistent across all nodes and switches
Ensure voting disks have redundant storage paths
Configure multipath correctly with proper timeouts
Ensure chrony/ntp is stable across all nodes
Keep GI and DB patched with latest RU
Monitor network latency and packet loss
Monitor SAN response times and path flapping

How to Prove CRS Rebooted the Node

CRS triggered reboot if ALL are true:

OS reboot timestamp matches incident time (last -x)
ocssd.log contains eviction messages (evicting node)
alert log shows CSSD heartbeat/voting failure
OS log shows watchdog/hangcheck reboot or forced restart

If ocssd.log has eviction but OS reboot is absent → CRS evicted but OS stayed up (only stack restart).

If you paste the output of this command, the exact root cause can be identified quickly:


grep -i "evict\|misscount\|voting\|heartbeat\|reconfig" \
$GRID_HOME/log/$(hostname)/cssd/ocssd.log | tail -80

Please do like and subscribe to my youtube channel: https://www.youtube.com/@foalabs If you like this post please follow,share and comment

Oracle RAC Node Eviction – Full Checklist. and Troubleshooting

How to Find Why CRS/CSSD Rebooted a Server in Oracle RAC – Complete Checklist

What is Oracle RAC Node Eviction?

Does CRS Really Reboot the Server?

CRS does not reboot the server randomly.

Step 1: Confirm If the Server Rebooted (OS Level Proof)

1.1 Check current uptime

1.2 Check last reboot time

1.3 Check reboot/shutdown history

What to look for:

Step 2: Confirm CRS Stack Restart (CRS Restart vs OS Restart)

2.1 Check CRS stack status

2.2 Check cluster status on all nodes

Step 3: Identify the Exact Node Eviction Reason (Most Important Log)

3.1 View recent eviction details

3.2 Search for eviction messages

3.3 Search for common root cause patterns

Step 4: Check Grid Infrastructure Alert Log (CRS Alert Log)

4.1 View recent log section

4.2 Search for reboot and eviction keywords

Step 5: Check OHASD Log (Why CRS Restarted)

5.1 View OHASD log

5.2 Search for failure patterns

Step 6: Check CRSD Log (Resource Failure Analysis)

6.1 View CRSD log

6.2 Search for eviction triggers

Step 7: Check Voting Disk and OCR Health

7.1 Check voting disk status

7.2 Check OCR health

Step 8: Check ASM Diskgroup and Storage Status

8.1 Check ASM diskgroup status

8.2 Check ASM disks

8.3 Check ASM alert log

Step 9: Check Multipath / SAN Issues (Path Flapping)

9.1 Check multipath status

9.2 Check SAN timeout errors

9.3 Check OS messages log

Step 10: Check RAC Interconnect (Private Network Failure)

10.1 Identify interconnect interface

10.2 Check interface health and packet drops

10.3 Check bonding (if configured)

10.4 Ping test between RAC nodes (private IP)

Step 11: MTU Mismatch Check (Very Common Hidden Issue)

11.1 Check MTU

Step 12: Check CPU Hang / OS Lockup / Soft Lockup Errors

12.1 Check kernel logs

12.2 Search for hung tasks

12.3 Check for OOM killer

Step 13: Check for Watchdog or Hangcheck Timer Reboot

13.1 Check hangcheck module

13.2 Check watchdog errors

Step 14: Check Time Drift / NTP / Chrony Issues (Critical)

14.1 Check chrony status

14.2 Compare node times

14.3 Check CTSSD log

Step 15: Check Hardware Errors (ECC/MCE)

Step 16: Check OS Journal Logs for Shutdown Events

16.1 List boots

16.2 Check previous boot logs

Step 17: Confirm If CRS Triggered the Reboot (Direct Proof)

Step 18: Single Command Script to Collect All Evidence

Step 19: Root Cause Mapping (Fast Troubleshooting Table)

If ocssd.log shows misscount exceeded

If ocssd.log shows voting file not accessible

If OS log shows watchdog reset

If CTSSD shows time drift

If /var/log/messages shows MCE/ECC

Step 20: Best Practices to Prevent Oracle RAC Node Eviction

Recommended RAC Hardening Checklist

How to Prove CRS Rebooted the Node

CRS triggered reboot if ALL are true:

You might like

Post a Comment

Post a Comment

Contact Form