Check Logs for why a System Reboot OR Shutdown in Linux
Quick Triage — Always Run First
Before diving into individual causes, establish a timeline.
# 1. When did the last boot happen?
who -b
# 2. List all reboots with timestamps
last reboot | head -20
# 3. Last shutdown/reboot event with adjacent runlevel change
last -x | grep -C1 'shutdown\|reboot' | head -30
# 4. Check previous boot logs (most important on systemd systems)
journalctl -b -1 --no-pager | tail -200
# 5. Check kernel ring buffer from previous boot
journalctl -b -1 -k --no-pager | tail -100
# 6. Did auditd see a clean shutdown or a surprise boot?
ausearch -i -m system_boot,system_shutdown | tail -6
# If two SYSTEM_BOOT lines appear in a row with no SYSTEM_SHUTDOWN between them,
# the system did NOT shut down gracefully — something crashed or lost power.User / Admin Initiated Reboot (Signal 15 / SIGTERM)
What it is
Signal 15 is SIGTERM — the graceful termination signal sent to all processes during a normal shutdown, reboot, or init 6. The last line syslog emits before going down is:
exiting on signal 15
This is not a crash. It means a user or program directed the shutdown.
Log patterns
# /var/log/messages or journalctl:
shutdown[PID]: shutting down for system reboot
init: Switching to runlevel: 6 # SysV init systems (RHEL 6)
systemd-logind[PID]: System is rebooting. # systemd systems (RHEL 7+)
syslogd: exiting on signal 15
Investigation commands
# Who was logged in just before the reboot?
last | head -30
# Check /var/log/secure (RHEL) or /var/log/auth.log (Debian/Ubuntu) for who ran sudo
ishan-rhel ~]# grep -i 'shutdown\|reboot\|halt\|init 6\|systemctl' /var/log/secure | tail -30
# Check bash history of root and other admin users (not reliable but a starting point)
cat /root/.bash_history | grep -iE 'reboot|shutdown|halt|init 6|systemctl'
# Check audit log for who ran the shutdown command
ausearch -c shutdown --start yesterday --end now -i
ausearch -c reboot --start yesterday --end now -i
ausearch -c systemctl --start yesterday --end now -i | grep -i reboot
# Check systemd journal for the initiating session
journalctl -b -1 --no-pager | grep -iE 'reboot|shutdown|signal 15|runlevel'
# Check if any automated/scheduled task triggered it (cron, at, systemd timers)
ishan-rhel ~]# cat /var/spool/cron/root
systemctl list-timers --all | grep -i reboot
atqACPI / Power Button / Thermal Shutdown
What it is
An ACPI (Advanced Configuration and Power Interface) event triggers a shutdown when:
- The physical power button is pressed
- The hypervisor/cloud platform sends a power-off signal
- A thermal threshold is exceeded, and the firmware initiates shutdown
Log patterns
# journalctl / /var/log/messages:
kernel: ACPI: Power Button [PWRF/PWRB]
systemd-logind[PID]: Power key pressed.
systemd-logind[PID]: System is powering down.
kernel: thermal thermal_zone0: critical temperature reached (NNN C), shutting down
Investigation commands
# Look for ACPI power button events
journalctl -b -1 -k --no-pager | grep -i 'acpi\|power button\|thermal\|critical temp'
# Check current thermal zone readings (post-boot, for trend)
cat /sys/class/thermal/thermal_zone*/temp # values are in millidegrees C (divide by 1000)
# Check IPMI System Event Log for thermal events (requires ipmitool)
ipmitool sel list | grep -i 'thermal\|temp\|power\|button'
# ACPI events via dmesg
ishan-rhel ~]# dmesg | grep -i 'acpi\|thermal\|power button'
# Check systemd-logind config for power button action
grep -i 'HandlePowerKey\|HandleLidSwitch' /etc/systemd/logind.confOOM Killer (Out of Memory)
What it is
When the kernel cannot satisfy a memory allocation request and no swap is available (or vm.overcommit_memory policy kicks in), the OOM killer selects and kills a process. If the OOM killer is configured with vm.panic_on_oom=1 or the killed process is critical (e.g., the init process), the system reboots.
Log patterns
kernel: Out of memory: Kill process PID (process_name) score NNN or sacrifice child
kernel: Killed process PID (process_name) total-vm:NNNkB, anon-rss:NNNkB
kernel: oom_kill_process: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null)...
Investigation commands
# Search for OOM events in previous boot
journalctl -b -1 -k --no-pager | grep -i 'out of memory\|oom\|killed process'
# Search all available journal history
journalctl -k --no-pager | grep -i 'out of memory\|oom killer'
# Search /var/log/messages directly (RHEL)
grep -i 'out of memory\|oom\|kill process' /var/log/messages
# Check current OOM panic setting
sysctl vm.panic_on_oom
ishan-rhel ~]# cat /proc/sys/vm/panic_on_oom
# 0 = OOM killer runs (no panic); 1 = panic on OOM; 2 = always panic
# Check current overcommit policy
sysctl vm.overcommit_memory
# Review memory usage around reboot time (if SAR data available)
sar -r -f /var/log/sa/saYYMMDD # replace with date of rebootKernel Panic
What it is
A kernel panic occurs when the kernel encounters an unrecoverable error: a NULL pointer dereference, a BUG() assertion failure, a fatal hardware error, a corrupted stack, or a driver fault. If kernel.panic sysctl is set to a non-zero value, the system reboots automatically after that many seconds.
Log patterns
kernel: Kernel panic - not syncing: <reason string>
kernel: BUG: unable to handle kernel NULL pointer dereference at 0000...
kernel: general protection fault: 0000 [#1] SMP
kernel: Oops: 0002 [#1] SMP PREEMPT
Investigation commands
# Check previous boot kernel messages for panic
journalctl -b -1 -k --no-pager | grep -iE 'panic|oops|bug:|general protection|call trace'
# Check auto-reboot on panic setting
sysctl kernel.panic
cat /proc/sys/kernel/panic
# 0 = hang on panic (no reboot); >0 = reboot after N seconds
# Check if panic-on-oops is enabled
sysctl kernel.panic_on_oops
# Check dmesg for oops/panic backtraces
dmesg | grep -A 20 -i 'kernel panic\|BUG:\|Oops:'
# Check if kdump captured a vmcore (see Cause 9 for full kdump section)
ls -lh /var/crash/
less /var/crash/127.0.0.1-2026-06-17-18:10:05/vmcore-dmesg.txtCPU Soft Lockup / Hard Lockup (Watchdog)
What it is
The kernel watchdog detects two types of CPU lockups:
- Soft lockup: A task monopolizes a CPU for longer than
kernel.watchdog_threshseconds (default: 20s) without yielding. The kernel prints a warning. Ifkernel.softlockup_panic=1, it reboots. - Hard lockup (NMI watchdog): A CPU becomes completely unresponsive even to NMI interrupts, indicating a true hardware hang. If
kernel.hardlockup_panic=1, it reboots.
Log patterns
kernel: watchdog: BUG: soft lockup - CPU#N stuck for NNs! [task_name:PID]
kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu N
Investigation commands
# Search journal for soft/hard lockup events
journalctl -b -1 -k --no-pager | grep -iE 'soft lockup|hard lockup|nmi watchdog|hung task'
# Check current soft lockup panic setting
sysctl kernel.softlockup_panic
sysctl kernel.hardlockup_panic
sysctl kernel.watchdog_thresh # default 20 seconds
# Check hung task panic setting
sysctl kernel.hung_task_panic
sysctl kernel.hung_task_timeout_secs # default 120 seconds
# Check if watchdog is enabled
sysctl kernel.watchdogPacemaker / Cluster Fencing (STONITH)
What it is
High-availability clusters use STONITH (Shoot The Other Node In The Head) fencing to recover from split-brain scenarios. The cluster reboots (or power-cycles) a node that it considers unhealthy to protect shared resources. This is an intentional cluster-initiated reboot, not a crash.
Common fencing agents: fence_ipmilan, fence_idrac, fence_apc, fence_vmware_soap.
Log patterns
# /var/log/messages or journalctl:
pacemaker-fenced[PID]: notice: Requesting peer fencing (reboot) targeting <nodename>
pacemaker-controld[PID]: notice: Requesting fencing (reboot) of node <nodename>
fence_ipmilan: Succeeded in operation reboot for <node>
corosync[PID]: [TOTEM] A new membership was created (node left cluster)
Investigation commands
# Pacemaker cluster and fencing logs
journalctl -b -1 --no-pager -u pacemaker | grep -iE 'fenc|stonith|reboot|shot'
journalctl -b -1 --no-pager -u corosync | tail -50
# Dedicated fencing log (if configured)
grep -iE 'fenc|stonith|reboot' /var/log/pacemaker/pacemaker.log | tail -50
# Check cluster history for fencing events (pcs)
pcs status
pcs stonith history
# Check Corosync ring/quorum at time of incident
grep -i 'quorum\|ring\|lost\|partition' /var/log/cluster/corosync.log | tail -30
# crm_report captures everything for a time window
crm_report -f "YYYY-MM-DD HH:MM:SS" -t "YYYY-MM-DD HH:MM:SS" /tmp/crm-reportIPMI / BMC Hardware Watchdog
What it is
The BMC (Baseboard Management Controller) has a hardware watchdog timer independent of the OS. If the OS fails to periodically reset (kick) the watchdog before it counts down to zero, the BMC performs a hardware reset (power cycle or reboot). This can happen if:
- The
ipmi_watchdogkernel module orwatchdogdaemon stops running - The system hangs at a level below where software can reset the timer
Log patterns
# In IPMI System Event Log (SEL):
# "OS Watchdog Timer | OS Watchdog Timer Expired"
# In /var/log/messages before the reset (if the OS was still logging):
kernel: IPMI Watchdog: Starting countdown in kernel.
watchdog[PID]: keepalive failed
Investigation commands
# Check IPMI SEL for watchdog timer expiry events
ipmitool sel list | grep -i 'watchdog\|timer\|reset\|power'
# Get full IPMI event log with decoded descriptions
ipmitool sel elist
# Check if ipmi_watchdog module is loaded and its settings
lsmod | grep ipmi_watchdog
cat /sys/module/ipmi_watchdog/parameters/action # reset, power_cycle, power_off, none
cat /sys/module/ipmi_watchdog/parameters/timeout # countdown in seconds
cat /sys/module/ipmi_watchdog/parameters/pretimeout # pre-NMI seconds
# Check watchdog daemon status (if using watchdog package)
systemctl status watchdog
# Review BMC System Event Log entries around reboot time
ipmitool sel time get # confirm BMC clock vs system clock
ipmitool sdr type 'Watchdog'SysRq Triggered Crash / Manual Panic
What it is
The Magic SysRq key mechanism allows a privileged user or script to force an immediate kernel crash (useful for testing kdump). Writing c to /proc/sysrq-trigger calls panic() directly. This is also the mechanism used by some monitoring tools to force a vmcore capture on a hung system.
Log patterns
kernel: SysRq : Trigger a crash
kernel: Kernel panic - not syncing: sysrq triggered crash
Investigation commands
# Check if SysRq crash was triggered
journalctl -b -1 -k --no-pager | grep -i 'sysrq\|trigger a crash'
grep -i sysrq /var/log/messages
# Check current SysRq enabled bitmap
sysctl kernel.sysrq
cat /proc/sys/kernel/sysrq
# 0=disabled, 1=all, 176=safe subset (RHEL default in some versions)
# Who could have done it? Check audit log
ausearch -f /proc/sysrq-trigger --start yesterday --end now -i
# Check /etc/sysctl.conf and /etc/sysctl.d/ for intentional configuration
grep -r sysrq /etc/sysctl.conf /etc/sysctl.d/Kernel Crash Dump (kdump)
What it is
kdump is the kernel crash dump mechanism included in RHEL 6 through RHEL 10. When any of the above panic-inducing conditions fires, if kdump is properly configured, the running kernel hands off to a small capture kernel (kexec) which saves a memory image (vmcore) to disk before rebooting. kdump is the single best tool for post-incident RCA of unexpected reboots.
If kdump was not configured before the reboot, it may not be possible to determine the root cause. — Red Hat KCS
Verify kdump is installed and running
# Is kexec-tools installed?
rpm -q kexec-tools
# Is kdump service enabled and active?
systemctl status kdump
# How much crash memory is reserved?
cat /proc/cmdline | grep -o 'crashkernel=[^ ]*'
# or
grep crashkernel /etc/default/grub
# Where will vmcore be saved?
grep -v '^#' /etc/kdump.conf | grep -v '^$'After a crash — analyze the vmcore
# List crash dumps captured
ls -lh /var/crash/
# Identify which vmcore belongs to which crash time
ls -lh /var/crash/*/vmcore
# Install crash and kernel debug symbols (RHEL)
dnf install crash
dnf install kernel-debuginfo-$(uname -r)
# Open the vmcore with crash utility
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/<timestamp>/vmcore
# Inside crash utility — most useful commands:
# bt — backtrace of the crashing CPU at time of panic
# log — kernel message buffer (dmesg at crash time)
# ps — process list at crash time
# vm — virtual memory info
# sys — system info (uptime, kernel version, panic string)
# q — quitEnable kdump if not already active
# Enable and start kdump
systemctl enable --now kdump
# Verify crash kernel is reserved
grep -i crashkernel /proc/cmdline
# If not set, add crashkernel= to GRUB and reboot
# RHEL 8/9 (BIOS):
grubby --update-kernel=ALL --args="crashkernel=auto"
# RHEL 8/9 (UEFI):
grubby --update-kernel=ALL --args="crashkernel=auto"
# Test kdump is functional (THIS WILL CRASH THE SYSTEM — test env only)
echo c > /proc/sysrq-triggerHardware Errors (MCE / ECC Memory / PCIe)
What it is
Machine Check Exceptions (MCE) are hardware-level errors reported by the CPU: uncorrectable ECC memory errors, CPU internal errors, PCIe bus errors. An uncorrectable MCE causes an immediate kernel panic. Correctable errors (single-bit ECC) are logged as warnings but do not cause reboots by themselves.
Log patterns
kernel: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU N: Machine Check: 0 Bank N: <error code>
kernel: EDAC MC0: N CE error(s) on DIMM <location>
kernel: NFIT: nfit_handle_mce: uncorrectable error
Investigation commands
# Check MCE log via mcelog (older, RHEL 6/7)
mcelog --client # if mcelog daemon is running
cat /var/log/mcelog
# RHEL 8/9+: use rasdaemon instead of mcelog
systemctl status rasdaemon
ras-mc-ctl --summary
ras-mc-ctl --errors
# Check kernel MCE messages
journalctl -b -1 -k --no-pager | grep -iE 'mce|machine check|edac|ecc|uncorrect'
dmesg | grep -iE 'mce|machine check|edac|ecc'
# Check IPMI SEL for memory/hardware errors
ipmitool sel elist | grep -iE 'mem|ecc|correctable|uncorrectable|dimm'
# Check EDAC (Error Detection And Correction) subsystem
ls /sys/devices/system/edac/mc/
cat /sys/devices/system/edac/mc/mc*/ce_count # correctable errors
cat /sys/devices/system/edac/mc/mc*/ue_count # uncorrectable errors (critical)Hung Task / D-state Process (Uninterruptible Sleep)
What it is
A process stuck in D-state (uninterruptible sleep) is usually waiting on I/O that never completes — typically an NFS server that went away, a failed disk, or a storage path issue. If kernel.hung_task_panic=1 is set and the task stays stuck beyond kernel.hung_task_timeout_secs, the kernel panics and reboots.
Log patterns
kernel: INFO: task <taskname>:PID blocked for more than NNN seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:kworker/0:1H state:D stack:NNNNk ...
Investigation commands
# Find D-state processes right now (if system is still responsive)
ps aux | awk '$8 == "D"'
# Check for hung task kernel messages in previous boot
journalctl -b -1 -k --no-pager | grep -iE 'blocked for more than|hung task|state:D'
grep 'blocked for more than' /var/log/messages
# Check hung_task settings
sysctl kernel.hung_task_timeout_secs # 0 = disabled
sysctl kernel.hung_task_panic # 1 = panic when hung task detected
# Check NFS mounts for stale/hanging connections
mount | grep nfs
nfsstat -m
cat /proc/mounts | grep nfs
# Check storage/multipath for path failures
multipath -ll
dmsetup statusUPS / Power Loss
What it is
If a UPS management daemon (e.g., apcupsd, NUT) detects a power failure and battery level too low, it issues a controlled shutdown. An uncontrolled power cut will produce no shutdown log at all — the journal simply ends abruptly.
How to identify
# If the journal simply ends with no shutdown sequence → power cut
journalctl -b -1 --no-pager | tail -30
# A normal shutdown ends with lines like "Reached target Power-Off"
# An abrupt power loss: the journal ends mid-stream with no shutdown messages
# Check auditd for two consecutive boots with no shutdown between them
ausearch -i -m system_boot,system_shutdown | tail -8
# Check apcupsd logs (if installed)
cat /var/log/apcupsd.events
# Check NUT (Network UPS Tools) logs
journalctl -b -1 --no-pager -u nut-monitor
journalctl -b -1 --no-pager -u upsmon
# Check IPMI SEL for power loss events
ipmitool sel elist | grep -iE 'power\|ac lost\|battery'Summary
Reboot detected (These are command so added those caused, but there could be more)
│
├─ journalctl -b -1 shows clean shutdown messages?
│ ├─ YES → Cause 1 (Signal 15 / user initiated) or Cause 2 (ACPI/power button)
│ │ → Check /var/log/secure, ausearch, last
│ │
│ └─ NO → Journal ends abruptly
│ ├─ ausearch shows two SYSTEM_BOOT with no SYSTEM_SHUTDOWN?
│ │ ├─ YES → NOT a graceful shutdown. Check:
│ │ │ ├─ kdump vmcore in /var/crash/ → Cause 4 (kernel panic)
│ │ │ ├─ OOM messages in journalctl → Cause 3 (OOM)
│ │ │ ├─ MCE / EDAC messages → Cause 10 (hardware)
│ │ │ ├─ soft lockup messages → Cause 5 (watchdog)
│ │ │ ├─ hung task messages → Cause 11 (D-state)
│ │ │ ├─ IPMI SEL watchdog timer → Cause 7 (BMC watchdog)
│ │ │ └─ No logs at all → Cause 12 (power loss)
│ │ │
│ │ └─ Cluster node? Check pcs / corosync → Cause 6 (STONITH)
│ │
│ └─ sysrq trigger in logs? → Cause 8 (SysRq)
*Document compiled from: Red Hat KCS (solutions/6038, solutions/31411, solutions/737033, articles/why-did-my-rhel-system-reboot), kernel.org documentation, ipmitool man pages, pacemaker documentation.
Refer - https://access.redhat.com/articles/why-did-my-rhel-system-reboot
Do not hesitate to connect with Rackspace support for any assistance.