Check Logs for why a System Reboot OR Shutdown in Linux


Quick Triage — Always Run First

Before diving into individual causes, establish a timeline.

# 1. When did the last boot happen?
who -b

# 2. List all reboots with timestamps
last reboot | head -20

# 3. Last shutdown/reboot event with adjacent runlevel change
last -x | grep -C1 'shutdown\|reboot' | head -30

# 4. Check previous boot logs (most important on systemd systems)
journalctl -b -1 --no-pager | tail -200

# 5. Check kernel ring buffer from previous boot
journalctl -b -1 -k --no-pager | tail -100

# 6. Did auditd see a clean shutdown or a surprise boot?
ausearch -i -m system_boot,system_shutdown | tail -6
# If two SYSTEM_BOOT lines appear in a row with no SYSTEM_SHUTDOWN between them,
# the system did NOT shut down gracefully — something crashed or lost power.

User / Admin Initiated Reboot (Signal 15 / SIGTERM)

What it is

Signal 15 is SIGTERM — the graceful termination signal sent to all processes during a normal shutdown, reboot, or init 6. The last line syslog emits before going down is:

exiting on signal 15

This is not a crash. It means a user or program directed the shutdown.

Log patterns

# /var/log/messages or journalctl:
shutdown[PID]: shutting down for system reboot
init: Switching to runlevel: 6          # SysV init systems (RHEL 6)
systemd-logind[PID]: System is rebooting.  # systemd systems (RHEL 7+)
syslogd: exiting on signal 15

Investigation commands

# Who was logged in just before the reboot?
last | head -30

# Check /var/log/secure (RHEL) or /var/log/auth.log (Debian/Ubuntu) for who ran sudo
ishan-rhel ~]# grep -i 'shutdown\|reboot\|halt\|init 6\|systemctl' /var/log/secure | tail -30

# Check bash history of root and other admin users (not reliable but a starting point)
cat /root/.bash_history | grep -iE 'reboot|shutdown|halt|init 6|systemctl'

# Check audit log for who ran the shutdown command
ausearch -c shutdown --start yesterday --end now -i
ausearch -c reboot   --start yesterday --end now -i
ausearch -c systemctl --start yesterday --end now -i | grep -i reboot

# Check systemd journal for the initiating session
journalctl -b -1 --no-pager | grep -iE 'reboot|shutdown|signal 15|runlevel'

# Check if any automated/scheduled task triggered it (cron, at, systemd timers)
ishan-rhel ~]# cat /var/spool/cron/root
systemctl list-timers --all | grep -i reboot
atq

ACPI / Power Button / Thermal Shutdown

What it is

An ACPI (Advanced Configuration and Power Interface) event triggers a shutdown when:

  • The physical power button is pressed
  • The hypervisor/cloud platform sends a power-off signal
  • A thermal threshold is exceeded, and the firmware initiates shutdown

Log patterns

# journalctl / /var/log/messages:
kernel: ACPI: Power Button [PWRF/PWRB]
systemd-logind[PID]: Power key pressed.
systemd-logind[PID]: System is powering down.
kernel: thermal thermal_zone0: critical temperature reached (NNN C), shutting down

Investigation commands

# Look for ACPI power button events
journalctl -b -1 -k --no-pager | grep -i 'acpi\|power button\|thermal\|critical temp'

# Check current thermal zone readings (post-boot, for trend)
cat /sys/class/thermal/thermal_zone*/temp   # values are in millidegrees C (divide by 1000)

# Check IPMI System Event Log for thermal events (requires ipmitool)
ipmitool sel list | grep -i 'thermal\|temp\|power\|button'

# ACPI events via dmesg
ishan-rhel ~]# dmesg | grep -i 'acpi\|thermal\|power button'

# Check systemd-logind config for power button action
grep -i 'HandlePowerKey\|HandleLidSwitch' /etc/systemd/logind.conf

OOM Killer (Out of Memory)

What it is

When the kernel cannot satisfy a memory allocation request and no swap is available (or vm.overcommit_memory policy kicks in), the OOM killer selects and kills a process. If the OOM killer is configured with vm.panic_on_oom=1 or the killed process is critical (e.g., the init process), the system reboots.

Log patterns

kernel: Out of memory: Kill process PID (process_name) score NNN or sacrifice child
kernel: Killed process PID (process_name) total-vm:NNNkB, anon-rss:NNNkB
kernel: oom_kill_process: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null)...

Investigation commands

# Search for OOM events in previous boot
journalctl -b -1 -k --no-pager | grep -i 'out of memory\|oom\|killed process'

# Search all available journal history
journalctl -k --no-pager | grep -i 'out of memory\|oom killer'

# Search /var/log/messages directly (RHEL)
grep -i 'out of memory\|oom\|kill process' /var/log/messages

# Check current OOM panic setting
sysctl vm.panic_on_oom
ishan-rhel ~]# cat /proc/sys/vm/panic_on_oom
# 0 = OOM killer runs (no panic); 1 = panic on OOM; 2 = always panic

# Check current overcommit policy
sysctl vm.overcommit_memory

# Review memory usage around reboot time (if SAR data available)
sar -r -f /var/log/sa/saYYMMDD   # replace with date of reboot

Kernel Panic

What it is

A kernel panic occurs when the kernel encounters an unrecoverable error: a NULL pointer dereference, a BUG() assertion failure, a fatal hardware error, a corrupted stack, or a driver fault. If kernel.panic sysctl is set to a non-zero value, the system reboots automatically after that many seconds.

Log patterns

kernel: Kernel panic - not syncing: <reason string>
kernel: BUG: unable to handle kernel NULL pointer dereference at 0000...
kernel: general protection fault: 0000 [#1] SMP
kernel: Oops: 0002 [#1] SMP PREEMPT

Investigation commands

# Check previous boot kernel messages for panic
journalctl -b -1 -k --no-pager | grep -iE 'panic|oops|bug:|general protection|call trace'

# Check auto-reboot on panic setting
sysctl kernel.panic
cat /proc/sys/kernel/panic
# 0 = hang on panic (no reboot); >0 = reboot after N seconds

# Check if panic-on-oops is enabled
sysctl kernel.panic_on_oops

# Check dmesg for oops/panic backtraces
dmesg | grep -A 20 -i 'kernel panic\|BUG:\|Oops:'

# Check if kdump captured a vmcore (see Cause 9 for full kdump section)
ls -lh /var/crash/
less /var/crash/127.0.0.1-2026-06-17-18:10:05/vmcore-dmesg.txt

CPU Soft Lockup / Hard Lockup (Watchdog)

What it is

The kernel watchdog detects two types of CPU lockups:

  • Soft lockup: A task monopolizes a CPU for longer than kernel.watchdog_thresh seconds (default: 20s) without yielding. The kernel prints a warning. If kernel.softlockup_panic=1, it reboots.
  • Hard lockup (NMI watchdog): A CPU becomes completely unresponsive even to NMI interrupts, indicating a true hardware hang. If kernel.hardlockup_panic=1, it reboots.

Log patterns

kernel: watchdog: BUG: soft lockup - CPU#N stuck for NNs! [task_name:PID]
kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu N

Investigation commands

# Search journal for soft/hard lockup events
journalctl -b -1 -k --no-pager | grep -iE 'soft lockup|hard lockup|nmi watchdog|hung task'

# Check current soft lockup panic setting
sysctl kernel.softlockup_panic
sysctl kernel.hardlockup_panic
sysctl kernel.watchdog_thresh    # default 20 seconds

# Check hung task panic setting
sysctl kernel.hung_task_panic
sysctl kernel.hung_task_timeout_secs   # default 120 seconds

# Check if watchdog is enabled
sysctl kernel.watchdog

Pacemaker / Cluster Fencing (STONITH)

What it is

High-availability clusters use STONITH (Shoot The Other Node In The Head) fencing to recover from split-brain scenarios. The cluster reboots (or power-cycles) a node that it considers unhealthy to protect shared resources. This is an intentional cluster-initiated reboot, not a crash.

Common fencing agents: fence_ipmilan, fence_idrac, fence_apc, fence_vmware_soap.

Log patterns

# /var/log/messages or journalctl:
pacemaker-fenced[PID]: notice: Requesting peer fencing (reboot) targeting <nodename>
pacemaker-controld[PID]: notice: Requesting fencing (reboot) of node <nodename>
fence_ipmilan: Succeeded in operation reboot for <node>
corosync[PID]: [TOTEM] A new membership was created (node left cluster)

Investigation commands

# Pacemaker cluster and fencing logs
journalctl -b -1 --no-pager -u pacemaker | grep -iE 'fenc|stonith|reboot|shot'
journalctl -b -1 --no-pager -u corosync  | tail -50

# Dedicated fencing log (if configured)
grep -iE 'fenc|stonith|reboot' /var/log/pacemaker/pacemaker.log | tail -50

# Check cluster history for fencing events (pcs)
pcs status
pcs stonith history

# Check Corosync ring/quorum at time of incident
grep -i 'quorum\|ring\|lost\|partition' /var/log/cluster/corosync.log | tail -30

# crm_report captures everything for a time window
crm_report -f "YYYY-MM-DD HH:MM:SS" -t "YYYY-MM-DD HH:MM:SS" /tmp/crm-report

IPMI / BMC Hardware Watchdog

What it is

The BMC (Baseboard Management Controller) has a hardware watchdog timer independent of the OS. If the OS fails to periodically reset (kick) the watchdog before it counts down to zero, the BMC performs a hardware reset (power cycle or reboot). This can happen if:

  • The ipmi_watchdog kernel module or watchdog daemon stops running
  • The system hangs at a level below where software can reset the timer

Log patterns

# In IPMI System Event Log (SEL):
#   "OS Watchdog Timer | OS Watchdog Timer Expired"
# In /var/log/messages before the reset (if the OS was still logging):
kernel: IPMI Watchdog: Starting countdown in kernel.
watchdog[PID]: keepalive failed

Investigation commands

# Check IPMI SEL for watchdog timer expiry events
ipmitool sel list | grep -i 'watchdog\|timer\|reset\|power'

# Get full IPMI event log with decoded descriptions
ipmitool sel elist

# Check if ipmi_watchdog module is loaded and its settings
lsmod | grep ipmi_watchdog
cat /sys/module/ipmi_watchdog/parameters/action       # reset, power_cycle, power_off, none
cat /sys/module/ipmi_watchdog/parameters/timeout      # countdown in seconds
cat /sys/module/ipmi_watchdog/parameters/pretimeout   # pre-NMI seconds

# Check watchdog daemon status (if using watchdog package)
systemctl status watchdog

# Review BMC System Event Log entries around reboot time
ipmitool sel time get   # confirm BMC clock vs system clock
ipmitool sdr type 'Watchdog'

SysRq Triggered Crash / Manual Panic

What it is

The Magic SysRq key mechanism allows a privileged user or script to force an immediate kernel crash (useful for testing kdump). Writing c to /proc/sysrq-trigger calls panic() directly. This is also the mechanism used by some monitoring tools to force a vmcore capture on a hung system.

Log patterns

kernel: SysRq : Trigger a crash
kernel: Kernel panic - not syncing: sysrq triggered crash

Investigation commands

# Check if SysRq crash was triggered
journalctl -b -1 -k --no-pager | grep -i 'sysrq\|trigger a crash'
grep -i sysrq /var/log/messages

# Check current SysRq enabled bitmap
sysctl kernel.sysrq
cat /proc/sys/kernel/sysrq
# 0=disabled, 1=all, 176=safe subset (RHEL default in some versions)

# Who could have done it? Check audit log
ausearch -f /proc/sysrq-trigger --start yesterday --end now -i

# Check /etc/sysctl.conf and /etc/sysctl.d/ for intentional configuration
grep -r sysrq /etc/sysctl.conf /etc/sysctl.d/

Kernel Crash Dump (kdump)

What it is

kdump is the kernel crash dump mechanism included in RHEL 6 through RHEL 10. When any of the above panic-inducing conditions fires, if kdump is properly configured, the running kernel hands off to a small capture kernel (kexec) which saves a memory image (vmcore) to disk before rebooting. kdump is the single best tool for post-incident RCA of unexpected reboots.

If kdump was not configured before the reboot, it may not be possible to determine the root cause. — Red Hat KCS

Verify kdump is installed and running

# Is kexec-tools installed?
rpm -q kexec-tools

# Is kdump service enabled and active?
systemctl status kdump

# How much crash memory is reserved?
cat /proc/cmdline | grep -o 'crashkernel=[^ ]*'
# or
grep crashkernel /etc/default/grub

# Where will vmcore be saved?
grep -v '^#' /etc/kdump.conf | grep -v '^$'

After a crash — analyze the vmcore

# List crash dumps captured
ls -lh /var/crash/

# Identify which vmcore belongs to which crash time
ls -lh /var/crash/*/vmcore

# Install crash and kernel debug symbols (RHEL)
dnf install crash
dnf install kernel-debuginfo-$(uname -r)

# Open the vmcore with crash utility
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/<timestamp>/vmcore

# Inside crash utility — most useful commands:
# bt         — backtrace of the crashing CPU at time of panic
# log        — kernel message buffer (dmesg at crash time)
# ps         — process list at crash time
# vm         — virtual memory info
# sys        — system info (uptime, kernel version, panic string)
# q          — quit

Enable kdump if not already active

# Enable and start kdump
systemctl enable --now kdump

# Verify crash kernel is reserved
grep -i crashkernel /proc/cmdline

# If not set, add crashkernel= to GRUB and reboot
# RHEL 8/9 (BIOS):
grubby --update-kernel=ALL --args="crashkernel=auto"
# RHEL 8/9 (UEFI):
grubby --update-kernel=ALL --args="crashkernel=auto"

# Test kdump is functional (THIS WILL CRASH THE SYSTEM — test env only)
echo c > /proc/sysrq-trigger

Hardware Errors (MCE / ECC Memory / PCIe)

What it is

Machine Check Exceptions (MCE) are hardware-level errors reported by the CPU: uncorrectable ECC memory errors, CPU internal errors, PCIe bus errors. An uncorrectable MCE causes an immediate kernel panic. Correctable errors (single-bit ECC) are logged as warnings but do not cause reboots by themselves.

Log patterns

kernel: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU N: Machine Check: 0 Bank N: <error code>
kernel: EDAC MC0: N CE error(s) on DIMM <location>
kernel: NFIT: nfit_handle_mce: uncorrectable error

Investigation commands

# Check MCE log via mcelog (older, RHEL 6/7)
mcelog --client   # if mcelog daemon is running
cat /var/log/mcelog

# RHEL 8/9+: use rasdaemon instead of mcelog
systemctl status rasdaemon
ras-mc-ctl --summary
ras-mc-ctl --errors

# Check kernel MCE messages
journalctl -b -1 -k --no-pager | grep -iE 'mce|machine check|edac|ecc|uncorrect'
dmesg | grep -iE 'mce|machine check|edac|ecc'

# Check IPMI SEL for memory/hardware errors
ipmitool sel elist | grep -iE 'mem|ecc|correctable|uncorrectable|dimm'

# Check EDAC (Error Detection And Correction) subsystem
ls /sys/devices/system/edac/mc/
cat /sys/devices/system/edac/mc/mc*/ce_count    # correctable errors
cat /sys/devices/system/edac/mc/mc*/ue_count    # uncorrectable errors (critical)

Hung Task / D-state Process (Uninterruptible Sleep)

What it is

A process stuck in D-state (uninterruptible sleep) is usually waiting on I/O that never completes — typically an NFS server that went away, a failed disk, or a storage path issue. If kernel.hung_task_panic=1 is set and the task stays stuck beyond kernel.hung_task_timeout_secs, the kernel panics and reboots.

Log patterns

kernel: INFO: task <taskname>:PID blocked for more than NNN seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:kworker/0:1H  state:D stack:NNNNk  ...

Investigation commands

# Find D-state processes right now (if system is still responsive)
ps aux | awk '$8 == "D"'

# Check for hung task kernel messages in previous boot
journalctl -b -1 -k --no-pager | grep -iE 'blocked for more than|hung task|state:D'
grep 'blocked for more than' /var/log/messages

# Check hung_task settings
sysctl kernel.hung_task_timeout_secs    # 0 = disabled
sysctl kernel.hung_task_panic           # 1 = panic when hung task detected

# Check NFS mounts for stale/hanging connections
mount | grep nfs
nfsstat -m
cat /proc/mounts | grep nfs

# Check storage/multipath for path failures
multipath -ll
dmsetup status

UPS / Power Loss

What it is

If a UPS management daemon (e.g., apcupsd, NUT) detects a power failure and battery level too low, it issues a controlled shutdown. An uncontrolled power cut will produce no shutdown log at all — the journal simply ends abruptly.

How to identify

# If the journal simply ends with no shutdown sequence → power cut
journalctl -b -1 --no-pager | tail -30
# A normal shutdown ends with lines like "Reached target Power-Off"
# An abrupt power loss: the journal ends mid-stream with no shutdown messages

# Check auditd for two consecutive boots with no shutdown between them
ausearch -i -m system_boot,system_shutdown | tail -8

# Check apcupsd logs (if installed)
cat /var/log/apcupsd.events

# Check NUT (Network UPS Tools) logs
journalctl -b -1 --no-pager -u nut-monitor
journalctl -b -1 --no-pager -u upsmon

# Check IPMI SEL for power loss events
ipmitool sel elist | grep -iE 'power\|ac lost\|battery'

Summary

Reboot detected (These are  command so added those caused, but there could be more)
│
├─ journalctl -b -1 shows clean shutdown messages?
│   ├─ YES → Cause 1 (Signal 15 / user initiated) or Cause 2 (ACPI/power button)
│   │         → Check /var/log/secure, ausearch, last
│   │
│   └─ NO → Journal ends abruptly
│           ├─ ausearch shows two SYSTEM_BOOT with no SYSTEM_SHUTDOWN?
│           │   ├─ YES → NOT a graceful shutdown. Check:
│           │   │         ├─ kdump vmcore in /var/crash/  → Cause 4 (kernel panic)
│           │   │         ├─ OOM messages in journalctl   → Cause 3 (OOM)
│           │   │         ├─ MCE / EDAC messages          → Cause 10 (hardware)
│           │   │         ├─ soft lockup messages         → Cause 5 (watchdog)
│           │   │         ├─ hung task messages           → Cause 11 (D-state)
│           │   │         ├─ IPMI SEL watchdog timer      → Cause 7 (BMC watchdog)
│           │   │         └─ No logs at all               → Cause 12 (power loss)
│           │   │
│           │   └─ Cluster node? Check pcs / corosync     → Cause 6 (STONITH)
│           │
│           └─ sysrq trigger in logs?                     → Cause 8 (SysRq)

*Document compiled from: Red Hat KCS (solutions/6038, solutions/31411, solutions/737033, articles/why-did-my-rhel-system-reboot), kernel.org documentation, ipmitool man pages, pacemaker documentation.

Refer - https://access.redhat.com/articles/why-did-my-rhel-system-reboot


Do not hesitate to connect with Rackspace support for any assistance.