Debugging system hangs sucks because, by definition, we have no live access to the state of the system.
There are some things you can do (in advance, sadly), that can help to narrow this down
Add these lines to
and reboot. The latter can be toggled while running, with
mdb, but the former will not take effect without a reboot.
To be safe, in case your hardware is newer and you are using the
(Credit: Mentioned by Robert Mustacchi on mailing lists)
The first setting enables the "deadman timer", which will panic the system if (by default) 50 seconds elapse without a kernel clock tick.
The second one causes the system to crash dump (without entering the debugger) if an NMI (Non-Maskable Interrupt) is received – i.e. via hardware button or BIOS watchdog timer on a supporting motherboard, from remote consoles, from hypervisor for VMs and so on.
On SPARC systems,
apic_panic_on_nmi of course does not exist, in that case the easiest thing to do is to break to the PROM console and force a panic with
More on the deadman timer in Solaris
When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second.
If the deadman timer detects that that
clock() hasn't run on that CPU for a period of time, it will induce a panic, which will cause a core file to be written to
/var/crash (or the location you configured with
If you would like the deadman to wait more (or less) than the default timeout prior to inducing a panic, you can set the “
snoop_interval” variable to the
desired number of seconds * 100000 (the following example line in
/etc/system file will induce a panic if the clock hasn't ticked after 90 seconds):
This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)
If your system is hung, and you ideally have the above two settings in place, consider waiting to see if the deadman timer will trigger. If it does not, remote management on your system may allow you to inject an NMI and force a dump that way.
Sending NMI from IPMI remote management console
Sourced from Ben Rockwood's blog:
...The best way to poke NMI once you're ready for it is via IPMI (tested on Sun and Dell):
If you have a moderately recent version of ILOM, you can poke the
/SP/diag/generate_host_nmi value like so:
Sending NMI to a VirtualBox VM
Sourced from Darren Moffat's blog:
- http://blogs.oracle.com/darren/entry/sending_a_break_to_opensolaris – "Sending a Break to OpenSolaris hosted in Virtualbox"
I've recently starting using VirtualBox instead of physical machines for some of my basic functional testing. When doing some types of kernel development it is often necessary to force the system into
The F1-A keystroke does this on OpenSolaris x86 systems by default, however that isn't going to work with VirtualBox because that keystroke will be grabbed by some very low level kernel routines in the (OpenSolaris-based) host and never reaches the guest.
So we need an alternate way of getting a break to the guest OpenSolaris from the host one.
I was sure someone else must have worked this out before. I didn't get the full answer from a quick google search but I did find all the parts.
The CLI for VirtualBox can send an NMI (Non Maskable Interupt) to any running guest. OpenSolaris can be configured to drop into
kmdb or force a panic when receiving an NMI.
In the guest put this into
/etc/system and reboot:
Or to set it interactively do:
Then with the VirtualBox CLI we can send an NMI to our guest:
Nice easy solution.
Though I do now wonder why we don't have some default action for when an NMI is received - but then, not everyone cares about getting a dump or getting into
Keyboard breakout into the kernel debugger
If neither is applicable, boot your system with the
-k flag on the kernel command line, and while hung press
F1-A (that is, press
a while holding the
F1 key, as if it were a shift); may be
STOP-A on Sun keyboards with Sun boxes. This will in theory enter the kernel debugger (
kmdb) on the console (if X is running, you will not be able to see the console, and should then type
$<systemdump and press return –
kmdb may be listening, even though you can't see it.)
Unfortunately, the keyboard break to the debugger rarely works.
For SPARC systems an equivalent option may be to break out to PROM with
STOP-A keypress (or get access to PROM remotely via Serial LOM or RSC/ALOM/ILOM) and send a
break command to suspend the OS and enter the kernel debugger, if loaded. You can resume the OS (if not hung) by the
- http://prefetch.net/blog/index.php/2007/02/11/recovering-from-solaris-hangs-with-the-deadman-timer/ – "Recovering from Solaris hangs with the deadman timer" on Blog O' Matty, general description of deadman timer (adaptation of the text is copypasted above)
- http://www.cuddletech.com/blog/pivot/entry.php?id=1044 – "Crashing Solaris for Fun and Profit", a very detailed post, also includes info on NMI and triggering it via IPMI consoles
- http://realsysadmin.com/www/?p=24 – "A healthy dose of mdb", includes examples on system tracing after the failure to uncover the hang causes
- Serial Console in VirtualBox – setup of serial console access to a VirtualBox VM in our Wiki; useful if your test system is running inside a VM
- http://developers.sun.com/solaris/articles/manage_core_dump.html – core dump management and analysis