Skip to end of metadata
Go to start of metadata

Debugging system hangs sucks because, by definition, we have no live access to the state of the system.

There are some things you can do (in advance, sadly), that can help to narrow this down

/etc/system settings

Add these lines to /etc/system file:

and reboot. The latter can be toggled while running, with mdb, but the former will not take effect without a reboot.

To be safe, in case your hardware is newer and you are using the apix psm driver, you should also set:

(Credit: Mentioned by Robert Mustacchi on mailing lists)

The first setting enables the "deadman timer", which will panic the system if (by default) 50 seconds elapse without a kernel clock tick.

The second one causes the system to crash dump (without entering the debugger) if an NMI (Non-Maskable Interrupt) is received – i.e. via hardware button or BIOS watchdog timer on a supporting motherboard, from remote consoles, from hypervisor for VMs and so on.

On SPARC systems, apic_panic_on_nmi of course does not exist, in that case the easiest thing to do is to break to the PROM console and force a panic with sync.

More on the deadman timer in Solaris

When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second.

If the deadman timer detects that that clock() hasn't run on that CPU for a period of time, it will induce a panic, which will cause a core file to be written to /var/crash (or the location you configured with dumpadm).

If you would like the deadman to wait more (or less) than the default timeout prior to inducing a panic, you can set the “snoop_interval” variable to the desired number of seconds * 100000 (the following example line in /etc/system file will induce a panic if the clock hasn't ticked after 90 seconds):

This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)

While hung

If your system is hung, and you ideally have the above two settings in place, consider waiting to see if the deadman timer will trigger. If it does not, remote management on your system may allow you to inject an NMI and force a dump that way.

Sending NMI from IPMI remote management console

Sourced from Ben Rockwood's blog:

...The best way to poke NMI once you're ready for it is via IPMI (tested on Sun and Dell):

If you have a moderately recent version of ILOM, you can poke the /SP/diag/generate_host_nmi value like so:

Sending NMI to a VirtualBox VM

Sourced from Darren Moffat's blog:

I've recently starting using VirtualBox instead of physical machines for some of my basic functional testing. When doing some types of kernel development it is often necessary to force the system into kmdb.

The F1-A keystroke does this on OpenSolaris x86 systems by default, however that isn't going to work with VirtualBox because that keystroke will be grabbed by some very low level kernel routines in the (OpenSolaris-based) host and never reaches the guest.
So we need an alternate way of getting a break to the guest OpenSolaris from the host one.
I was sure someone else must have worked this out before. I didn't get the full answer from a quick google search but I did find all the parts.
The CLI for VirtualBox can send an NMI (Non Maskable Interupt) to any running guest. OpenSolaris can be configured to drop into kmdb or force a panic when receiving an NMI.

In the guest put this into /etc/system and reboot:

/etc/system addon

Or to set it interactively do:

Then with the VirtualBox CLI we can send an NMI to our guest:

Nice easy solution.

Though I do now wonder why we don't have some default action for when an NMI is received - but then, not everyone cares about getting a dump or getting into kmdb!

Keyboard breakout into the kernel debugger

If neither is applicable, boot your system with the -k flag on the kernel command line, and while hung press F1-A (that is, press a while holding the F1 key, as if it were a shift); may be STOP-A on Sun keyboards with Sun boxes. This will in theory enter the kernel debugger (kmdb) on the console (if X is running, you will not be able to see the console, and should then type $<systemdump and press return – kmdb may be listening, even though you can't see it.)

Unfortunately, the keyboard break to the debugger rarely works.

For SPARC systems an equivalent option may be to break out to PROM with STOP-A keypress (or get access to PROM remotely via Serial LOM or RSC/ALOM/ILOM) and send a break command to suspend the OS and enter the kernel debugger, if loaded. You can resume the OS (if not hung) by the go command.

More info

See blogs:

 

Labels: