...
There are some things you can do (in advance, sadly), that can help to narrow this down
/etc/system settings
Add these lines to /etc/system file:
...
On SPARC systems, apic_panic_on_nmi of course does not exist, in that case the easiest thing to do is to break to the PROM console and force a panic with sync.
More on the deadman timer in Solaris
When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second, which will in turn cause the kernel lbolt variable to be updated.
If the deadman timer detects that that lbolt variable hasn’t changed clock() hasn't run on that CPU for a period of time, it will induce a panic, which will cause a core file to be written to /var/crash (or the location you configured with dumpadm).
If you would like the deadman to wait more (or less) than the default timeout prior to inducing a panic, you can set the “snoop_interval” variable to the desired number of seconds * 100000 (the following example line in /etc/system file will induce a panic if the lbolt variable hasn’t been updated clock hasn't ticked after 90 seconds):
| Code Block |
|---|
set snoop_interval=9000000 |
This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)
While hung
If your system is hung, and you ideally have the above two settings in place, consider waiting to see if the deadman timer will trigger. If it does not, remote management on your system may allow you to inject an NMI and force a dump that way.
Sending NMI from IPMI remote management console
Sourced from:
...
| Code Block |
|---|
-> cd /SP/diag
/SP/diag
-> show
/SP/diag
Targets:
snapshot
Properties:
generate_host_nmi = (Cannot show property)
state = disabled
Commands:
cd
set
show
-> set generate_host_nmi=true
Set 'generate_host_nmi' to 'true'
|
Sending NMI to a VirtualBox VM
Sourced from:
- http://blogs.sun.com/darren/entry/sending_a_break_to_opensolaris – "Sending a Break to OpenSolaris hosted in Virtualbox"
| Wiki Markup |
|---|
I've recently starting using VirtualBox instead of physical machines for some of my basic functional testing. When doing some types of kernel development it is often necessary to force the system into {{kmdb}}.
The F1-A keystroke does this on OpenSolaris x86 systems by default, however that isn't going to work with VirtualBox because that keystroke will be grabbed by some very low level kernel routines in the (OpenSolaris-based) host and never reaches the guest.
So we need an alternate way of getting a break to the guest OpenSolaris from the host one.
I was sure someone else must have worked this out before. I didn't get the full answer from a quick google search but I did find all the parts.
The CLI for VirtualBox can send an NMI (Non Maskable Interupt) to any running guest. OpenSolaris can be configured to drop into {{kmdb}} or force a panic when receiving an NMI.
In the guest put this into {{/etc/system}} and reboot:
{code:title=/etc/system addon}
set pcplusmp:apic_kmdb_on_nmi=1
{code}
Or to set it interactively do:
{code}
# echo apic_kmdb_on_nmi/W1 | mdb -kw
# mdb -K
{code}
Then with the VirtualBox CLI we can send an NMI to our guest:
{code}
$ VBoxManage controlvm _ZFS_Crypto_Test_ injectnmi
{code}
Nice easy solution.
Though I do now wonder why we don't have some default action for when an NMI is received - but then, not everyone cares about getting a dump or getting into {{kmdb}}! |
Keyboard breakout into the kernel debugger
If neither is applicable, boot your system with the -k flag on the kernel command line, and while hung press F1-A (that is, press a while holding the F1 key, as if it were a shift); may be STOP-A on Sun keyboards with Sun boxes. This will in theory enter the kernel debugger (kmdb) on the console (if X is running, you will not be able to see the console, and should then type $<systemdump and press return – kmdb may be listening, even though you can't see it.)
...
For SPARC systems an equivalent option may be to break out to PROM with STOP-A keypress (or get access to PROM remotely via Serial LOM or RSC/ALOM/ILOM) and send a break command to suspend the OS and enter the kernel debugger, if loaded. You can resume the OS (if not hung) by the go command.
More info
See blogs:
- http://prefetch.net/blog/index.php/2007/02/11/recovering-from-solaris-hangs-with-the-deadman-timer/ – "Recovering from Solaris hangs with the deadman timer", general description of deadman timer (adaptation of the text is copypasted above)
- http://www.cuddletech.com/blog/pivot/entry.php?id=1044 – "Crashing Solaris for Fun and Profit", a very detailed post, also includes info on NMI and triggering it via IPMI consoles
- http://realsysadmin.com/www/?p=24 – "A healthy dose of mdb", includes examples on system tracing after the failure to uncover the hang causes
- Serial Console in VirtualBox – setup of serial console access to a VirtualBox VM in our Wiki; useful if your test system is running inside a VM
- http://developers.sun.com/solaris/articles/manage_core_dump.html – core dump management and analysis
...