Solaris Crash Recovery and Fault Analysis

download Solaris Crash Recovery and Fault Analysis

of 12

Transcript of Solaris Crash Recovery and Fault Analysis

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    1/12

    Pgina 1s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.

    Doosan Machine Tools Precision Design, High Efficiency, View Complete Product Lineusa.doosaninfracore.co.kr

    Industry Chock Easy adjustable engine/machine mounts, 40% cheaper, high quality www.industrychock.com

    fault tree Corrosion - predict, control, avoid ASM International Education Course asminternational.org/education

    olaris Crash Recovery and Fault Analysis

    urce: http://www.sun.co.nz/patches/abstract.html

    ntroduction

    is is a short paper to give you some feel for how to deal with thelaris operating system when you are faced with a fault, and need to

    store the system to healthy operation quickly without resorting to

    lling your CV from the bottom draw and calling the local employment

    encies.

    rts of what I discuss will be relevant to other Unix operating

    stems, and I will try to be unix generic where ever possible.

    is paper is primarily pitched at faults on a server. In New Zealand,

    st Suns act as servers - infact, I believe most unix machines act as

    rvers. Much of this paper also relates to work stations, but the

    sic assumption I make is that you are working on a server. Few work

    ations are mission critical.

    rstly, a couple of notes. The most powerful Unix debugging tool is

    e observant system administrator. This is the same for any unix, or

    fact, any machine. Check your message files, check the console

    ssages, check disk space, talk to your users, get training, etc.

    e more you understand and observe your machine, the more likely it

    that you will spot problems before they occur, and be betteruipped to resolve problems after they occur.

    cond only to the observant administrator is the manual. All sorts of

    at stuff is carefully written down in the manual. The manual is your

    iend. It can make you look good, sound good and feel good. The

    ility to read the manual is an excellent skill that few people have.

    is worth taking the time to familiarize yourself with how manual

    tries are laid out and the purpose of each section, especially the

    ee Also" and "Known Bugs" sections.

    will break the paper up into four main areas. The first is failed

    stem analysis - the machine is down, what can I do ? The second is

    stem Recovery - how do I get the machine back up and running.

    irdly, a brief flirt with some of the failure analysis resources,

    cluding tools to look at what has happened, and what is happening

    w. The fourth section covers preventative maintenance - things to do

    either avoid crashes or survive them with minimal grief.

    will attempt to point out what are generally termed "religious

    sues" before I wade into them. What I means by religious issues are

    ose gray areas where there are a number options, each with different

    rengths and weaknesses, and many people can get very strongly

    inionated in what is the correct answer. I do not propose that I

    ow the correct answer - infact, in almost all of these cases, there

    no correct answer, just a series of compromises that one lives

    th.

    e final note before I launch into this. This paper barely

    ratches the surface of the available debugging tools and utilities

    thin the Solaris operating system. I have tried to cover enough

    tail help you navigate through the pitfalls of a severe software

    ilure, throw in a few pointers to the odd useful information that

    not generally known and give a few insights that may help you

    alise some of the trade offs involved when building a machine.

    he system is down - where do I begin ?

    0 - Failed System Analysis.

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    2/12

    Pgina 2s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    art by verifying if the machine is infact down. If panic messages

    e streaming by, smoke pouring out of the machine, all the power off,

    c, you can be confident that the machine has faulted.

    wever, things are not always so obvious. The machine could be hung,

    e network down, some services could be broken, or even the machine is

    ne - your user expectations could be wrong, so they percieve that the

    chine is down.

    1 Test Network Connectivity

    rstly, if possible, ping the machine. If the machine responds, you

    ow that at least the core of the OS is functional, and the network is

    . If not, you know you need to do more investigation.

    ilure to respond to a ping can be caused by broken routing, bad

    tmasks, interface problems or any of a range of network failures.

    ually a good time to head to the computer room and start

    vestigating.

    you can ping the machine, can you telnet or rlogin into the

    chine ? If you can, the machine is probably generally healthy, and

    is time to find out what the user observed, and what is causing

    is.

    ere are several possibilities as to why you may not be able to login

    the machine via the network. If the machine is in single user mode,

    u usually get a message telling you that you cannot login at this

    me. Similarly, running out of pseudo terminals will normally give a

    ssage indicating that the machine cannot create another device.

    2 Checking the console

    xt, check the console. Make sure it has power. If it is not on, you

    n halt the machine when you turn it on. If the machine has a key and

    u can put the machine into lock mode, do so before turning on thewer to the terminal.

    ad the messages. Carefully. They are trying to tell you something.

    there are error messages, note them down. They may not survive a

    boot. They may be the only clue to the failure, and losing them

    uld well mean the difference between "Please do this to fix the

    ult" and "Please phone us next time the problem reoccurs."

    e of my favourite debugging tools is the extremely rare console

    rminal printer. So many failures would be captured in all their

    ory the first time with this tool. A 30 line panic message cannot

    roll over the top of a printer, whereas it can on a terminal.

    turally, with a console printer, you do not want to do much work on

    e console, but this is true anyway - the console is for system

    ministration, and should only be used as such.

    3 Checking for System Hang

    - no messages on the console, the network appears to be down. Press

    e return key on the console then wait a few seconds. Did anythingppen ? Usually, if the console responds, even if the console just

    ved down a line, the machine is probably still alive. If the machine

    s a graphic head, press the caps lock key. If the caps lock light

    sponds, the machine is atleast not hard hung.

    the machine does not respond, you are rapidly running out of

    tions. The next action might be a surprise - go and observe the

    chine. Listen to it - is the disk hammering away ? Are the lights

    ashing on the machine ? Are the disk lights flashing ? Are the

    twork lights flashing ? Is the machine exhibiting any signs of

    fe ?

    this time, you should be able to make a reasonable guess as to

    ether the machine is still running at all, or whether it is hung.

    4 Recover or Reboot ?

    the machine still appears to be running, but denying service, you

    ed to do an environment check. Start at the network. There are a

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    3/12

    Pgina 3s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    mber of network services that can hang the machine. Make sure the

    twork connection is firmly plugged in. Make sure the hub has power.

    e if the link light is on at the hub. If the link light is on, is

    e active light on, flashing or doing nothing ? Possibly try a

    fferent port in the hub. Unplug the machine from the hub and see if

    e console gets any messages. Doing this investigation may give vital

    ues as to what the failure is. These can be vital if you find that

    booting the machine does not fix the failure, and are searching for

    me external agent impacting the machine.

    this seems like a lot of things to look at and relatively few direct

    xes, I apologise, but the operating system is over 5.5 million lines

    code at last count, and given the nature of this size and the

    companying flexibility of system, the range of failures that can

    pact the machine are simply immeasurable. This is the same for any

    ix, and all modern operating systems. Quick sales spiel - the sheer

    nge of possible failures is a good reason to have a maintenance

    ntract so that you have people specializing in this stuff backing you

    when you really need them.

    5 Capturing Failure Analysis Data

    the machine is hung or so severely broken that you cannot proceed,

    is probably a good time to either call your support supplier or try

    dump the state of the machine for later analysis. If you really are

    the ball, you can possibly break into the prom and rummage around in

    e machine to see what is broken. I won't bother to elaborate on that

    tion - if you can't do it already, nothing I can say now will help

    u. We do have a training course that will give you a reasonable shot

    doing this stuff, and more. I will give some details at the end if

    yone is interested.

    want to get to the prom prompt to tell the machine to dump the

    mory image into swap space. Hopefully you have adequate swap space

    cope with the dump, or we may simply not be able to capture all of

    e dump, and the dump will be worthless. Even then, you will need

    ve savecore turned on and you have not striped your swap space to

    oid going into single user mode and manually capturing the savecore.

    re on this later.

    rst, if you are on a graphic head, hit the stop and A keys

    multaneously. If you are on a terminal, generate a break. Since

    is is terminal dependant, this exercise is left to the student.

    so, if you are on a machine with a key control on the front, for

    ample, an enterprise server or a sparc centre, make sure the key

    itch is not in the locked position, as this disables the console

    eak.

    l going well, you will be rewarded with an ok prompt. This means the

    chine was "soft" hung, and there is a good chance that we will be

    le to get a core for subsequent analysis. If you do not have an ok

    ompt, time to get a bit more brutal. Unplug the keyboard, then plug

    back in if you are on a graphic head. If you are using a terminal,

    wercycle it. Please note - any messages on the screen will be lost -

    ke sure you have noted them down before hand.

    you still have no ok prompt, either the machine is hard hung ormeone has seriously disabled console breaks. At this time, your only

    tion is to powercycle the machine. The only available debugging will

    your observations, the system logs if they are intact and possibly

    r. More on that later.

    you have got an ok prompt, type sync and hit return. The machine

    ould then dump the memory image to swap. Usually, the only reason

    is will fail is gross SCSI chain failure. You will usually know

    is because either the machine will hang on the sync or you will get

    jor scsi errors on the write. This means the core is not likely to

    captured, but we do have more clues as to the nature of the

    ilure.

    he machine is down. How do we get it back online ?

    0 - System Recovery.

    e immediate choice you need to make at this point is whether to bring

    e machine up to multiuser first or into single user and look around,

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    4/12

    Pgina 4s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    ssibly doing some repairs first. Going to single user will take

    re time, but is generally safer, and if you need to recover the

    vecore and have some issues as to why you may not be able to do this

    a normal boot, you may need to bring the machine up in single user.

    third option exists for really sick machines - booting off the

    twork or cdrom and running off an in memory, minimal image of unix.

    u will need to boot to an in memory image of the operating system to

    cover root or user from backup, repair the device tree or do some

    rious debugging of why the machine cannot boot to single user mode.

    1 - Pre-recovery planning

    fore we begin system recovery, take a couple of minutes and think

    out what you are going to do. This is especially true if you are

    ing disk striping on swap, and is critical if your root and user file

    stems are mirrored. What is a good option for system resilience to

    rdware failure also makes system recovery much more complex. Now is

    good time to take stock of what is required. If you have got

    rrored root and user file systems, you want to endeavour to recover

    e operating system from single user mode or multi-user mode - why ?

    cause when you boot off the cdrom, you cannot deal with the operating

    stem via the mirroring software. You must mount one half of the

    rror and repair it. Even if you repair both sides of the mirror, it

    very unlikely that they will both be repaired to the same state.

    en you boot the machine up and the mirroring software comes online,

    has no idea which side of the mirror is correct, and it will

    ramble the data on both sides of the mirror until they are equally

    ashed. You can safely assumes that this is very bad, and you will be

    storing your machine from backup shortly afterwards.

    2 - Power cycle the whole machine

    nerally, in the event of a major failure of the system, if you are at

    e ok prompt, power cycle the system. This includes all disks, tape

    ives, etc. External devices can get seriously confused and stop the

    chine booting by jamming a scsi channel. This may even be what

    used the failure initially. The only fix is a powercycle. Turn onl external devices and, in the case of devices like the storage

    rays, wait until all the disks are online. Then power on the cpu.

    you want to boot to a runlevel other than the default, hit stop A or

    eak on the console. This should get you to an ok prompt. If you get

    a greater than (>), type n then hit return to get to the ok prompt.

    you are on a machine with a key switch, and you are not sure that

    e hardware is behaving correctly, you can always set the switch into

    agnostic mode. Diagnostic mode will print all of the power on self

    st information as the machine boots, giving you a chance to view what

    happening within the core of the machine at boot time, and hopefully

    ghlighting any component failues on the machine. Almost all

    mponent failures will be caught by either the power on self test or

    seen by the operating system and logged in the system messages files.

    u can also turn the diag-switch option on in the NVRam, but unless

    u change the diag-device entry, the machine will attempt to boot off

    e net. Break to the ok prompt at this point and turn the diag-switch

    false to complete booting.

    3 - NVram options and issues

    the machine hangs at boot or seems to be trying to do something odd,

    g. boot off an unexpected device, it is conceivable that the machine

    s scrambled the NVRAM that the prom uses to boot from. There are 2

    ys to fix this - if the machine is hanging on boot, powercycle the

    chine and hit the stop N key combination repeatedily. This will

    set the NVRAM and hopefully get you to the ok prompt. If you are

    ready at the ok prompt, enter

    t-defaults

    d the NVRAM will be reset. This will clear out any NVRAM programming

    u have done. Make sure you have noted down what you have done and

    store it if required. This may also set the diagnostic switch on.

    u will know this is on when the machine tries to boot from the

    twork. Type

    t diag-switch? false

    the ok prompt to turn this off.

    e NVRam and boot prom are reasonably sophisticated on the sparc

    rdware. It is capable of a range of testing and system diagnosis,

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    5/12

    Pgina 5s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    d system configuration. You can view the system device tree to check

    r items you are expecting to see, configure devices aliases, load

    rmware, configure boot options, etc. This is all documented in the

    laris System Adminsitrator Answerbook.

    rticular things of note are the ability to test various options

    cluding the scsi devices via probe-scsi and probe-scsi-all, and check

    e on-board network connection to see if it sees any packets on the

    re via watch-net. You can traverse the device tree with cd and ls.

    ere are also a range of switches for the boot command, most notably

    - verbose option - report the device testing at boot time

    - reconfiguration boot - rebuild the device tree on boot

    - boot to single user mode

    - do not start up any windowing, normally used for in-memory boots

    - check with the user which of the boot files to use on boot.

    4 Booting to an in-memory image

    will cover the cdrom or network boot of the machine first, as this is

    e most drastic and low level boot, and has the most powerful recover

    tions. If you have a network boot server for the local machine

    nfigured,

    ot net -sw

    ll boot you up to a single user shell, no password required. If you

    not have this previously configured, pull out the manual on

    mpstart and start reading. If the machine has a Sun cdrom drive,

    ad the operating system cdrom into the drive and enter

    ot cdrom -sw

    the ok prompt.

    you are worried about the security implications of the above, either

    ck the machine in a secure room (in real terms, there is no security

    r a physically accessable machine, ever) or set the prom password

    . Do not forget it. If you do forget it, the only way to unset it

    thout knowing the password is via the eeprom command when the machine

    running and you are logged in as root. If you can't get the machine

    , you are history.

    her things to note - when you boot up to a memory image, the device

    ee is not built with historic information. This means that the

    vice tree may differ radically from that on a machine that has been

    configured several times, as the machine when running will attempt to

    tain device numbers over time. For example, you always want your

    tabase to live on c5t2d0d3, even if you remove controller four from

    e system.

    5 Cleaning up after the crash

    rmally, the first thing to do here is to fsck the root and user file

    stems, unless you are running a mirrored root or user file system and

    not intend to make any changes on it. This is the only time when

    u can safely fsck the raw root and user file systems. This can fix

    riously corrupt file systems.

    wever, fsck is not a panacea. I have seen a number of examples of

    ck incorrectly recovering a corrupt root and the only fix has been tocover the file system from backup. Backups are important, do them.

    e fsck will show what mount point the file system was last mounted

    , assuming you are fscking a file system. Check to see that you are

    deed fscking the file system you intend to fsck.

    u can then mount the root file system on the mount point /a and the

    er file system as /a/usr. Do an ls to verify that they are infact

    e correct file systems. If user is wrong, at least you can look in

    /etc/vfstab and see what it should be. At this point, we can chroot

    /a and perform maintenance on the disk based operating system root

    lesytem image while it is quiescent. Note, however, that you have

    t mounted /var or other file systems. If you need them, fsck them

    en mount them before you do the chroot. You will want /var if you

    tend to use the vi editor on the root file system. Also note that

    u will need to set the TERM type when in single user mode.

    you have to recover the root and user file systems, this is the

    int to do it. Assuming that you have a ufsdump of the file systems,

    wfs the raw partition, mount it as /a and ufsrestore to it. You will

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    6/12

    Pgina 6s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    so need to run the installboot command to tell the boot loader where

    find the boot block. Otherwise, the machine will complain that the

    ot file does not appear to be executable when you try and bring the

    chine up.

    you are dealing with a mirrored root or user file system, you

    mediately need to edit /a/etc/vfstab and comment out the mirrored

    vice entries, replace both the raw and cooked mirrored file system

    tries with the raw and cooked disk device entries, then edit

    /etc/system and comment out the metadevice entries if they exist.

    en the machine is back up, you will need to rebuild the devices to

    t the machine fully operational. Generally, if you are in this

    tuation, call your support provider before attempting the recovery

    d get help. The penalties for failure are high enough that you want

    verify that you are taking the correct measures. If you are not

    re what you are doing, you are in over your head.

    6 Rebuilding the device tree and the /dev structures

    nning on the in memory unix image is the best time to rebuilt the

    erating system device tree, especially in the event of corruption of

    rious nodes under /dev. /dev under solaris is merely a series of

    mbolic links into /devices. The entries in /devices are complex

    vice pointers into the in-memory device tree.

    rebuild the device tree and the /dev structures, you need to know if

    e operating system is carrying device history. I will cover how to

    build the device tree without history first. Firstly, chroot onto the

    ot file system by

    root /a /bin/sh

    to /dev and remove the rmt, dsk and rdsk directories. Remove any

    her suspect entries in /dev, but do not remove the following entries :

    ev/null, /dev/zero, /dev/mem, /dev/kmem and /dev/ksyms. cd to

    evices and remove all the directories except pseudo. /devices/pseudo

    s the pointers for the previously mentioned devices. Next, rebuild the

    evices entries, then the symlinks for the various required devices via

    e following commands:

    vconfig

    vlinks

    sks

    is will rebuild the devices required for rebooting the operating

    stem. Touch /reconfigure to get the operating system to complete the

    vice tree rebuild on the next boot.

    you are trying to rebuild the device tree on a machine with some

    storic device entries, which is generally most important for disk

    ntroller numbers, you will need to retain the first entry in

    ev/rdsk and /dev/dsk for that disk chain, e.g c5t0d0s0 in the above

    ample, before you rebuild the device tree. If you wish to change a

    sk controller number, you use a similar mechanism. For example, to

    ange a controller from c4 to c5 where there is no c5 currently, cd to

    ev/rdsk, move c4t0d0s0 to c5t0d0s0, remove all the other c4 entries,

    the same in / dev/dsk and then run the command disks.

    u can do some manipulations of the device tree on a multiuser running

    stem, but this is a form of russian roulette that I would not advise

    less you are fully aware of what the system is doing.

    7 Preparing the machine to boot

    u can also turn on savecore at this time. I will cover this in a

    ter section, as this is certainly not the best time to turn on

    vecore. You may also want to turn on various boot up debugging at

    is time. The most useful tool I find for debugging boot failures is

    quick hack to the /etc/rc scripts. The main script of interest are

    tc/rc2. If you have a quick look through this script, you will find

    small section of code that runs the S scripts in the directory

    tc/rc2.d with the start option. I suggest that you place an echo of

    e script name just before the case statement in which this is done

    , so that you can see which script is hanging or failing. You can

    so run the scripts with the -x option to debug more gnarly failures,

    set the -x option within a problem script. Turning the -x option

    for all scripts will overload your screen with information, and is

    nerally not a good idea.

    en you have finished with the root file system, exit out of chroot,

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    7/12

    Pgina 7s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    to / and unmount the usr, var and root file systems before rebooting

    e machine.

    8 Booting up to single user mode

    boot the machine to single user mode, get to the ok prompt and type

    ot -sw

    is should bring the machine up to the point where you are asked for

    e root password to proceed with system maintenance. If the machine

    es not accept the root password at this point, you may well find that

    e shadow or password file has been damaged or destroyed by the system

    ilure. The only recovery option you have is to boot to an in-memory

    rsion of the operating system, and repair these files. Remember the

    sues involved if you are using mirroring.

    9 Cleaning up when forced into single user mode

    other situation that would normally cause the machine to enter single

    er mode is the failure of a file system to fsck cleanly when

    oting. The machine will usually have the root and user file systems

    unted read only when this occurs, and the mnttab file will usually

    dicate that all file systems are mounted. When this occurs, you

    rst need to remount the root and user file systems as writeable, then

    ean up the mnttab file before proceeding with other work. Start by

    cking the root and user file systems, or just run fsck to fsck all

    e filesystems with fsck entries in the vfstab. Then remount the root

    d user file systems as writeable via the following commands

    unt -o remount,rw /

    d

    unt -o remount,rw /usr

    llow this by a umountall command, which should clean up the mnttab

    le.

    10 Manually Dumping the savecore image

    th the machine in single user mode, you can dump the save core image

    to a file before swap is mounted and possibly trashes the image.nce the savecore image can be up to the size of physical memory, you

    ll want to dump this on a partition that has enough space. By

    fault, the operating system attempts to dump this in the /var/crash

    rectory. You will need to mount the partition you wish to dump the

    age in, if it is not already mounted. Create a directory for the

    age if it does not already exist, and use the command

    vecore image-directory

    dump the image.

    11 Booting to multiuser mode

    ce you have finished whatever single user administration work you

    sh to do, you can exit single mode to immediately boot to multiuser

    de. There are two places that this process will normally hang.

    12 Nfs and the automounter

    e first and most common is when there is a nfs mount in the vfstab

    le, and the remote server is not running. The simple work around to

    is is to add either the bg or soft mount option in the vfstab, or usee automounter. When you are on a server, the automounter needs to be

    ed with great care if you are using direct maps. Personally, I am

    ligiously opposed to direct maps, but they are a common choice. The

    oblem arises in the fact that the automounter takes control of a

    unt point in a direct map.

    , for example, you wish to mount /usr/local from the main server as

    sr/local on all of your client machines. If you are using nis or

    splus maps to manage the automounter, when the server runs the

    tomounter, the automounter tries to take control of /usr/local, which

    ready has a file system mounted under it, and in use. This has

    used the mount point to hang when I have run into this problem. I

    efer to use the automounter in combination with symbolic links to get

    ound this, so that at work, my /usr/local is a symbolic link to

    gtn/apps/local. This map system also allows me to do some neat

    ings with lan and wan links with one consistent set of maps.

    13 Problems in the device tree

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    8/12

    Pgina 8s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    other fault that can cause a hang at boot is a problem in the device

    ee. One common bug is that the ps command will hang due to a

    rruption in the device tree caused by buttons and dials package. If

    u have the device /dev/bd.off, remove it, then get rid of the buttons

    d dials software via

    grm SUNWdialh SUNWdial

    less you have a buttons and dials board on your machine. I am not aware

    any of these in New Zealand.

    e other problem is normally related to the permissions of /dev/null or

    ev/zero. You can check the permissions of these via the following

    mmand

    -lL /dev/null /dev/zero

    ich should return something like

    w-rw-rw- 1 root sys 13, 2 Apr 19 21:04 /dev/null

    w-rw-rw- 1 root sys 13, 12 May 4 1996 /dev/zero

    ese need to be readable and writable to all, otherwise processes will

    ock when trying to access them.

    ow that we are up, why did we fail ?

    0 Failure Analysis Tools

    w that you have the machine up and running again, it is time for a

    ick look around to see what possibly caused the failure.

    1 The system log files

    e first place to look are the system message files. The current

    ssages can be pulled from the machine via the dmesg command. If the

    chine has been rebooted recently, the boot details of the machine,

    cluding the hardware configuation at that time, can be seen. These

    ssages are also saved in the messages files in /var/adm, the older

    les having a numeric postfix on them. Hardware problems seen by the

    erating system will also be seen in these files. It is worthwhile

    nitoring these files and being aware of what the messages in them

    an on your machine.

    ese files are updated via the syslog daemon. How the daemon works

    d what it logs can be modified via the syslogd.conf file. If, for

    ample, you find that the file is continuously being filled with

    ndmail messages, you may elect to put the sendmail messages in a

    parate file and allocate the system messages file to more critical

    ssages. As a side note, if the last message from the previous boot

    from syslogd indicating it was shutdown with a signal 15, this tells

    u that someone halted the machine manually. Use the last command to

    ve you clues as to who the culprit may be.

    2 System Accounting

    e very useful tool to consider using if you are not already is sar,

    system accounting. As the adm user, you can use cron to run the

    mmand

    sr/lib/sa/sa1

    riodically to check point various operating system parameters,

    cluding cpu usage, system memory and swap usage, networking and disk

    ilization, etc. You can then use the sar command to interogate theles captured to get information and trends from these files. The

    ends are particularly useful for system performance tuning and system

    bugging. This is often the only way to spot a slow memory leak in

    e kernel.

    o things to be aware of with sar. It can chew through space in the

    ar partition, and it is not particularly good at cleaning up after

    self. The second problem is that it often does not report correctly

    ter a reboot. The data is being saved, but sar does not report it.

    believe you can use the time options on sar to get around this.

    3 Crash analysis

    u have captured the system crash and now want to look at it. There

    e two tools that ship with the operating system to do this, crash and

    b. Neither are pretty, but crash is the most user friendly, which is

    t saying much. Crash at least has a help. Once again, grovelling

    ound in the kernel is for the trained expert, and not much I can say

    ll help here. There are some internal tools being developed to do

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    9/12

    Pgina 9s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    ght weight crash dump analysis, but I am not sure when or if these

    ll be available outside Sun. Generally, call your service provider

    d they should be able to analyse the core file and figure out why

    e machine failed. This can be a very time consuming exercise, so

    ease be patient. Digging through hundreds of threads to find which

    s the critical mutex can be very painful.

    4 Streams Error reporting

    other useful tool that you can turn on is streams error reporting.

    turn this on, create a directory /var/adm/streams and run the

    mmand strerr. You can run this up in the background on boot to get

    e machine to continually record the messages. While primarily useful

    r network problems, streams errors can highlight other system

    oblems and can pick up the odd intermittant fault that has otherwise

    en missed. Run it, have a look at what it produces.

    5 SunSolve

    nSolve is a database of various system issues, including bug reports,

    tch information, symptoms and resolutions and other useful

    cumentation that Sun packages up for contract customers. There are

    o main distribution mechanisms for sunsolve, a cdrom which is

    stributed once every 6 weeks and via the internet. Both mechanisms

    ve a fairly powerful search engine with them. If you have a error

    ssage you wish to check up on, sunsolve is a very good point to do

    e initial search to see what you can do about it, and whether you

    ould be concerned.

    6 Patches

    th sunsolve mechanisms also have full access to the Sun patch

    tabase, enabling you to install any patches indicated by sunsolve, or

    e current recommended patch cluster to bring your machine up to date

    th the current recommended revision of you operating system.

    tching the machine to the latest recommended revisions is required

    fore SunService can escalate a problem over to Engineering foralysis and repair. If your machine does become unstable, I would

    nerally recommend bringing the patches up to date incase the problem

    known and a fix has already been released for it, which is often

    e case.

    e recommended and security patches are also available for anonymous

    wnload to all Sun customers from sunsolve1.sun.com. However, access

    the non-recommended patches is limited to machines which are covered

    der a maintenance contract for legal reasons.

    patch is a fix to a recognised problem or problems within the

    erating system or it's attendant software. Patches are normally

    med with a 6 digit identifier code, followed by a 2 digit revision

    de. If you cd into the top level of a patch directory, you will see

    README file detailing what issues the patch addresses, what special

    structions need to be performed when installing the patch and any

    her issues you will need to be aware of.

    ere will also be an installpatch script and a backoutpatch script.

    te that the backout is only possible where that patch has beenstalled without the -d option. The -d option stops installpatch

    cking up the files it is patching, so it cannot restore those files

    you wish to backout the patch. The -d option is chosen when you

    stall the recommended patches if you elect not to save the original

    rsions of the software.

    check which patches are on the machine, you can use the command

    owrev -p

    solaris machines. Under SunOS machines, patches are installed

    nually, so you need to keep accurate records as to what has been

    stalled. Patching is managed by a wrapper over the package system

    the present moment, although this is likely to change with Solaris

    6.

    7 Package information

    e system package database is stored in /var/sadm. Removing, moving

    otherwise fiddling with this directory can be fatal to the long term

    stem health, and at minimum will require you to do a full install

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    10/12

    Pgina 10s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    ther than an upgrade when you next come to upgrade the operating

    stem. If showrev core dumps on you, you will normally find that a

    re file has been dropped in the /var/sadm directories somewhere. It

    safe to clean this up.

    der solaris, the whole operating system is managed by packages. You

    n see what packages are installed via

    ginfo

    d you can get detailed information on a package via

    ginfo -l packagename

    n packages almost always start with SUNW, which is our trading stock

    me. I have no idea why they used this, incase anyone wants to know.

    en you install patches, these will install as packages with a numeric

    stfix to the main packages that they update. You should be able to

    e this on a pkginfo command.

    hat can we do to improve reliability ?

    0 Preventative Maintenance

    ch of this section borders on what are religious issues for most

    stem administrators. I will cover the least contentious issue first.

    1 Turning on savecore

    e savecore command is commented out in the default configuration of

    e operating system, primarily because of the mayhem dumping a large

    le in /var can cause when the system administrator is not prepared

    r it. In the script /etc/init.d/sysetup, you will find the

    llowing block commented out

    Default is to not do a savecore

    f [ ! -d /var/crash/`uname -n` ]

    hen mkdir -m 0700 -p /var/crash/`uname -n`

    i

    echo 'checking for crash dump...\c 'avecore /var/crash/`uname -n`

    echo ''

    is dumps the savecore file in the / var/crash directory after creating

    directory for the image. If you enable the savecore into a different

    le system, remember that this process is run before the mount of

    n-core file systems, so you will probably need to mount the file

    stem before running savecore. Also, if you mount the file system

    w, turn off the automatic mount in vfstab.

    2 file system Layout

    o major schools of thought come into play when discussing the

    erating system file system layout, those who believe disk space is a

    oblem and want to put everything in the root and those who like to

    mpartmentalise the operating system, putting everything into it's own

    rtition. There have been several major debates about this within

    n, and the general consensus is that the single partition is good for

    rk stations, but not appropriate for servers.

    prefer compartmentalisation, and I feel that any reasonablyperienced system adminstator should be able to allocate space where

    quired in the first place. Also, disk is cheap these days. However,

    really makes very little difference except for a select few file

    stems.

    e most important file systems to the operating system are the root

    d user file systems, whether as distinct partitions or combined.

    ese two file systems should generally be relatively static. On the

    her hand is the /var file system, which is where the operating system

    es a lot of scribbling and creating files. In the event of a crash,

    e /var partition will almost always need an fsck to clean it up

    fore remounting. Since fsck can also be a liability, you do not want

    run it on the root or user file systems if possible. Therefore, it

    strongly advised that you keep /var, and any volatile file systems,

    distinct file systems from the root and user partitions. This also

    kes the machine much quicker to recover in the event of major

    rruption where you need to recover those two partitions.

    e other separate partition you should have is swap. Although you can

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    11/12

    Pgina 11s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    n the machine with a swap partition, or you could swap onto a file

    thin a file system, you are not going to catch a dump in the event of

    ilure. This means that you may not find out why the machine is

    ashing, so you may not get a fix.

    3 Backup Strategies

    ckups are obviously important. If you do not know this lesson

    ready, you will learn it in the same painful way that every other

    ix administrator has learned it.

    at is almost as important as doing backups is what type of backups

    u use. tar and cpio are not adequate for backing up the root and

    er file systems. You need a backup that will guarantee to recover

    e file system exactly as it was, including holey files, device

    tries, directory sizes, etc. tar and cpio cannot do this. For the

    ot and user file systems, assuming that they are ufs file systems,

    e ufsdump periodically.

    en if you have the solstice backup product, aka networker, you want

    separately backup these partitions via ufsdump, and include the /opt

    rtition. You need networker installed to recover networker backups,

    d the operating system cd does not have this option, so you would

    ve to build the operating system, then install networker, then

    cense and configure it, before you could begin to recover the

    chine. With the ufsdump backups, you recover the root, usr and opt

    le systems from a in-memory version of unix, run installboot, then

    ot the machine and recover the rest of the system.

    4 Metadevice Myths

    surprising number of people are confused by what mirroring does, and

    y you use it. One particular confusion relates to whether you need a

    ckup of the root file system if you have it mirrored.

    at mirroring does is duplicate a partition, including all changes

    de to it. What mirroring is good for is the continued service of thechine in the event of a hardware failure of half of a mirror, or the

    ility to backup a partition while the partition is still in use.

    at it does not protect against is user error or software failure. If

    meone deletes /etc/passwd on the mirror, it is guaranteed to be

    leted from both sides of the mirror, making both sides of the mirror

    ually useless.

    e to various problems with maintenance of a mirrored root file

    stem, I would state that if your machine is mission critical, in that

    must provide service between certain hours, mirroring root could

    ll be useful to you. Otherwise, it is often a hinderance, as you are

    ch more likely to suffer a software failure that could be an order of

    gnitude more difficult to fix due to mirroring.

    is comes back to the old rule, which always applies to computers as

    ch as anything else, which is Keep it simple.

    onclusion

    ere is a lot of complexity within the Solaris operating system,there is in virtually all modern operating systems. Knowledge and

    anning are the keys to managing this complexity, especially in times

    crisis. If you can take the time to learn about your machine, talk

    your users, observe your machine and do a little exploring, reading

    d thinking, and you will have a less stressful work life. Ofcourse,

    w you make time for this while you are running around dealing with

    e current disasters it the big question.

    anks for your time, and I hope you get some value from this course,

    d that you feel a bit more in control next time everything turns to

    llo and 400 users are ringing to find out why they can't surf the

    b ...

    crashed disk

    By : tunde ( Fri Oct 6 07:57:532006 )

  • 8/7/2019 Solaris Crash Recovery and Fault Analysis

    12/12

    Pgina 12s Crash Recovery and Fault Analysis - by G Calkin

    27/04/2011 04:36:22 p.m.www.unixguide.net/sun/solaris_crash_analysis.shtml

    Solaris Crash Recovery andFault Analysis

    By : anonymous ( Tue Sep 1222:46:24 2006 )

    Systementers ok

    By : Alaba ( Fri Aug 25 01:02:512006 )

    Please go through this ...........

    By : anonymous ( Sat Jun 3 11:01:472006 )

    about any passward recovery tool

    By : kuldeep singh ( Tue Feb 7 05:23:19 2006)

    l Comments

    me : anonymousssword :

    mail :

    bject :

    mments :

    establecer Submit Comments

    xchange database repair All Exchange versions at once Sofortiger Export in pst-Dateien www.cds24.de

    ata Recovery Laci SpecialistsRecover your Data No Recovery No Fee (800)233-3648www.DriveCrash.com

    ard drive recovery tool For corrupted, reformatted drives Windows 7, XP, Vista, 2003, 2000www.quetek.com

    NIXguide.net

    nglish to Visayan Cebuano DictionaryGoogle Search

    Suggest a Site

    Visayan Cebuano to English Dictionary