Article

may2002.tar

Solaris^TM Administration Best Practices

Peter Baer Galvin

Over the past few months in Sys Admin, both online and in print, a discourse has been taking place about the best practices of Solaris administrators. This month in the Solaris Corner, I take the best of the old, add the best of the new, and create a consensus "best practices" document. In this document, you'll see some repeats from past columns, but it seemed logical to put forth a complete version of the document this once. I hope this will be useful to both experienced and novice administrators, and that it will continue to evolve and grow as more administrators contribute their wisdom.

"Consensus" and "systems administration" are words that are rarely used near each other. Systems administration is a journeyman's trade, with knowledge earned through experience and hard work. Every sys admin performs his or her work with slight and not-so-slight variation from colleagues. Some of that variation is caused by personal choice, some by superstition, and some by a differing knowledge set. The more this hard-won knowledge is spread, the more systems will be run alike, and the more stable, usable, and manageable they will be, thus the need for a Best Practices document. Note that most of the information here applies to other operating systems, and sometimes to the real world as well.

Solaris Administration Best Practices, Version 1.0

This document is the result of input from many of the top administrators in the Solaris community. Please help your fellow sys admins (as they have helped you here) by contributing your best practices to this document. Email them to me at: [email protected].

Keep an Eye Peeled and a Wall at Your Back

The best way to prepare to debug problems is to know how your systems run when there are no problems. Then, when a problem occurs, it can easily be discerned. For example, if you aren't familiar with the normal values of system statistics (CPU load, number of interrupts, memory scan rates, and so on), determining that one value is unusual will be impossible. Devise trip wires and reports to help detect when something unusual is happening. swatch and netsaint are good programs for this purpose.

Also, pay attention to error messages and clean up the errors behind them. You ignore them at your peril -- the problem will snowball and devour you at the worst possible time, or they'll mount until they hide a really important problem that you miss in all the noise.

Communicate with Users

Veteran sys admins realize that they are there to make users' lives easier. Those who are going to enjoy long sys admin careers take this to heart, and make decisions and recommendations based on usability, stability, and security. The antithesis of this rule is the joke I used to tell as a university sys admin: "This university job would be great if it wasn't for the dang students messing up my systems". Good sys admins realize that this is a joke, not gospel!

Also remember the example set by the World Champion New England Patriots -- teamwork can overcome strong foes. Talk with your fellow admins, bounce ideas around, share experiences, and learn from each other. It will make you, them, and your systems better.

Help Users Fix It Themselves

Helping users help themselves means that you can spend your time on the hard and interesting problems (not to mention fewer calls on your off-hours, and happier users). For example, if you have reports that tell you about quota abuse (overly large mail folders and the like), enable the users to solve the problem, rather than repeatedly complaining to them about their abuse. The counter to this rule is that a little knowledge can be dangerous. Users may think they understand the problem when they don't, or might think they are solving the problem when they are making it worse.

Know When to Use Strategy, and When to Use Tactics

Sys admins must learn the difference between strategy and tactics and learn the place for both. Being good at this requires experience. Strategy means arranging the battlefield so that your chances are maximized, possibly allowing you to win easily or even without a fight. Tactics mean hand-to-hand combat. You win by being good at both. An ounce of strategy can be worth a couple of tons of tactics. But don't be overly clever with strategy where tactics will do.

Another way to think of this rule is: "Do it the hard way and get it over with." Too often admins try to do things the easy way, or take a shortcut, only to have to redo everything anyway and do it the "hard" way. (Note that the "hard way" may not be the vendor-documented way of doing things.)

All Projects Take Twice as Long as They Should

Project planning, even when performed by experienced planners, misses small but important steps that, at a minimum, delay the project, and at a maximum, can destroy the plan (or the resulting systems). The most experienced planners pad the times they think each step will take, doubling the time of a given plan. Then, when the plan proves to be accurate, he or she is considered to be a genius. In reality, the planner thought "this step should take four days, so I'll put down eight".

Another way that knowledge of this rule can be used to your advantage -- announce big changes far earlier than you really need them done. If you need to power off a data center or replace a critical server by October 31. Announce it for September. People will be much more forthcoming about possible problems as the "deadline" approaches. You can adjust to "push back" very diplomatically and generously because your real deadline is not imperiled. (Of course, it's also very important to be honest with your users to establish trust, so be careful with this "Scotty" rule.)

It's Not Done Until It's Tested

Many sys admins like their work because of the technical challenges, minute details, and creative processes involved in running systems. The type of personality drawn to those types of challenges typically is not the type that is good at thoroughly testing a solution, and then retesting after changes to variables. Unfortunately for them, testing is required for system assurance, and for good systems administration. I shudder to think how much system (and administrator) time has been wasted by less-than-thorough testing.

Note that this rule can translate into the need for a test environment of some sort. If the production environment is important, there should be a test environment for learning and experimentation.

It's Not Done Until It's Documented

Documentation is never a favorite task. It's hard to do and even harder to keep up to date, but it pays great dividends when done properly. Note that documentation does not need to be in the form of a novel. It can be the result of a script run to capture system configuration files and status command output, in part. Another strategy -- document your systems administration using basic HTML. (Netscape Composer is sufficient for this task.) Documents can be stored remotely, and links can be included to point to useful stuff. You can even burn a CD of the contents to archive with the backups.

If you keep a system history this way, searching the documents can help solve recurring problems. Frequently, admins waste time working on problems that have previously been solved, but not properly documented.

Never Change Anything on Fridays

Whatever the cause (in a hurry to get home, goblins, gamma rays), changes made before weekends (or vacations) frequently turn into disaster. Do not tempt the fates -- just wait until there are a couple of days in a row to make the change and monitor it. Some admins even avoid making a change late in the day, preferring to do so at the start of a day to allow more monitoring and debugging time.

Use Defaults Whenever Possible

I recall a conversion with a client in which the client was trying to go outside of the box, save some time and money, and produce a complex but theoretically workable solution. My response was "there's such a thing as being too clever". He continued being clever (too much so, in my opinion; just right, in his opinion). The solution had problems, was difficult to debug, was difficult for vendors to support, and was quite a bother for a while.

In another example, some admins make changes for convenience that end up making the system different from others. For example, some still put files into /usr/local/bin, even though Sun has discouraged that (and encouraged /opt/local) for many years. It may make the users' lives easier, because that's where they expect to find files, but other admins may be unpleasantly surprised when they use standard methods and find they conflict with the current system configuration.

This rule (as with the others) can be violated with good cause. For example, where security is concerned, security can be increased by not using defaults.

Furthermore, standardize as much as possible. Try to run the same release of Solaris on all machines, with the same patch set, with the same revisions of applications, with the same hardware configurations -- this is easier said than done, as with most of these rules. It is important to set goals, and drive toward them, when possible. This is a good goal, even if it is not 100% attainable.

With this in mind, isolate site- and system-specific changes. Try to keep all nonstandard things in one place so it is easy to manage them, move them, and to know that something is non-standard.

Always Be Able to Undo What You Are About to Do

Never move forward with something unless you are fully prepared to return the server to the original starting point (e.g., make images, back stuff up and test, make sure you have the original CDs). Back up the entire system if making systemic changes, such as system upgrades or major application upgrades. Backup individual files if making minor changes. Rather than deleting something (a directory or application), try renaming or moving it first. Everyone who has administered systems for any amount of time has seen the result of ignoring this rule.

Avoid Poor Decisions from Above

This certainly falls under the category of easier said than done, unfortunately. Management can have bad data and make bad decisions, or can even have good data and make bad decisions. You as the admin usually have to live with (and suffer with) these decisions, so, with reason and data, encourage the correct decision. Even if you lose the battle, you can always say "I told you so" and feel good about yourself (while looking for a new job).

If You Haven't Seen It Work, It Probably Doesn't

Also known as "the discount of marketing". Products are announced, purchased, and installed with the expectation that they will perform as advertised, and in some small percentage of time they actually do. Most of the time, they are over promised and under delivered.

If You're Fighting Fires, Find the Sources

I would posit that thousands of admin-lives have been wasted by fighting computer system "fires", instead of the causes of those fires. To avoid this problem, you must automate what can be automated, especially monitoring and log checking. This can free up enough time to allow you to make progress on projects, rather than spending time on tedious work. Those projects in turn can stabilize systems, improve manageability, increase performance, and in general make the systems happier.

If You Don't Understand It, Don't Play with It on Production Systems

In my university days, we had a student programmer look at the key switch on our main production Sun server, wonder what it does, and turn it to find out. It would have been much better if he asked someone about it first, or at least tried it on a test system. That turns out to be the case with just about everything admin-related. (Of course, he's now a .com millionaire, so who got the last laugh?)

If It Can Be Accidentally Used, and Can Produce Bad Consequences, Protect It

For example, if there is a big red power button at shoulder height on a wall that is just begging to be leaned against, protect it. Otherwise, the power could be lost in the entire workstation lab and lots of students could lose their work (not that such a thing happened on the systems I was managing...). This rule should be extrapolated to just about everything users have access to -- hardware or software. For instance, if it's a dangerous command or procedure, wrap it in a protective script.

Ockham's Razor Is Very Sharp Indeed

William of Ockham (or Occam) advanced humanity immensely when he devised his "razor", or the "law of parsimony", which says that the simplest of two or more competing theories is the preferable one.

Checking the simple and obvious stuff first can save a lot of time and aggravation. A good example is the user who calls and says, "I can't log in anymore." Rather than reset the password or worse yet, tear into the NIS server, just ask if he has the "Caps Lock" key on. Note that to properly execute Ockham's Razor, you must start with no preconceived notions. Assume nothing, gather data, test hypothesis, and then decide on the problem and the solution.

Another Ockham's Razor corollary: Never attribute to malice what can be explained as the result of sheer idiot stupidity.

Frequently, implementation and management complexity is unnecessary and results from "too clever" systems administration. This subtle rule has wide-ranging ramifications. Many times, a sys admin is in an untenable position (debugging a problem, transitioning, upgrading, and so on) because of too much cleverness in the past (their own or someone else's).

When in Doubt, Reboot

As silly and hackneyed as it sounds, this is probably the most important rule. It's also the most controversial, with much argument on both sides. Some argue that it's amateur to reboot, others talk of all the time saved and problems solved by rebooting.

Of course, sometimes a sys admin doesn't have the luxury to reboot systems every time there is a problem. For every time that a corrupted jumbo patch was the culprit and a reboot solved the problem, there is a time that rebooting ad nauseum proved nothing and some real detective work was needed to resolve a problem. As with many of these rules, experience is a guide. Knowing when to reboot -- and when not to -- is one characteristic of a good, experienced admin.

If It Ain't Broke, Don't Fix It

It's amazing that cliché's are often true. When I recall all the time I've wasted making just one last change, or one small improvement, only to cause irreparable harm resulting in prolonged debugging (or an operating system reinstallation), I wish this rule were tattooed somewhere obvious. The beauty of this rule is that it works in real life, not just for systems administration.

Save Early and Often

If you added up all the hours wasted because of data lost due to program and system crashes, it would be a very large number. Saving early and often reduces the amount of loss when a problem occurs.

I've heard a story that Bill Joy was adding a lot of features to his program, the vi editor. It had multiple windows and a bunch of good stuff. The system crashed and he lost the changes. Frustrated, he didn't repeat the changes, leaving us with a much less useful vi. Don't let this happen to you!

Dedicate a System Disk

So much depends on the system disk that it is worth keeping the disk only for system use. Swapping it then becomes possible without effecting users. Performing upgrades, cloning, and mirroring are all easier as well. Performance improves if the system disk is not used for other purposes as well.

Have a Plan

Develop a written task list for important tasks. Develop it during initial testing and use/refine it as the changes are applied to increasingly more critical servers. By the time you get to production, you typically have a short maintenance window and avoiding mistakes is critical. A finely tuned plan can make all the difference, not to mention that the next time (and the time after that) that you are going to be doing something similar, you already have a template for a new task list.

Don't Panic and Have Fun

Rash decisions usually turn out to be bad ones. Logic, reason, testing, deduction, and repeatability -- these techniques usually result in solved problems and smooth-running systems. And in spite of the complexity of the job, and the pressure that can result from the inherent responsibilities, try to enjoy your work and your co-workers.

Acknowledgements

Thanks to the following folks for contributing to this document: Stewart Dean, Ken Stone, Art Kufeldt, Juan Pablo Sanchez Beltrßn, Pete Tamas, Christopher Jones, Leslie Walker, Dave Powell, Mike Zinni, Peggy Fenner, John Kelly, Lola Brown, Chris Gait, Timothy R. Geier, Michael McConnell, David J. DeWolfe, Christopher Corayer, and Tarjei Jensen

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.

SolarisTM Administration Best Practices

Solaris^TM Administration Best Practices