Article

mar2003.tar

Solaris^TM Administration Best Practices

Peter Baer Galvin

The content of this "Best Practices" column has evolved over time. Thanks to reader feedback and some new real world experience, I've updated it to be more complete. I hope this information will be useful to both experienced and novice administrators, and that this document will continue to evolve and grow as more administrators contribute their wisdom.

"Consensus" and "systems administration" are words rarely used near each other. Systems administration is a journeyman's trade, with knowledge earned through hard work and hard-earned experience. Every sys admin performs his or her work with slightly (or not so slightly) differently from colleagues. Some of that variation is caused by personal choice, some by superstition, and some by differing knowledge sets. And unfortunately, the saying, "It's a dirty job, but somebody said I had to do it", that used to appear on Sys Admin t-shirts is too often true. However, the more these best practices are applied, the more similarly systems will be managed, and the more stable, usable, and manageable they will become. Thus arises the need for a Best Practices document. Note that most of the information here applies to all operating systems, and sometimes to the real world as well.

Solaris Administration Best Practices, Version 2.0

This document is the result of input from many of the top administrators in the Solaris community. Please help your fellow sys admins (as they have helped you) by contributing your experience to this document. Email me at: [email protected].

Keep an Eye Peeled and a Wall at Your Back

The best way to start to debug problems is to know how your systems run when there are no problems. Then, when a problem occurs, it can be easily discerned. For example, if you aren't familiar with the normal values of system statistics (CPU load, number of interrupts, memory scan rates, and so on), determining that one value is unusual will be impossible. Devise trip wires and reports to help detect when something unusual is happening. Swatch and netsaint are good programs for this purpose.

Also, pay attention to error messages and clean up the errors behind them. You ignore them at your peril -- the problem will snowball and bury you at the worst possible time, or extraneous error messages will mount until they hide a really important problem.

Communicate with Users

Veteran sys admins realize that they are there to make users' lives easier. Those who are going to enjoy long sys admin careers take this to heart, and make decisions and recommendations based on usability, stability, and security. The antithesis of this rule is the joke I used to tell as a university sys admin: "This university job would be great if it wasn't for the dang students messing up my systems". Good sys admins realize that this is a joke!

Be sure that the communication is truthful as well. If you are stumped on a problem, let users know what you're doing to resolve the situation and keep them updated as you make progress. Users respect people who solve problems, even if the answer wasn't immediately known. They don't respect the wrong answer, or obfuscation, or lack of response. When users are left hanging, they tend to panic. Remember this quote if you need to admit a lack of a solution. It's from Thomas Alva Edison, who when pressured for a solution said: "I have not failed. I've just found 10,000 ways that won't work."

Also remember that teamwork can overcome strong foes. Talk with your fellow admins, bounce ideas around, share experiences, and learn from each other. It will make you, them, and your systems better.

Help Users Fix It Themselves

Helping users help themselves means that you can spend your time on the hard and interesting problems. It also means fewer calls on your off-hours and generally happier users. For example, if you have reports that tell you about quota abuse (overly large mail folders and the like), show users how to solve the problem, rather than complaining to them about their abuse. Note that the counter to this rule is that a little knowledge can be dangerous. Users may think they understand the problem when they don't, or might think they are solving the problem when they are actually making it worse. Communication is again the key.

Use Available Information

Remember to read the manuals, readme information, update documents, and check for user group, news group, or vendor news information on the Web. In general, research the problem and use the available information before implementing new hardware or software. For example, SunFire servers can be implemented in many ways (e.g., partitioned, domained). Before making a decision you'll have to live with, do your research to make sure it is the right one.

Also consider talking with the vendor and calling their technical support. There is no harm in opening a service call to be sure that the vendor agrees with your approach to a problem. Building a network of contacts within a vendor is another excellent way to determine best practices, to sanity-check decisions, and to get help when you need it.

Know When to Use Strategy and When to Use Tactics

Sys admins must learn the difference between strategy and tactics and learn the place for both. Being good at this requires experience. Strategy means arranging the battlefield so that your chances of winning are maximized -- possibly even without a fight. Tactics mean hand-to-hand combat. You win by being good at both. An ounce of strategy can be worth a couple of tons of tactics. But don't be overly clever with strategy where tactics will do.

Another way to think of this rule is: "Do it the hard way and get it over with." Too often admins try to do things the easy way, or take a shortcut, only to have to redo everything anyway and do it the "hard" way. (Note that the "hard way" may not be the vendor-documented way.)

For example, you could just rush through a new install, manually edit files, manually run configuration scripts, and so on. Alternatively, you could use cfengine to automate a custom installation, such that it can be repeated identically. Of course, there are times when one or the other makes sense, but the best-practice-implementing sys admin knows to make a conscious decision rather than letting schedules or just plain laziness drive it.

All Projects Take Twice as Long as They Should

Project planning, even when done by experienced planners, misses small but important steps that, at least, delay projects, and at most, can destroy the plan (or the resulting systems). The most experienced planners pad the times they think each step will take, doubling the time of a given plan. Then, when the plan proves to be accurate, he or she is considered to be a genius. In reality, the planner thought, "this step should take four days, so I'll put down eight".

Another way that knowledge of this rule can be used to your advantage is to announce big changes far earlier than you really need them done. If you need to power off a data center or replace a critical server by October 31, announce it for September. People will be much more forthcoming about possible problems as the "deadline" approaches. You can adjust to "push back" very diplomatically and generously because your real deadline is not imperiled. (Of course, it's also very important to be honest with your users to establish trust, so be careful with this "Scotty" rule.)

It's Not Done Until It's Tested

Many sys admins like their work because of the technical challenges, minute details, and creative processes involved in running systems. The type of personality drawn to those types of challenges typically is not the type that is good at thoroughly testing a solution, and then retesting after changes to variables. Unfortunately, testing is required for system assurance and good systems administration. I shudder to think how much system (and administrator) time has been wasted by incomplete testing. Note that this rule can translate into the need for some sort of test environment. If the production environment is important, there should be a test environment for proof of concept work, practice of production procedures, and learning and experimentation.

It's Not Done Until It's Documented

Documentation is never a favorite task. It's hard to do and even harder to keep up to date, but it pays great dividends when done properly. Note that documentation does not need to be in the form of a novel. It can be the result of a script run to capture system configuration files or the use of 'script' to capture status command output, in part. Another strategy -- document your systems administration using basic HTML. (Netscape Composer is sufficient for this task.) Documents can be stored remotely, and links can be included to point to useful information. You can even burn a CD of the contents to archive with the backups.

If you keep a system history this way, searching the documents can help solve recurring problems. Frequently, admins waste time working on problems that have previously been solved but not properly documented.

Never Change Anything on Fridays

Whatever the cause (in a hurry to get home, goblins, gamma rays), changes made before weekends (or vacations) frequently turn into disaster. Do not tempt the fates -- just wait until there are a couple of days in a row to make the change and monitor it. Some admins even avoid making a change late in the day, preferring to do so at the start of a day to allow more monitoring and debugging time. Further, some don't like making changes on Mondays either, as it tends to be a hectic, system-dependent kind of day.

Audit Before Edit

Before making any major changes to a system (i.e., hardware upgrades, patches, etc.), review the system logs to make sure the system is operating normally. If there were no problems with the system before the change was made, any resulting errors are (probably) caused by the change and nothing else. Consider how annoying it is to make a change, check the logs, find an error, and debug the problem, only to find that the error was from a pre-existing condition.

Use Defaults Whenever Possible

I recall a conversation in which a client was trying to go outside of the box, save some time and money, and produce a complex but theoretically workable solution. My response was "there's such a thing as being too clever". He continued to be clever, and the resulting solution had problems, was difficult to debug, was difficult for vendors to support, and was quite a bother.

In another example, some admins make changes for convenience that end up making the system different from others. For example, some still put files into /usr/local/bin, even though Sun has discouraged that (and encouraged /opt/local) for many years. It may make the users' lives easier, if that's where they expect to find files, but other admins may be unpleasantly surprised when they encounter nonstandard system configuration. (Note that the use of /opt versus /usr/local is still a subject of debate. This is just an example.) This rule (as with the others) can be violated with good cause. For example, security sometimes can be increased by not using defaults.

Furthermore, make your own defaults -- standardize as much as possible. Try to run the same release of Solaris on all machines, with the same patch set, the same revisions of applications, and the same hardware configurations. This is easier said than done, as with most things, but it's a good goal, even if it is not 100% attainable. In another example, you might decide not to use /opt/local, but you might standardize on /usr/local for read-only local changes, and /var/local for modifiable local changes. Documenting these per-site defaults makes them even more useful and easier to manage.

With this in mind, isolate site- and system-specific changes. Try to keep all nonstandard things in one place so it is easy to manage them, move them, and to know that something is out of the ordinary.

Always Be Able to Undo What You Are About to Do

Never move forward with something unless you are fully prepared to return the server to the original starting point (e.g., make images, back up and test, make sure you have the original CDs). Back up the entire system when making systemic changes, such as system upgrades or major application upgrades. Back up individual files when making minor changes. Rather than deleting something (a directory or application), try renaming or moving it first.

Do Not Spoil Management

This falls under the category of easier said than done, unfortunately. Management can have bad data and make bad decisions, or can even have good data and make bad decisions. You as the admin usually have to live with (and suffer with) these decisions, so, with reason and data, encourage the correct decision. Even if you lose the battle, you can always say "I told you so" and feel good about yourself (while looking for a new job).

Good points from a reader:

Sometimes management doesn't want to spend money on correct solutions, and sys admins respond by developing complicated, hacked-together, cheap solutions just to get by. (Backup systems come to mind.) In some cases, they will use their own personal equipment when the company won't buy something. Unfortunately, by spoiling management in this way, the sys admin prevents them from understanding the true costs of their decisions. From management's point of view, things are now working fine. Then the sys admin moves on, taking his or her equipment, and the new guy has to face the attitude of "the last guy didn't need all that stuff". Once a decision is made, live with it. Don't sabotage it, but don't go out and buy your own stuff or slap together something unsupportable. If it was the wrong decision, everyone will find out eventually, and it will be easier to get things done right. Just do the best job you can with what you have to work with, and be sure to document.

If You Haven't Seen It Work, It Probably Doesn't

Also known as "the discount of marketing". Products are announced, purchased, and installed with the expectation that they will perform as advertised, and sometimes they actually do. Most of the time, they are over-promised and under-delivered.

If You're Fighting Fires, Find the Sources

I would posit that thousands of admin-lives have been wasted fighting computer system "fires" instead of fighting the causes of those fires. To avoid this trap, you must automate what can be automated, especially monitoring and log checking. Automation can free up time to allow you to make progress on other projects, rather than spending time on tedious, repetitive work.

If You Don't Understand It, Don't Play with It on Production Systems

In my university days, we had a student programmer look at the key switch on our main production Sun server, wonder what it does, and turn it to find out. It would have been better if he asked someone first, or at least tried the switch on a test system, but that's the case with just about everything admin-related.

If It Can Be Accidentally Used, and Can Produce Bad Consequences, Protect It

For example, if there is a big red power button at shoulder height on a wall that is just begging to be leaned against, protect it. Otherwise, the power could be lost in the entire workstation lab and lots of students could lose their work (not that such a thing happened on the systems I was managing...). This rule can be extrapolated to just about everything users have access to -- hardware or software. For instance, if it's a dangerous command or procedure, wrap it in a protective script.

Ockham's Razor Is Very Sharp Indeed

William of Ockham (or Occam) advanced humanity immensely when he devised his "razor", or the "law of parsimony", which says that the simplest of two or more competing theories is the preferable one.

Checking the simple and obvious stuff first can save a lot of time and aggravation. An example is the user who calls and says, "I can't log in anymore." Rather than reset the password or, worse yet, tear into the NIS server, just ask whether the user has the "Caps Lock" key on. Note that to properly execute Ockham's Razor, you must start with no preconceived notions. Assume nothing, gather data, test your hypothesis, and then decide on the problem and the solution.

An Ockham's Razor corollary: Never attribute to malice what can be explained by sheer idiot stupidity. Frequently, implementation and management complexity is unnecessary and results from "too clever" systems administration. This subtle rule has wide-ranging ramifications. Many times, a sys admin is in an untenable position (debugging a problem, transitioning, upgrading, and so on) because of too much cleverness in the past (their own or someone else's).

The Last Change Is the Most Suspicious

When debugging a problem, concentrate on the last change made to the system or the environment, and work backwards through other recent changes. For example, if a system blew a power supply, consider that the new UPS might have caused it. Or perhaps the system is failing to boot because of the changes to the startup scripts that were made previously. Or perhaps the system performance is now poor because the root disk was improperly swapped when it went bad, and the mirroring software is having difficulty recovering. There are many cases when the last change, no matter how innocuous or seemingly unrelated, caused the problem that was the target of debugging efforts. Begin debugging by looking at the latest change(s) with a jaded eye, before looking at other possible causes.

When in Doubt, Reboot

As silly and hackneyed as it sounds, this is an important rule. It's also the most controversial, with much argument on both sides. Some argue that it's amateur to reboot; others talk of all the time saved and problems solved by rebooting.

Of course, sometimes a sys admin doesn't have the luxury to reboot systems every time there is a problem. For every time a corrupted jumbo patch was the culprit and a reboot solved the problem, there is a time that rebooting over and over proved nothing and some real detective work was needed to resolve a problem.

Here's an example from a reader:

As a support technician, I get called when the sys admins haven't solved the problem themselves and, every time I can, I start and end my job with a reboot. It's wonderful how many problems you find when you reboot a poorly administered machine.

As a counter example to this rule, consider this from a reader:

I don't like the "When in Doubt, Reboot". It fosters sloppy systems management and gives a misperception to others about where the real causes lie. Recently, management wanted me to look at a system they claimed had operating system problems in that they had to reboot it 2-3 times a week often in prime time. The perception was that it was an OS vendor issue and although adding 4GB ($$$) of memory helped, it still had problems. The base cause was determined to be poor programming of a middleware product that allocated IPC memory but didn't free it. A second issue was that, when the product hung, analysts wrote a script to use "kill -9" (no cleanup) to remove the processes from memory and thus not run the cleanup routines. This left IPC resources in a sad state. Once the system ran out of shared memory, the applications stopped working. But management was ready to buy even more memory, which would have done little. With the real problems identified, and patched applications in place, the system now has 189 days on it without a reboot.

So, as with many of these rules, experience is a guide. Knowing when to reboot -- and when not to -- is one characteristic of a good, experienced admin.

If It Ain't Broke, Don't Fix It

It's amazing that clichès are often true. When I recall all the time I've wasted making just one last change, or one small improvement, only to cause irreparable harm resulting in prolonged debugging (or an operating system reinstallation), I wish this rule were tattooed somewhere obvious. The beauty of this rule is that it works in real life, not just for systems administration.

Save Early and Often

If you added up all the hours wasted because of data lost due to program and system crashes, it would be a very large number. Saving early and often reduces the amount of loss when a problem occurs. I've heard a story that Bill Joy was adding a lot of features to his program, the vi editor. It had multiple windows and a bunch of good stuff. The disk crashed and he lost the changes. Frustrated, he didn't repeat the changes, leaving us with a much less useful vi. Don't let this happen to you!

Dedicate a System Disk

So much depends on the system disk that it is worth keeping the disk only for system use. Swapping it then becomes possible without affecting users. Performing upgrades, cloning, and mirroring are all easier as well. Performance improves if the system disk is not used for other purposes.

Have a Plan

Develop a written task list for important tasks. Develop it during initial testing and use/refine it as the changes are applied to increasingly more critical servers. By the time you get to production, you typically have a short maintenance window and avoiding mistakes is critical. A finely tuned plan can make all the difference, not to mention that the next time (and the time after that) you will perform a similar task, you will already have a template for a new task list.

Cables and Connectors Can Go Bad

Although there are no expiration dates on cables and connectors, they can go bad over time. Frequently, this happens when they are components in a larger system change. For example, when new I/O boards are added to a system and cables are moved to these new connections, the previously working devices seem to stop working. In some cases, a cable or connector is the problem, not the new I/O board or the old device. Problems like these are especially common when cables are pulled too tightly, wound up, strapped down, and otherwise made to look "neat". Also, take a look under the raised floor and see what life is like down there.

Mind the Power

Be sure to match power supplied with power drawn and not to weaken a power infrastructure with low-end power solutions. Check system specs of power needs (frequently these are maximum numbers, assuming totally full systems). Make sure to provide at least that. Also, understand the kind of power connections that system needs. This involves not only the connectors and phases, but also the kinds of sources to use for multiple-power-source systems. Some systems allow N+1 power with multiple connections, but the connections must be in sync (from the same side of the UPS, for example). Others allow or require power sources to be from multiple sources (before and after the UPS). Grounding is also important, especially when multiple systems are attached to each other or to shared storage. All of these aspects can make or break the reliability of a computing facility. In one example, a site had a flaky cluster with multiple hardware failures, and they found that improper grounding had caused many weeks of problems. Remember management only recalls that the server failed; they don't care about the details.

Try Before You Buy

Every compute site is different, every computer user is different, and every system is at least slightly different. That makes choosing correct and appropriate hardware, software, and services exceptionally challenging. The old saw of "try before you buy" is therefore more often true in computing than in other aspects of life. Always check reference accounts, talk to other users and sys admins, and if you're still in doubt about the ability of the tool to meet your needs, try before you buy! When evaluating solutions, consider functionality for the users, cost, reliability, scalability, maintainability, backup, recovery, and security.

Don't Panic and Have Fun

Rash decisions usually turn out to be bad ones. Logic, reason, testing, deduction, and repeatability -- these techniques usually result in solved problems and smooth-running systems. And in spite of the complexity of the job, and the pressure that can result from the inherent responsibilities, try to enjoy your work and your co-workers.

Final Pearls of Wisdom

Keep your propagation constant less than 1. (This comes from nuclear reactor physics. A reactor with a propagation constant less than 1 is a generator. More than 1 is a warhead. Basically, don't let things get out of control.)
Everything works in front of the salesman.
Don't cross the streams (Ghostbusters reference -- heed safety tips).
If at first you don't succeed, blame the compiler.
If you finish a project early, the scope will change to render your work meaningless before the due date.
If someone is trying to save your life, cooperate.
Never beam down to the planet while wearing a red shirt (Star Trek reference -- don't go looking for trouble).
Learning from your mistakes is good. Learning from someone else's mistakes is better.
The fact that something should have worked does not change the fact that it didn't.
The customer isn't always right, but he pays the bills.
Flattery is flattery, but chocolate gets results.
When dealing on an enigmatic symptom, whether it's an obscure application or database error, or a system "hanging": the Hardware is always guilty until proven innocent.
Use only standard cross-platform file formats, to share documentation (i.e., ASCII files, HTML, or PDF).
Use a log file in every computer to log every change you make.
Share your knowledge and keep no secrets.
Don't reinvent the wheel, but be creative.
If you can't live without it, print it out on hardcopy.
Always know where your software licenses are.
Always know where your installation CDs/DVDs/tapes are.
The question you ask as a sys admin is not "Are you paranoid?"; it's "Are you paranoid enough?"

Acknowledgements

Much thanks to the following folks for contributing to this document: Stewart Dean, Ken Stone, Art Kufeldt, Juan Pablo Sanchez BeltrÚn, Pete Tamas, Christopher Jones, Leslie Walker, Dave Powell, Mike Zinni, Peggy Fenner, John Kelly, Lola Brown, Chris Gait, Timothy R. Geier, Michael McConnell, David J. DeWolfe, Christopher Corayer, Tarjei Jensen, John Petrella, Daniel Hyatt, Skeezics Boondoggle, Dave Anderson, Jaime Cardoso, Joel Andrews, Dan Wendlick, Christopher Vera, Jim Mathieu, Bruce Kirkland, Bob Barton, David Meissner, Gary L. Smith, Francisco Mancardi, Keith Dowsett, and Sue Spoddig.

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.

SolarisTM Administration Best Practices

Solaris^TM Administration Best Practices