SolarisTM
Administration Best Practices
Peter Baer Galvin
The content of this "Best Practices" column has evolved over time.
Thanks to reader feedback and some new real world experience, I've
updated it to be more complete. I hope this information will be
useful to both experienced and novice administrators, and that this
document will continue to evolve and grow as more administrators
contribute their wisdom.
"Consensus" and "systems administration" are words rarely used
near each other. Systems administration is a journeyman's trade,
with knowledge earned through hard work and hard-earned experience.
Every sys admin performs his or her work with slightly (or not so
slightly) differently from colleagues. Some of that variation is
caused by personal choice, some by superstition, and some by differing
knowledge sets. And unfortunately, the saying, "It's a dirty job,
but somebody said I had to do it", that used to appear on Sys
Admin t-shirts is too often true. However, the more these best
practices are applied, the more similarly systems will be managed,
and the more stable, usable, and manageable they will become. Thus
arises the need for a Best Practices document. Note that most of
the information here applies to all operating systems, and
sometimes to the real world as well.
Solaris Administration Best Practices, Version 2.0
This document is the result of input from many of the top administrators
in the Solaris community. Please help your fellow sys admins (as
they have helped you) by contributing your experience to this document.
Email me at: bestpractice@petergalvin.org.
Keep an Eye Peeled and a Wall at Your Back
The best way to start to debug problems is to know how your systems
run when there are no problems. Then, when a problem occurs, it
can be easily discerned. For example, if you aren't familiar with
the normal values of system statistics (CPU load, number of interrupts,
memory scan rates, and so on), determining that one value is unusual
will be impossible. Devise trip wires and reports to help detect
when something unusual is happening. Swatch and netsaint are good
programs for this purpose.
Also, pay attention to error messages and clean up the errors
behind them. You ignore them at your peril -- the problem will snowball
and bury you at the worst possible time, or extraneous error messages
will mount until they hide a really important problem.
Communicate with Users
Veteran sys admins realize that they are there to make users'
lives easier. Those who are going to enjoy long sys admin careers
take this to heart, and make decisions and recommendations based
on usability, stability, and security. The antithesis of this rule
is the joke I used to tell as a university sys admin: "This university
job would be great if it wasn't for the dang students messing up
my systems". Good sys admins realize that this is a joke!
Be sure that the communication is truthful as well. If you are
stumped on a problem, let users know what you're doing to resolve
the situation and keep them updated as you make progress. Users
respect people who solve problems, even if the answer wasn't immediately
known. They don't respect the wrong answer, or obfuscation, or lack
of response. When users are left hanging, they tend to panic. Remember
this quote if you need to admit a lack of a solution. It's from
Thomas Alva Edison, who when pressured for a solution said: "I have
not failed. I've just found 10,000 ways that won't work."
Also remember that teamwork can overcome strong foes. Talk with
your fellow admins, bounce ideas around, share experiences, and
learn from each other. It will make you, them, and your systems
better.
Help Users Fix It Themselves
Helping users help themselves means that you can spend your time
on the hard and interesting problems. It also means fewer calls
on your off-hours and generally happier users. For example, if you
have reports that tell you about quota abuse (overly large mail
folders and the like), show users how to solve the problem, rather
than complaining to them about their abuse. Note that the counter
to this rule is that a little knowledge can be dangerous. Users
may think they understand the problem when they don't, or might
think they are solving the problem when they are actually making
it worse. Communication is again the key.
Use Available Information
Remember to read the manuals, readme information, update documents,
and check for user group, news group, or vendor news information
on the Web. In general, research the problem and use the available
information before implementing new hardware or software. For example,
SunFire servers can be implemented in many ways (e.g., partitioned,
domained). Before making a decision you'll have to live with, do
your research to make sure it is the right one.
Also consider talking with the vendor and calling their technical
support. There is no harm in opening a service call to be sure that
the vendor agrees with your approach to a problem. Building a network
of contacts within a vendor is another excellent way to determine
best practices, to sanity-check decisions, and to get help when
you need it.
Know When to Use Strategy and When to Use Tactics
Sys admins must learn the difference between strategy and tactics
and learn the place for both. Being good at this requires experience.
Strategy means arranging the battlefield so that your chances of
winning are maximized -- possibly even without a fight. Tactics
mean hand-to-hand combat. You win by being good at both. An ounce
of strategy can be worth a couple of tons of tactics. But don't
be overly clever with strategy where tactics will do.
Another way to think of this rule is: "Do it the hard way and
get it over with." Too often admins try to do things the easy way,
or take a shortcut, only to have to redo everything anyway and do
it the "hard" way. (Note that the "hard way" may not be the vendor-documented
way.)
For example, you could just rush through a new install, manually
edit files, manually run configuration scripts, and so on. Alternatively,
you could use cfengine to automate a custom installation, such that
it can be repeated identically. Of course, there are times when
one or the other makes sense, but the best-practice-implementing
sys admin knows to make a conscious decision rather than letting
schedules or just plain laziness drive it.
All Projects Take Twice as Long as They Should
Project planning, even when done by experienced planners, misses
small but important steps that, at least, delay projects, and at
most, can destroy the plan (or the resulting systems). The most
experienced planners pad the times they think each step will take,
doubling the time of a given plan. Then, when the plan proves to
be accurate, he or she is considered to be a genius. In reality,
the planner thought, "this step should take four days, so I'll put
down eight".
Another way that knowledge of this rule can be used to your advantage
is to announce big changes far earlier than you really need them
done. If you need to power off a data center or replace a critical
server by October 31, announce it for September. People will be
much more forthcoming about possible problems as the "deadline"
approaches. You can adjust to "push back" very diplomatically and
generously because your real deadline is not imperiled. (Of course,
it's also very important to be honest with your users to establish
trust, so be careful with this "Scotty" rule.)
It's Not Done Until It's Tested
Many sys admins like their work because of the technical challenges,
minute details, and creative processes involved in running systems.
The type of personality drawn to those types of challenges typically
is not the type that is good at thoroughly testing a solution, and
then retesting after changes to variables. Unfortunately, testing
is required for system assurance and good systems administration.
I shudder to think how much system (and administrator) time has
been wasted by incomplete testing. Note that this rule can translate
into the need for some sort of test environment. If the production
environment is important, there should be a test environment for
proof of concept work, practice of production procedures, and learning
and experimentation.
It's Not Done Until It's Documented
Documentation is never a favorite task. It's hard to do and even
harder to keep up to date, but it pays great dividends when done
properly. Note that documentation does not need to be in the form
of a novel. It can be the result of a script run to capture system
configuration files or the use of 'script' to capture status command
output, in part. Another strategy -- document your systems administration
using basic HTML. (Netscape Composer is sufficient for this task.)
Documents can be stored remotely, and links can be included to point
to useful information. You can even burn a CD of the contents to
archive with the backups.
If you keep a system history this way, searching the documents
can help solve recurring problems. Frequently, admins waste time
working on problems that have previously been solved but not properly
documented.
Never Change Anything on Fridays
Whatever the cause (in a hurry to get home, goblins, gamma rays),
changes made before weekends (or vacations) frequently turn into
disaster. Do not tempt the fates -- just wait until there are a
couple of days in a row to make the change and monitor it. Some
admins even avoid making a change late in the day, preferring to
do so at the start of a day to allow more monitoring and debugging
time. Further, some don't like making changes on Mondays either,
as it tends to be a hectic, system-dependent kind of day.
Audit Before Edit
Before making any major changes to a system (i.e., hardware upgrades,
patches, etc.), review the system logs to make sure the system is
operating normally. If there were no problems with the system before
the change was made, any resulting errors are (probably) caused
by the change and nothing else. Consider how annoying it is to make
a change, check the logs, find an error, and debug the problem,
only to find that the error was from a pre-existing condition.
Use Defaults Whenever Possible
I recall a conversation in which a client was trying to go outside
of the box, save some time and money, and produce a complex but
theoretically workable solution. My response was "there's such a
thing as being too clever". He continued to be clever, and the resulting
solution had problems, was difficult to debug, was difficult for
vendors to support, and was quite a bother.
In another example, some admins make changes for convenience that
end up making the system different from others. For example, some
still put files into /usr/local/bin, even though Sun has discouraged
that (and encouraged /opt/local) for many years. It may make the
users' lives easier, if that's where they expect to find files,
but other admins may be unpleasantly surprised when they encounter
nonstandard system configuration. (Note that the use of /opt versus
/usr/local is still a subject of debate. This is just an example.)
This rule (as with the others) can be violated with good cause.
For example, security sometimes can be increased by not using defaults.
Furthermore, make your own defaults -- standardize as much as
possible. Try to run the same release of Solaris on all machines,
with the same patch set, the same revisions of applications, and
the same hardware configurations. This is easier said than done,
as with most things, but it's a good goal, even if it is not 100%
attainable. In another example, you might decide not to use /opt/local,
but you might standardize on /usr/local for read-only local changes,
and /var/local for modifiable local changes. Documenting these per-site
defaults makes them even more useful and easier to manage.
With this in mind, isolate site- and system-specific changes.
Try to keep all nonstandard things in one place so it is easy to
manage them, move them, and to know that something is out of the
ordinary.
Always Be Able to Undo What You Are About to Do
Never move forward with something unless you are fully prepared
to return the server to the original starting point (e.g., make
images, back up and test, make sure you have the original CDs).
Back up the entire system when making systemic changes, such as
system upgrades or major application upgrades. Back up individual
files when making minor changes. Rather than deleting something
(a directory or application), try renaming or moving it first.
Do Not Spoil Management
This falls under the category of easier said than done, unfortunately.
Management can have bad data and make bad decisions, or can even
have good data and make bad decisions. You as the admin usually
have to live with (and suffer with) these decisions, so, with reason
and data, encourage the correct decision. Even if you lose the battle,
you can always say "I told you so" and feel good about yourself
(while looking for a new job).
Good points from a reader:
Sometimes management doesn't want to spend money on correct solutions,
and sys admins respond by developing complicated, hacked-together,
cheap solutions just to get by. (Backup systems come to mind.) In
some cases, they will use their own personal equipment when the
company won't buy something. Unfortunately, by spoiling management
in this way, the sys admin prevents them from understanding the
true costs of their decisions. From management's point of view,
things are now working fine. Then the sys admin moves on, taking
his or her equipment, and the new guy has to face the attitude of
"the last guy didn't need all that stuff". Once a decision is made,
live with it. Don't sabotage it, but don't go out and buy your own
stuff or slap together something unsupportable. If it was the wrong
decision, everyone will find out eventually, and it will be easier
to get things done right. Just do the best job you can with what
you have to work with, and be sure to document.
If You Haven't Seen It Work, It Probably Doesn't
Also known as "the discount of marketing". Products are announced,
purchased, and installed with the expectation that they will perform
as advertised, and sometimes they actually do. Most of the time,
they are over-promised and under-delivered.
If You're Fighting Fires, Find the Sources
I would posit that thousands of admin-lives have been wasted fighting
computer system "fires" instead of fighting the causes of those
fires. To avoid this trap, you must automate what can be automated,
especially monitoring and log checking. Automation can free up time
to allow you to make progress on other projects, rather than spending
time on tedious, repetitive work.
If You Don't Understand It, Don't Play with It on Production
Systems
In my university days, we had a student programmer look at the
key switch on our main production Sun server, wonder what it does,
and turn it to find out. It would have been better if he asked someone
first, or at least tried the switch on a test system, but that's
the case with just about everything admin-related.
If It Can Be Accidentally Used, and Can Produce Bad Consequences,
Protect It
For example, if there is a big red power button at shoulder height
on a wall that is just begging to be leaned against, protect it.
Otherwise, the power could be lost in the entire workstation lab
and lots of students could lose their work (not that such a thing
happened on the systems I was managing...). This rule can be extrapolated
to just about everything users have access to -- hardware or software.
For instance, if it's a dangerous command or procedure, wrap it
in a protective script.
Ockham's Razor Is Very Sharp Indeed
William of Ockham (or Occam) advanced humanity immensely when
he devised his "razor", or the "law of parsimony", which says that
the simplest of two or more competing theories is the preferable
one.
Checking the simple and obvious stuff first can save a lot of
time and aggravation. An example is the user who calls and says,
"I can't log in anymore." Rather than reset the password or, worse
yet, tear into the NIS server, just ask whether the user has the
"Caps Lock" key on. Note that to properly execute Ockham's Razor,
you must start with no preconceived notions. Assume nothing, gather
data, test your hypothesis, and then decide on the problem and the
solution.
An Ockham's Razor corollary: Never attribute to malice what can
be explained by sheer idiot stupidity. Frequently, implementation
and management complexity is unnecessary and results from "too clever"
systems administration. This subtle rule has wide-ranging ramifications.
Many times, a sys admin is in an untenable position (debugging a
problem, transitioning, upgrading, and so on) because of too much
cleverness in the past (their own or someone else's).
The Last Change Is the Most Suspicious
When debugging a problem, concentrate on the last change made
to the system or the environment, and work backwards through other
recent changes. For example, if a system blew a power supply, consider
that the new UPS might have caused it. Or perhaps the system is
failing to boot because of the changes to the startup scripts that
were made previously. Or perhaps the system performance is now poor
because the root disk was improperly swapped when it went bad, and
the mirroring software is having difficulty recovering. There are
many cases when the last change, no matter how innocuous or seemingly
unrelated, caused the problem that was the target of debugging efforts.
Begin debugging by looking at the latest change(s) with a jaded
eye, before looking at other possible causes.
When in Doubt, Reboot
As silly and hackneyed as it sounds, this is an important rule.
It's also the most controversial, with much argument on both sides.
Some argue that it's amateur to reboot; others talk of all the time
saved and problems solved by rebooting.
Of course, sometimes a sys admin doesn't have the luxury to reboot
systems every time there is a problem. For every time a corrupted
jumbo patch was the culprit and a reboot solved the problem, there
is a time that rebooting over and over proved nothing and some real
detective work was needed to resolve a problem.
Here's an example from a reader:
As a support technician, I get called when the sys admins haven't
solved the problem themselves and, every time I can, I start and
end my job with a reboot. It's wonderful how many problems you find
when you reboot a poorly administered machine.
As a counter example to this rule, consider this from a reader:
I don't like the "When in Doubt, Reboot". It fosters sloppy systems
management and gives a misperception to others about where the real
causes lie. Recently, management wanted me to look at a system they
claimed had operating system problems in that they had to reboot
it 2-3 times a week often in prime time. The perception was that
it was an OS vendor issue and although adding 4GB ($$$) of memory
helped, it still had problems. The base cause was determined to
be poor programming of a middleware product that allocated IPC memory
but didn't free it. A second issue was that, when the product hung,
analysts wrote a script to use "kill -9" (no cleanup) to remove
the processes from memory and thus not run the cleanup routines.
This left IPC resources in a sad state. Once the system ran out
of shared memory, the applications stopped working. But management
was ready to buy even more memory, which would have done little.
With the real problems identified, and patched applications in place,
the system now has 189 days on it without a reboot.
So, as with many of these rules, experience is a guide. Knowing
when to reboot -- and when not to -- is one characteristic of a
good, experienced admin.
If It Ain't Broke, Don't Fix It
It's amazing that clichès are often true. When I recall
all the time I've wasted making just one last change, or one small
improvement, only to cause irreparable harm resulting in prolonged
debugging (or an operating system reinstallation), I wish this rule
were tattooed somewhere obvious. The beauty of this rule is that
it works in real life, not just for systems administration.
Save Early and Often
If you added up all the hours wasted because of data lost due
to program and system crashes, it would be a very large number.
Saving early and often reduces the amount of loss when a problem
occurs. I've heard a story that Bill Joy was adding a lot of features
to his program, the vi editor. It had multiple windows and a bunch
of good stuff. The disk crashed and he lost the changes. Frustrated,
he didn't repeat the changes, leaving us with a much less useful
vi. Don't let this happen to you!
Dedicate a System Disk
So much depends on the system disk that it is worth keeping the
disk only for system use. Swapping it then becomes possible without
affecting users. Performing upgrades, cloning, and mirroring are
all easier as well. Performance improves if the system disk is not
used for other purposes.
Have a Plan
Develop a written task list for important tasks. Develop it during
initial testing and use/refine it as the changes are applied to
increasingly more critical servers. By the time you get to production,
you typically have a short maintenance window and avoiding mistakes
is critical. A finely tuned plan can make all the difference, not
to mention that the next time (and the time after that) you will
perform a similar task, you will already have a template for a new
task list.
Cables and Connectors Can Go Bad
Although there are no expiration dates on cables and connectors,
they can go bad over time. Frequently, this happens when they are
components in a larger system change. For example, when new I/O
boards are added to a system and cables are moved to these new connections,
the previously working devices seem to stop working. In some cases,
a cable or connector is the problem, not the new I/O board or the
old device. Problems like these are especially common when cables
are pulled too tightly, wound up, strapped down, and otherwise made
to look "neat". Also, take a look under the raised floor and see
what life is like down there.
Mind the Power
Be sure to match power supplied with power drawn and not to weaken
a power infrastructure with low-end power solutions. Check system
specs of power needs (frequently these are maximum numbers, assuming
totally full systems). Make sure to provide at least that. Also,
understand the kind of power connections that system needs. This
involves not only the connectors and phases, but also the kinds
of sources to use for multiple-power-source systems. Some systems
allow N+1 power with multiple connections, but the connections must
be in sync (from the same side of the UPS, for example). Others
allow or require power sources to be from multiple sources (before
and after the UPS). Grounding is also important, especially when
multiple systems are attached to each other or to shared storage.
All of these aspects can make or break the reliability of a computing
facility. In one example, a site had a flaky cluster with multiple
hardware failures, and they found that improper grounding had caused
many weeks of problems. Remember management only recalls that the
server failed; they don't care about the details.
Try Before You Buy
Every compute site is different, every computer user is different,
and every system is at least slightly different. That makes choosing
correct and appropriate hardware, software, and services exceptionally
challenging. The old saw of "try before you buy" is therefore more
often true in computing than in other aspects of life. Always check
reference accounts, talk to other users and sys admins, and if you're
still in doubt about the ability of the tool to meet your needs,
try before you buy! When evaluating solutions, consider functionality
for the users, cost, reliability, scalability, maintainability,
backup, recovery, and security.
Don't Panic and Have Fun
Rash decisions usually turn out to be bad ones. Logic, reason,
testing, deduction, and repeatability -- these techniques usually
result in solved problems and smooth-running systems. And in spite
of the complexity of the job, and the pressure that can result from
the inherent responsibilities, try to enjoy your work and your co-workers.
Final Pearls of Wisdom
- Keep your propagation constant less than 1. (This comes from
nuclear reactor physics. A reactor with a propagation constant
less than 1 is a generator. More than 1 is a warhead. Basically,
don't let things get out of control.)
- Everything works in front of the salesman.
- Don't cross the streams (Ghostbusters reference -- heed safety
tips).
- If at first you don't succeed, blame the compiler.
- If you finish a project early, the scope will change to render
your work meaningless before the due date.
- If someone is trying to save your life, cooperate.
- Never beam down to the planet while wearing a red shirt (Star
Trek reference -- don't go looking for trouble).
- Learning from your mistakes is good. Learning from someone
else's mistakes is better.
- The fact that something should have worked does not change
the fact that it didn't.
- The customer isn't always right, but he pays the bills.
- Flattery is flattery, but chocolate gets results.
- When dealing on an enigmatic symptom, whether it's an obscure
application or database error, or a system "hanging": the Hardware
is always guilty until proven innocent.
- Use only standard cross-platform file formats, to share documentation
(i.e., ASCII files, HTML, or PDF).
- Use a log file in every computer to log every change you make.
- Share your knowledge and keep no secrets.
- Don't reinvent the wheel, but be creative.
- If you can't live without it, print it out on hardcopy.
- Always know where your software licenses are.
- Always know where your installation CDs/DVDs/tapes are.
- The question you ask as a sys admin is not "Are you paranoid?";
it's "Are you paranoid enough?"
Acknowledgements
Much thanks to the following folks for contributing to this document:
Stewart Dean, Ken Stone, Art Kufeldt, Juan Pablo Sanchez BeltrÚn,
Pete Tamas, Christopher Jones, Leslie Walker, Dave Powell, Mike
Zinni, Peggy Fenner, John Kelly, Lola Brown, Chris Gait, Timothy
R. Geier, Michael McConnell, David J. DeWolfe, Christopher Corayer,
Tarjei Jensen, John Petrella, Daniel Hyatt, Skeezics Boondoggle,
Dave Anderson, Jaime Cardoso, Joel Andrews, Dan Wendlick, Christopher
Vera, Jim Mathieu, Bruce Kirkland, Bob Barton, David Meissner, Gary
L. Smith, Francisco Mancardi, Keith Dowsett, and Sue Spoddig.
Peter Baer Galvin (http://www.petergalvin.org) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column, and Pete's
Super Systems, the systems management column for Unix Insider
(http://www.unixinsider.com). Peter is coauthor of the Operating
Systems Concepts and Applied Operating Systems Concepts
textbooks. As a consultant and trainer, Peter has taught tutorials
and given talks on security and systems administration worldwide.
|