|  SolarisTM 
              Administration Best Practices
 Peter Baer Galvin
              Over the past few months in Sys Admin, both online and 
              in print, a discourse has been taking place about the best practices 
              of Solaris administrators. This month in the Solaris Corner, I take 
              the best of the old, add the best of the new, and create a consensus 
              "best practices" document. In this document, you'll 
              see some repeats from past columns, but it seemed logical to put 
              forth a complete version of the document this once. I hope this 
              will be useful to both experienced and novice administrators, and 
              that it will continue to evolve and grow as more administrators 
              contribute their wisdom.
              "Consensus" and "systems administration" are 
              words that are rarely used near each other. Systems administration 
              is a journeyman's trade, with knowledge earned through experience 
              and hard work. Every sys admin performs his or her work with slight 
              and not-so-slight variation from colleagues. Some of that variation 
              is caused by personal choice, some by superstition, and some by 
              a differing knowledge set. The more this hard-won knowledge is spread, 
              the more systems will be run alike, and the more stable, usable, 
              and manageable they will be, thus the need for a Best Practices 
              document. Note that most of the information here applies to other 
              operating systems, and sometimes to the real world as well.
              Solaris Administration Best Practices, Version 1.0
              This document is the result of input from many of the top administrators 
              in the Solaris community. Please help your fellow sys admins (as 
              they have helped you here) by contributing your best practices to 
              this document. Email them to me at: [email protected].
              Keep an Eye Peeled and a Wall at Your Back
              The best way to prepare to debug problems is to know how your 
              systems run when there are no problems. Then, when a problem occurs, 
              it can easily be discerned. For example, if you aren't familiar 
              with the normal values of system statistics (CPU load, number of 
              interrupts, memory scan rates, and so on), determining that one 
              value is unusual will be impossible. Devise trip wires and reports 
              to help detect when something unusual is happening. swatch 
              and netsaint are good programs for this purpose.
              Also, pay attention to error messages and clean up the errors 
              behind them. You ignore them at your peril -- the problem will 
              snowball and devour you at the worst possible time, or they'll 
              mount until they hide a really important problem that you miss in 
              all the noise. 
              Communicate with Users
              Veteran sys admins realize that they are there to make users' 
              lives easier. Those who are going to enjoy long sys admin careers 
              take this to heart, and make decisions and recommendations based 
              on usability, stability, and security. The antithesis of this rule 
              is the joke I used to tell as a university sys admin: "This 
              university job would be great if it wasn't for the dang students 
              messing up my systems". Good sys admins realize that this is 
              a joke, not gospel!
              Also remember the example set by the World Champion New England 
              Patriots -- teamwork can overcome strong foes. Talk with your 
              fellow admins, bounce ideas around, share experiences, and learn 
              from each other. It will make you, them, and your systems better.
              Help Users Fix It Themselves
              Helping users help themselves means that you can spend your time 
              on the hard and interesting problems (not to mention fewer calls 
              on your off-hours, and happier users). For example, if you have 
              reports that tell you about quota abuse (overly large mail folders 
              and the like), enable the users to solve the problem, rather than 
              repeatedly complaining to them about their abuse. The counter to 
              this rule is that a little knowledge can be dangerous. Users may 
              think they understand the problem when they don't, or might 
              think they are solving the problem when they are making it worse.
              Know When to Use Strategy, and When to Use Tactics
              Sys admins must learn the difference between strategy and tactics 
              and learn the place for both. Being good at this requires experience. 
              Strategy means arranging the battlefield so that your chances are 
              maximized, possibly allowing you to win easily or even without a 
              fight. Tactics mean hand-to-hand combat. You win by being good at 
              both. An ounce of strategy can be worth a couple of tons of tactics. 
              But don't be overly clever with strategy where tactics will 
              do.
              Another way to think of this rule is: "Do it the hard way 
              and get it over with." Too often admins try to do things the 
              easy way, or take a shortcut, only to have to redo everything anyway 
              and do it the "hard" way. (Note that the "hard way" 
              may not be the vendor-documented way of doing things.) 
              All Projects Take Twice as Long as They Should
              Project planning, even when performed by experienced planners, 
              misses small but important steps that, at a minimum, delay the project, 
              and at a maximum, can destroy the plan (or the resulting systems). 
              The most experienced planners pad the times they think each step 
              will take, doubling the time of a given plan. Then, when the plan 
              proves to be accurate, he or she is considered to be a genius. In 
              reality, the planner thought "this step should take four days, 
              so I'll put down eight".
              Another way that knowledge of this rule can be used to your advantage 
              -- announce big changes far earlier than you really need them 
              done. If you need to power off a data center or replace a critical 
              server by October 31. Announce it for September. People will be 
              much more forthcoming about possible problems as the "deadline" 
              approaches. You can adjust to "push back" very diplomatically 
              and generously because your real deadline is not imperiled. (Of 
              course, it's also very important to be honest with your users 
              to establish trust, so be careful with this "Scotty" rule.) 
              It's Not Done Until It's Tested
              Many sys admins like their work because of the technical challenges, 
              minute details, and creative processes involved in running systems. 
              The type of personality drawn to those types of challenges typically 
              is not the type that is good at thoroughly testing a solution, and 
              then retesting after changes to variables. Unfortunately for them, 
              testing is required for system assurance, and for good systems administration. 
              I shudder to think how much system (and administrator) time has 
              been wasted by less-than-thorough testing.
              Note that this rule can translate into the need for a test environment 
              of some sort. If the production environment is important, there 
              should be a test environment for learning and experimentation.
              It's Not Done Until It's Documented
              Documentation is never a favorite task. It's hard to do and 
              even harder to keep up to date, but it pays great dividends when 
              done properly. Note that documentation does not need to be in the 
              form of a novel. It can be the result of a script run to capture 
              system configuration files and status command output, in part. Another 
              strategy -- document your systems administration using basic 
              HTML. (Netscape Composer is sufficient for this task.) Documents 
              can be stored remotely, and links can be included to point to useful 
              stuff. You can even burn a CD of the contents to archive with the 
              backups.
              If you keep a system history this way, searching the documents 
              can help solve recurring problems. Frequently, admins waste time 
              working on problems that have previously been solved, but not properly 
              documented.
              Never Change Anything on Fridays
              Whatever the cause (in a hurry to get home, goblins, gamma rays), 
              changes made before weekends (or vacations) frequently turn into 
              disaster. Do not tempt the fates -- just wait until there are 
              a couple of days in a row to make the change and monitor it. Some 
              admins even avoid making a change late in the day, preferring to 
              do so at the start of a day to allow more monitoring and debugging 
              time.
              Use Defaults Whenever Possible
              I recall a conversion with a client in which the client was trying 
              to go outside of the box, save some time and money, and produce 
              a complex but theoretically workable solution. My response was "there's 
              such a thing as being too clever". He continued being clever 
              (too much so, in my opinion; just right, in his opinion). The solution 
              had problems, was difficult to debug, was difficult for vendors 
              to support, and was quite a bother for a while.
              In another example, some admins make changes for convenience that 
              end up making the system different from others. For example, some 
              still put files into /usr/local/bin, even though Sun has 
              discouraged that (and encouraged /opt/local) for many years. 
              It may make the users' lives easier, because that's where 
              they expect to find files, but other admins may be unpleasantly 
              surprised when they use standard methods and find they conflict 
              with the current system configuration.
              This rule (as with the others) can be violated with good cause. 
              For example, where security is concerned, security can be increased 
              by not using defaults.
              Furthermore, standardize as much as possible. Try to run the same 
              release of Solaris on all machines, with the same patch set, with 
              the same revisions of applications, with the same hardware configurations 
              -- this is easier said than done, as with most of these rules. 
              It is important to set goals, and drive toward them, when possible. 
              This is a good goal, even if it is not 100% attainable.
              With this in mind, isolate site- and system-specific changes. 
              Try to keep all nonstandard things in one place so it is easy to 
              manage them, move them, and to know that something is non-standard.
              Always Be Able to Undo What You Are About to Do
              Never move forward with something unless you are fully prepared 
              to return the server to the original starting point (e.g., make 
              images, back stuff up and test, make sure you have the original 
              CDs). Back up the entire system if making systemic changes, such 
              as system upgrades or major application upgrades. Backup individual 
              files if making minor changes. Rather than deleting something (a 
              directory or application), try renaming or moving it first. Everyone 
              who has administered systems for any amount of time has seen the 
              result of ignoring this rule.
              Avoid Poor Decisions from Above
              This certainly falls under the category of easier said than done, 
              unfortunately. Management can have bad data and make bad decisions, 
              or can even have good data and make bad decisions. You as the admin 
              usually have to live with (and suffer with) these decisions, so, 
              with reason and data, encourage the correct decision. Even if you 
              lose the battle, you can always say "I told you so" and 
              feel good about yourself (while looking for a new job).
              If You Haven't Seen It Work, It Probably Doesn't
              Also known as "the discount of marketing". Products 
              are announced, purchased, and installed with the expectation that 
              they will perform as advertised, and in some small percentage of 
              time they actually do. Most of the time, they are over promised 
              and under delivered.
              If You're Fighting Fires, Find the Sources
              I would posit that thousands of admin-lives have been wasted by 
              fighting computer system "fires", instead of the causes 
              of those fires. To avoid this problem, you must automate what can 
              be automated, especially monitoring and log checking. This can free 
              up enough time to allow you to make progress on projects, rather 
              than spending time on tedious work. Those projects in turn can stabilize 
              systems, improve manageability, increase performance, and in general 
              make the systems happier.
              If You Don't Understand It, Don't Play with It on 
              Production Systems
              In my university days, we had a student programmer look at the 
              key switch on our main production Sun server, wonder what it does, 
              and turn it to find out. It would have been much better if he asked 
              someone about it first, or at least tried it on a test system. That 
              turns out to be the case with just about everything admin-related. 
              (Of course, he's now a .com millionaire, so who got the last 
              laugh?)
              If It Can Be Accidentally Used, and Can Produce Bad Consequences, 
              Protect It
              For example, if there is a big red power button at shoulder height 
              on a wall that is just begging to be leaned against, protect it. 
              Otherwise, the power could be lost in the entire workstation lab 
              and lots of students could lose their work (not that such a thing 
              happened on the systems I was managing...). This rule should be 
              extrapolated to just about everything users have access to -- 
              hardware or software. For instance, if it's a dangerous command 
              or procedure, wrap it in a protective script.
              Ockham's Razor Is Very Sharp Indeed
              William of Ockham (or Occam) advanced humanity immensely when 
              he devised his "razor", or the "law of parsimony", 
              which says that the simplest of two or more competing theories is 
              the preferable one.
              Checking the simple and obvious stuff first can save a lot of 
              time and aggravation. A good example is the user who calls and says, 
              "I can't log in anymore." Rather than reset the password 
              or worse yet, tear into the NIS server, just ask if he has the "Caps 
              Lock" key on. Note that to properly execute Ockham's Razor, 
              you must start with no preconceived notions. Assume nothing, gather 
              data, test hypothesis, and then decide on the problem and the solution.
              Another Ockham's Razor corollary: Never attribute to malice 
              what can be explained as the result of sheer idiot stupidity.
              Frequently, implementation and management complexity is unnecessary 
              and results from "too clever" systems administration. 
              This subtle rule has wide-ranging ramifications. Many times, a sys 
              admin is in an untenable position (debugging a problem, transitioning, 
              upgrading, and so on) because of too much cleverness in the past 
              (their own or someone else's).
              When in Doubt, Reboot
              As silly and hackneyed as it sounds, this is probably the most 
              important rule. It's also the most controversial, with much 
              argument on both sides. Some argue that it's amateur to reboot, 
              others talk of all the time saved and problems solved by rebooting.
              Of course, sometimes a sys admin doesn't have the luxury 
              to reboot systems every time there is a problem. For every time 
              that a corrupted jumbo patch was the culprit and a reboot solved 
              the problem, there is a time that rebooting ad nauseum proved nothing 
              and some real detective work was needed to resolve a problem. As 
              with many of these rules, experience is a guide. Knowing when to 
              reboot -- and when not to -- is one characteristic of a 
              good, experienced admin.
              If It Ain't Broke, Don't Fix It
              It's amazing that cliché's are often true. When 
              I recall all the time I've wasted making just one last change, 
              or one small improvement, only to cause irreparable harm resulting 
              in prolonged debugging (or an operating system reinstallation), 
              I wish this rule were tattooed somewhere obvious. The beauty of 
              this rule is that it works in real life, not just for systems administration.
              Save Early and Often
              If you added up all the hours wasted because of data lost due 
              to program and system crashes, it would be a very large number. 
              Saving early and often reduces the amount of loss when a problem 
              occurs.
              I've heard a story that Bill Joy was adding a lot of features 
              to his program, the vi editor. It had multiple windows and 
              a bunch of good stuff. The system crashed and he lost the changes. 
              Frustrated, he didn't repeat the changes, leaving us with a 
              much less useful vi. Don't let this happen to you!
              Dedicate a System Disk
              So much depends on the system disk that it is worth keeping the 
              disk only for system use. Swapping it then becomes possible without 
              effecting users. Performing upgrades, cloning, and mirroring are 
              all easier as well. Performance improves if the system disk is not 
              used for other purposes as well.
              Have a Plan
              Develop a written task list for important tasks. Develop it during 
              initial testing and use/refine it as the changes are applied to 
              increasingly more critical servers. By the time you get to production, 
              you typically have a short maintenance window and avoiding mistakes 
              is critical. A finely tuned plan can make all the difference, not 
              to mention that the next time (and the time after that) that you 
              are going to be doing something similar, you already have a template 
              for a new task list.
              Don't Panic and Have Fun
              Rash decisions usually turn out to be bad ones. Logic, reason, 
              testing, deduction, and repeatability -- these techniques usually 
              result in solved problems and smooth-running systems. And in spite 
              of the complexity of the job, and the pressure that can result from 
              the inherent responsibilities, try to enjoy your work and your co-workers.
              Acknowledgements
              Thanks to the following folks for contributing to this document: 
              Stewart Dean, Ken Stone, Art Kufeldt, Juan Pablo Sanchez Beltrßn, 
              Pete Tamas, Christopher Jones, Leslie Walker, Dave Powell, Mike 
              Zinni, Peggy Fenner, John Kelly, Lola Brown, Chris Gait, Timothy 
              R. Geier, Michael McConnell, David J. DeWolfe, Christopher Corayer, 
              and Tarjei Jensen
              Peter Baer Galvin (http://www.petergalvin.org) is the 
              Chief Technologist for Corporate Technologies (www.cptech.com), 
              a premier systems integrator and VAR. Before that, Peter was the 
              systems manager for Brown University's Computer Science Department. 
              He has written articles for Byte and other magazines, and 
              previously wrote Pete's Wicked World, the security column, 
              and Pete's Super Systems, the systems management column for 
              Unix Insider (http://www.unixinsider.com). Peter is 
              coauthor of the Operating Systems Concepts and Applied 
              Operating Systems Concepts textbooks. As a consultant and trainer, 
              Peter has taught tutorials and given talks on security and systems 
              administration worldwide.
           |