AIX
Alternate Disk Installation
Jeff Marsh
In this article, I will describe some tools within AIX (some new,
some old) that can help you reduce the off-hours time spent by your
administration staff during maintenance upgrades. I will also show
you some uses for these same toolsets that can help you reduce recovery
times due to rootvg corruption.
Alternate Disk Installation
What is it? According to the IBM AIX Installation Guide:
"Alternate disk installation, available in AIX Version 4.3,
allows installing the system while it is up and running, allowing
installation or upgrade down time to be decreased considerably."
Thus, with another set of bootable drives within a server, you
can install maintenance (e.g., upgrade your system from AIX 4.3.3.04
to AIX 4.3.3.06) during the day without interruption or any effects
to the running applications. However, you will still need a reboot
to make it active.
The support model prior to Alternate Disk Installation required
all work to be done off-hours during an application maintenance
window that generally took two to four hours. Now you can reduce
that off-hour time from two to four hours per server to just the
time to reboot. I'll also show you how you can complete multiple
upgrades in that same reboot window using Network Installation Manager
(NIM).
Requirements
To enable Alternate Disk Installation, you need to install the
following base-level filesets and upgrade to at least these corresponding
fileset levels. These filesets do not require a reboot to install:
Base level filesets: Fileset levels:
bos.alt_disk_install.rte 26
bos.alt_disk_install.boot_images 27
You will also need another free, bootable drive within your server.
In this case, you are configuring new servers with four internal drives
for systems administration purposes: two drives for the primary rootvg
mirrored, and two for alt_disk_install implementations. You
could get by with just one additional drive, but we prefer to have
two.
How It Works
Alternate Disk Installation works by cloning your primary rootvg
running on hdisk0 and hdisk1, for example, to a second set of drives,
hdisk2 and hdisk3. After the system completes those copies using
basic find, backup, and restfile utilities,
it will install the latest maintenance level you designate.
This process is shown in Figure 1. First, you clone hdisk0/1 to
hdisk2/3, and then you apply maintenance to the newly cloned hdisk2/3
while the applications continue to run against hdisk0/1.
To complete this task from SMIT, issue the following fast path.
You should expect to see the following panels:
smitty alt_clone
Clone the rootvg to an Alternate Disk:
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Target Disk(s) to install [hdisk2 hdisk3]
Phase to execute all
+
image.data file []
/
Exclude list []
/
Bundle to install [update_all]
+
-OR-
Fileset(s) to install []
Fix bundle to install []
-OR-
Fixes to install []
Directory or Device with images [/mnt]
(required if filesets, bundles or fixes used)
installp Flags
COMMIT software updates? yes
+
SAVE replaced files? no
+
AUTOMATICALLY install requisite software? yes
+
EXTEND file systems if space needed? yes
+
OVERWRITE same or newer versions? no
+
VERIFY install and check file sizes? no
+
Customization script []
/
Set bootlist to boot from this disk
on next reboot? yes
Reboot when complete? no
+
Verbose output? no
+
Debug output? no
+
[BOTTOM]
F1=Help F2=Refresh F3=Cancel
F4=List
F5=Reset F6=Command F7=Edit
F8=Image
F9=Shell F10=Exit Enter=Do
From the example above, note the following:
- You are cloning to hdisk2 and hdisk3.
- You are running an update_all operation of maintenance
mounted in the /mnt mount point. (In this case, this is
a CD-ROM with the AIX 4.3.3.06 maintenance filesets.)
- You are specifying that this operation should change our bootlist
to hdisk2 and hdisk3 after completion.
- You are not asking the process to complete an immediate reboot
upon completion of the upgrade because this is something you want
to schedule in an appropriate maintenance window.
At the completion of the operation, you will notice from the bootlist
-m normal -o command that the bootlist will be set to hdisk2 hdisk3,
and issuing an lspv command will show the following:
root@aknimp1:/> lspv
hdisk0 000f261d90bf6ea0 rootvg
hdisk1 000f261dae86d104 rootvg
hdisk2 000f261db52d4d95 altinst_rootvg
hdisk3 000f261db52d4ca6 altinst_rootvg
hdisk4 000f018d07d4f412 None
hdisk5 000f261dbde71c66 None
hdisk6 000f261dbd8eea89 nimresvg
At this point, you have cloned and installed the latest AIX maintenance
level during the day. You are now ready to activate that latest maintenance
with a reboot operation at whatever time is appropriate for the outage
to your application users. You can save significant off-hours time
for maintenance upgrades; our off-hours time has been reduced to the
time needed for a simple reboot.
Alternate Disk Installation -- After the Reboot
After the reboot, issue the oslevel command or complete
the appropriate verifications to ensure your maintenance upgrade
occurred as expected. If you issue the lspv command, you
will notice the following:
root@aknimp1:/> lspv
hdisk0 000f261d90bf6ea0 old_rootvg
hdisk1 000f261dae86d104 old_rootvg
hdisk2 000f261db52d4d95 rootvg
hdisk3 000f261db52d4ca6 rootvg
hdisk4 000f018d07d4f412 None
hdisk5 000f261dbde71c66 None
hdisk6 000f261dbd8eea89 nimresvg
Both hdisk2 and hdisk3, from which you have booted, now show a volume
group identifier of rootvg. Hdisks 0 and 1 now show a volume group
of old_rootvg and are varied off.
Now, you have several options. My preference is to leave hdisk0
and hdisk1 alone with the old maintenance levels in case you need
to fall back on them.
Let's assume that after the reboot your applications aren't
working well with the latest maintenance. The previous support model
suggests that you need to get the mksysb backup taken prior
to your upgrade and begin a restore process. This could take two
hours or more, with the hope that the tape image was good. The new
support model with Alternate Disk Installation says to change your
bootlist back to hdisk0 and hdisk1 and to reboot the server. At
some future point, when you decide the maintenance is good and you
don't need to fall back, you can clone the latest maintenance
residing on hdisk2/3 back to hdisk0/1.
Cloning Back to hdisk0/1
To complete the cloning of hdisk2/3 back to hdisk0/1, you must
issue the following commands:
- alt_disk_install -W hdisk0 hdisk1 -- Wakes up the
old_rootvg
- alt_disk_install -S -- Puts the old_rootvg
back to sleep
- alt_disk_install -X old_rootvg -- Removes the old_rootvg
volume group name associated with hdisk0/1 from the ODM and assigns
them a value of "none", which will allow the cloning
to recur cleanly.
- smitty alt_clone -- Reclone back to hdisk0/1 using
the previous example.
I will discuss using the above commands further in the next section;
however, in order to reclone drives that have been previously used
to boot from, you must follow the commands verbatim to remove the
knowledge of the old_rootvg volume group name from the ODM.
Other Uses for alt_disk_install
Some other items that alt_disk_install may be helpful with
are:
- Nightly backup of your system -- Using alt_disk_install,
you can backup your system nightly (or at whatever frequency is
appropriate) without having to manage mksysb tapes. If
you suffer some type of rootvg corruption, either major
or minor, you can restore using the data on the cloned drives.
- mksysb Images -- The alt_disk_install command
can be used to install images (AIX 4.3 or later) onto AIX 4.1
and later versions.
- You can also use alt_disk_install for recovery of corrupted
files in rootvg and to reduce the size of logical volumes
in rootvg, as described in the following sections.
Recovery of Corrupted Files in rootvg
If you suffer major corruption (hdisk failure), and the server
crashes, and if you have cloned that data to another bootable drive,
you could interface with SMS, for example, to change your bootlist
to your other cloned drives and quickly recover the server.
If you suffer minor corruption within the rootvg where
a file or a few files are corrupted or inadvertently deleted, you
can wake up the cloned copy of the rootvg and copy those
deleted or corrupted files back to the primary rootvg while
the server is up and running.
In this example, you are booted against hdisk0/1 and have recently
cloned the system to hdisk2/3. To access the cloned copy of the
rootvg while the server is up and running, complete the following:
1. alt_disk_install -W hdisk2 hdisk3 -- Wakes up the
cloned copy:
root@aknimp1:/> alt_disk_install -W hdisk2 hdisk3
Waking up altinst_rootvg volume group ...
Replaying log for /dev/alt_hd4.
2. From a df -k command, you will notice that the wake up command
has mounted the alternate rootvg logical volumes, which are
prefaced with /alt_inst prefix:
root@aknimp1:/> df -k
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 49152 5608 89% 1226 5% /
/dev/hd2 753664 5056 100% 19966 11% /usr
/dev/hd9var 16384 14340 13% 222 6% /var
/dev/hd3 32768 30376 8% 98 2% /tmp
/dev/lvexport 131072 126772 4% 41 1% /export
/dev/lv01 4980736 94468 99% 4546 1% /export/lpp_source
/dev/lv02 917504 448868 52% 29468 13% /export/spot
/dev/lvmksysb 15204352 3381328 78% 31 1% /export/mksysb
/dev/lvadmin 131072 126868 4% 25 1% /admin
/dev/hd1 16384 15820 4% 20 1% /home
/dev/lvadsm 16384 56 100% 21 1% /var/adsm
/dev/alt_hd4 49152 5704 89% 1192 5% /alt_inst
/dev/alt_lvadmin 131072 126868 4% 25 1% /alt_inst/admin
/dev/alt_hd1 16384 15820 4% 20 1% /alt_inst/home
/dev/alt_hd3 32768 30376 8% 98 2% /alt_inst/tmp
/dev/alt_hd2 753664 5056 100% 19966 11% /alt_inst/usr
/dev/alt_hd9var 16384 14380 13% 219 6% /alt_inst/var
/dev/alt_lvadsm 16384 1848 89% 20 1% /alt_inst/var/adsm
3. Copy the corrupted files from the appropriate alt_inst logical
volume/filesystem. In this case, I corrupted my /etc/hosts
file, so I will issue the following command to restore it from my
latest cloned backup:
cp /alt_inst/etc/hosts /etc/hosts
4. When you have restored the required files, put the altinst_rootvg
back to sleep, which will unmount the /alt_inst logical volumes/filesystems
by issuing:
alt_disk_install -S
Reducing Logical Volumes Size Within the rootvg
Remember the pain associated with the need to reduce the size
of a logical volume within the rootvg? It took a tape restore
of the system to complete. Now, you can complete that reduction
within a simple cloning process. The steps to complete that process
are as follows:
1. Issue a mkszfile command to create the /image.data
file.
2. Edit the /image.data file and specify SHRINK=yes
in the logical_volume_policy stanza:
image_data:
IMAGE_TYPE= bff
DATE_TIME= Tue Oct 3 10:29:55 CDT 2000
UNAME_INFO= AIX aknimp1 3 4 000F261D4C00
PRODUCT_TAPE= no
USERVG_LIST= nimresvg
OSLEVEL= 4.3.3.10
logical_volume_policy:
SHRINK= yes
EXACT_FIT= no
ils_data:
LANG= en_US
3. Clone the rootvg to hdisk2 and hdisk3, specifying your customized
/image.data file by issuing one of the following commands:
sm itty alt_clone (remember to specify the location of
your image.data file on the image.data file prompt)
or
al t_disk_install -i/image.data -B -C hdisk2 hdisk3 (from
the command line)
4. After the completion of the cloning operation, wake up the
altinst_rootvg by issuing:
alt_disk_install -W hdisk2 hdisk3
5. Review your df -k output and compare the primary logical
volume sizing to their /alt_inst counterparts.
6. If you are satisfied with the sizing reduction, change your
bootlist (bootlist -m normal hdisk2 hdisk3) and reboot.
Network Installation Managment (NIM)
I want to briefly discuss NIM and show how well it interfaces
with alternate disk installation. It can easily help you to manage
upgrades on a group of servers, thus saving you even more time.
What Is NIM?
Paraphrasing from the AIX Network Installation Management Guide
and Reference, "NIM is a base component of AIX and permits
and aids in the installation and maintenance of AIX, it's basic
operating system, and additional software and fixes that may be
supplied over the network. NIM provides for the customization of
machines both during and after installation. As a result, NIM has
eliminated the reliance of the systems administration staff on tapes
and CD-ROMs for software installation and maintenance."
In this case, you are using NIM to centrally manage a group of
standalone machines (NIM clients) from a centrally located network
attached to a NIM master. From the NIM master, you can manage operating
system installations, maintenance upgrades, mksysb images
for backup and recovery, installation of new servers (cloning),
and the re-installation of existing servers in case of a disaster.
There's a great deal of functionality provided by NIM. I
recommend reviewing the usage guide to see what NIM features could
benefit your environment. I also recommend a good Redbook from IBM,
NIM: From A to Z in AIX 4.3 (SG24-5524-00), which was published
in February 2000.
I won't cover the specifics of setting up the NIM master
and the corresponding NIM client configurations; it is not an overly
complicated process. However, it will require someone with NIM-specific
knowledge to lay out the functional NIM environment. If you support
SP complexes, you have already had a fair amount of exposure to
NIM even though it is buried one layer below PSSP.
One key feature of NIM that will help manage a group of servers
concurrently is the Machine Group definition. Within NIM, you can
operate as easily on a single machine as you can a group of machines.
For instance, we have defined several machine groups within our
NIM master environment. These definitions allow us to operate on
a group of like servers concurrently.
How Does It Integrate with Alternate Disk Installation?
NIM knows how to fully exploit Alternate Disk Installation. For
example, look at the initial clone and update_all operation.
Let's say you want to use NIM to extend the model (instead
of upgrading the maintenance level on a single server) and you want
to complete this operation on ten Lotus Notes servers that are similarly
configured and are defined in a Notes machine group within NIM.
From SMIT on the NIM master, issue the following fast path and you
will see this panel:
smitty nim_alt_clone
Clone the rootvg to an Alternate Disk
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Target Machine / Group to Install [NOTES] +
* Target Disk(s) to install [hdisk2 hdisk3]
Phase to execute all +
IMAGE_DATA resource [] +/
EXCLUDE_FILES resource [] +/
(leave blank to include all files in backup)
BUNDLE to install [] +
-OR-
Fileset(s) to install []
FIX_BUNDLE to install [] +
-OR-
FIXES to install [update_all]
LPP_SOURCE [aix433_lppsource] +
(required if filesets, bundles or fixes used)
installp Flags
COMMIT software updates? yes +
SAVE replaced files? no +
AUTOMATICALLY install requisite software? yes +
EXTEND filesystems if space needed? yes +
OVERWRITE same or newer versions? no +
VERIFY install and check file sizes? no +
Customization SCRIPT resource [] +/
Set bootlist to boot from this disk
on next reboot yes +
Reboot when complete? no +
Verbose output? no +
Debug output? no +
Group controls (only valid for group targets):
Number of concurrent operations [] #
Time limit (hours) [] #
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
In this example, you would cause every server defined in the Notes
Machine group to begin a process to clone itself from hdisk0/1 to
hdisk2/3. At the completion of the cloning operation, NIM would then
NFS-mount the aix433_lppsource resource (in this case, it's
the AIX 4.3.3 lppsource filesystem, which includes the 4.3.3.06
maintenance) and apply it to the newly cloned hdisk2/3 on each of
these servers. This also instructs NIM to change the bootlist on each
of these servers as a part of the operation but does not cause an
immediate reboot. I recommend, however, using NIM to schedule a reboot
of all these servers during the maintenance window.
All of this work, including the cloning and upgrading of the maintenance
level, can be completed during the day without affecting the running
application (e.g., Notes). For the previous support model, this
same upgrade would have taken about 2 hours per server plus reboot
time to complete during an application maintenance window, generally
in the middle of the night. If a single person worked to complete
this process, this could have taken about 25 hours spread across
multiple weekends to complete. With NIM and Alternate Disk Installation,
this upgrade outage can be reduced to the time to reboot these 10
servers concurrently (or about 30 minutes, in our case). Note that
your time may vary depending on speed of network, number of filesets
being updated, time to reboot, and problems encountered.
Figure 2 shows the process using NIM/Machine Groups and Alternate
Disk Installation. First, you instruct the NIM master to have each
of the servers in the defined machine group clone hdisk0/1 to hdisk2/3
(depicted in red). Then, NIM will NFS-mount the appropriate LPPSOURCE
filesystem containing the AIX 4.3.3.06 maintenance level and apply
that maintenance to the newly cloned drives (operation in green).
Again, this process happens concurrently on all servers in the defined
NIM machine group without affecting the running applications.
Conclusion
My team is in the process of rolling out this methodology change.
I think we can significantly reduce the amount of time spent in
support of our current AIX standalone infrastructure. I also think
Alternate Disk Installation and NIM, can help you better manage
your infrastructure and provide some consistency to your installation,
upgrade, maintenance, and build procedures. In conclusion, I hope
the above discussion will help you significantly reduce the amount
of off-hours time associated with maintenance or fileset upgrades
within AIX.
Jeff Marsh is the Systems Advisor to the UNIX Server Team working
at American Century Investments, a premier investment manager serving
nearly two million individual and institutional investors. Jeff
can be contacted at: jeffrey_marsh@americancentury.com.
|