| Serial Storage Architecture Management  Miles Purdy 
              Serial Storage Architecture, or SSA, is one of the least understood 
              technologies I manage. SSA is a fast, high-capacity hard disk storage 
              solution available for most UNIX-based and PC-based computers, although 
              it may be most predominant on RS/6000's (a.k.a. pSeries). Many 
              people think that SSA is an IBM proprietary standard; in fact, it 
              is an open standard. I will explain the technology and show examples 
              of how to manage SSA on RS/6000's. I will also describe some 
              benefits of SSA and provide some performance and tuning tips and 
              management scripts. 
              The Technology Overview 
              SSA technology is all about loops and devices talking to each 
              other. (See Figure 1.) Every SSA device is called a node. The nodes 
              in SSA adapters are called initiators because they issue commands, 
              and the nodes in SSA disks are called targets because they respond 
              to commands. To create an SSA loop, you start with an SSA adapter 
              -- a PCI card. Each adapter has two loops (A and B) and two 
              physical ports per loop (A1, A2, and B1, B2). Copper cables are 
              commonly used to connect the adapter to an I/O drawer (a.k.a. enclosure). 
              Enclosures contain a maximum of 16 hard drives. You can start the 
              loop by cabling from port A1 on the SSA adapter to port 1 on the 
              I/O drawer. If the I/O drawer is full of disks, you can complete 
              the loop by cabling from port 16 on the I/O drawer back to port 
              A2 on the adapter, thereby forming a complete loop. The automatic 
              bypass cards will close if they detect a disk on either side, thus 
              closing the loop. 
              The SSA adapters are amazing pieces of technology. The new Advanced 
              Serial RAIDPlus adapter (feature code 6230 from IBM) supports RAID 
              0, 1, 5, and 1+0. It has both read and write cache. The fast-write 
              cache, as it is called, buffers many smaller writes into blocks 
              and sends full stripes to the array, greatly improving performance, 
              especially for RAID 5. If the power fails, the adapter has a battery 
              to maintain the fast-write cache, which will commit the write operations 
              once the power returns. If you have two hosts, each with an adapter 
              sharing common disks and using the fast-write cache, called two-way 
              fast write cache, each host can maintain a copy of the other's 
              fast-write cache. If one system fails, the adapter in the other 
              host can then commit the write operations. The new adapter supports 
              transfer rates of 40 MBs per port, for a total capacity of 160 MBs. 
              Copper cables are used to connect SSA devices, although fiber 
              optic extenders are available to travel up to 10 km. The cables 
              may connect an adapter to an I/O drawer, an adapter to an adapter, 
              or an I/O drawer to an I/O drawer. 
              I/O drawers are used for large configurations, although a standalone 
              desk-side unit is also available. The I/O drawers are rack mounted, 
              fitting into a standard 19" rack. Each rack can hold up to 
              six I/O drawers. As I mentioned, each I/O drawer can hold up to 
              16 disks, and each adapter loop can support 48 disks -- 96 disks 
              in total, per adapter. However, for performance, 32 disks per loop 
              is probably best. The disks in the I/O drawer are grouped into four 
              groups of four. You can add disks one a time, but you should be 
              adding disks in multiples of four at this level of storage. In fact, 
              when buying a new drawer, always try to get 16 disks. If you are 
              using more than 4 disks (i.e., 8 or 16 in single loop), there is 
              no need to cable between ports 4 and 5, for example. The I/O drawer 
              will automatically detect the presence of disks in slots 4 and 5 
              and close the loop with the automatic bypass card. Early models 
              required a 6" cable between the ports. The I/O drawers feature 
              redundant power and cooling. 
              The model D40 I/O drawer supports disk capacities of 9.1, 18.2, 
              and 36.4 GB, in speeds of 7200 and 10200 rpms. Any combination of 
              disk capacities and speeds is possible, but not recommended. If 
              you're using RAID, all the disks must be the same or performance 
              will suffer and capacity will be lost. With 96 disks per adapter 
              at 36.4 GB each, a single adapter can access, at most, 3494 GB. 
              The largest standalone RS/6000's (i.e., model S80) can have 
              a maximum of 26 SSA adapters per system, for a maximum capacity 
              of 90844 GB per system (although I've never heard of this being 
              done). 
              Data travels between devices on the loop, not just between disks 
              and adapters. Thus, there can be many data transfers occurring at 
              the same time, which is called spatial reuse. Data travels in either 
              direction on the loop, thus preventing a single break from isolating 
              any disks. This is one of SSA's biggest benefits. Besides providing 
              fault tolerance, this allows the systems administrator to dynamically 
              add more disks to the system without taking anything offline. By 
              breaking the loop in only one place, you can quickly and safely 
              add another I/O drawer into a processing system. The loops are also 
              full duplex. 
              Managing SSA 
              There are many AIX commands to display and manage storage technology; 
              some of these commands with SSA examples are shown in Tables 1-4. 
              For those not familiar with AIX, AIX has two devices for interfacing 
              with SSA disks, hdisks, and pdisks. AIX defines a pdisk device for 
              every physical SSA disk. If you're using just a bunch of disks 
              (JBOD), there will be one logical disk device (hdisk) for every 
              physical disk device, or if the disks are configured into a RAID 
              array, there will be logical disk device for every RAID array. AIX 
              then defines a container, called a volume group, to manage groups 
              of logical disks. Most operations on storage, such as allocating 
              space, work with logical disks or volume groups. 
              Our Environment 
              Our environment consists solely of IBM RS/6000's: an SP2 
              frame with Silver nodes, two S80s, an F50 for the control work station, 
              Magstar L32 tape library, and an SP Switch. One S80 is the production 
              Sybase database server (hostname UNXP); the other is the data warehouse 
              (hostname UNXM) using Sybase IQ, Sybase ASE, and Cognos's Power 
              Mart. One of the Silver nodes is our development environment (hostname 
              UNXD), one is the Tivoli Storage Manager (TSM) server (hostname 
              UNXR), and two are for system test. All of the nodes have some SSA 
              disk attached to them, and we use only SSA for external disks. 
              The two production servers share four model D40's with 16 
              x 9.1-GB disks each, and one model 020 with 16 x 9.1-GB disks. Each 
              server has two Advanced Serial RAIDPlus SSA adapters (feature code 
              6230) with the 32-MB fast-write cache option, 128-MB cache option, 
              and micro code level A400. I recommend the fast-write cache, especially 
              if you're using RAID 5. The 128-MB cache option is recommended 
              for two-way fast-write cache operations. 
              The development environment contains one recently purchased model 
              D40 enclosure with 16 x 18.2-GB disks, which consolidated three 
              model 020 enclosures that were then moved to system test. The system 
              test environment has 36 x 4.5-GB and 4 x 9.1-GB disks. The TSM server 
              has 8 x 9.1-GB disks and 8 x 18.2-GB disks. The F50 is one of the 
              few RS/6000's that support internal SSA disks, thus not requiring 
              an SSA I/O drawer. It has 10 x 9.1-GB SSA disks. 
              Currently, all of the external disk drives are server attached, 
              with no storage area network (SAN), yet. Managing a homogeneous 
              environment has not created the problems that a SAN would solve. 
              Although SSA disks can be used to boot the system, I prefer internal 
              SCSI disks for the operating system (i.e., rootvg). All of 
              the RAID levels that are used on all of the machines are implemented 
              in the adapter hardware. 
              The Production Database Server 
              The production database server's disks are configured to 
              solve two competing conditions: performance and availability. The 
              disks are configured into two RAID 1+0 arrays and one RAID 1 array. 
              The RAID 1 array has two disks, by definition, with one disk mirroring 
              the other. RAID 1+0 has both performance (striping) and availability 
              (mirroring). To further increase availability, hot spares, (or redundant 
              disks) are used. RAID 1+0 first makes a copy (mirror) of every disk, 
              and when data is written to the disks, it is striped across all 
              disks (in the primary pool). The RAID adapter then automatically 
              makes a copy to the secondary pool. The primary and secondary pools 
              currently have seven disks each, plus one hot spare disk each, for 
              a total of 32 disks. The hot spares are configured into preferred 
              pools, meaning that each of the four groups of seven disks has a 
              hot spare specifically assigned to it. This was done to prevent 
              a hot spare from another enclosure from taking over for a failing 
              disk. 
              The first RAID 1+0 array is physically in the P001 I/O drawer. 
              Disks 1 to 7 are the primary pool, and disks 10 to 16 are the secondary 
              pool. The other RAID 1+0 array is in the P002 I/O drawer. Disks 
              8 and 9 in each I/O drawer are the hot spares. The RAID 1 array 
              is disks 1 and 16 in the CLR1 I/O drawer. (See Figures 2 and 3.) 
              Ideally, I would configure the primary pool of the first RAID 
              1+0 array and the primary pool of the second RAID 1+0 array to be 
              in one I/O drawer (i.e., P001), and the secondary pool of the first 
              RAID 1+0 array and the secondary pool of the second RAID 1+0 to 
              be in the other I/O drawer (i.e., P002). (See Figure 5.) One I/O 
              drawer should be located away from the server, perhaps on another 
              floor, or in another building. If one enclosure fails, I can still 
              have half of each array. Note that RAID arrays cannot cross loops. 
              Therefore, although the primary and secondary pools are in different 
              enclosures, they must be in the same loop. (See Figure 4 for the 
              wiring diagram, and then compare Figure 2 to Figure 4, and Figure 
              3 to Figure 5.) 
              There are currently two volume groups -- ssavg16Prod1 
              and ssavg16Prod2. ssavg16Prod1 has one RAID 1+0 array 
              and the RAID 1 array. ssavg16Prod2 has the other RAID 1+0 
              array. 
              After running with only two RAID 1+0 arrays for a couple of months, 
              I noticed a performance problem on one array. It turned out the 
              production Sybase server's tempdb was generating enough 
              reads and writes to saturate the array. I added a RAID 1 array on 
              a different SSA Adapter loop just to hold tempdb. The array 
              is disks 1 and 16 in the CLR1 I/O drawer. IBM recommends that half 
              of the disks in a RAID array are closest to one port and the other 
              half closest to the other port in the loop. Implementing this has 
              solved the performance problem. 
              The Data Warehouse 
              The data warehouse's disks are configured differently from 
              the production database servers. The data warehouse contains mostly 
              read-only data that is regenerated every couple of days. The data 
              is copied from the production database server, so there is little 
              critical permanent data residing there. The data warehouse also 
              has larger storage requirements than that of the production database 
              server. To this end, some of the storage is configured as RAID 1+0, 
              and some of it is RAID 5. 
              The data warehouse has 46 x 9.1-GB disks. The 16 disks in the 
              MIS1 enclosure comprise one RAID 1+0 array. Disks 2-14 in the CLR1 
              enclosure comprise another RAID 1+0 array. Finally, disks 1-8 and 
              9-16 comprise two RAID 5 arrays in the MIS enclosure. (See Figure 
              2.) 
              Managing storage on the data warehouse is much easier since I 
              reduced the number of volume groups. I previously had four volume 
              groups -- one for each RAID array. I have since migrated all 
              the arrays into one volume group. Once all the arrays were in one 
              volume group, it became very easy to move data around between the 
              different types of RAID arrays. For example, if we see poor performance 
              on one of the RAID 5 arrays for a particular logical volume (LV), 
              we can now migrate that LV to a RAID 1+0 array while the data is 
              being accessed. (This is a feature of the logical volume manager 
              and not SSA directly.) The reverse is also true -- lesser used 
              LVs have been moved to the RAID 5 arrays. 
              
            Characteristics of Production Disk Arrangement 
              Multiple Hosts 
              In the production system, there are two hosts -- UNXM and 
              UNXP. Any applications that need to be shared are installed on the 
              internal SCSI disks of both hosts. The data to be shared is on the 
              external SSA disks. SSA simplifies sharing data because the disks 
              can be cabled up to eight hosts. IBM's High Availability/Cluster 
              Multi-Processing (HACMP) software is installed as a cluster on these 
              two nodes. HACMP monitors the physical resources and in the event 
              of a complete system failure, the other system takes over. 
              There are some special considerations for attaching SSA disks 
              to more than one host: 
              
              
             
               All the disks and adapters still must form loops on one pair 
                of ports. Loops cannot cross ports. 
               In most cases, only one host may have a volume group online 
                at a time. 
               There are special rules for the number of adapters in a loop 
                based on the adapter type, whether you are using RAID, and whether 
                the fast-write cache is used. See the adapter's user manual. 
               Volume groups should not be configured to automatically varyon 
                at system boot, especially if you're using HACMP. 
               If you are using the fast-write cache in a multi-initiator, 
                multi-host loop, the overall throughput of each adapter will be 
                less, because some adapter cycles are required to synchronize 
                with the fast-write cache of the other adapter. 
               Put the disks closest to the host that will normally be using 
                them. 
               Currently, you cannot use RAID 0 arrays in a multi-initiator 
                loop. 
              Each SSA loop has two paths (ports) to each group of disks in 
              the loop, and data can travel bi-directionally around the loops. 
              Thus, a single break in the loop, such as a disk or host failing, 
              will not affect the system. 
              Redundancy 
              RAID 1+0 automatically allows for N/2 disks to fail, as long as 
              two copies of the same disk don't fail. With our hot spares, 
              N/2 + 2 disks can fail in each array, as long as two copies of the 
              same disk don't fail before the hot spare disk takes over and 
              the array is rebuilt. RAID 5 allows for one disk to fail, two with 
              a hot spare; in a RAID 1 array, one disk can fail. 
              TSM Server 
              The disks on the Tivoli Storage Manager (TSM) server host are 
              configured mostly for speed. TSM is our backup software. There is 
              no redundancy except for the TSM database and log. I use one RAID 
              1 array for the TSM database and log. RAID 0 is used for the TSM 
              disk storage pool and a filesystem that the Sybase database backups 
              reside on. This filesystem is NFS exported and mounted on all the 
              other hosts. This allows all the hosts to easily back up and load 
              our databases. For example, the production database backs up to 
              this directory, then we load the database dump into the development 
              environment from this directory. 
              Development Server 
              The development environment is configured to maximize the amount 
              of the disk storage while providing some redundancy. To accomplish 
              this, I use RAID 5. The performance of RAID 5 is at least twice 
              as slow as RAID 1+0 (or non-RAID) disks for write operations. We 
              really notice this when we load our production database into the 
              test environment. 
              How to Configure Volume Groups for Large RAID Arrays 
              The first problem I encountered when we started using many large 
              SSA RAID arrays was that they would not fit into our existing volume 
              groups. Because SSA is a high-capacity storage system, you will 
              probably need to modify your volume groups or create new ones to 
              contain the large disks. 
              We started out using many small disks, and thus the physical partition 
              (PP) size of our volume groups (VGs) was 4 MBs. A physical partition 
              in AIX is the smallest unit of disk space that can be allocated 
              under the logical volume manager (LVM). Don't get this confused 
              with file system space -- filesystems are built on top of logical 
              volumes. In versions of AIX prior to 4.3, there was a limitation 
              of 1016 PPs per physical volume (PV), or logical disk. There was 
              also a limit of 32 physical volumes per volume group. 
              In AIX 4.3.1 and above, the only limitation is 32512 physical 
              partitions per volume group. The physical partition size must be 
              2n, where 1 = > n < = 1024 (see sidebar "Creating Large Volume 
              Groups"). 
              Performance and Tuning 
              The first thing you need when trying to increase computer speed 
              is a yardstick -- a measurable event. For us, it is our backups 
              because they make heavy use of the SSA disks. We store all of our 
              performance and tuning data in a Sybase database and use our data 
              warehousing tools to analyze the data. 
              One of the most important events that we monitor (partly for speed, 
              mostly to ensure they work) is the production database backups. 
              The production Sybase database server backs up to an NFS-mounted 
              filesystem over the SP Switch. 
              For the backups, the data must be read on the client side, sent 
              to the NFS mount over the SP switch, and written to the hard drives 
              on the NFS server. This involves tuning the client hard drives, 
              the client's SSA adapters, the client, NFS, the NFS server 
              (UNXR), NFS server's SSA adapters, and NFS server's hard 
              drives -- an almost impossible task that I'm sure I haven't 
              gotten right yet. The client-side disks are RAID arrays -- RAID 
              1, 5, or 1+0. The NFS server's disks are RAID 0 arrays. 
              I use a script that I wrote called iostat_logging (see 
              Listing 3) to monitor disk performance. This dumps the output of 
              the iostat command into a database table every 15 minutes. 
              This is very useful to gauge the disk's performance and view 
              trends over time. 
              The production server can send data to the filesystems at about 
              6-8 MB/s. Also, this is how we get our database backups to tape; 
              TSM will back up the database dumps as regular files. 
              It took about 135 minutes (about 4 MB/s) to load the production 
              database into the test environment, and it took only 54 minutes 
              (about 11 MB/s) to load the exact same dump into the data warehouse. 
              The database is on two RAID 1+0 arrays in the data warehouse. See 
              the sidebar "Simple Rules for Tuning SSA Devices" for 
              SSA tuning recommendations. 
              Simple Recommendation for RDBMS 
              Had we not been using SSA disks, maintaining and expanding our 
              production database server would have been more difficult. In the 
              past five years, we have upgraded the hardware three times. Each 
              time, all that was required to migrate the database was to unplug 
              the SSA disks on one host and plug them into the new system. The 
              production database server grows at a rate of about 10 GB per year, 
              and SSA has allowed us to easily expand with it. The fast-write 
              cache allows database transactions to commit faster, thereby increasing 
              the speed and throughput of the database server. Finally, having 
              the databases attached to multiple hosts allows for almost 100% 
              uptime. 
              Systems administrators often wonder how they should configure 
              their hard drives to work best with databases. I have been using 
              the following recommendations: 
              
              
             
              Tuning the Host Commonly joined tables and tables and their indexes should 
                reside on different disks or arrays and, if possible, different 
                adapters. 
               Use many smaller drives before fewer larger drives. 
               Use the correct RAID level. Different objects should be on 
                different types of RAID. Don't put your log on a RAID 5 array, 
                for example. I use only RAID 1+0 and RAID 1 for our production 
                database. 
              To get the most out of an SSA disk subsystem, the performance 
              and tuning exercise must include tuning the host and, in particular, 
              the virtual memory manager (VMM). The VMM in AIX manages virtual 
              memory, which includes real memory and swap space. The command to 
              check and change the VMM's settings is vmtune. It is 
              found in /usr/samples/kernel/vmtune (if you have installed 
              bos.adt.samples). See the sidebar "Parameters" 
              for parameters that I have changed. 
              Tuning AIO Servers 
              AIX uses asynchronous input/output (AIO) servers to perform asynchronous 
              I/O to the disk subsystem. Using AIO can greatly improve SSA performance 
              because the server doesn't have to wait for the I/O to complete. 
              For example, to check or change your settings: 
              
             
root@unxm:/>smitty aio
Select Change / Show Characteristics of Asynchronous I/O
The minimum number of servers is the number of AIO servers that get 
            started at system boot; the maximum number of servers is the maximum 
            number of servers that will run on that machine. When you are starting, 
            your tuning maximum should be set to the number of disks performing 
            AIO times 10. Minimum should be set to maximum divided by 2.  To find out how many AIO servers you currently have running, use: 
              
             
pstat -a | grep -i aios | wc -l
If this number equals your minimum value, you may be able to reduce 
            the number of AIO servers. If it equals your maximum, you may need 
            to increase the number of AIO servers. If it is between your minimum 
            and maximum, no changes should be required.  Other Tuning Considerations 
              Depending on your own situation, tuning the following may improve 
              SSA or overall system performance: 
              
              
             
               Sync daemon -- syncd 
               I/O pacing -- high and low water marks 
               schedtune -- /usr/samples/kernel/schedtune 
              Production Database Server Stats 
              Whenever our production database is backed up, we automatically 
              record statistics from the dump. The graph in Figure 6 shows the 
              aggregate throughput of data from the production database server's 
              hard drives to the NFS mount. The most noticeable feature in Figure 
              6 is the huge drop in performance from July to September of 2000. 
              Unfortunately, I haven't been able to pinpoint what changed 
              and completely reverse it. If we had not tracked the statistics, 
              we never would have noticed the decrease in performance. 
              TSM Server Stats 
              This is an example of a graph that PowerPlay produces from the 
              iostat_logging script from UNXR. It shows the Kbytes/s of 
              the hdisks every 15 minutes for the month of February. The values 
              are the average for every day in the month -- either the average 
              for the 15-minute period or the average for the hour. For examples, 
              see Figures 7 and 8. 
              In the figures, the ADSM database/log peak at 0700 and 0000 is 
              when the ADSM database is backed up to tape. The NFS mount is busy 
              from 0100-0230 when the production Sybase database does its backups 
              to the NFS mounted filesystem. Then the filesystem is backed up 
              to tape between 0300 and 0500. The large peak at 1900 is when UNXM's 
              and UNXD's Sybase database do their backups. The sustained 
              use of the NFS mount in the morning is the production database being 
              loaded into various test environments. The array is fast enough 
              for the clients, sustaining about 10MB/s. The bottleneck is probably 
              NFS. Notice the peak transfer rate is about 22 MB/s, and the sustained 
              rate for the 19th hour was 14 MB/s. The ADSM storage pool is busy 
              at various times during the day when it is backed up to tape, or 
              when node's filesystems are backed up or restored. 
              Scripts 
              I use several Korn shell and Perl scripts to manage storage and 
              have included some of them here: 
              
              List Disks (lst_disks) -- lst_disks gives an 
              hdisk-by-hdisk (which may be a RAID array) report of physical volumes, 
              volume groups, and logical volumes. With it, you can easily find 
              out which disk an LV is on, how big it is, its intra-disk distribution, 
              and the mount point if it is a filesystem. It will show how much 
              space is left on a logical disk and the physical partition size 
              of the volume group. (See Listing 1.) 
              Enclosure Report (encl_to_vg) -- With the new SSA Drawers, 
              model D40, you can now identify the drawers with a four-character 
              string of your choosing. This has made physically identifying individual 
              disks much easier than with previous SSA drawers. I wrote this script 
              to give a report of each physical disk in a given drawer. It is 
              very handy also for showing the pdisk-to-hdisk relationship, and 
              easily shows when an hdisk is a RAID array. In a roundabout way, 
              it also shows what disks may be hot spares. (If you want to get 
              detailed information on RAID arrays, read the file /usr/ssa/ssaraid/ssaraid.README.) 
              Also included is the micro-code level of each disk and the type 
              of each disk. This is great for making sure that all disks in a 
              RAID array are the same. (See Listing 2.) 
              Log iostat information to a Sybase database (run_iostat_logging) 
              -- Because we already had the data warehouse environment set 
              up, I thought it would be neat to store the disk I/O statistics. 
              Thus, I wrote a script that stores iostat information in a Sybase 
              database. (See Listing 3.) 
              Miles df report (mi_df) -- Because I was fed 
              up with trying to read the standard df command's output, 
              I wrote a little Perl script to fancy up the output. (See Listing 
              4.) 
              Final Thoughts 
              Imagine that your production server is out of disk space, and 
              your 24x7 environment doesn't allow for down time. By breaking 
              you current SSA loop in only one place, you can add another I/O 
              drawer and 16 x 18.2-GB new disks. You configure the disks and bring 
              a new RAID array online. If you're using a logical volume manager, 
              you can even spread the load between the two arrays, all the while 
              the server and disks are online and the data can be read and written 
              to. You have just saved the company and your hide. 
              References 
              What is SCSI -- http://whatis.techtarget.com/WhatIs_Definition_Page/0,4152,214242,00.html 
              SSA Support sites -- http://www.hursley.ibm.com/ssa 
              and http://www.ibm.com 
              "Storage Capacity Planning". Zwieback, Dave D. 1999. 
              Sys Admin. August 1999. 
              Monitoring and Managing IBM SSA Disk Subsystem -- http://www.redbooks.ibm.com 
              Advanced SerialRAID Plus Adapter Planning Guide. 2nd Ed., 
              October 2000. IBM. 
              Waters, F. 1996. AIX Performance and Tuning. New York: 
              Prentice Hall. 
              "Serial Storage Architecture." Judd, Murfet and Palmer 
              1996 -- http://www.research.ibm.com/journal/rd/judd/judd.html 
              PCI Adapter Placement Guide. 10th Ed., February 2000. IBM. 
              Miles Purdy has a computer science degree from the University 
              of Manitoba and works for the Canadian Federal Government's 
              department of Agriculture (Agriculture and Agri-Food Canada, Farm 
              Income Programs Directorate division). He is a system manager, managing 
              a mid-sized RS/6000 environment. His department provides farm financial 
              programs to approximately 250000 Canadian farmers and provides income 
              stabilization and disaster assistance. He can be contacted at: [email protected]. 
           |