Cover V12, i11

Article
Figure 1

nov2003.tar

More Truth about Tapes, Backups, and Restores

Peter Baer Galvin

In the July issue (http://www.samag.com/documents/s=8284/sam0307h/0307h.htm), I discussed backup solutions and the reasons that sites stay with antiquated tape and software technologies. This month, I'll cover a variety of standard and unorthodox backup and restore solutions, and reveal more truths.

The truth about tape: SCSI bus is still a better choice than Fibre channel bus, in some situations.

In spite of the vendor hype, new technology is not always the best solution. A case in point is Fibre vs. SCSI tape and library attachment. While Fibre has some benefits, especially the increased distance allowed between the connection points, SCSI is still the best solution in some scenarios. SCSI solutions tend to be more complex, but sometimes increased complexity is paid for by increased function or reliability. Consider a Fibre HBA in a host, a SCSI-based tape library, and a bridge between the host and the library. Experienced systems administrators might frown on this solution due to the need for multiple cables, connections, and devices. And indeed it could be more prone to failure, more overhead to manage, and more complex to debug, but consider that the bridge includes buffering that can help stream data to the tape drives (making them more efficient and decreasing backup time). The Fibre to the host allows the library to be distant, and the SCSI drives and robot can be more reliable than their Fibre equivalents.

I've seen instances when a Fibre-based robot control had intermittent errors, while SCSI-based robot controls were rock-solid. Given that Fibre is relatively new on the backup scene, and that SCSI-based drives and robots have been in use for years, this should not be surprising. Also consider that some drives are not yet available with Fibre-attach, so this solution could allow for the best combination of drive and robot technology while maximizing throughput, reliability, and distances between the devices.

The truth about tape: SANs are not just for disks.

If your site has several data-centric hosts, a backup SAN could be a good next-generation backup solution. Many sites have moved to this architecture simply because their backup networks could not move data fast enough to meet backup and restore window requirements. A sample backup SAN is shown in Figure 1. Of course there are many options and variations on this architecture. In this example, database servers are directly attached to a storage array, and other hosts with storage are connected on a network. The backup server is connected to the robot and to the network. Backup data can stream from the database servers, through Fibre, through the Fibre-SCSI bridges, to the SCSI tape drives in the tape library. Likewise, data can flow across the network, through the backup server, to the library via Fibre to Fibre-attached tape drives. This type of facility has several advantages (and disadvantages) compared to direct-attach solutions:

  • It allows for a central library to be used by many hosts.
  • Throughput between hosts and tape can be scaled as required, simply by increasing connections between the host, the SAN switches, and the tape library.
  • Backup management can be simplified by backup software than can coordinate all hosts, storage, and library use.
  • Adding hosts (and accompanying storage) is less likely to require new infrastructure. In the usual case, the existing library can accommodate the new data, or might require additional tape drives within the library. With network-based backups, adding hosts might cause a backup network to be inadequate. There the solution might be to change the entire network infrastructure between hosts and the backup server.
  • Depending on the storage technology in use, backups can bypass the application server and require only the backup server to drive backup throughput. In Figure 1, the backup server also is connected to the main storage. If the storage array supports snapshots or other splitting/reconnecting technology, a copy of the data to be backed up can be created and exported to the backup server. (See also near-line storage below).
  • Restores can be much faster than in direct-attach tape scenarios, as all tape drives in the central library could be used for a restore, moving more megabytes per second than if a smaller library or single tape drive was attached to the target host.

A downside to a centralized backup solution is that backup software needs to be more advanced to manage the backup and recovery (B&R). My favorite solution is Veritas NetBackup (NBU). From one central management console, NBU can orchestrate the allocation of tape drives, use of those drives for backup or restore, and release of the drives for use in some other backup or restore. Another downside is the cost of implementing a backup SAN compared with a standard network-based solution.

The truth about tape: unlike almost all other technology, the larger the tape robot the lower the price per capacity.

Several small libraries are usually more costly than one large library with the same capacity. Add to the equation the maintenance cost and management overhead involved when more devices are in the mix, and a central tape library makes sense, in most circumstances. Sun has a nice library comparison available (although pricing information will have to be obtained elsewhere): http://www.sun.com/storage/tape/tape_lib_comparison.html.

The truth about tape: like operating systems are best at backing up and restoring their own kind.

A few years ago, I was consulting at a large site containing a mix of Sun and Windows machines. The backup software they were using was running on Windows. The backups appeared to work fine. Unfortunately, the restores of Solaris, especially the root disks, were not so fortunate. Generally, the "media server" -- the machine driving the data to the tape drives -- should be of the same operating system type as the machines the data are coming from. Sometimes mixing does work, but as it's more prone to failure, why take unnecessary risk?

The truth about tape: changing backup software is difficult.

Although I've talked with many sys admins who were unhappy with their current backup software, only rarely have they felt enough pain to switch to another solution. Some folks who even lost data when they had a disk problem and a restore failure continued with that infrastructure rather than replacing it. The reasons vary according to the site, but mostly they involved the cost of buying the new software. Also consider that the old software usually must be kept to enable restores from the old backup tapes. This difficulty of changing means that the backup software choice is likely a long-term one, so choose wisely.

The truth about tape: the B&R technologies are rapidly evolving.

There are several new technologies affecting the B&R problem. Backup SANs have already been described. Another is new tape drive technologies that increase speed and density. Some tape drives provide write-once-read-many (WORM) functionality. The most interesting action is in the appliance space. For example, Decru has a box that encrypts data as it flows from the backup server to the tape drives (and decrypts it during restore). This solution can assure that all the hard work of securing your environment is not undone by clear text tapes being taken off site.

Another area of appliance solutions is instant restore. These new devices, from companies such as Revivio, record all writes performed to the main storage. They can then provide a view of the storage as it was at any point in time, such as just before a database corruption or a data loss. This type of solution augments standard backups rather than replacing them. Their key is that they make it unlikely that tape will ever be needed for restores -- the restores can be done directly from the appliance.

The truth about tape: applications add another layer of complexity.

If it weren't for those dang applications, B&R would be easy! Consider a database server, and the complexities of assuring that the data to be backed up is consistent. Usually this involves the backup software communicating with the database software to coordinate access to the data. A simpler solution that is appropriate with smaller databases is for the database administrator to automate extraction of data (while it is quiesced) to a locally accessible disk, and for the backup to copy that extract to tape.

The truth about tape: near-line is a new name for an old idea (which doesn't make it bad).

One big B&R industry buzz is about "near-line storage". This is not a new idea. Amanda, the free backup software package, uses disk as the first stop for data being backed up. It then moves the data from that buffer to tape. Old ideas can be good ones, and near-line storage does have a place in many environments. Typically it should be used when a tape solution cannot meet the required backup or restore window. The data are copied to lower cost disk (near-line storage) during the backup window, and then to tape during a larger window. Restores can be done from the near-line storage as well, with the cost of near-line storage being the only limiting factor in determining the number of backups kept before deletion.

Note that no one is making the argument that near-line storage is as cost effective as tape. Rather, it is lower cost than the main storage, and faster than tape, and therefore can be of use in some environments. Finally, note that the data could be moved to near-line via backup software, in which case it is in backup-tape format and would need to be restored to be used. Alternately, the data could be copied to near-line, and would then be in its native file system format and be available for testing or disaster recovery use.

The truth about tape: network backups work.

With all of the above said about more complex backup environments, consider that network-based backup infrastructure has been around for a long time, and works for a lot of sites. The more complex architectures certainly have their place, and are required in some circumstances, but they should not be used when a simpler solution would do. If the performance (or lack thereof) of an existing network facility is the only impetus to move forward, consider using parallel paths to stretch throughput. For example, IP Multipathing could allow easy use of multiple backup networks. A 1-Gb network, when driven by a Sun server, can move approximately 450 Mb per second. Putting in a second or third gigabit interface and using IP Multipathing can double and triple that performance. Do not forget that sufficient CPU would be needed to drive that throughput. A good rule of thumb is that one fast CPU can drive one high-throughput interface (be it a Fibre channel or a gigabit network HBA).

The Truth

It is important that any B&R architecture focus on the primary problem. That problem is not backups and their performance, rather, it is restores and their performance and reliability.

Conclusions

There are quite a few obvious and subtle complexities involved in designing or redesigning a backup and restore solution. Mistakes can come from overlooking or underestimating these complexities. Specific areas to consider are server to tape attachment, throughput of backup and restore, robot control, and application interactions. Hopefully the truths revealed here can help assure that your B&R solution meets your requirements immediately, and for the distant future.

Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.