Using
NetWorker and UFS Snapshots to Back Up Oracle Databases
Christopher T. Beers
In early 2001, I was asked to find a solution that would allow
us to back up an OracleTM 8.0.5 database after the machine was
upgraded to SolarisTM 8. At that time, the machine was running
Solaris 2.6, and a turnkey application that was built on top of
the outdated release of Oracle. We backed up the system using BMC
SQL-BackTrack to a Legato NetWorker server. This setup provided
our customers with point-in-time recoveries using Oracle's
archive logging, warm online backups during the week, and cold offline
backups weekly on Sunday mornings. Our customers, of course, expected
similar service levels with the new solution -- specifically,
downtime in which to cold back up the database was at a premium.
We wanted to move away from the BMC SQL-BackTrack solution because
BMC no longer supports its tool to back up the outdated Oracle database.
This left two options -- back up the database using the vendor-provided
methods, or create something that we would support in house. The
vendor supported nightly cold offline backups of the database to
locally attached tape drives using ufsdump, or UFS snapshots of
database mount points to locally attached tape drives using ufsdump.
Neither of these solutions worked well in our environment because
the machine did not have a locally attached tape drive, and I did
not want to manage tapes for this particular host. (I spend too
much time managing tapes as it is.)
I needed to develop a solution that would be somewhat supported
by the application vendor but workable in our enterprise environment.
This lead to the creation of snapshot.sh (Listing 1), which when
used in conjunction with Legato NetWorker provides our ultimate
network-based backup solution.
The reason this solution is unique is that it does not require
the purchase of the Legato Oracle Business module, which is very
expensive, to back up a restorable Oracle database. It also requires
very little downtime to get nightly cold backups of the database,
which are generally better and more reliable than warm backups.
Overview
Legato NetWorker
Legato NetWorker is a network-based backup tool with a client/server
architecture that allows cross-platform backups to a central server.
The NetWorker administrator configures clients within the server
software and then groups the clients together. Each client resource
in the NetWorker server has a list of filesystems that will be backed
up. Groups are lists of clients that start their backup at the same
specified time.
In its vanilla state, the NetWorker server begins a group at the
specific start time and instructs each client to initiate its backup
with the server. The server tells the client what filesystems to
backup, at what levels (full, incremental, 0-9, etc.) to perform
the backup, and where to send its data. The client then uses the
NetWorker save command to send the filesystem data over the
network to the tape device.
NetWorker allows you to specify a different backup command to
override the default save command. When overriding the default
save command, you are responsible for sending the data to
the NetWorker server using the save command yourself. This
allows administrators to write scripts around the save command
to perform any action, as well as to back up the requested information.
Legato also provides an alternate save command, savepnpc.
savepnpc performs some pre-processing activities before the
backup is actually started, uses the save command to send
data to the server, and then performs some post-processing activities
after the backup is complete. This is the command that we will use
to integrate our solution with NetWorker.
Solaris UFS Snapshots
Sun introduced the ability to take snapshots of UFS filesystems
in their Solaris 8 01/01 release using a utility called fssnap.
This command allows you to take a snapshot of a currently mounted
filesystem that can be then used to back up the filesystem. The
snapshot is a read-only image. Changes to the live filesystem are
tracked by blocks using a backing-store file on which you can place
a maximum size.
Snapshot.sh uses fssnap to create read-only snapshots
of filesystems that contain Oracle or related application data.
The script then mounts these filesystems onto mount points so that
they can be backed up. The mounted snapshot is a virtual device
and not a copy of the filesystem at the time the command is run,
as this would take way too long to complete. A snapshot of a 120-GB
filesystem takes seconds to complete and mount. During the backup,
writes to the real filesystem are tracked into the backing-store
file and "virtually subtracted" from the read-only image.
To create a filesystem snapshot with fssnap, you execute
the following command and fssnap will return the device name
created. This command places a 500-MB limit on the backing-store
/scratch/bstore:
# /usr/sbin/fssnap -F ufs -o maxsize=500m,bs=/scratch/bstore-file/export/home
/dev/fssnap/0
You can then mount this device onto a mount point as shown:
# mount -F UFS -ro /dev/fssnap/0 /mnt
You can get a listing of snapshots that currently exist using the
-i switch to fssnap. If you specify the original mount
point after the -i it will give you more verbose information
about that particular snapshot:
# fssnap -i
0 /export/home
# fssnap -i /export/home
Snapshot number : 0
Block Device : /dev/fssnap/0
Raw Device : /dev/rfssnap/0
Mount point : /export/home
Device state : idle
Backing store path : /scratch/bstore-file
Backing store size : 480 KB
Maximum backing store size : 500 MB
Snapshot create time : Tue Jan 07 09:57:07 2003
Copy-on-write granularity : 32 KB
Oracle Script Modifications
To shut down Oracle unattended, I usually make some slight modifications
to the Oracle-provided shutdown scripts. The dbshut command,
usually found in $ORACLE_HOME/bin, is executed to stop the database.
The dbshut script uses the $ORACLE_HOME/bin/svrmgrl
command to shut down the database. The following patch file is what
I usually apply to the Oracle 8.1.7 dbshut script after initial
setup and configuration of the database:
*** dbshut.original Thu Jan 24 09:51:57 2002
--- dbshut Thu May 23 10:13:28 2002
***************
*** 69,79 ****
fi
case $VERSION in
6) sqldba command=shutdown ;;
*) $SQLDBA <<EOF
connect internal
! shutdown
EOF
;;
esac
if test $? -eq 0 ; then
--- 69,80 ----
fi
case $VERSION in
6) sqldba command=shutdown ;;
*) $SQLDBA <<EOF
connect internal
! alter system switch logfile;
! shutdown immediate
EOF
;;
esac
if test $? -eq 0 ; then
The default dbshut script uses the shutdown command,
which will willingly wait for all users that are connected to the
database to disconnect their session (and at 2:30 a.m. this usually
does not happen) before shutting down and dismounting the database.
To make sure the database closes properly and dismounts, I use the
shutdown immediate command instead. This command will
cleanly shut down the database and force users' sessions to disconnect.
Just before executing shutdown immediate, I force a redo log
switch using alter system switch logfile so that I can be reasonably
assured that the current redo log does not hold any transactions.
Locating Critical Oracle Filesystems
For the snapshot.sh script to function correctly, some
environment variables must be set. These include the filesystems
that need to be UFS snapshot. To get a list of critical Oracle filesystems,
I wrote the following shell script that I install on Oracle database
servers upon initial creation and installation:
#!/bin/sh
ORACLE_SID=my_SID
export ORACLE_SID
/opt/oracle/product/8.1.7/bin/svrmgrl << EOF
connect internal/oracle
select * from v\$datafile;
select * from v\$controlfile;
select * from v\$logfile;
EOF
After changing the environment variable ORACLE_SID and executing this
script as the Oracle user on your database machine, it will output
the location of the control files, redo log files, and all data files
and their locations. It is very important that all filesystems containing
Oracle data be snapped because you want a consistent database when
your backup software dumps the information to tape.
Snapshot.sh
As I have mentioned before, snapshot.sh is the script (Listing
1) that brings all of this together. It is called as the pre-processing
command when using the NetWorker savepnpc backup command.
It then stops Oracle and other applications using their respective
init scripts, snaps the filesystems with important data, and finally
restarts Oracle and the other applications. This whole process takes
minutes, which means that Oracle will be unavailable to the users
for a very short time every night. In my experience using Oracle
8.0.5 with a 66-GB database, this script takes roughly five minutes
every night. NetWorker then saves the snapshot-mounted filesystems
to tape. Once all filesystems have been backed up, snapshot.sh
runs as the post-processing script and cleans up by un-mounting
snapshots, deleting them from the system, and removing the backing-store
files.
To allow the script to work in your environment, you must change
the variables located at the top of the script. The variables to
be changed are listed below, and these are the values I will use
to demonstrate the rest of the procedure:
# Init scripts in the order they need to be executed on startup
INIT_SCRIPTS_STARTUP="oracle my_application"
# Init scripts in the order they need to be executed on shutdown
INIT_SCRIPTS_SHUTDOWN="my_application oracle"
# Application keywords passed to ps to check to see if processes are shutdown
# I use the usernames to processes are started with.
# Write the line as you would pass it to egrep (ie. using a OR
# symbol between the keywords)
PROCS_KEYWORDS="oracle|my_application"
# Mount points that need contain Oracle data (files/tables,
# control files, redo file, archive log files, etc)
# and need to be snapshot
MOUNT_POINTS="/oracle /my_app"
# Directory to store backing-store file to track block changes in
# the snapshot filesystems
SCRATCH_SPACE="/var/tmp"
# Extension to add to $MOUNT_POINT when mounting the read-only
# image with fssnap
MOUNT_POINT_EXT="snapshot"
# Backing-store file extension
BSTORE_EXT="bstore"
# Location of fssnap
FSSNAP="/usr/sbin/fssnap"
# Set if you wish to control the size of the backing-store file
# Read man page on fssnap_ufs but it suffices to say that
# maxsize=n[k,m,g] where n is a number
# Be aware that if this file fills up, your snapshot will be
# invalid as it can no longer track changes
# I choose not set it for this reason and the fact that my
# $SCRATCH_SPACE is 5 times the size of the DB.
BSTORE_MAXSIZE=""
Putting It All Together
Installation Process and Configuration
To begin, create a client resource on the Legato NetWorker server.
Create the client by selecting the Client Setup menu choice under
the Clients menu in the NetWorker GUI interface. Click on "create"
and fill in the appropriate fields. Make sure you specify the appropriate
group membership (note group name as it will be needed later), specify
the snapshot mount points instead of the actual ones (i.e., /oracle
would be /oraclesnapshot, my_app would be my_appsnapshot, etc.),
and change the "Backup Command" to savepnpc (as
shown in Figure 1).
Since retention times for operating system data and database data
differ (as well as how often I have to do full backups), I create
two clients with the same name. The first client contains OS-related
filesystems (/, /usr, /var, ...) and the second client contains
snapshot mount points. Creating two clients allows me to back up
the database and operating system at different times, on different
tapes, with different retention periods.
Next, I copy the snapshot.sh script to /usr/local/sbin on the
client. I change the ownership to root, group other, and permission
of 700. Make sure to edit the environment variables at the top of
the script.
Although the snapshot.sh script will do this automatically, I
want to create the mount points for the snapshots. Assuming the
default script values (i.e., /oracle and /my_app are the filesystem),
I would make a directory called /oraclesnapshot and /my_appsnapshot
(the format is $MOUNT_POINT$MOUNT_POINT_EXT as defined within snapshot.sh).
I also want to make sure the directory for the backing-store information
exists and create it if it does not.
Next I go to the client and create a file called /nsr/res/<groupname>.res
(use the group name noted in the first step). This file specifies
the pre-processing command, post-processing command, and the time
when post-processing is run even if the backup is not yet complete.
The file looks like this, and the format is documented within the
NetWorker Administrator's Manual:
type: savepnpc;
precmd: "/usr/local/sbin/snapshot.sh start > /nsr/logs/precmd.log 2>&1";
pstcmd: "/usr/local/sbin/snapshot.sh stop > /nsr/logs/pstcmd.log 2>&1";
timeout: "09:00am";
The output from snapshot.sh will be put into /nsr/logs/[pre|pst]cmd.log
and overwritten every night. If you choose not to redirect the output
from the snapshot.sh script, it will be placed in the savegroup notification
by default (making your savegroup notification very cluttered). This
will allow you to track information about the process, specifically
how large the backing store files are getting before they are deleted.
Common Problems
I ran into a couple of minor problems during the testing and initial
implementation of this script and will describe those here.
The first problem I encountered was during the testing stage,
which I'll describe later. When I originally began testing
this solution, I limited the size of the backing-store file. During
testing, I made a lot of changes to the tablespace files and exceeded
this limit, and the snapshot was automatically deleted. This is
why I now track the file size by capturing it in the pstcmd.log
file, instead of placing a limit on the file size growth.
Another problem is that when NetWorker abnormally exits, it does
not always clean up after itself. When NetWorker is performing the
backup on your client, it creates a file named /nsr/tmp/<group-name>.tmp.
If this file exists when the next backup is run, the pre- and post-processing
scripts are not run, but the machine will still back up the filesystems.
Although this was a problem early in my testing, it has not happened
during the six months that this procedure has been used. In fact,
I now allow the application administrators to cancel the backup
by creating the /nsr/tmp/<group-name>.tmp file before the
backup is run. I then run a cron job at noon every day to check
for its existence, and mail the appropriate people if it's
there. It has been noted on the NetWorker listserv that Legato thinks
this file staying around is a bug, and they may "fix"
it eventually by removing it just before the backup starts.
Testing Scenarios
This process has been tested extensively in our environment because
we regularly recover data onto different machines for disaster recovery
and testing purposes. The original tests were completed on an Oracle
8.0.5 database. I started the backup on the test machine as it normally
would have been done by NetWorker. During the backup, I concatenated
/dev/null into five data files for minutes at a time (this
is where I exceeded my maxsize on the backing-store file). Once
the backup was complete, the Oracle administrator identified one
of the infected files. I then recovered the data files and the DBA
recovered them into the database without incident.
Recoveries
To recover snapshot filesystems to the original mount points,
you must relocate the data within NetWorker's recovery tools.
Use the nwrecover or recover command to select files
from the snapshot mount point. Instruct the tool to place those
files on the original mount points and let the tool do the rest.
Using the command-line recover tool, you would specify the following
to relocate your data:
# /usr/bin/recover
recover: Using nsrhost (xxxx.syr.edu) as server for xxxx.syr.edu
recover: Current working directory is /
recover> add /oraclesnapshot
recover> relocate /oracle
2442 file(s) marked for recovery
recover> recover
Using the nwrecover GUI tool, you would select the "Relocate"
choice in the Option menu and specify the correct path.
I have found it much easier to delete all the files that are associated
with the Oracle database, except the archive logs. This includes
control files, redo logs, and all tablespace data files. I then
recover all these files from the snapshot backup and have the Oracle
DBA's roll the database forward using the archive logs stored
on disk. I prefer this method because Oracle expects all the files
to have the same SCN number at startup and mount time (Oracle writes
this unique number at shutdown and dismount time). If you recover
only the file that is damaged, the Oracle DBA must recover the database
using Oracle tools (and I will leave that in-depth discussion to
you and your DBA).
Conclusion
As usual, your mileage may vary. In our environment, cost cutting
is always on our minds. Using supported tools from various vendors
to back up databases with little or no downtime at minimal cost
is our ultimate goal. There are certainly trade-offs to this solution,
but it works reliably, is supported by the key players involved
(savepnpc is support by Legato and fssnap by Sun), and is
restorable (which is the most important).
Christopher T. Beers is a UNIX Systems Engineer at Syracuse
University's Computing and Media Services. His primary responsibilities
include Solaris and Linux administration for machines used in academic
computing, implementation of new services, and support of existing
services. While he is away from work, he enjoys riding his Harley
with his fiancè (when it's not snowing) and watching
Syracuse Orangemen college basketball.
|