Solstice
DiskSuite Soft Partitions -- Not Hard at All
Matthew Cheek
The Sun Solaris Operating Environment (OE) is a mature implementation
of UNIX that runs both on Sun SPARC and Intel x86 architecture hardware.
The native file system of Solaris is the venerable UNIX File System
(UFS) and, except for minor changes (e.g., UFS logging) over the
years, it has not really kept pace with the explosive growth of
storage technology, in terms of both capacity and performance. Fortunately,
Sun has always offered DiskSuite1, an add-on logical volume manager
for the Solaris OE that provided fault-tolerant features such as
concatenation, striping, mirroring, RAID5, and logging. However,
both UFS and DiskSuite have always been constrained by the physical
disk partition table, effectively limiting the number of file systems
per disk to eight. This was not much of a problem when disks were
relatively small (i.e., less than two gigabytes). As individual
disk capacities continued to increase, the eight-slice limit became
more of an issue, especially from a space management standpoint.
This article will introduce DiskSuite Soft Partitions, a new feature
that shatters the eight-slice barrier and greatly extends the capabilities
of the DiskSuite storage management product.
DiskSuite Overview
DiskSuite is a layered product that enables administrators to
manage a system's disks through the concept of virtual disk
devices called metadevices. From the point of view of a user or
application, a DiskSuite metadevices is seen as and accessed identically
to a physical disk device through the use of the DiskSuite metadisk
device driver, which coordinates I/O to and from physical disk devices
and metadevices.
While the most basic metadevice is simply a single disk partition,
such a metadevice is not very useful. The value of DiskSuite is
only realized when complex metadevices are used to increase data
availability or storage capacity.
DiskSuite permits the creation of metadevices that consist of
multiple physical disk partitions. This allows the administrator
to have file systems that are larger than any single disk device
by distributing the data across multiple physical disk slices. Such
a complex metadevice can either be a concatenation, which distributes
the data "end-to-end", or a stripe, which alternates chunks
of data across disk slices. A concatenation is typically used to
"grow" an existing metadevice by attaching additional
slice(s) to the end. A striped metadevice provides better performance
by causing I/O operations to be spread across multiple physical
disks. Here is the command to create a concatenated metadevice:
# metainit d10 3 1 c1t0d0s2 1 c1t1d0s2 1 c1t2d0s2
d10: Concat/Stripe is setup
The concatenation metadevice created in this example consists of three
"stripes" (the number 3) each made up of a single slice
(the number 1 preceding each slice). In this example, all three slices
are actually the entire disk. DiskSuite then displays a confirmation
message that the metadevice was successfully set up.
Similarly, a stripe metadevice is created with this command:
# metainit d20 1 3 c1t0d0s2 c1t1d0s2 c1t2d0s2
d20: Concat/Stripe is setup
This striped metadevice, d20, is made up of a single stripe (the number
1) spread across three slices. Since an interlace values was not specified
with the "-i" switch, the default size of each data segment
per slice is 16 Kbytes. DiskSuite concatenations and stripes are also
known as RAID level 02.
Mirroring
Be aware that neither a concatenation nor a stripe by itself provides
any level of fault tolerance. The failure of a single physical device
in a concatenated or striped metadevice results in the loss of the
entire metadevice. DiskSuite's answer to this is the mirror
metadevice. DiskSuite mirroring provides data integrity by writing
data to two or more disk devices. A DiskSuite mirror metadevice
is made of one or more other metadevices referred to as submirrors.
For example, the following commands will create a two-way mirror
metadevice (d1), which comprises two simple submirror metadevices
(d2 and d3), each corresponding to a physical disk partition:
# metainit d2 1 1 c0t1d0s2
d2: Concat/Stripe is setup
# metainit d3 1 1 c0t2d0s2
d3: Concat/Stripe is setup
# metainit d1 -m d2
d1: Mirror is setup
# metattach d1 d3
d1: Submirror d3 is attached
The resulting d1 mirror metadevice can then be used just as if it
were a physical disk slice. However, every write request to this mirror
metadevice will actually be written to each of the two submirror metadevices
(d2 and d3) and, hence, to each of the two physical disk partitions.
This mirroring (also known as RAID level 1), provides redundant copies
of your data. Loss of either underlying physical disk partition will
not impact the operation of the mirror metadevice, and the repair
or replacement of the failed device can be scheduled at a convenient
opportunity.
Typically, the desired attributes of performance and capacity
along with availability are achieved through the use of mirror metadevices
composed of striped or concatenated submirror metadevices. This
results in logical volumes that are immune to at least single disk
failures and sometimes multiple simultaneous disk failures, depending
on whether the entire logical stripe remains intact. DiskSuite striped
mirrors are also known as RAID level 1+0, or simply 10.
RAID5
While mirrored metadevices are the most fault tolerant, the physical
disk requirements are exactly double that of a simple metadevice.
If you are unwilling or unable to dedicate twice as many disks to
meet your space requirements but still wish to have data redundancy,
DiskSuite provides the RAID5 metadevice. As the name implies, a
DiskSuite RAID5 metadevice implements RAID level 5, which is striping
with parity and data distributed across all the physical disk partitions
in the metadevice.
If a physical disk fails, the data on the failed disk can be rebuilt
from the distributed data and parity information stored on the remaining
disks. The parity information is stored in one slice's worth
of space, but is actually distributed across all the slices in the
RAID5 metadevice. This means that the actual data capacity of a
RAID5 metadevice is n-1 times the size of each slice, where n equals
the number of slices in the metadevice. For instance, a RAID5 metadevice
composed of five 18-gigabyte slices results in approximately 72
gigabytes of usable space (4 x 18). An example command to create
a DiskSuite RAID5 metadevice is:
# metainit d100 -r c2t0d0s2 c2t1d0s2 c2t2d0s2 c2t3d0s2 c2t4d0s2
d100: RAID is setup
The RAID5 metadevice d100 is created from five slices. DiskSuite confirms
that the RAID5 metadevice is set up and begins initializing the metadevice.
Initialization is the process of "zeroing out" all the disk
blocks in all the slices. The time to complete this initialization
depends on the size of the RAID5 metadevice and the speed of the disks.
Until the initialization process is complete, the RAID5 metadevice
is unavailable for use.
A DiskSuite RAID5 metadevice can suffer the loss of only a single
member slice and continue operating, albeit in a degraded mode,
as the missing data is reconstructed from the parity information
on the remaining slices. It is important to replace a failed disk
in a RAID5 metadevice as soon as possible because the loss of a
second disk while the metadevice is still degraded will result in
the loss of the entire metadevice.
Creating a File System
In most cases, the next step after configuring DiskSuite metadevices
(concatenations, stripes, mirrors, or RAID5) is to create a UFS
file system on the metadevice. Once this is complete, the file system
can be mounted and made available for use. For instance, the following
commands create and mount a file system on the example d100 RAID5
metadevice from above:
# newfs /dev/md/dsk/d100
newfs: construct a new file system /dev/md/rdsk/d100: (y/n)? y
/dev/md/rdsk/d100: 141449528 sectors in 30019 cylinders of 19 tracks, 248 sectors
69067.2MB in 1365 cyl groups (22 c/g, 50.62MB/g, 6208 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 103952, 207872, 311792, 415712, 519632, 623552, 727472, 831392, 935312, 1039232, 1143152, 1247072, 1350992,
1454912, 1558832, 1662752, 1766672, 1870592, 1974512, 2078432, 2182352, 2286272, 2390192, 2494112, 2598032,
.
.
.
# mount /dev/md/dsk/d100 /mnt
# df -k /mnt
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d100 70174288 562468 77557168 1% /mnt
Please note that this section is meant only as a brief conceptual
overview of DiskSuite's concatenation, stripe, mirror, and RAID5
metadevices and how they are used to increase data performance, capacity,
and availability. Also note that the details of configuring and maintaining
the DiskSuite metadevice state database replicas, DiskSuite Trans
metadevices for logging, or DiskSuite hotspares are not covered at
all. Refer to the DiskSuite Installation and Reference Guides for
details on implementing DiskSuite.
DiskSuite Soft Partitions
Introduced to DiskSuite v4.2.1 for Solaris 8 (and subsequently
back ported to DiskSuite v4.2 for Solaris 7 and 2.6), DiskSuite
Soft Partitions are a new feature that provides for much greater
flexibility in managing disk resources. Simply put, with Soft Partitions,
administrators can logically subdivide a physical disk or a DiskSuite
concatenation, stripe, mirror, or RAID5 metadevice into more than
eight partitions. This is especially important as disks become larger
and disk arrays present even larger logical devices to the system.
A simple example will demonstrate the value of Soft Partitions.
Let's revisit our d100 RAID5 metadevice from the previous
section. After creating and initializing the metadevice, the next
step is to create a single UFS file system. However, what if a single
72-gigabyte file system is just too large? Before the introduction
of Soft Partitions, there was no other option. Now, rather than
a single huge file system, we can create one or more smaller Soft
Partitions on the RAID5 metadevice and use them independently. For
example, we could create one or more soft partitions on top of that
72-gigabyte d100 RAID5 metadevice:
# metainit d101 -p d100 1g
d101: Soft Partition is setup
# metainit d102 -p d100 2g
d102: Soft Partition is setup
The previous two commands create two soft partitions on top of d100:
one that is 1 gigabyte in size, and another 2 gigabtyes in size. (The
last parameter of these metainit commands is the desired size of the
soft partition and can be specified in K or k for kilobytes, M or
m for megabytes, G or g for gigabytes, T or t for terabyte (one terabyte
is the current maximum soft partition size), or B or b for blocks.)
Now we can create file systems on these soft partition metadevices
and mount them for use:
# newfs /dev/md/rdsk/d101
<newfs output suppressed for brevity>
# mount /dev/md/dsk/d102 /mnt1
# newfs /dev/md/rdsk/d102
<newfs output suppressed for brevity>
# mount /dev/md/dsk/d102 /mnt2
# df -k /mnt1 /mnt2
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d101 983599 9 924575 1% /mnt1
/dev/md/dsk/d102 2030887 9 1847086 1% /mnt2
Each of these two newly created soft partition metadevices are using
a portion of the d100 metadevice, and the remaining 69 gigabytes are
available for creating other soft partitions and/or increasing the
size of existing soft partitions. For example, if it becomes necessary
to enlarge the 1-gigabyte file system to 5 gigabytes, the following
commands will accomplish that, all without unmounting the file system:
# metattach d101 4g
d101: Soft Partition has been grown
# growfs -M /mnt1 /dev/md/rdsk/d101
<growfs output suppressed for brevity>
# df -k /mnt1
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d101 4921773 9 4862749 1% /mnt1
Besides creating soft partitions on pre-existing metadevices, DiskSuite
permits the creation of soft partitions directly on physical disk
partitions or even entire disks. This last capability is useful when
a disk array presents logical volumes to the system that are actually
multiple disk logical units (LUNs). For instance, the Sun StorEdge
RAID Manager is used to group a set of physical drives in a disk array
into a LUN that is either RAID level 0, 1, 3, or 5. This LUN is then
seen by Solaris as one drive. Typically, administrators desire to
group the physical drives in an array into large, manageable sets.
As a result, these LUNs can be tens or hundreds (or more) of gigabytes
in size and, prior to the availability of DiskSuite soft partitions,
admins were limited to no more than eight file systems per LUN. The
following example demonstrates the creation of multiple soft partitions
right onto a physical drive (c3t0d0):
# metainit d110 -p -e c3t0d0 10g
d110: Soft Partition is setup
# metainit d111 -p c3t0d0s0 5g
d111: Soft Partition is setup
# metainit d112 -p c3t0d0s0 15g
d112: Soft Partition is setup
The "-e" switch in the first metainit command indicates
that the entire disk (as specified by "c*t*d*") should be
repartitioned and reserved for soft partitions. This option can only
be used the first time a soft partition is placed on an entire disk.
Keep in mind that the specified disk will be repartitioned such that
slice 7 is reserved for metadevice state database replica(s) and slice
0 contains the remaining space. Slice 7 will be a minimum of two megabytes
in size, but could be larger depending on the particular characteristics
of the disk. Be aware that the previous partition layout will be overwritten.
The initial soft partition (d110 in the example above) will be placed
on slice 0. Subsequent soft partitions (e.g., d111, d112) are also
placed on slice 0 by omitting the "-e" switch and specifying
the disk component as c*t*d*s0. DiskSuite manages the layout of soft
partitions on slice 0.
The metastat command is used to display the status of all
metadevices. Here is the output showing the previous three soft
partitions built on a single drive:
# metastat
d110: Soft Partition
Component: c3t0d0s0
State: Okay
Size: 20971520 blocks
Extent Start Block Block count
0 1 20971520
d111: Soft Partition
Component: c3t0d0s0
State: Okay
Size: 10485760 blocks
Extent Start Block Block count
0 20971522 10485760
d112: Soft Partition
Component: c3t0d0s0
State: Okay
Size: 31457280 blocks
Extent Start Block Block count
0 12582915 31457280
Upgrading DiskSuite
Soft Partitions are only available on DiskSuite v4.2.1 on Solaris
8 and DiskSuite v4.2 on Solaris 7 and 2.6. Additionally, the Soft
Partitions feature was added to the base DiskSuite product via a
product patch. In other words, although you may be running DiskSuite
v4.2.1 or v4.2, you may or may not have soft partitions. The simplest
way to determine whether your DiskSuite product is soft partition-aware
is to run the metainit command with no options. If the usage message
shows the "-p" option, you have soft partitions:
# metainit
usage: metainit [-s setname] [-n] [-f] concat/stripe numstripes
width component... [-i interlace]
[width component... [-i interlace]] [-h hotspare_pool]
metainit [-s setname] [-n] [-f] mirror -m submirror...
[read_options] [write_options] [pass_num]
metainit [-s setname] [-n] [-f] RAID -r component...
[-i interlace] [-h hotspare_pool]
[-k] [-o original_column_count]
metainit [-s setname] [-n] [-f] trans -t master [log]
metainit [-s setname] [-n] [-f] hotspare_pool [hotspare...]
metainit [-s setname] [-n] [-f] softpart -p [-e] device size
metainit [-s setname] [-n] [-f] md.tab_entry
metainit [-s setname] [-n] [-f] -a
metainit -r
If you do not have soft partitions, but are running one of the specified
versions of DiskSuite on a supported version of Solaris, all that
is necessary to add soft partition support is to download and apply
the latest DiskSuite Product Patch from SunSolve Online (http://sunsolve.sun.com)
and reboot. The minimum required patch versions are shown in Table
1. While these are the minimum DiskSuite Product Patch versions to
add soft partitions, I strongly recommend that the very latest version
of these important patches be applied. This Product Patch includes
not only bug fixes but also enhancements to the DiskSuite product.
Solaris Volume Manager
Starting with Solaris 9, DiskSuite is renamed the "Solaris
Volume Manager" or "SVM" and has been completely
integrated into the Solaris 9 OE. SVM is now the standard logical
volume manager and the name change is simply a reflection of that
new emphasis. There are no significant core differences between
Solstice DiskSuite v4.2.1 on Solaris 8 and the Solaris Volume Manager
on Solaris 9. For all intents and purposes, SVM IS Solstice DiskSuite
v4.2.1 and, as such, includes the Soft Partition feature.
Summary
Soft Partitions free the Solaris administrator from having to
think of storage only in terms of disk slices. Simply configure
physical disk devices into large DiskSuite metadevice storage "pools"
and use DiskSuite Soft Partitions to carve the pools up as needed.
I hope that this introduction to DiskSuite Soft Partitions will
cause those Solaris systems administrators already using DiskSuite
in their environments to revisit their configurations and possibly
implement soft partitions. In addition, I encourage Solaris systems
administrators not currently using DiskSuite to reconsider its place
in their environments.
Matthew Cheek is a senior UNIX systems administrator with experience
in the telecommunications, healthcare, and manufacturing industries
and has installed, configured, managed, and written about UNIX systems
since 1988. He is the lead author of Tru64 UNIX System Administrator's
Guide from Digital Press. Matt is currently at Medical Archival
Systems, Inc. and can be contacted at: cheek@mars-systems.com.
1 DiskSuite was originally named "Online:
DiskSuite" and was compatible with both Solaris 1 (i.e., SunOS
4.1.x) and Solaris 2. Beginning with Solaris 2.4, it was renamed
"Solstice DiskSuite" and it continued to be called that
until, with the release of Solaris 9, it became the "Solaris
Volume Manager". In this article, I adopt the common shorthand
of calling it simply "DiskSuite".
2 RAID is an acronym for Redundant Array
of Inexpensive (or Independent) Disks. There are seven RAID levels,
0-6, each referring to a strategy of distributing data across disks
while ensuring data redundancy. DiskSuite supports level 0 (concatenations
and stripes), RAID level 1 (mirrors), RAID level 5 (striped with
parity information for redundancy), and RAID level 0+1 (striped
mirrors). Use your favorite search engine to find resources describing
the various RAID levels.
|