Review
of Current File Systems and Volume Managers
Henry Newman
During my past few articles, I have reviewed volume management
and file system concepts and issues. If you have not read those
articles, it might be best to review them because a number of background
concepts were covered that provide understanding of how everything
works internally.
This month, I will cover a few different file systems, volume
managers, and operating systems and review some of the tunable parameters
and strengths and weaknesses. This is not an in-depth comparative
analysis of each volume manager and file system but an overview
of some of the key characteristics and considerations. Much of the
information on these file systems is taken from man pages, Web pages,
and online documentation.
Volume Manager and File System Considerations
Before reviewing the each of the volume managers and file systems,
it is critical to know the administrative requirements, the user
requirements for performance and reliability, and the hardware environment.
For example, if you need 10 terabytes of space for the user application(s)
and the file system limit is 2 terabytes, you must have at least
five file systems. This increases the administrative overhead. Knowing
your requirements will allow you to determine the best volume manager
and file system combination and perhaps allow you to determine the
best operating system and host platform. Also note that, in some
cases, volume managers are used to support host bus adapter (HBA)
failover.
Before I begin, I should explain my biases. My background is high
performance computing and I/O, and I am of the opinion that designing
for performance should take into account the whole data path. Some
vendors tout architectural features as requirements. For example,
a journaling/logging file system is considered a requirement. I
believe the real requirement is fast recovery after a crash and/or
system failure, and journaling/logging is an implementation of that
requirement. In some application environments, such as real-time
data streams, journaling/logging can seriously degrade performance
because the file system metadata must be written two times -- once
to the journal/log, and then committed to the metadata. Be sure
you understand the real requirements of what you are trying to do
and don't just look at the implementation and the vendor-touted
features.
IBM (Linux and AIX)
The open source bug has bitten IBM's volume manager and file system
groups. LVM (Logical Volume Manager), JFS (Journaled File System),
and GPFS (Global Parallel File System) are all now supported under
Linux as well as AIX.
LVM
This volume manager provides standard volume management functions,
such as mirroring, striping, and concatenation. On Linux, LVM can
work with other file systems besides JFS and GPFS.
JFS
IBM's local file system provides a number of tuning options including:
- Separation of the journal and the data
- Defining the size of the log
- Defining the block size of the file system with a maximum of
4096 bytes
- Size of the allocation group that determines the number of
bytes per inode
JFS does not allow large allocations and high performance for
streaming data; it is generally used for local file systems on AIX.
The benchmark data that I have seen show it behind XFS in terms
of performance on Linux.
GPFS
The original intent behind the file system was to support parallel
I/O from applications for IBM SP clusters. GPFS has been ported
to Linux and supports most of the features that are available on
AIX systems. As you would expect with a shared file system, the
larger the request the better the performance. GPFS has, in my opinion,
significant limitations on allocation sizes. The largest allocation
to my knowledge is 512 KB, which is less than a full stripe on a
RAID-5 system with 8+1 configuration and even small 128-KB stripe
allocations. GPFS does not support round-robin allocation and stripes
all of the data. GPFS supports hierarchical storage management (HSM)
with HPSS. HSM is critical for large file systems because backup
and restoration is virtually impossible given the speed of tape
drives compared with the density increases in disk drives.
ADIC StorNext File System
StorNext is a multi-platform file system that supports Windows,
Linux, SGI, and Sun. This file system has its roots in the high-performance
world. The file system supports round-robin and striping, large
allocations, bandwidth allocation multipathing, metadata server
failover, and integration with ADIC's HSM StorNext Storage Manager.
ADIC believes it has solved some of the metadata scalability problems
given they have the oldest commercial shared multi-platform file
system. Many of the features are designed for data streaming as
the original market implementation was in the video-editing market
space.
Linux
Within Linux, a number of different volume managers and file systems
are supported, and they have very similar characteristics and tunables.
Volume Managers
LVM
The Logical Volume Manager (LVM) is distributed with the 2.4 Linux
kernel, but LVM releases are asynchronous from kernel releases.
Although a number of creation volume management options exist (like
shrinking or extending a volume and migration), only one performance
tunable exists and that is the stripe size and stripes:
lvcreate -i2 -I4 -l100 -nanothertestlv testvg
-i 2 is the number of stripes in the volume. -I 4 is
the stripe size in KB.
Enterprise Volume Management System (EVMS)
EVMS is a project from the Linux Kernel Foundry. Much information,
along with a HOWTO and FAQ, can be found at the Web site:
http://evms.sourceforge.net/
A comparison of EVMS with other volume managers is provided at:
http://evms.sourceforge.net/comparison.pdf
This volume manager provides software RAID support and also supports
bad block allocation/reallocation for IDE drives. EVMS should be considered
if Linux is going to be used.
File Systems
Files system developed for Linux were originally designed for
small transactions. Most do not support large allocation sizes (more
than 1 MB) nor (until recently) large file system sizes and large
I/O transfer sizes. Most do not support the concept of data and
metadata separation. So in general, file systems on Linux cannot
scale as well as file systems on Solaris or AIX.
ReiserFS
This file system was one of the first journaling file systems
under Linux. It supports the concepts of journal separation and
journal size. The largest allocation is only 4096 bytes, so files
could become fragmented and cause performance issues for large files.
For mount options, Reiser supports various hashing options for directories.
Reiser uses a balanced btree approach, which allows allocations
to be balanced across the devices. Large file systems on ReiserFS
do not seem to scale well with file systems over 1 TB from the benchmark
data I have seen.
EXT3
Basically, EXT3 is the EXT2 file system with journaling put on
top of it. It is much more difficult to add a significant feature
or function to a file system after the structure has been created.
Although EXT3 might be great for EXT2 users who want to upgrade
to a journaling file system (the fsck performance for EXT2 was poor),
it does not compare with the performance of the other file systems
discussed. Ext3 does not support large allocations and does not
scale well compared with XFS.
XFS
XFS was ported from the SGI IRIX operating system. It likely has
the highest performance of the local Linux file systems. XFS allows
allocations up to 64 KB and supports direct I/O and file pre-allocation.
Features such as asynchronous journaling are supported, which improve
performance, but of course the tradeoff is some reliability. Direct
I/O, pre-allocation, and other high-performance features are supported.
Solaris
On Solaris, two volume managers are available, along with a file
system that manages it own volumes.
Volume Managers
Veritas VxVM
This is the original volume manager on Solaris and it is used
on other platforms, such as Linux and HP-UX. Veritas supports all
of the usual functions, including software RAID and tunables such
as request size. The VxVM is well integrated with UFS, VxFS, Veritas's
Dynamic Multipathing for HBA failover, and Veritas Cluster Server
for machine failover. Veritas has released VxVM under Linux.
Solaris Volume Manager
This is the product that was called Solaris Disk Suite (or SDS)
until recently. It has features similar to VxVM and provides integration
with Sun's HBA failover and Sun Cluster Manager for machine failover.
File Systems
Three file systems are available under Solaris. UFS has been around
in theory for more than 35 years. Understanding performance requirements
is critical to determining which file system will meet your needs.
QFS
This file system in unique in a number of ways:
1. It manages its own volumes and does not need a volume manager.
2. It does not journal/log metadata but uses another method.
3. It supports large allocations (~64 MB) and settable separate
direct I/O bypass options for read and write.
4. Built-in HSM as backup of very large file systems is next to
impossible.
The QFS file system has its roots in high-performance environments,
where users made large requests (more than 1 MB), needed large file
systems (more than 10 TB), required a large number of files per
file system (more than 20 million), and needed gigabytes of I/O
performance. QFS supports pre-allocation of files -- both striping
files across all of the devices, and allocation of a file on a single
different device for each open system call. It has evolved to support
a homogeneous (Solaris only) shared file system. QFS is excellent
for large file systems, high-speed I/O, and large I/O requests.
Performance for other areas requires a very good understanding of
the file system.
QFS as of the 4.0 release natively supports homogeneous file sharing
between Sun systems. All of the features available for the native
file system are supported on the client systems, and additional
features for allocation are supported on the client systems. Heterogeneous
support is provided via a product from IBM/Tivoli called SANergy,
which allows support for AIX, Linux, Tru64, SGI, and some versions
of Windows. The number of features for remote allocation and caching
are limited compared with using native QFS on Sun clients.
UFS
This is one of the oldest file systems available; some of the
original proposals for UFS came from papers written in 1965. Most
of the original device layouts have been maintained over the years,
which has limited its performance. Sun recently added journaling
to UFS because the metadata layout caused fsck performance to be
horrible. Adding journaling to UFS as an afterthought is questionable,
in my opinion. The largest allocation is 8192 bytes, and the largest
continuous allocation can be a maximum of 1 MB. The allocation size
is a real problem for some RAID devices because the default stripe
size for a RAID group can be 4 MB for a RAID-5 8+1 configuration.
Direct I/O was also added recently, and it is supported via both
a mount option and based on the request size from the application.
SanPoint Foundation Suite
For years, the only alternative to UFS on Solaris was VxFS. VxFS
supported journaling/logging and was well integrated with VxVM.
Today, VxFS supports automatic discovery of direct I/O based on
the application request size and pre-allocation, and is well integrated
with Oracle and other databases applications.
Veritas supports a shared file system -- SANPOINT Control -- where
the metadata and data are separated. The file system uses VxVM and
other features to provide high metadata reliability and path failover.
SANPOINT controller currently supports Solaris, Windows, Linux,
AIX and HP-UX.
Conclusions
Choosing the right file system for your environment includes understanding
not only the application requirements of file size and file system
size, but also the administrative requirements and hardware platforms
supported by a file system. Some file systems support only some
RAIDs and switch hardware. Don't ask me why a file system needs
to certify a RAID, because it does not make sense to me.
The bottom line is that almost all of the file systems above could
be "the best" given specific application requirements, available
or planned hardware, and administrative requirements. Running an
operationally sound benchmark (a benchmark that mimics the real
operation of the system) is critical to determining which will be
"the best" for you.
What's Next
During the past 10 months, I have reviewed the data path to/from:
User Application<-->Operating System<-->Volume Manager
and File System<-->Device(s)
Now that I've covered the basics, in the next few columns, I will
describe how to determine requirements, define the rules, and run
a benchmark for file systems and storage. Issues like HBAs, RAIDs,
tapes, and backup will be reviewed. Issues such as I/O performance
under load, data fragmentation, file system and file size limitations,
and scaling should all come out in a benchmark or file system test.
If you have suggestions or specific issues, please feel free to
email me.
Henry Newman has worked in the IT industry for more than 20
years. Originally at Cray Research and now with a consulting organization,
he has provided expertise in systems architecture and performance
analysis to customers in government, scientific research, and industry
around the world. His focus is high-performance computing, storage
and networking for UNIX systems, and he previously authored a monthly
column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.
|