Article

feb2003.tar

Review of Current File Systems and Volume Managers

Henry Newman

During my past few articles, I have reviewed volume management and file system concepts and issues. If you have not read those articles, it might be best to review them because a number of background concepts were covered that provide understanding of how everything works internally.

This month, I will cover a few different file systems, volume managers, and operating systems and review some of the tunable parameters and strengths and weaknesses. This is not an in-depth comparative analysis of each volume manager and file system but an overview of some of the key characteristics and considerations. Much of the information on these file systems is taken from man pages, Web pages, and online documentation.

Volume Manager and File System Considerations

Before reviewing the each of the volume managers and file systems, it is critical to know the administrative requirements, the user requirements for performance and reliability, and the hardware environment. For example, if you need 10 terabytes of space for the user application(s) and the file system limit is 2 terabytes, you must have at least five file systems. This increases the administrative overhead. Knowing your requirements will allow you to determine the best volume manager and file system combination and perhaps allow you to determine the best operating system and host platform. Also note that, in some cases, volume managers are used to support host bus adapter (HBA) failover.

Before I begin, I should explain my biases. My background is high performance computing and I/O, and I am of the opinion that designing for performance should take into account the whole data path. Some vendors tout architectural features as requirements. For example, a journaling/logging file system is considered a requirement. I believe the real requirement is fast recovery after a crash and/or system failure, and journaling/logging is an implementation of that requirement. In some application environments, such as real-time data streams, journaling/logging can seriously degrade performance because the file system metadata must be written two times -- once to the journal/log, and then committed to the metadata. Be sure you understand the real requirements of what you are trying to do and don't just look at the implementation and the vendor-touted features.

IBM (Linux and AIX)

The open source bug has bitten IBM's volume manager and file system groups. LVM (Logical Volume Manager), JFS (Journaled File System), and GPFS (Global Parallel File System) are all now supported under Linux as well as AIX.

LVM

This volume manager provides standard volume management functions, such as mirroring, striping, and concatenation. On Linux, LVM can work with other file systems besides JFS and GPFS.

JFS

IBM's local file system provides a number of tuning options including:

Separation of the journal and the data
Defining the size of the log
Defining the block size of the file system with a maximum of 4096 bytes
Size of the allocation group that determines the number of bytes per inode

JFS does not allow large allocations and high performance for streaming data; it is generally used for local file systems on AIX. The benchmark data that I have seen show it behind XFS in terms of performance on Linux.

GPFS

The original intent behind the file system was to support parallel I/O from applications for IBM SP clusters. GPFS has been ported to Linux and supports most of the features that are available on AIX systems. As you would expect with a shared file system, the larger the request the better the performance. GPFS has, in my opinion, significant limitations on allocation sizes. The largest allocation to my knowledge is 512 KB, which is less than a full stripe on a RAID-5 system with 8+1 configuration and even small 128-KB stripe allocations. GPFS does not support round-robin allocation and stripes all of the data. GPFS supports hierarchical storage management (HSM) with HPSS. HSM is critical for large file systems because backup and restoration is virtually impossible given the speed of tape drives compared with the density increases in disk drives.

ADIC StorNext File System

StorNext is a multi-platform file system that supports Windows, Linux, SGI, and Sun. This file system has its roots in the high-performance world. The file system supports round-robin and striping, large allocations, bandwidth allocation multipathing, metadata server failover, and integration with ADIC's HSM StorNext Storage Manager. ADIC believes it has solved some of the metadata scalability problems given they have the oldest commercial shared multi-platform file system. Many of the features are designed for data streaming as the original market implementation was in the video-editing market space.

Linux

Within Linux, a number of different volume managers and file systems are supported, and they have very similar characteristics and tunables.

Volume Managers

LVM

The Logical Volume Manager (LVM) is distributed with the 2.4 Linux kernel, but LVM releases are asynchronous from kernel releases. Although a number of creation volume management options exist (like shrinking or extending a volume and migration), only one performance tunable exists and that is the stripe size and stripes:

lvcreate -i2 -I4 -l100 -nanothertestlv testvg

-i 2 is the number of stripes in the volume. -I 4 is the stripe size in KB.

Enterprise Volume Management System (EVMS)

EVMS is a project from the Linux Kernel Foundry. Much information, along with a HOWTO and FAQ, can be found at the Web site:

http://evms.sourceforge.net/

A comparison of EVMS with other volume managers is provided at:

http://evms.sourceforge.net/comparison.pdf

This volume manager provides software RAID support and also supports bad block allocation/reallocation for IDE drives. EVMS should be considered if Linux is going to be used.

File Systems

Files system developed for Linux were originally designed for small transactions. Most do not support large allocation sizes (more than 1 MB) nor (until recently) large file system sizes and large I/O transfer sizes. Most do not support the concept of data and metadata separation. So in general, file systems on Linux cannot scale as well as file systems on Solaris or AIX.

ReiserFS

This file system was one of the first journaling file systems under Linux. It supports the concepts of journal separation and journal size. The largest allocation is only 4096 bytes, so files could become fragmented and cause performance issues for large files. For mount options, Reiser supports various hashing options for directories. Reiser uses a balanced btree approach, which allows allocations to be balanced across the devices. Large file systems on ReiserFS do not seem to scale well with file systems over 1 TB from the benchmark data I have seen.

EXT3

Basically, EXT3 is the EXT2 file system with journaling put on top of it. It is much more difficult to add a significant feature or function to a file system after the structure has been created. Although EXT3 might be great for EXT2 users who want to upgrade to a journaling file system (the fsck performance for EXT2 was poor), it does not compare with the performance of the other file systems discussed. Ext3 does not support large allocations and does not scale well compared with XFS.

XFS

XFS was ported from the SGI IRIX operating system. It likely has the highest performance of the local Linux file systems. XFS allows allocations up to 64 KB and supports direct I/O and file pre-allocation. Features such as asynchronous journaling are supported, which improve performance, but of course the tradeoff is some reliability. Direct I/O, pre-allocation, and other high-performance features are supported.

Solaris

On Solaris, two volume managers are available, along with a file system that manages it own volumes.

Volume Managers

Veritas VxVM

This is the original volume manager on Solaris and it is used on other platforms, such as Linux and HP-UX. Veritas supports all of the usual functions, including software RAID and tunables such as request size. The VxVM is well integrated with UFS, VxFS, Veritas's Dynamic Multipathing for HBA failover, and Veritas Cluster Server for machine failover. Veritas has released VxVM under Linux.

Solaris Volume Manager

This is the product that was called Solaris Disk Suite (or SDS) until recently. It has features similar to VxVM and provides integration with Sun's HBA failover and Sun Cluster Manager for machine failover.

File Systems

Three file systems are available under Solaris. UFS has been around in theory for more than 35 years. Understanding performance requirements is critical to determining which file system will meet your needs.

QFS

This file system in unique in a number of ways:

1. It manages its own volumes and does not need a volume manager.

2. It does not journal/log metadata but uses another method.

3. It supports large allocations (~64 MB) and settable separate direct I/O bypass options for read and write.

4. Built-in HSM as backup of very large file systems is next to impossible.

The QFS file system has its roots in high-performance environments, where users made large requests (more than 1 MB), needed large file systems (more than 10 TB), required a large number of files per file system (more than 20 million), and needed gigabytes of I/O performance. QFS supports pre-allocation of files -- both striping files across all of the devices, and allocation of a file on a single different device for each open system call. It has evolved to support a homogeneous (Solaris only) shared file system. QFS is excellent for large file systems, high-speed I/O, and large I/O requests. Performance for other areas requires a very good understanding of the file system.

QFS as of the 4.0 release natively supports homogeneous file sharing between Sun systems. All of the features available for the native file system are supported on the client systems, and additional features for allocation are supported on the client systems. Heterogeneous support is provided via a product from IBM/Tivoli called SANergy, which allows support for AIX, Linux, Tru64, SGI, and some versions of Windows. The number of features for remote allocation and caching are limited compared with using native QFS on Sun clients.

UFS

This is one of the oldest file systems available; some of the original proposals for UFS came from papers written in 1965. Most of the original device layouts have been maintained over the years, which has limited its performance. Sun recently added journaling to UFS because the metadata layout caused fsck performance to be horrible. Adding journaling to UFS as an afterthought is questionable, in my opinion. The largest allocation is 8192 bytes, and the largest continuous allocation can be a maximum of 1 MB. The allocation size is a real problem for some RAID devices because the default stripe size for a RAID group can be 4 MB for a RAID-5 8+1 configuration. Direct I/O was also added recently, and it is supported via both a mount option and based on the request size from the application.

SanPoint Foundation Suite

For years, the only alternative to UFS on Solaris was VxFS. VxFS supported journaling/logging and was well integrated with VxVM. Today, VxFS supports automatic discovery of direct I/O based on the application request size and pre-allocation, and is well integrated with Oracle and other databases applications.

Veritas supports a shared file system -- SANPOINT Control -- where the metadata and data are separated. The file system uses VxVM and other features to provide high metadata reliability and path failover. SANPOINT controller currently supports Solaris, Windows, Linux, AIX and HP-UX.

Conclusions

Choosing the right file system for your environment includes understanding not only the application requirements of file size and file system size, but also the administrative requirements and hardware platforms supported by a file system. Some file systems support only some RAIDs and switch hardware. Don't ask me why a file system needs to certify a RAID, because it does not make sense to me.

The bottom line is that almost all of the file systems above could be "the best" given specific application requirements, available or planned hardware, and administrative requirements. Running an operationally sound benchmark (a benchmark that mimics the real operation of the system) is critical to determining which will be "the best" for you.

What's Next

During the past 10 months, I have reviewed the data path to/from:

User Application<-->Operating System<-->Volume Manager and File System<-->Device(s)

Now that I've covered the basics, in the next few columns, I will describe how to determine requirements, define the rules, and run a benchmark for file systems and storage. Issues like HBAs, RAIDs, tapes, and backup will be reviewed. Issues such as I/O performance under load, data fragmentation, file system and file size limitations, and scaling should all come out in a benchmark or file system test. If you have suggestions or specific issues, please feel free to email me.

Henry Newman has worked in the IT industry for more than 20 years. Originally at Cray Research and now with a consulting organization, he has provided expertise in systems architecture and performance analysis to customers in government, scientific research, and industry around the world. His focus is high-performance computing, storage and networking for UNIX systems, and he previously authored a monthly column about storage for Server/Workstation Expert magazine. He may be reached at: [email protected].