Cover V11, I12

Article

dec2002.tar

File Systems -- Understanding the Internals

Henry Newman

This month, I will cover the internal workings and implementations of file systems. File systems are the next stage in the data path as part of my series on storage. In my last column, I covered volume management -- how volumes are created and managed, and performance issues related to striping volumes. For file systems that require a volume manager to control multiple devices, understanding the integration between the volume manager and file system is critical.

File System Services

A file system provides a management framework for the storage. It allows you to see and manage the data via the framework that we call files. File systems do the following:

  • Manage the available space -- Allocation maps, where the files reside on the storage, removal of files
  • Provide access control -- Standard UNIX/POSIX permissions, Access Control Lists (ACL)
  • Support standard UNIX/POSIX interfaces -- read/write, fread/fwrite, aioread/aiowrite
  • Support other feature functions -- Hierarchical storage management, special I/O for databases

Some file systems have additional features and functions for data sharing across servers, security, file locking, and hierarchical storage management (HSM).

File System Functions

Historically, file systems needed to be checked after a system crash for consistency. As file systems got larger and disk performance did not scale with the size, the time to check the file system after a crash became longer and longer. I remember back in 1992 waiting for 11 hours to fsck(1M) a Cray file system that I had crashed (the customer was not happy). Since that time, a number of technologies have been developed to improve this functionality.

Logging

Log-based file systems were first discussed in a USENIX paper, "The Design and Implementation of a Log Structured File System", by Rosenbaum and Ousterhout in 1990. In this paper, the authors analyzed I/O usage and concluded that:

1. I/Os are short.
2. Most I/Os are random.
3. All data is fragmented on current file system technology (BSD and other file systems).
4. Disk access speeds have not changed much.

Since that time, file system vendors such as Compaq/HP (ADvFS), Veritas (VxFS), SGI (XFS), IBM (JFS), Sun (UFS Logging), and others, as well as Linux file systems (EXT3, ReiserFS) have taken the original concept and modified it to log metadata operations. The goal is to ensure that the file system metadata is synchronized when the log area becomes full. If the system crashes, the expectation is that only the metadata log will have to be checked for consistency after the boot -- not all of the metadata. This check is commonly called fsck(1M). This methodology was developed based on the requirement to boot quickly after a crash.

Log Placement

Most file system and volume managers allow the placement of the log on another device different from the data. This is done to reduce contention and increase the performance of the file system metadata operations. Each time the files are written or opened, the inode is kept in the log and it is periodically written to the metadata area within the actual file system metadata area(s).

When doing a large number of metadata logging operations, logging device performance can become an issue. With logging, the file system metadata is copied two times.

1. The file system metadata is written to the log device.
2. The file system metadata is moved from the log device to the file system metadata area after the log becomes full.

This double copy can become a performance bottleneck if, for example, a large number of metadata operations fill the log and the file system is busy with data operations, or if the log device is slower than the number of log operations that are needed.

Some people (including me) have philosophical issues with logs and logging, and you either fall into the logging camp or non-logging camp, and there are far more loggers than non-loggers. These file system wars are best left to bar room discussions (which occasionally turn into brawls), but in a nutshell, the original reason for logging was to cache data and metadata. Logging is currently used as a method for fast file system recovery. Fast file system recovery is the requirement, not the logging, and I believe other methods can meet the requirement.

Inode and Data Location

Two types of inode formats are used by file systems to describe the location of the data associated with the file being represented by the inode. They are:

1. Indirect blocks
2. Extents

Indirect Blocks

If a file system is using indirect blocks, the inode points to the data block but all subsequent data blocks are pointed to by the last data block. So, the first inode points to the first data allocation, and the last data within that first allocation points to the location of the next data block. UFS is a good example of a file system that uses indirect block allocation to represent block address. Using indirect blocks becomes a huge issue when doing random reads, because often the whole file must be read to determine the location of the data that is needed. Here are some good Web resources for further information:

http://ou800doc.caldera.com/FS_admin/_ufs_Inodes.html#_The_ufs_Inodes_Disk_Block_Addre
http://www.cs.wisc.edu/~bart/537/lecturenotes/s25.html
http://www.stanford.edu/class/cs140/projects/filesys/review.html
The advantage of this indirect block method was that adding to the end of a file was simple, because the data block was in memory (remember that 8K of memory was huge in 1965). It was a great idea given the available technology at the time.

Extents

With extent-based inode usage, the inode contains the address list of where data resides in the file system. For many file systems, each inode contains space for 15 addresses and a pointer to the location of the next inode, which can also contain 15 addresses. The data blocks contain any information as to the location of other data -- unlike the indirect block representation. See the following for more information:

http://www.cs.utexas.edu/users/mootaz/cs372/Lectures/L-17.ppt
http://www.cs.uwyo.edu/~seker/courses/4740/lect12.ppt 
Extent representation is used in almost all file systems today. Indirect block allocation was proposed with the original UFS design circa 1965, and might have met the hardware and software requirements of that day. However, extent-based allocation is far more optimal given today's hardware and software structure. I believe, given current technology trends, that extents will become outdated as we move to the future.

Data Representation/Allocation

Data representation and allocation involves how and where file systems place data on the managed devices (note that I do not say disks, because some file system support hierarchical storage management (HSM) so that data can reside on tape or disk or both). Data allocation algorithms and data placement becomes a big issue for some file systems. As mentioned in my previous article, most file systems require a volume manager when using multiple devices.

Given what I have seen for most user and scratch file systems, a 90/10 rule applies. 90% of the files use 10% of the space, and 10% of the files use 90% of the space. Sometimes the distribution is 95/5 or even a bi-modal distribution, but you are likely to have a very skewed distribution of sizes and counts for files -- not a statistically normal distribution. Understanding how allocation is accomplished in a file system provides a better understanding of how the file system will scale and perform in your environment given the file sizes, the allocation sizes, and how the data is accessed. There are generally two types of representation of free space.

1. Btrees -- used by Veritas, XFS Reiser, and other file systems
2. Bitmaps -- used by QFS, NFTS, HFS (Mac/Linux)

A B-tree is a sorted list of all the free allocations and used space within the file system. It is important to understand how space is allocated and used from within the tree. Some good reading on B-tree allocation, which is used in most file systems, can be found at:

http://cs-people.bu.edu/dbuzan/cs112/lab7/lab92.html
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
Bitmap allocation is less commonly used. This method is used where each bit in the map represents a single allocation unit such as 512 bytes (a single hardware sector on a disk) or up to 65536 KB (the largest allocation in the Sun QFS file system). There are at least two methods for free space searching. Most file systems use first-fit algorithms, however, at least two other file systems use best-fit -- Cray NC1FS and Fujitsu FPFS.

http://www.cs.utexas.edu/users/dahlin/Classes/UGOS/quiz/soln6.html
The first-fit method tries to find the first space within the file system that matches the allocation size. In some file systems, the first-fit method is used to find the space closest to the last allocation of the file being extended. The best-fit method tries to find the best place in the file system for the allocation of the data. This method is used to try to reduce total file system fragmentation. The best-fit method always takes more CPU cycles than first-fit, because the whole file system must be searched for the best allocation.

The best-fit method also works better at reducing fragmentation, especially when files can be pre-allocated (for file systems that support this method) or for large allocations (e.g., multiple MBs). Most vendors do not support this method, because most users do not pre-allocate and most allocations in file systems are not large. Also, because volume managers provide the file system with a single address space, the search time would be huge in a large file system. Even for file systems that allow round-robin allocation, the best-fit algorithm requires searching the whole round-robin device (partition).

In general, B-trees can do a better job for smaller allocations, because the allocation time is faster. As the file system grows, and data is added to the end of files, the files become fragmented. Bitmaps are better for larger allocations, but the time for allocation (searching for free space) can be much greater, especially if the best-fit algorithm is used. Both bitmaps and B-trees can be optimized for your operational requirements with tunable parameters in the file system, volume manager, RAID configuration, and allocations.

Conclusions

I've provided an overview of how file systems work internally and some common file system activities. This sets the foundation for understanding file system tuning parameters and their means. In the next column, I will cover some of the specific tunables, creating volumes and file systems on Solaris (UFS, VxFS/VxVM, QFS), AIX (JFS), and Linux (ResierFS, EXT3). Using these file systems and volume managers will provide a basis for understanding other file systems and volume managers.

Henry Newman has worked in the IT industry for more than 20 years. Originally at Cray Research and now with a consulting organization, he has provided expertise in systems architecture and performance analysis to customers in government, scientific research, and industry around the world. His focus is high-performance computing, storage and networking for UNIX systems, and he previously authored a monthly column about storage for Server/Workstation Expert magazine. He may be reached at: hsn@hsnewman.com.