File
Systems -- Understanding the Internals
Henry Newman
This month, I will cover the internal workings and implementations
of file systems. File systems are the next stage in the data path
as part of my series on storage. In my last column, I covered volume
management -- how volumes are created and managed, and performance
issues related to striping volumes. For file systems that require
a volume manager to control multiple devices, understanding the
integration between the volume manager and file system is critical.
File System Services
A file system provides a management framework for the storage.
It allows you to see and manage the data via the framework that
we call files. File systems do the following:
- Manage the available space -- Allocation maps, where the
files reside on the storage, removal of files
- Provide access control -- Standard UNIX/POSIX permissions,
Access Control Lists (ACL)
- Support standard UNIX/POSIX interfaces -- read/write, fread/fwrite,
aioread/aiowrite
- Support other feature functions -- Hierarchical storage
management, special I/O for databases
Some file systems have additional features and functions for data
sharing across servers, security, file locking, and hierarchical
storage management (HSM).
File System Functions
Historically, file systems needed to be checked after a system
crash for consistency. As file systems got larger and disk performance
did not scale with the size, the time to check the file system after
a crash became longer and longer. I remember back in 1992 waiting
for 11 hours to fsck(1M) a Cray file system that I had crashed
(the customer was not happy). Since that time, a number of technologies
have been developed to improve this functionality.
Logging
Log-based file systems were first discussed in a USENIX paper,
"The Design and Implementation of a Log Structured File System",
by Rosenbaum and Ousterhout in 1990. In this paper, the authors
analyzed I/O usage and concluded that:
1. I/Os are short.
2. Most I/Os are random.
3. All data is fragmented on current file system technology (BSD
and other file systems).
4. Disk access speeds have not changed much.
Since that time, file system vendors such as Compaq/HP (ADvFS),
Veritas (VxFS), SGI (XFS), IBM (JFS), Sun (UFS Logging), and others,
as well as Linux file systems (EXT3, ReiserFS) have taken the original
concept and modified it to log metadata operations. The goal is
to ensure that the file system metadata is synchronized when the
log area becomes full. If the system crashes, the expectation is
that only the metadata log will have to be checked for consistency
after the boot -- not all of the metadata. This check
is commonly called fsck(1M). This methodology was developed
based on the requirement to boot quickly after a crash.
Log Placement
Most file system and volume managers allow the placement of the
log on another device different from the data. This is done to reduce
contention and increase the performance of the file system metadata
operations. Each time the files are written or opened, the inode
is kept in the log and it is periodically written to the metadata
area within the actual file system metadata area(s).
When doing a large number of metadata logging operations, logging
device performance can become an issue. With logging, the file system
metadata is copied two times.
1. The file system metadata is written to the log device.
2. The file system metadata is moved from the log device to the
file system metadata area after the log becomes full.
This double copy can become a performance bottleneck if, for example,
a large number of metadata operations fill the log and the file
system is busy with data operations, or if the log device is slower
than the number of log operations that are needed.
Some people (including me) have philosophical issues with logs
and logging, and you either fall into the logging camp or non-logging
camp, and there are far more loggers than non-loggers. These file
system wars are best left to bar room discussions (which occasionally
turn into brawls), but in a nutshell, the original reason for logging
was to cache data and metadata. Logging is currently used as a method
for fast file system recovery. Fast file system recovery is the
requirement, not the logging, and I believe other methods can meet
the requirement.
Inode and Data Location
Two types of inode formats are used by file systems to describe
the location of the data associated with the file being represented
by the inode. They are:
1. Indirect blocks
2. Extents
Indirect Blocks
If a file system is using indirect blocks, the inode points to
the data block but all subsequent data blocks are pointed to by
the last data block. So, the first inode points to the first data
allocation, and the last data within that first allocation points
to the location of the next data block. UFS is a good example of
a file system that uses indirect block allocation to represent block
address. Using indirect blocks becomes a huge issue when doing random
reads, because often the whole file must be read to determine the
location of the data that is needed. Here are some good Web resources
for further information:
http://ou800doc.caldera.com/FS_admin/_ufs_Inodes.html#_The_ufs_Inodes_Disk_Block_Addre
http://www.cs.wisc.edu/~bart/537/lecturenotes/s25.html
http://www.stanford.edu/class/cs140/projects/filesys/review.html
The advantage of this indirect block method was that adding to the
end of a file was simple, because the data block was in memory (remember
that 8K of memory was huge in 1965). It was a great idea given the
available technology at the time.
Extents
With extent-based inode usage, the inode contains the address
list of where data resides in the file system. For many file systems,
each inode contains space for 15 addresses and a pointer to the
location of the next inode, which can also contain 15 addresses.
The data blocks contain any information as to the location of other
data -- unlike the indirect block representation. See the following
for more information:
http://www.cs.utexas.edu/users/mootaz/cs372/Lectures/L-17.ppt
http://www.cs.uwyo.edu/~seker/courses/4740/lect12.ppt
Extent representation is used in almost all file systems today. Indirect
block allocation was proposed with the original UFS design circa 1965,
and might have met the hardware and software requirements of that
day. However, extent-based allocation is far more optimal given today's
hardware and software structure. I believe, given current technology
trends, that extents will become outdated as we move to the future.
Data Representation/Allocation
Data representation and allocation involves how and where file
systems place data on the managed devices (note that I do not say
disks, because some file system support hierarchical storage management
(HSM) so that data can reside on tape or disk or both). Data allocation
algorithms and data placement becomes a big issue for some file
systems. As mentioned in my previous article, most file systems
require a volume manager when using multiple devices.
Given what I have seen for most user and scratch file systems,
a 90/10 rule applies. 90% of the files use 10% of the space, and
10% of the files use 90% of the space. Sometimes the distribution
is 95/5 or even a bi-modal distribution, but you are likely to have
a very skewed distribution of sizes and counts for files --
not a statistically normal distribution. Understanding how allocation
is accomplished in a file system provides a better understanding
of how the file system will scale and perform in your environment
given the file sizes, the allocation sizes, and how the data is
accessed. There are generally two types of representation of free
space.
1. Btrees -- used by Veritas, XFS Reiser, and other file systems
2. Bitmaps -- used by QFS, NFTS, HFS (Mac/Linux)
A B-tree is a sorted list of all the free allocations and used
space within the file system. It is important to understand how
space is allocated and used from within the tree. Some good reading
on B-tree allocation, which is used in most file systems, can be
found at:
http://cs-people.bu.edu/dbuzan/cs112/lab7/lab92.html
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
Bitmap allocation is less commonly used. This method is used where
each bit in the map represents a single allocation unit such as 512
bytes (a single hardware sector on a disk) or up to 65536 KB (the
largest allocation in the Sun QFS file system). There are at least
two methods for free space searching. Most file systems use first-fit
algorithms, however, at least two other file systems use best-fit
-- Cray NC1FS and Fujitsu FPFS.
http://www.cs.utexas.edu/users/dahlin/Classes/UGOS/quiz/soln6.html
The first-fit method tries to find the first space within the file
system that matches the allocation size. In some file systems, the
first-fit method is used to find the space closest to the last allocation
of the file being extended. The best-fit method tries to find the
best place in the file system for the allocation of the data. This
method is used to try to reduce total file system fragmentation. The
best-fit method always takes more CPU cycles than first-fit, because
the whole file system must be searched for the best allocation.
The best-fit method also works better at reducing fragmentation,
especially when files can be pre-allocated (for file systems that
support this method) or for large allocations (e.g., multiple MBs).
Most vendors do not support this method, because most users do not
pre-allocate and most allocations in file systems are not large.
Also, because volume managers provide the file system with a single
address space, the search time would be huge in a large file system.
Even for file systems that allow round-robin allocation, the best-fit
algorithm requires searching the whole round-robin device (partition).
In general, B-trees can do a better job for smaller allocations,
because the allocation time is faster. As the file system grows,
and data is added to the end of files, the files become fragmented.
Bitmaps are better for larger allocations, but the time for allocation
(searching for free space) can be much greater, especially if the
best-fit algorithm is used. Both bitmaps and B-trees can be optimized
for your operational requirements with tunable parameters in the
file system, volume manager, RAID configuration, and allocations.
Conclusions
I've provided an overview of how file systems work internally
and some common file system activities. This sets the foundation
for understanding file system tuning parameters and their means.
In the next column, I will cover some of the specific tunables,
creating volumes and file systems on Solaris (UFS, VxFS/VxVM, QFS),
AIX (JFS), and Linux (ResierFS, EXT3). Using these file systems
and volume managers will provide a basis for understanding other
file systems and volume managers.
Henry Newman has worked in the IT industry for more than 20
years. Originally at Cray Research and now with a consulting organization,
he has provided expertise in systems architecture and performance
analysis to customers in government, scientific research, and industry
around the world. His focus is high-performance computing, storage
and networking for UNIX systems, and he previously authored a monthly
column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.
|