SolarisTM
Resource Management -- The Fair Share Scheduler
Peter Baer Galvin
The previous SolarisTM Companion began coverage of the new
Solaris 9 resource management features. This month continues the
analysis by exploring the core of S9RM, the fair share scheduler.
This scheduler provides precise control of CPU use, allowing optimum
system resource use without conflict between applications and workloads.
The fair share scheduler (FSS) was available as an unbundled product
before Solaris 9, and is included in the full Solaris 9 release
at no charge. The Solaris 9 Resource Management package includes
resource limits (discussed here previously), the FSS, resource pools,
and the Internet protocol quality of service (IPQoS) facility (to
be covered in the future). This suite is one of the key features
of Solaris 9, along with the inclusion of SunScreen, new device
support, and performance and reliability improvements. Solaris 9
is in the early adoption phase at most sites, and I hope this column
will help Sun sites determine if and when to move to Solaris 9,
and which features to adopt as part of the move or later in the
Solaris 9 lifecycle.
Overview
Fair share scheduling is a concept dating from before 1988, when
research into this scheduling concept resulted in a paper by J.
Kay and P. Lauder being published in the Communications of the
ACM. That paper and other information are available from http://citeseer.nj.nec.com/kay88fair.html.
Publishing a research paper is a far cry from use in commercial
operating systems and, in fact, few fair share schedulers have been
implemented. They are not a general replacement for time-sharing
schedulers, which are demand-based. Time sharing schedulers try
to maximize the use of system resources, and tend to allocate CPU
slices to anyone who wants them. They figure a process wants to
be a "hog", so be it, at least the system is being used
fully!
With the advent of utility computing and Sun's drive toward
the N1 architecture supporting on-demand service creations and destruction,
more refined control of CPU use is required. If a facility runs
N database servers for M different projects, they must have fine-grained
control of resource use available. Enter the FSS, which is another
loadable scheduler class within Solaris. In fact, it can coexist
with the time-share scheduler, real-time scheduler, interactive
scheduler, and fixed-priority scheduler classes already available
on Solaris. Details of mixing use of these classes are beyond the
scope of this column, but are available in the Sun documentation.
Functionality
In general, the FSS allows projects, tasks, and processes to be
assigned "shares" of the system CPUs. There is no available
or maximum share value. Rather, shares are relative. If one project
has a share of 6 and the only other project has a share of 2, the
first project will receive 6/8ths of the CPU cycles, and the second
would receive 2/8ths.
The reality of the FSS is that projects (which contain tasks,
which contain processes) can be determined by the systems manager
to deserve some relative amount of the CPU cycles of a system, and
that resource availability is guaranteed. If a system is funded
by multiple groups, for instance, those groups can receive the amount
of CPU corresponding to their contribution.
Of course, there are other ways to limit the amount of CPU use
by processes. Processor sets is an absolute method. A process is
assigned to a set of CPUs, and cannot run on other CPUs. Nor can
other processes run on those CPUs. (As an aside, resource pools
are a Solaris 9 augmentation of processor sets. If you are using
sets, check out the pools.)
FSS is a little more flexible than that. If CPU resources are
unused, anyone can use them. However, if a certain project needs
resources, and deserves those resources, then the FSS will schedule
that project to use those CPU resources and disallow use by anything
else while that project is using them. Thus FSS is a very nice combination
of maximizing system resource use, but assuring that CPU resources
are available via a pre-determined ratio when resource demand is
high.
Domains, processor sets, and the FSS can be used in combination
to slice and dice a system to exactly match the site's needs.
Domains are separately booted environments within a Sun server and
do not affect each other. Processor sets exist within one operating
system image and create a wall preventing use of CPUs by disparate
processes. FSS could be used within a domain or a processor set
to further refine which processes gain priority over which others.
This combination provides a very complete set of CPU-use management
tools.
Utility
First, the FSS scheduling class needs to be set as the default
scheduler for the system:
# dispadmin -d FSS
This change is permanent (until another class is declared to be the
default), and system processes will be placed into the FSS scheduler
from now on after each reboot. At this point none of the existing
processes are in the new class. Rather, they are still in their old
classes or the Solaris default class of TS. To move all of the existing
TS processes to FSS, use:
# priocntl -s -c FSS -i class TS
And now check the scheduling classes that are in use:
# ps -ef -o pset,class | grep -v CLS | sort | uniq
- IA
- TS
- FSS
- SYS
The use of the FSS is based on a few commands and configuration files.
The project.cpu-shares property in the /etc/project
file adds share information to the project information typically contained
there. Projects not given shares here are assigned a default share
value of 1. Note that any processes with shares of 0 are starved of
CPU use while any projects with shares greater than 0 are running.
In the following example of /etc/project, any tasks or
processes in testproject are assigned a share of 10:
testproject:100::::project.cpu-shares=(privileged,10,none)
Note that only pertinent processes started after this value is in
place would receive that share value. The projects file is
only read when a project is instantiated or a process joins an existing
project. To change the share of all processes in test-project to 3
while they execute, use prctl:
# prctl -r -n project.cpu-shares -v 3 -i project testproject
This command only has a temporary effect, as it does not change /etc/project.
The value specified by the -v option can be from zero to 65535.
All system processes run in project 0, which is given the maximum
value of shares.
How does the FSS perform in the real world? Consider a system
with three new projects defined:
# more /etc/project
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
lowprioproject:11:For testing:pbg::project.cpu-shares=(privileged,0,none)
medprioproject:12:For testing:pbg::project.cpu-shares=(privileged,1,none)
highprioproject:13:For testing:pbg::project.cpu-shares=(privileged,10,none)
Before any user processes are started, prstat shows an unused
system with only the standard projects in use:
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
1 4 13M 11M 2.3% 0:00:01 0.2% user.root
3 33 178M 98M 20% 0:00:07 0.1% default
0 42 98M 56M 11% 7:40:22 0.0% system </code>
Now we start a lowprioproject task:
$ newtask -p lowprioproject /usr/tmp/cpuhog &
And view the prstat project information:
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
11 2 1984K 1192K 0.2% 0:02:10 100% lowprioproject
1 4 13M 11M 2.3% 0:00:01 0.1% user.root
3 33 178M 98M 20% 0:00:07 0.1% default
0 42 98M 56M 11% 7:40:22 0.0% system
Total: 82 processes, 148 lwps, load averages: 0.91, 1.07, 1.91
Now we can start a medium-priority project process, and it should
get as much CPU as it desires:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
14236 pbg 928K 416K run 1 0 0:02:27 99% cpuhog/1
14230 pbg 928K 416K run 0 0 0:04:18 0.0% cpuhog/1
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
12 1 928K 416K 0.1% 0:02:27 99% medprioproject
1 5 13M 11M 2.3% 0:00:01 0.2% user.root
0 42 98M 56M 11% 7:40:22 0.1% system
3 33 178M 98M 20% 0:00:07 0.0% default
11 2 1984K 1192K 0.2% 0:04:18 0.0% lowprioproject
Total: 84 processes, 150 lwps, load averages: 1.95, 1.44, 1.84
Note that the load average is 2, because two threads are runnable.
The FSS scheduler gives no time slices to the cpuhog running with
a share of 0, however.
Next, we can start the high-priority project (with 10 shares).
To begin, we confirm that it has 10 shares:
$ newtask -p highprioproject
$ prctl -n project.cpu-shares $$
14265: sh
project.cpu-shares [ no-basic no-local-action ]
10 privileged none
65535 system deny [ max ]
Then, we start another "cpuhog" and check the results. As
expected, the high priority gets approximately 90% of the CPU, the
medium priority gets about 10%, and the low priority gets starved:
$ /usr/tmp/cpuhog &
$ prstat -J -p 14267,14236,14230
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
14267 root 928K 416K run 2 0 0:05:11 90% cpuhog/1
14236 pbg 928K 416K run 1 0 0:05:22 9.3% cpuhog/1
14230 pbg 928K 416K run 0 0 0:04:18 0.0% cpuhog/1
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
13 1 928K 416K 0.1% 0:05:11 90% highprioproject
12 1 928K 416K 0.1% 0:05:22 9.3% medprioproject
11 1 928K 416K 0.1% 0:04:18 0.0% lowprioproject
Total: 3 processes, 3 lwps, load averages: 3.02, 2.59, 2.23 Summary
Acknowledgement
Special thanks to Andrei Dorofeev, Member of Technical Staff at
Sun, for giving input to this column. He also showed me an interesting
command for listing how many runnable threads are on a system, including
any normally hidden kernel threads:
# mdb -k
B0Loading modules: [ unix krtld genunix ip usba s1394 ufs_log random nfs ptm cpc
lofs ]
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 00001400000 1b 3 0 57 no no t-1 300051d2fc0 mdb
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 14 30005176560 cpuhog
EXISTS 1 30005176fe0 cpuhog
ENABLE 0 300051d3260 cpuhog
> ^D
Summary The Solaris Fair Share Scheduler is part of the
Solaris 9 Resource Manager suite. It creates a new scheduler class,
and can be used to provide strict control on which processes use
how much CPU compared to other projects that use the FSS class.
Together with processor sets, processor pools, and domains, minute
management of CPU use is now possible. Along with the other resource
manager features, the FSS allows control of the use of the Solaris
environment that has never before been possible.
You can find out all about the FSS in Sun's documentation:
http://docs.sun.com/db/doc/806-4076/6jd6amqqo?a=view
Experimentation with the new FSS is low risk, but should be done on
non-production environments. Enabling and using the FSS does not even
require a reboot, so excuses are limited for not giving it a try.
Peter Baer Galvin (http://www.petergalvin.org) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column,
and Pete's Super Systems, the systems management column for
Unix Insider (http://www.unixinsider.com). Peter is
coauthor of the Operating Systems Concepts and Applied
Operating Systems Concepts textbooks. As a consultant and trainer,
Peter has taught tutorials and given talks on security and systems
administration worldwide.
|