Exploring
the SolarisTM Device Tree
Shaun Clowes
New Solaris administrators often struggle trying to find commands
to query the hardware attached to their machines. This is hardly
surprising given that the number of machines most administrators
manage has increased during the past few years, making it more difficult
to remember the configuration of them all. Fortunately, there are
some very useful system tools such as cfgadm, prtconf,
sysdef, and format that can help. You may have used
these tools and perhaps wondered how they determine the hardware
configuration. The main source of this information is the device
tree -- data kept by the kernel describing the hardware attached
to the machine.
In this article, I will examine the internals of the Solaris device
tree. I will also show how you can use the device tree to create
your own special-purpose system tools. Note that I'll describe
the tree and its features using terminology from Solaris 8 documentation.
Solaris 7 and earlier releases used different terminology but the
concepts and the code implementation (with caveats as described)
are the same.
The Device Tree
The device tree is kept in kernel memory and contains the kernel's
current information about the hardware configuration of the machine.
Each node in the tree is either a bus nexus node or a leaf (device)
node. A bus nexus node describes a device or machine component that
can logically or physically contain other devices. For example,
on most machines, there would be a bus nexus node for a PCI bus.
Most of the nodes in the device tree are not bus nexus nodes, instead
they describe devices, such as network cards. These nodes are referred
to as leaf nodes because they are always attached to a bus nexus
node and cannot contain other devices. For example, the network
card leaf node could be owned by a PCI node but could not have further
leaf nodes attached to it. The tree is rooted at a bus nexus node
for the hardware platform of the machine that is linked to some
of the machine's buses, which eventually link to all of the
device leaf nodes.
The basic structure of the device tree is shown in Figure 1. The
dotted lines in the diagram indicate conceptual links, with bus
nexus nodes having links to all of the devices on the bus in a parent/child
relationship. As it turns out, the nexus node really only has a
link to its first child, which then links to all of its siblings
(the black lines represent these real links). The real links will
be discussed in a later section.
It is important to note that a device represented in the tree
need not be an accessible piece of hardware. Nodes may exist for
devices that do not have drivers available as well as for pseudo
devices. Pseudo devices are devices that are provided by the kernel
(or drivers in the kernel) and do not exist physically, only logically
(e.g., the kmem device (/dev/kmem) that provides access to kernel
memory). Pseudo device nodes are attached to a special pseudo bus
nexus node (since logical devices clearly can't be attached
to a physical bus).
Inside a Node
Each device node contains a variety of information about the device.
The node format is very flexible and can accommodate both general
information needed by the system and specific information needed
by drivers or userland applications. This flexibility means that
a node is actually quite a complex data structure, thus I will only
cover a subset of the most useful parts of the structure.
The basic information in a node is a set of pointers to the node's
parent, sibling, and first child. The parent pointer contains the
address of the bus nexus node representing the bus to which this
device is attached. The sibling pointer contains the address of
the node for the next device attached to the same parent bus, if
any exists. The child pointer is only used in bus nexus nodes with
attached devices and contains the address of the first node describing
a device attached to the bus. These fields form the tree structure.
To traverse a bus, the first child node is found using the child
pointer from the bus nexus node. Then, all of the other children
can be found by following the sibling pointers in each node attached
to the bus.
Each node provides a variety of general information; in particular,
each node must provide a name and binding name. Other information
that may be provided includes a bus address, driver name, device
state, and device instance number. The node name is the most descriptive
of the general fields to the observer and can (depending on the
device and driver) be a generic name or a meaningful description
of the device. The binding name is used by the kernel to find an
appropriate driver to handle the device. The bus address describes
the device's location on the bus to which it is attached, often
in hardware terms. The driver name describes the driver that is
managing this device, if one exists and is bound to the device.
The device state is normally 0 if the device is functioning correctly,
but can indicate whether the device or its parent bus is down. The
instance number can be set by the system to differentiate between
devices attached to the same driver. As an example of these fields,
a typical SCSI disk might have the following field values: name
"sd", binding name "sd", bus "1,0",
driver "sd", state "0", instance "1".
The flexibility of the structure is provided in its ability to
contain "properties" for each node. Each node has three
separate lists of properties (driver, system, and hardware) that
contain information related to each aspect of the device. Each list
can hold any number of properties, each having a name, a type, and
data. The name describes the property being provided. The type describes
the format of the data contained in the property. The property can
contain a boolean, an array of integers, an array of strings, or
an array of bytes. The data held by the property is then interpreted
according to the type specified. Obviously, this mechanism is very
powerful because it allows virtually any sort of information about
a device to be provided to requestors who understand the property
names and data contained within.
Some devices can be directly addressed from userspace through
the use of device special files. Obvious examples include the disk
block and character files in /dev (e.g., /dev/dsk/c0t0d0s0).
These devices have a list of minor nodes attached to their node
in the device tree; each minor node describes a device special file
that can be used to address the device. Information contained in
each minor node includes a name, device type, special file type,
and dev_t. The "name" field provides an identifying
name for the minor device. The "device type" field identifies
the type of device that this file can be used to address (e.g.,
it might identify the device as a disk). The "special file
type" indicates whether this special device file is a character
or block file. The dev_t is the combination of a major and
minor pair used to create the special file with mknod. For
example, a SCSI disk might have a minor node with special file type
"character", name "a,raw", type "ddi_block:channel",
and devt "32,0" to refer to the character special file
for slice 0 (a = 0) of the disk. An administrator could take this
information and use "mknod disks0 c 32 0" to create a
special file that could be used to refer to this device.
The /devices Directory
This may sound familiar to many readers, even those who have not
heard of the device tree before, because a version of the device
tree is exported to userland in the form of /devices directory.
The main difference between the in-kernel device tree and the /devices
hierarchy is that only devices that are bound to a driver, and thus
usable, are present in the /devices tree; the in-kernel tree
contains all devices attached to the system, bound or not.
A heavily abbreviated listing of the /devices tree on a
fairly typical machine looks like the following:
/devices
/devices/pseudo
/devices/pseudo/icmp@0:icmp
/devices/pci@1f,4000:devctl
/devices/pci@1f,4000
/devices/pci@1f,4000/scsi@3:devctl
/devices/pci@1f,4000/scsi@3:scsi
/devices/pci@1f,4000/ebus@1
/devices/pci@1f,4000/ebus@1/se@14,400000:0,hdlc
/devices/pci@1f,4000/ebus@1/se@14,400000:a,cu
/devices/pci@1f,4000/scsi@3
/devices/pci@1f,4000/scsi@3/sd@0,0:a
/devices/pci@1f,4000/scsi@3/sd@0,0:a,raw
/devices/pci@1f,4000/scsi@3/sd@6,0:a
/devices/pci@1f,4000/scsi@3/sd@6,0:a,raw
It's fairly easy to see the device tree structure in this listing.
Each device tree node has a representative name of the form <node
name>@<bus address>. For each device node, one file is
present for each of the minor nodes attached to the node with a filename
of the form <representative name>:<minor node name>
(with the correct major and minor for the device). For bus nexus nodes
that have child nodes, there is also a directory with name <representative
name>, which contains the files representing the child nodes.
This listing also highlights the descriptive power of the device tree
-- it shows that there is a disk or CD-ROM (managed by the sd
driver) attached to the first SCSI bus on the first PCI controller.
The kernel always refers to devices through the device tree, and
thus through names like those shown in the listing. (Administrative
names like c0t0d0s0 have no meaning to the kernel.) In fact, the
normal administrative /dev nodes are all symbolic links to
the related device in the /devices tree. They are constructed
by devfsadmd (or devlinks and associated utilities
in older versions of Solaris) merely as a convenience to the administrator.
Thus, it's possible for a device to be perfectly functional
and usable via its /devices file without having an entry
in the /dev tree.
System Tools
As previously mentioned, tools such as prtconf and cfgadm
are dependent on the device tree and use it to retrieve most of
the information they provide. The following is a heavily edited
portion of the output from prtconf -v:
pci, instance #0
Driver properties:
name <interrupt-priorities> length <24>
value <0x0000000e0000000e0000000e0000000e0000000e0000000e>.
scsi, instance #0
sd, instance #0
System properties:
name <lun> length <4>
value <0x00000000>.
Driver properties:
name <inquiry-product-id> length <17>
value 'MAG3091L SUN9.0G'
It is easy to see the relationship to the device tree in this output;
in fact, this is basically just a dump of the tree contents. prtconf
shows the name and instance number for each node in the device tree,
along with any properties in the system, hardware, and driver properties
lists. It then lists all the child nodes for bus nexus nodes. Tools
like prtconf are very useful, because administrators can use
them to query the kernel's current view of the hardware and driver
configuration of the machine.
Unfortunately, prtconf isn't particularly useful for
querying particular aspects of the machine configuration. It can
take some scripting magic to use its output to provide something
as simple as a list of all CD-ROM drives in the machine. In Solaris
8, this information can be easily obtained from cfgadm -a,
however it is still useful to write custom tools to obtain information
directly from the device tree (a process illustrated in the next
sections).
Reading the Device Tree
The first question is how to get at the tree. The tree is stored
in kernel memory with the first node at the symbol top_devinfo,
which can be found in the sys/ddi_impldefs.h header file.
Each node is represented by a dev_info structure, which is
also described in the ddi_impledefs.h header file, which
means that the tree can be read by users' applications (with
root privileges) by reading kernel memory through /dev/kmem
character special device file (or indirectly through libkvm
routines). As Alexander Golomshtok and Yefim Nodelman described
in "Managing Solaris with Kstat" (Sys Admin magazine,
October 2001: http://www.samag.com/documents/s=1323/sam0110a/0110a.htm)
a seek() on this file positions the file pointer at a location
in kernel memory and read() copies the kernel memory to the
buffer in user space. Listing 1 shows how the root device node could
be read from kernel memory using this technique. (All Sys Admin
magazine listings are available at: http://www.sysadminmag.com/code/.)
There are a number of disadvantages to reading the device tree
directly from kernel memory. One major issue is that there is no
mechanism available for the reader to insure the tree isn't
modified while it is being read. This can lead to inconsistent or
incomplete data being returned. Another problem is that the program
then becomes tied to the word size of the kernel. If the kernel
is 64-bit, the program using libkvm to access kernel memory
must also be 64 bit, even though 32-bit applications can run under
64-bit kernels. Finally, accessing data this way is simply clunky,
particularly reading pointers then the data they point to. What
is the alternative?
Introducing the devinfo Library
The devinfo library (libdevinfo) provides a suite
of routines that can be used to query the device tree. It is used
by most of the system tools mentioned throughout this article to
access the information in the tree. There are two versions of libdevinfo.
The version distributed with Solaris 2.5.1 and 2.6 is entirely undocumented
and was rewritten, documented, then distributed with Solaris 7 and
8. This means that for users of Solaris 7 and 8, it's quite
simple to retrieve information from the tree (as I'll demonstrate),
but what about users of 2.6 and 2.7?
The obvious answer is to use the kmem interface as described
in the previous section. It's reasonably straightforward (but
less convenient) to perform the same functions that are possible
under the later version of libdevinfo using this method.
The undocumented libdevinfo on these systems is provided
exclusively for the use of the system tools, and the only documentation
I have found for it is an old newsgroup posting (http://groups.google.com/groups?q=libdevinfo&start=30&hl=en&rnum=31&selm=5rr0id%245qt%241%40nnrp.cs.ubc.ca).
I have only used the kmem technique on Solaris 2.5 and 2.6,
because libdevinfo offers very little (except perhaps being
a little more convenient). It uses kmem and therefore inherits
all the problems mentioned previously. Furthermore, it doesn't
offer the same amount of useful functionality as its successor,
which I will discuss shortly.
It may come as no surprise to learn that the more recent libdevinfo
does not access kernel memory through kmem. Instead, Solaris
7 and 8 include a special devinfo driver, which exports a
devinfo device that is addressed through an exported device
special file called /devices/pseudo/devinfo@0:devinfo,ro.
Routines in libdevinfo can be used to retrieve a complete
and consistent snapshot of the device tree through this file and
root privileges aren't even required under Solaris 8. The other
library routines can then be used to interrogate the snapshot and
read information from the tree.
Using libdevinfo
Listing 2 shows a fairly simple program that demonstrates many
of the features and functions available through libdevinfo,
which I'll use to discuss how the library can be used. I'll
also present a simpler program that uses libdevinfo to solve
a particular problem. The listing can be compiled using the GNU
C compiler (gcc) with the following command (assuming the source
is saved as devinfo.c):
gcc -g -o devinfo devinfo.c -ldevinfo
The output of the program is not pretty, but it is information-packed
and helps illustrate what is in the device tree. Some heavily edited
output from the program looks like this:
SUNW,Ultra-60 bind(SUNW,Ultra-60) bus() drv(rootnex) inst(-1) state() \
drvprop(pm-hardware-state) sysprop(relative-addressing)
packages bind(packages) bus() drv() inst(-1) state(drvdet) \
terminal-emulator bind(terminal-emulator) bus() drv() inst(-1) \
state(drvdet)
pci bind(pci108e,8000) bus(1f,4000) drv(pcipsy) inst(0) state() \
drvprop(interrupt-priorities) cminor(devctl ddi_ctl:devctl)
network bind(SUNW,hme) bus(1,1) drv(hme) inst(0) state() \
hwprop(latency-timer) hwprop(cache-line-size) cminor(hme ddi_network)
scsi bind(glm) bus(3) drv(glm) inst(0) state() drvprop(target6-TQ) \
drvprop(target6-wide) hwprop(latency-timer) cminor(devctl \
ddi_ctl:devctl:scsi) cminor(scsi ddi_ctl:attachment_point:scsi)
sd bind(sd) bus(0,0) drv(sd) inst(0) state() \
drvprop(inquiry-vendor-id) sysprop(lun) sysprop(target) \
bminor(a ddi_block:channel) cminor(a,raw ddi_block:channel)
The program initially calls di_init() on line 92 to get a snapshot
of the current device tree from the base of the tree (/), including
all of the subtree with all property and minor node data (DINFOCPYALL).
At this point, the library opens the devinfo device and uses
ioctls to cause the devinfo driver to lock the device
tree against modifications, then create a copy of the device tree
and copy it to user space. The function returns the first node in
the copied tree in the form of a di_node_t. The library provides
a suite of routines for reading information from the node so it isn't
normally necessary to know anything about the contents of the di_node_t.
The root node is then given to the examinenode() function
on line 98, which dumps information from the node to stdout.
The general information from the node is dumped on line 59. The
name, binding name, bus address, driver name, instance number, and
state are shown, all of which are retrieved easily through the libdevinfo
routines di_node_name(), di_binding_name(), di_bus_addr(),
di_driver_name(), and di_instance(). Note the conditional
expressions used for the bus address and driver name -- remember
that the device tree nodes can describe pseudo devices that have
no bus address, and devices that have no driver in the kernel and,
thus, no driver name.
The final piece of general information, the state, is retrieved
through di_state() on line 66 and then passed to the showstate()
function for display because the state is a set of individually
meaningful bits represented in an unsigned integer. showstate()
simply decodes the bits and outputs meaningful states.
Following the basic data about the node, the program prints all
of the available properties for the node. This is achieved through
the "while" loop on line 68. When di_prop_next()
is called with tProp equal to DI_MINOR_NIL, it returns
the first property and each successive call returns the next property
until none remain, at which point DI_MINOR_NIL is returned.
For each property, showprop() is called. showprop()
first looks into the property structures to determine which property
list they are from -- hardware, system, or driver. This information
is not directly provided by libdevinfo and is determined
by calculating the properties offset from the first offset for each
of the three lists in the snapshot of the device tree (lines 28-35).
The routine then prints a prefix indicating the type of property
along with the property name from the di_prop_name() routine
on line 37. Note that determining the property type this way isn't
necessary a good idea, because it introduces a dependency on internal
libdevinfo structures that may change in the future. However,
I've shown it as a demonstration of how data above and beyond
that provided by the libdevinfo routines can be retrieved
directly from the structures.
The routine then prints the minor node information in examinenode()
if there are any minor nodes attached to the current device tree
node. The code loops through all of the minor nodes on lines 74-76
using a loop exactly like that used to examine the properties in
lines 69-71, except that di_minor_next() is used. For each
minor node, showminor() is called to dump the minor node
information.
showminor() is quite simple. It initially uses di_minor_spectype()
to determine whether this device special file is a character or
block device on lines 41-44 so it can prefix the output with "c"
or "b". Then, it prints the name of the minor (using di_minor_name())
and the device type (using di_minor_nodetype()) on lines
46-47.
At this point, most of the information in the node has been successfully
dumped. The routine then recursively loops through the rest of the
tree by calling examinenode() on each of the child nodes.
On line 81, it calls di_child_node() to get the first child
node (if any exists). It then goes into a "while" loop
in which it calls examinenode() on the current child node
(line 83). This recursion will cause the child node and all of its
children to be dumped, since this call to examinenode() will
in turn call examinenode() for all its children. di_sibling_node()
is called on line 84 to get the sibling of the current child node,
and if no further children remain, DI_NODE_NIL will be returned.
Thus, the loop will terminate on line 81. Otherwise, the loop continues
until all children (and their children) have been dumped.
Although the output is not particularly clean, it is quite informative.
For example, it's reasonably easy to see the block minor exported
(sd@0,0:a) from the first SCSI disk. In the next section,
I will refine this information retrieval process to develop a short
program that lists all of the CD-ROM devices attached to the system.
A Useful Sample Tool
Listing 3 uses the device tree to list all the CD-ROM devices
attached to the system by finding CD-ROM minor nodes. As before,
the program first gets a snapshot of the tree on line 41. Once the
snapshot has been generated, a libdevinfo utility function
di_walk_node() is used on line 47 to walk the tree, it takes
in a root node, and loops through the entire subtree calling a specified
function with each node it finds. In this case, the parameters specify
that the traversal should start from the root of the tree, traverse
with children before siblings, and call the checknode function
for each node.
checknode is a relatively simple function that loops through
all the minor nodes of the device node using di_minor_next()
on line 21, as in the previous sample program. For each minor node,
it determines the node type with di_minor_nodetype() and
checks whether it is DDI_NT_CD or DDI_NT_CD_CHAN on
lines 22-24. These two types are used to define CD-ROM minor nodes.
If it finds a minor of those types, it knows this device must be
a CD-ROM and needs to be displayed. This code could be easily changed
to list hard disks by instead checking for DDI_NT_BLOCK or
DDI_NT_BLOCK_CHAN.
At this stage, the routine tries to find the administrative name
for the CD-ROM (e.g., c0d0t6s0) since the /devices filename
is less intuitive for a user. To find the administrative name checknode()
calls ftw() on line 26. ftw() is a standard function
in the C library that walks a file tree calling a specified function
for each file found. In this case, the ftw() call traverses
the /dev tree calling showpath() for each file found.
showpath() simply compares the dev_t (or the major
and minor) for the file to the dev_t specified in the device
tree minor node on line 9. If they are the same, then this file
is a block special file for the CD-ROM in question and is thus the
administrative device special file and name for the CD-ROM. In this
case, the function prints the name of the file and returns 1, which
will cause the ftw() call to end. If the dev_ts do
not match, the function simply exits with 0, thus causing the ftw()
file tree walk to continue.
Back in checknode(), if the file tree walk did not find
a matching file (i.e., it returned 0), there must not be an administrative
node for the CD-ROM in /dev. As discussed previously, a fully
functional device does not need to have an entry in /dev
but will always exist in the /devices filesystem. In this
case, the function calls di_devfs_path() on line 27 to get
the /devices path to the CD-ROM and prints that instead of
the administrative name. Having displayed the name, di_devfs_path_free()
is called on line 29 to free the storage allocated by di_devfs_path()
to store the path.
Conclusion
The Solaris kernel readily exports an amazing array of information
about the device configuration of the machine on which it is running
in the form of the device tree. Sun has gone to great lengths to
make that information accessible with the devinfo library.
This article has demonstrated an array of available functionality,
but has not covered the depth of functionality provided. One thing
is for sure -- administrators and systems programmers that use
libdevinfo to mine the device tree will find that it's
an amazing source for hard to find information about devices --
from disks to network cards and pseudo devices.
Shaun Clowes has worked in UNIX systems administration for
a number of years and is currently working as a systems programmer
on a variety of UNIX platforms, including Solaris, his favorite.
He can be contacted at: shaun@yap.com.au.
|