USENET ELM: A Case Study in Portability between UNIX Systems
Sydney S. Weinstein
The diversity of UNIX systems requires "Universal
UNIX Applications"
to be as portable as possible. The attempt to keep one
such application
-- USENET Elm -- portable as both UNIX and C have evolved
has
required constant effort and provides a useful case
study of UNIX
portability issues.
Dave Taylor wrote Elm in the mid-1980s while he was
working at Hewlett-Packard,
then in 1987 released it, with HP's blessings, to the
USENET community.
Like much freely distributable UNIX software, Elm is
released as source
code compiled by the user or system administrator. Thus
portability
of the system at the source code level is mandatory.
Elm, in the UNIX vernacular, is a Mail User Agent (MUA).
It displays
the contents of a mailbox or folder (sequential text
file containing
mail messages), allows display of individual mail messages
from the
mailbox, accepts replies to those messages, and allows
for generation
of new messages for the Mail Transport Agent (MTA) to
deliver. Elm
does not deliver the messages; instead, it passes them
to the MTA,
which handles the routing and delivery.
Early UNIX MUAs were line-oriented, as the standard
terminal in use
was a hard-copy printing terminal. With the switch to
CRT-based terminals,
UNIX applications moved from line- to a screen-orientation.
As one
of the early screen-oriented MUAs, Elm incorporated
the best features
of the line-oriented MUAs available in the mid-1980s
and extended
the concept to a full-screen, menu-driven system. Designed
to be simple
to use and "intuitive," yet not so restrictive
as to frustrate
sophisticated users, Elm is currently used by approximately
250,000
individual users, on over 20,000 systems.
Original Elm Environment
Elm was initially developed with HP-UX, a port of the
AT&T System
V.2 version of UNIX. These systems used a K&R-style
C compiler (ANSI
C was not yet a glint in someone's eye). Elm was coded
in the "loose"
style common to software not intended to be ported between
very diverse
systems.
AT&T System V.2 Dependencies
Hewlett-Packard based HP-UX on the Motorola MC680x0
family of processors.
Processors in this family share certain characteristics:
- 32-bit word length
- 32-bit integer length (int type)
- 32-bit argument passing (all arguments less than 32 bits long are
converted to 32-bit values when placed on the stack as arguments to functions)
- 32-bit pointer length
- Large linear addressing space with no segmentation
The common length for the pointer data type, argument
passing, and the int data type allowed for some very
loose
programming practices, the most common being to intermix
the int
and pointer data types freely, on the assumption that
an int
can always hold a pointer value. The common length also
means that
an integer/character argument passed to a function could
always be
considered as an int. Casting arguments to convert the
types
explicitly was not necessary.
The large linear addressing space allowed large buffers
to be placed
on the stack and used to hold data values without concern
for overflow.
If overflow appeared likely, the size of the buffer
could be increased
-- there was plenty of room.
Because AT&T UNIX System V.2 limited filenames to
fourteen characters,
the individual elements of a full path name (the filenames)
were short
and the space reserved to hold path names was also very
small. In
addition, Elm used the C library provided with this
UNIX since, at
the time, no other version of UNIX had a different C
library.
HP Function Keys
The original Elm was developed on and hard-coded to
support Hewlett-Packard
terminals. These terminals used their own keyboard layout
with their
own set of function keys. They also allowed for labeling
the function
keys on the screen directly above the keys themselves.
Since the HP
method is not an industry standard, the decision to
hard-code support
for terminals rather than use the termcap function key
fields
has created even greater portability problems.
Dave's Own Curses
A common library package called curses generally performs
screen
updating in UNIX programs. Dave Taylor, Elm's creator,
implemented
his own, simpler, version of the curses package. He
handled
only the low-level terminal control routines, such as
cursor move,
up-line, down-line, and clear screen and left all the
actual screen
intelligence to his display routines. Its limited interaction
with
the curses package makes Elm very portable to other
systems. At the
same time, however, the code's low-level nature makes
it very difficult
to modify the screen code or add features. Instead of
hiding the screen
intelligence in the curses routines, Dave distributed
it throughout
many modules.
Dave's curses package did make use of UNIX's underlying
terminal capability
database. He used the calls from the older termcap system
instead
of the newer System V.2 terminfo system. The termcap/terminfo
database tells applications programs how to perform
a common set of
functions on many different types of terminals. It allows
UNIX tasks
to be portable between terminal types.
In general, if you are writing a "universal UNIX
application,"
you can best achieve portability by using the system
configuration
libraries, such as termcap/terminfo. Use of these facilities
makes your program immediately portable to all systems
and equipment
to which anyone has ported those facilities. In the
case of termcap/terminfo,
your screen-oriented program can immediately function
on whatever
types of CRT terminal are in use.
Porting to BSD-Type Systems
Elm's first major port was from the HP-UX version of
AT&T UNIX System
V.2 to the other major variant of UNIX, the Berkeley
Software Distribution
(BSD). This was (and still is) a logical first major
port, especially
since one of the major UNIX minicomputers in the mid-1980s
was the
DEC VAX. The University of California at Berkeley had
ported an earlier
version of UNIX to the VAX and added support for page
demand virtual
memory and extended networking. This version became
known as BSD UNIX.
The DEC VAX is very similar to the MC680x0 family. Both
share the
32-bit features and large linear addressing space listed
earlier,
but the DEC VAX orders its bytes in the reverse order
of the MC680x0.
Since each processor is internally consistent, this
difference becomes
significant only if a memory region is addressed as
two different
data types. In the case, say, of a memory area addressed
both as a
text string and as an integer value, the integer value
0x41424344
(1,094,861,636) would be ABCD on the MC680x0 family
and
DCBA on the VAX family.
For purposes of portability, it is necessary to make
sure no data
structure refers to the same area of memory with two
different fundamental
types. All strings must be passed as string pointers.
The short cut
of placing a couple of characters into an int and passing
the
int will no longer work: the characters would come out
backwards
on the VAX family. In addition, code must examine union
data
structures to see which fundamental type is being used.
Further, if
the union is used to overlay two fundamental data types,
the
code must take into account the byte ordering of the
system on which
it is running.
Failure to implement these subtle coding changes will
not cause compiler
errors or link problems; instead, the result will be
strange behavior
at execution time. The program could crash with an invalid
pointer,
for example, or it could get a cursor movement string
out of sequence
and scramble the display. These types of problems are
very difficult
to track down.
BSD 4.2/4.3 vs. AT&T System V.2
For the application programmer, the major differences
between the
BSD UNIX family and the AT&T UNIX family reside
in the #include
files and C runtime libraries. Each team developed its
own runtime
library, with the result that similar routines have
different names.
Also, identical data structures ended up in different
#include
files. The differences show up most notably in the string
and memory
manipulation functions (see Table 1). In particular
memory block arguments
to the memcpy/memcmp routines are backwards from the
same arguments to the bcopy/bcmp routines.
Not only are the string routines defined differently,
but the header
files that declare them have subtly different names.
The AT&T UNIX
name is <string.h>, while under BSD UNIX, it's
<strings.h>.
As a further complication, some routines exist in only
one of the
systems. Note that memset is generic, and the 0 used
to initialize the block of memory is passed as an argument.
bzero,
on the other hand, can only set a block of memory to
zero. Several
other of the string functions included with System V.2
do not exist
on early, or "pure," BSD systems. These include
most of the
library routines that start with the prefix str, as
documented
on the string(3) manual page. These routines, at least,
will
show up as missing header files at compile time or undefined
externals
at link time, making these types of problems much easier
to track
down.
Only rarely are functions with the same name in both
versions used
for different purposes. However, many similar commands
take different
arguments in the two versions, affecting shell scripts
and spawned
commands.
Long vs. Short Filenames
One of the more annoying differences between the older
AT&T UNIX versions
and the BSD versions is the AT&T 14-character filename
limit. This
difference normally creates problems when porting from
BSD to AT&T
(if filenames are longer than 14 characters), but can
also cause difficulties
when porting in the opposite direction. Usually, in
this case, the
problem deals with buffer lengths. Most programs written
for systems
without the flex-file names (the name for the longer
file names used
in BSD systems) leave relatively short buffers for path
names. With
the longer filenames these buffers often overflow, causing
name truncation
or, worse, other data items on the stack to be overwritten.
Since the filenames are of different lengths, it follows
that the
directory structures must also differ. For this reason,
the directory
access functions differ in the data types of their arguments.
This
difference can also result in programs that compile
correctly but
do not produce the expected results. Symptoms include
directory listings
within the program that appear to be missing files or
that show garbage
filenames or the inability of the program to find files
in the directory.
Mailbox Locking
Another component of UNIX that was not yet standard
when the AT&T
and BSD split occurred was file locking, and both versions
developed
their own method of handling interlocks to prevent two
processes from
writing to the same file. The original mail systems
created a semaphore
file in the mail spool directory to indicate their locking
of the
spool file. This scheme worked well for local systems,
but required
that the mail user agent and the mail transport agent
have permission
to create files in the spool directory. The steps in
locking of this
type are:
Attempt to create a file of the name LCK..name
in the spool directory.
If the create succeeds, you have locked the file.
If the create fails, then someone else has locked
the file already. If the
iteration limit has not been exceeded, sleep for a short
duration, then return to
step one to try again.
If the iteration limit has been exceeded, report the
error to the user and,
optionally, just ignore the lockfile.
Later revisions of this method placed the process id
(PID) of the owning process in the lock file. When the
create
failed, the file could be opened for read and a system
call would
determine if the lock was stale (the process that owned
it no longer
existed). If the lock was stale, it would be removed
and the locking
process would be repeated.
AT&T System V.2 used this revised method for mailbox
locking. BSD
systems started with this locking protocol, but due
to atomic file
creation problems with NFS (Network File System), switched
to locking
the file only using the kernel file locking system call.
Newer UNIX
System V.4 systems use a system call for locking the
file that is
different from that used by the older BSD systems.
Using the wrong locking technique for the system results
in a window
of time where two tasks can write to the mailbox. This
can cause garbled
messages, lost messages, or truncated mailboxes. If
your program opens
a file for writing, you must consider how file locking
is performed
on all systems to which your application will be ported.
Changes in the Port
No method exists for writing a single set of code that
can handle
both the System V.2 and the BSD versions of UNIX. However,
the #ifdef
command of the C preprocessor makes it possible to integrate
both
versions into the same source files. Elm used this method
to provide
a single version of source code for both systems. The
initial #ifdef
symbol was BSD and was passed to the C compiler via
the Makefile.
ifdefs then handled the code required for the different
serial
communications systems calls (setting up the serial
line communications
modes), different string routines, and different header
files. In
addition, this port revealed some of the weaknesses
regarding buffer
sizes mentioned earlier. During the port, all the buffer
sizes were
adjusted to fit the needs of the larger of the two systems.
Elm did not run into any problems with byte ordering
at this stage
of the port. However, byte order did become a problem
once it became
possible to share the Elm alias database between NFS-linked
systems.
An unexpected surprise arose in the different implementations
of the
<ctype.h> macros for character manipulation. The
standard System
V.2 macros toupper and tolower, which convert a character's
case, would change only lower- or upper-case characters,
respectively.
If the character passed to the macro was not the appropriate
case,
no change was made. For example, in the statement
c = tolower('a');
under System V.2, c would contain the lower case
letter `a'. Under BSD, the macro is implemented as
#define tolower(c) ((c) - 'A' + 'a')
This macro turns the lower case a (0x61)
into a SOH code with the eighth bit set (0x101). The
two macros
had to be redefined as follows to make the code compatible
for both
System V.2 and BSD:
#define tolower(c)
(isupper(c) ? ((c) - 'A' + 'a') : c)
The isupper macro now protects the code, preventing
translation of all but upper-case letters. However,
this redefinition
is still not fully portable. It assumes that lower-
and upper-case
letters are always the same distance apart in the character
set as
the upper and lower case 'a'. This is true for ASCII,
but not
for all character sets.
Heterogeneous System File Sharing
The next big portability hurdle for Elm came when systems
were linked
together via NFS into one common disk cluster. NFS allowed
many different
types of systems -- even non-UNIX systems -- to share
disk partitions,
and many sites mounted the users' home directories via
NFS. Elm, which
uses a file for global aliases, then also needed to
access the private
alias data across the NFS file system as well. Since
the system where
the file resided and the system running Elm were not
necessarily of
the same type, byte order imediately became an issue.
Big vs. Little Endian
The battle over the order by which to number a word's
bits and bytes
has often been compared to the wars waged by the Lilliputians
of Gulliver's
Travels over such issues as which end of the egg should
be eaten
first, the little or the big end. Networking forced
UNIX to rise above
this war and declare a truce, or at least a translator.
Since all networks need multibyte addresses to identify
all of the
hosts and circuits, these addresses must share a common
byte order.
Communication becomes impossible if a single machine
is known as node
0x1234 on one system and node 0x4321 on others. The
solution is to
pass bytes over the network in network byte order. For
TCP/IP
networks, specifications issued by the Network Information
Center
document this order. Several macros (see Table 2) assist
the C programmer
in placing the bytes in that order (each routine converts
one item
into the proper byte ordering). Elm was adapted to store
its alias
tables using these routines, with the result that the
table appears
the same whether the machine accessing it was a "little-endian"
or a "big-endian." Users whose home directory
is cross-mounted
via NFS can access their private alias table regardless
of which type
of system they are on. In addition, the global or master
alias table
can also be shared across systems.
NFS Locking
NFS added a degree of portability to Elm, but it also
brought problems.
File locking, already discussed in the section on mailbox
locking,
was late to be standardized under UNIX. The multiple
locking methods
require portable C programs to adapt their locking methods
to each
system's standard. NFS makes that situation a bit worse.
Since NFS
is stateless, cross-system locking cannot be defined
using the standard
method (lockf or flock) for NFS-mounted file systems.
To work around the problem where remote programs access
files via
NFS, some systems use a special daemon, rpc.lockd, to
perform
the locks locally on the system where the files actually
reside. This
requires the portable C program to have yet another
method of locking
files. At present (2.3 and 2.4), Elm does not use the
lock daemon.
Coping with System Differences
As the prior sections demonstrate, many of the modifications
required
for portability between UNIX versions, or for that matter,
between
UNIX and other operating systems, require changes to
the code for
each system type. Yet, to maintain several versions
of the same file,
one for each different standard, would be impractical
and would lead
to problems such as inconsistent code, wasted space,
and a complicated
makefile procedure.
Fortunately, C provides a construct to handle these
differences with
a single source file.
The C preprocessor has three commands -- #if, #ifdef,
and #ifndef -- that do much of the work in creating
portable
programs.
#if tells the preprocessor to emit the lines following
the
command until it reaches an #else or #endif only if
the expression on the command line is true. Each symbol
in the expression
is evaluated based on its value at that point in the
file. These are
symbols, not variables, so each must be set to a value
using a #define
statement or the -Dsymbol=value argument to the command
line.
#ifdef tells the preprocessor to emit the lines following
the
command until it reaches an #else or #endif if the symbol
on the command line has been defined. It does not matter
what value
the symbol has. The symbol can be defined by a #define
statement,
by the -Dsymbol argument to the compiler command line
with
or without a value, or could have been predefined within
the preprocessor
itself. System manufacturers generally predefine a symbol
within their
C preprocessor to identify the system. This symbol is
intended to
delimit code that must differ for their system.
#ifndef tells the preprocessor to emit the lines following
the command until it reaches an #else or #endif if the
symbol has been not defined. The symbol can either never
have been
defined or have been cleared by an #undef command.
In all three cases, the C preprocessor will emit the
lines following
the command if the condition is met, causing the compiler
to compile
the lines on later passes. If the condition is not met,
the C preprocessor
just outputs a blank line for each line being skipped.
When the #else
command is reached, if there is one, the action is reversed.
In any
case, the if condition ends at the #endif command, which
is required.
The conditions can be nested in such a way that a check
for one symbol
is conditional on the preceding check for another. However,
portability
requires that you nest statements in a way that all
C compilers will
understand. For ease of readability, it is often useful
to indent
nested ifdefs as
#ifdef CONDITION1
#ifdef CONDITION2
#endif CONDITION2
#else CONDITION1
#ifndef CONDITION3
#endif !CONDITION3
#endif CONDITION1
Two aspects of this construct can create problems for
some compilers. First, many C preprocessors require
that the #
character be in the first column of the line. And, second,
many do
not allow symbols on the #else and #endif lines. To
ensure portability, type the lines as follows
#ifdef CONDITION1
# ifdef CONDITION2
# endif /* CONDITION2 */
#else /* CONDITION1 */
# ifndef CONDITION3
# endif /* !CONDITION3 */
#endif /* CONDITION1 */
Since ifdefs are often nested to many levels and
the #else or #endif might not be close to the command
which it affects, placing the condition name as a comment
on the #else
and #endif lines helps to clarify the structure.
Elm has always based its system portability changes
on ifdefs,
and as the number grew, the comments were added to make
the range
of each ifdef more apparent. However, this proliferation
of
ifdefs leads to the next problem, what is the proper
condition
to use?
How to Use #ifdef
When Elm was first ported, all of the changes required
for the BSD
version were grouped under the symbol BSD. This led
to code
fragments like
#ifdef BSD
# define strcpy index
# define strchr rindex
# include <sys/pwd.h >
# undef tolower
# undef toupper
#else
# include <wd.h>
#endif
Such constructs allow compiling the BSD version with
just the symbol -DBSD added to the CFLAGS= line of the
makefile. Problems arose, however, as Elm was ported
to systems that
were hybrids of the pure System V.2/V.3 and BSD 4.2/4.3
versions.
No longer were all of these changes required all of
the time.
A better approach is to define a symbol for each portability
change
itself, rather than for the system as a whole, and to
define these
symbols as close to the name of the condition as possible.
If the
previous code fragment had been written as
#ifdef HAS_INDEX
# define strcpy index
# define strchr rindex
#endif
#ifdef PWDINSYS
# include <sys/pwd.h>
#else
# include <pwd.h>
#endif
#ifdef TOLOWER_MACRO
# undef tolower
# undef toupper
#endif
then, as the different operating system versions required
different combinations of changes, the CFLAGS= line
could be
changed as needed. If the CFLAGS= line in the makefile
becomes
too complicated, then in one global header file, included
first in
all modules, a code sequence similar to
#ifdef ATT_SVR2
# undef HAS_INDEX
# undef PWDINSYS
# undef TOLOWER_MACRO
#endif
#ifdef SUNOS_41
# define HAS_INDEX
# define PWDINSYS
# define TOLOWER_MACRO
#endif
#ifdef HPUX_8
# ifdef HAS_INDEX
# undef PWDINSYS
# undef TOLOWER_MACRO
#endif
could handle each of the combinations with only a single
flag on the CFLAG= line of the makefile.
Using this type of code sequence in the include file,
porting to a
new operating system would only require listing the
features the system
supports. Of course, any new quirks of that operating
system would
generate new names and changes to the code in the rest
of the program.
But still, the makefile would require only the name
of the version
on its CFLAGS= line.
A side effect of this change is that there are now many,
if not hundreds,
of symbols created to ensure the widest portability,
and it becomes
very difficult to determine the proper values for a
new operating
system version/port for each of these symbols. But with
proper coding
style, help is on the way later in this article in the
section on
Metaconfig.
The Merge of System V and BSD
The merger of the System V and BSD standards into the
new System V
Release 4 standard has really placed a wringer on the
choice of ifdef.
Besides changing the location of many #include files,
this
standard splits into separate conditions many of the
old combinations
of things that used to go together as a single ifdef.
In particular,
SVR4 supports many items using both styles, and sometimes
one is better
than the other and other times, not.
Elm used to group most of the BSD compatibility changes
together.
Now that SVR4 has most of those items within the System
V defines,
these ifdefs had to restrict their range once again,
making
it all the more important to choose the ifdef symbol
to cover
as little as possible -- preferably just the single
change required
for the port. Then, when the underlying operating system
changes,
at worst the symbols will simply need to be defined/undefined
to adapt.
Metaconfig and Configure
Larry Wall has written many programs for C programmers
and has shared
them with the USENET community. All of the programs
run on many different
types of UNIX operating systems. To simplify porting,
Larry wrote
a shell script called Configure, for his rn program
(a USENET
network news reader) that tried to determine automatically
the values
needed for the various ifdef symbols. Where the script
could
not determine the answer automatically, it would ask
for "local
preference" items. To automatically configure the
software, you
just typed Configure at the shell prompt.
The Configure script would identify the location of
needed commands
and libraries, check the contents of those libraries
to determine
which functions were available, and ask the user for
local preference
items. From these, an #include file was built and included
into each source file. The header file contained the
results of the
program and function checks as #define SYMBOL or #undef
SYMBOL lines. It also included the preference items
as #define
PREFERENCE_SYMBOL value lines.
Coding the program to take advantage of Configure's
symbols allowed
immediate configuration at the source level. However,
writing the
Configure script by hand for each new program was tedious.
Since most
of it was boilerplate, and whole sections could be used
by many different
programs, this script was a perfect tool for automatically
generating
the ultimate script. Since Larry was working on a very
large program
with many portability changes, he used the program as
both the reason
to develop the tool and as a method of developing it.
The program
was Perl, and the tool he developed is Metaconfig.
Metaconfig is a large Perl script that scans a list
of files, called
a manifest, looking for all symbols used on #if type
lines in the .c, .h, and .y files, and all shell
variables used in the .SH files. These symbols form
the wanted
list. Using these symbols, Metaconfig then searches
a library of shell
script fragments, called units, for those units that
define
the symbols on the wanted list. Each of the units also
lists the other
units it requires, if any. All of these units are then
combined in
an order to satisfy the dependencies, and placed with
a common start
and end code to form the shell script Configure.
Since the units are common and reusable, a library of
units was quickly
developed that Metaconfig can use for other programs.
Each unit is
placed in a file named by combining the primary symbol
name with a
.U suffix. These units form the master library used
by Metaconfig.
Each program also has a local library of units which
are similar to
the master units, but incorporate changes to the master
library equivalent
unit. The local override units are given the same name
as the master
library unit they replace. When Metaconfig is run, it
generates a
message specifying which local units will override the
equivalent
units from the master library.
In addition to the override units, the local library
includes units
that are specific to a program and not considered useful
to other
programs. These custom unit files are also named by
combining the
primary symbol name and a .U suffix.
Metaconfig units and the symbols they define fall into
three categories:
Symbols that are automatically determined by the Configure
script and cannot be
overridden by the user.
Symbols that are automatically determined by the Configure
script, but can also be
overridden by the user. The automatically determined value
becomes the default
value the first time the script is run. The answer given the last
time the script was run is
the default value for each subsequent time the Configure script
is executed.
Symbols that are local preference items. No automatic
value is possible.
Sometimes the unit's code specifies a suggested value for a
default value the first
time Configure is run. Configure uses the answer from the
prior run as the default
for each subsequent run.
An example of the first case would be to check for certain
functions in the C library. Configure automatically
determines what
C functions exist in the libraries chosen to link the
application.
This list is available via a shell function and is used
to define
symbols based on the availability of individual functions.
Listing 1 shows d_strcspn.U, a unit from Elm's local
Metaconfig
library, which checks the existence of certain C functions.
The lines
preceded by a ? are control lines for Metaconfig.
RCS-type lines are comment lines for use by the Revision
Control System and
contain version tracking information.
MAKE-type lines contain a list of shell symbols defined
in this unit, followed
by a colon (:), and then the list of symbols/functions this unit
requires to be already
defined. This second list is the dependency list. The d_scrcspn.U
unit defines two shell
symbols, d_strspn and d_strcspn, and requires that the shell
symbols and libc already
be defined. The first symbol before the colon is the primary
symbol. The unit's filename
must match this symbol with a .U suffix.
The second MAKE line defines the types of operations
the dependency
makefile requires for this unit (the definition of these types
is too long to be
included here, but is explained in the Metaconfig documentation).
S-type lines are extracted to form documentation
on the shell symbols
available in the different unit files. The metaconfig source
includes a program
that automatically extracts these lines from all of the
units to produce
a document on the available symbols.
C-type lines function similarly to S-type
lines, but for symbols defined for use in C code rather
than in shell
scripts. Once again, the Metaconfig source includes
a program that
automatically extracts these lines and forms a document
on all of
the available C preprocessor symbols.
H-type lines are used by Metaconfig to automatically
generate the configuration include file.
The remainder of the lines comprise the shell script
fragment. In
the simple example in Listing 1, the shell script uses
a fragment
of shell code that is contained in the shell variable
inlibc.
The libc unit defines this variable, thus the libc dependency
on the first MAKE-type line. The inlibc function searches
the name list from the C libraries to see whether the
symbol in the
shell variable $1 exists. If it does, the symbol in
$2
is set to define. If not, the symbol in $2 is set to
undef. The set command on the line preceding the inlibc
call initializes $1 and $2. Using the value just
set into the symbol d_strspn, the ?H-type lines will
automatically produce a #define or #undef for the symbol
STRSPN. The C code can then use the line #ifdef STRSPN
when it needs to call the strspn C library function,
and provide
alternate code following a #else line.
d_internet.U (Listing 2) provides an example of the
second
type of Metaconfig unit, one that allows the user to
input a value
to override the default. The header lines are the same,
but the shell
script fragment is a bit more complicated. The first
section uses
the case construct to set the default value for the
d_internet
symbol based on the value in the shell variable d_internet
from the prior run. If the d_internet variable is empty,
or
not one of the strings define or undef, the default
value is set based on some conditions the shell script
can check on
its own. In this case, those symbols are set by other
units or by
shell code directly in this unit. The middle section
echoes a message
that explains the meaning of the symbol the user is
about to define.
The script then asks the question, presenting the default
answer to
the user. Lastly, the result the user types is checked
to see how
to define the shell symbol d_internet.
The last type of Metaconfig unit is used to define a
user choice or
local preference. The unit for these looks almost identical
to the
unit shown in Listing 2. The only difference is in how
the default
value is set when there is no prior answer to use. While
d_internet.U
used a value determined by the Configure script as the
default, this
local preference unit uses a hard-coded default directly
in the shell
fragment. Of course, it is still preferable to remember
the answer
from the last Configure run and use that as the default
whenever possible.
Just as C files can directly include the .h file written
by
the Configure script, shell scripts and other non-C
files can use
the shell variables in the file config.sh created by
the Configure
script to adapt to the results of the Configure run.
The Configure
script executes all files ending in .SH in the manifest
to
produce the appropriate adapted file. Listing 3 shows
an extract from
the makefile prototype, Makefile.SH, in Elm's master
directory.
The .SH files are broken into three sections. The first
section,
which runs up to the echo statement, locates the config.sh
file, which contains all the answers obtained by the
Configure script.
Configure then reads this file into the current shell.
The second
section uses the shell variables to modify the lines
with the results
of the config.sh just read. The last section just adds
the
remainder of the file that does not need the variables
substituted.
In the listing, the line [...] indicates that lines
were deleted
from this example. The actual makefile is much larger.
By coding your program to take advantage of the existing
library of
units, you can achieve instant portability between most
UNIX operating
systems with Metaconfig. In addition, by allowing for
local preferences,
Metaconfig provides an easy means of customizing the
distribution.
International Portability
The upcoming 2.4 version of Elm tackles a totally new
problem --
international portability. The ASCII character set,
which most UNIX
systems use, takes advantage of the English language's
26-character
alphabet to be a seven-bit code, with the eighth bit
within the eight-bit
byte used for parity. On most UNIX systems, internally,
the eighth
bit is always zero, cleared by the istrip terminal control
parameter.
Eight-Bit Clean
For languages with alphabets of more than 26 characters,
the eighth
bit is used to extend the character set to support additional
characters.
Any program destined for international consumption,
then, must be
eight-bit clean, which means that you do not alter or
clear
the eighth bit of any character value, and you do not
depend on all
character values to be positive when viewed as signed
characters.
The international standard treats all characters as
unsigned quantities.
Using the eighth bit to extend the character set also
changes the
definition of an alphabetic character. It is no longer
valid to consider
the range `A'-`Z' and `a'-'z' as the only
alphabetical characters. All checks for the type of
character should
use the macros defined in <ctype.h>. It is the
system's responsibility
to have the proper values in this file and its associated
modules
in the C library to support the local character set.
Because Elm has
always been eight-bit clean and has always used the
macros instead
of direct comparisons, version 2.4 required no changes
in these areas.
It's worth noting that some character sets are too large
even for
eight bits (the Japanese Kanji alphabet, for example,
uses a 16-bit
character). For purposes of international portability,
your program
should not assume an eight-bit character type.
NLS and Message Catalogs
Changes to messages, prompts, and commands from English
to the local
language represent the most significant challenge in
internationalization.
Since most programmers do not speak all the languages
needed to please
all of the potential users of their programs, how do
you solve this
problem?
The solution uses the concept of Native Language System
(NLS) support.
The X/Open standards committee, a group of computer
companies, produced
an NLS usable for UNIX that provides several components:
- LOCALE functions for setting the desired character
set and language characteristics, including bit length,
collation
sequence, and character attributes.
- System error messages in each of the locally supported
language sets.
- Message catalog support.
The LOCALE subsystem tells the C runtime library
which character set is in use. The user typically defines
the desired
character set as an environment variable. The locale
functions read
the variable and set up the appropriate structures and
collation lists.
ctype.h macros use these character attributes to determine
the class of each character. The collating sequence
allows the extended
characters to be sorted in appropriate order, rather
than be grouped
at the end due to the unused portion of the character-set
code space.
The user also sets the language for system error messages
in an environment variable. The locale functions initialize
the
syserror structure with the messages in the appropriate
language.
The most important change is support for message catalogs.
Because
most C programs, including Elm until the 2.4 release,
code their messages
directly into the source, a single compiled version
cannot output
different messages based on the language desired. Rather
than requiring
that messages for every supported language be coded
directly into
the program, solution gives the user the ability to
define new message
catalogs that include the text of all of the messages,
translated
by the user, into the chosen language. For example,
to print the command
scan message for calendar entries, Elm would display
a message on
the screen using the C code fragment
PutLine0(LINES-3, strlen(Prompt),
"Scan message for calendar entries...");;
This fragment, in English only, places the message at
the bottom of the screen. A message catalog function,
however, obtains
the message from a file based on its message number.
The file can
be translated into any language so that the program
can automatically
speak that language. Recoding the example using the
message catalog
functions yields
PutLine0(LINES-3,
strlen(Prompt),
catgets(elm_msg_cat, ElmSet, ElmScanForCalendar, "Scan
message for calendar entries..."));
The function catgets reads the message catalog
and loads into memory all the messages from the set
ElmSet,
if they are not already in memory. It then returns the
text string
of the message ElmScanForCalendar. If the message catalog
is
not open on the file elm_msg_cat, or there is no set
ElmSet
or no message ElmScanForCalendar, the string contained
in the
call is returned as the default answer.
The function that opens the message catalog, catopen(),
uses
the language environment variable to select the correct
file from
the application program's set of message catalogs, each
of which contains
the application's messages in a single language. The
program that
compiles the messages into the file also produces a
C header file
that defines the set and message number symbols.
Because word order rules and conventions vary among
languages, a straightforward
string replacement mechanism would produce garbled messages.
Where
an English message reads "6 messages received,"
for example,
the message in another language might read "received
6 messages."
In C, the printf function converts the numbers into
text strings
and builds simpler strings into complete messages. If
the string message,
or its foreign translation is in the variable msgs,
and the
string received is in the variable rcvd, then the message
could
be output with the printf statement
printf("%d %s %s\n", num_msgs, msgs, rcvd);
Since the arguments are passed in order on the stack,
the printf function just uses them in order to fulfill
its
format string. To turn that message into "received
6 messages,"
printf must access the arguments on the stack in a different
order. NLS provides for this ability with an extension
to the printf
function. If a format argument contains an integer followed
by a $
character, that integer is interpreted as the ordinal
of the argument
on the stack to use for this format string. The same
string would
then be printed as
printf("%1$d %2$s %3$s\n", num_msgs, msgs,
rcvd);
It then becomes easy to turn the message around to say
"received 6 messages" using
printf("%3$s %1$d %2$s\n", num_msgs, msgs,
rcvd);
Once again, the different format strings for these last
two printf statements would be obtained from the message
catalog
using the catgets()function. The final printf statement
would read
printf(catgets(elm_msg_cat, ElmSet,
ElmMessagesReceived, "%d %s %s\n"),
num_msgs, msgs, rcvd);
In addition, the values for the variables msgs
and rcvd can also be obtained from the message catalog.
The English version does not need the $ notation as
the arguments
are used in their natural order. The translations in
the message catalog
would use the $ notation as needed.
The problem remains of writing for an operating system
whose vendor
doesn't support NLS. Several freely distributable programs
provide
NLS support, including new versions of the printf family
of
functions. Elm, with release 2.4, will include one such
program so
that users whose systems don't support NLS will still
be able to compile
new message catalogs for the language of their choice.
Future Portability Issues
Up to this time, Elm has supported only electronic mail
interchange
using UNIX-based messaging systems. These systems use
the RFC-822
standard to format messages. A newer, international
standard, entitled
X.400, has been approved by the CCITT (the international
standards
body). This standard allows for a hierarchical address
to any place
in the world, on any computer system. And, unlike RFC-822,
it has
a companion standard, X.500, similar to the telephone
directory white
pages. The X.500 standard allows distributed directory
services, which
means that knowing only a name, one could look up the
electronic mail
address. Elm must eventually evolve beyond its purely
UNIX mail roots
and handle X.400 messaging systems directly, instead
of behind an
RFC/822-to-X.400 gateway.
The change in the UNIX market is from character-based
terminals to
bit-mapped terminals running Graphical User Interfaces
(GUI) also
has implications for Elm's development. Both of the
two major GUI
standards, OpenLook and OSF/Motif, use the X Windowing
System. Future
versions of Elm will have to support these as well as
the traditional
character-based interfaces. A complete redesign of Elm's
user interface
-- to replace menus with buttons and add support for
sliders and
multiple windows -- will be required.
These and other changes will wait for a rewrite after
2.4 is released.
Like all programs that have evolved through a long development,
Elm
at some point will need to be rewritten totally to clean
up convoluted
code and remove some of the past assumptions. Such a
rewrite provides
the best opportunity to consider the portability issues
that created
problems in the past and to design in ways of handling
them.
About the Author
Sydney S. Weinstein, CDP, CCP is a consultant, columnist,
lecturer, author,
professor, and president of Datacomp Systems, Inc.,
a consulting and contract
programming firm specializing in databases, data presentation
and windowing,
transaction processing, networking, testing and test
suites, and device
management for UNIX and MS-DOS. He can be contacted
care of Datacomp Systems,
Inc., 3837 Byron Road, Huntingdon Valley, PA 19006-2320
or via electronic
mail on the Internet/Usenet mailbox syd@DSI.COM (dsinc!syd
for those who
cannot do Internet addressing).
|