Parsing 
              Interesting Things
             Randal L. Schwartz
              Someone recently popped into one of the newsgroups I frequent 
              and asked how to parse an INI file. You might have seen those before, 
              with sections and keyword=value lines, like:
              
             
[login]
timeout=30
remote=yes
[password]
minlength=6
 
            I think they started in the Microsoft world, since no sane UNIX hacker 
            would have come up with something like that. No, we come up with things 
            like .Xdefaults and sendmail.cf and termcap. 
            But the request seemed simple: parse the file and gather the information 
            into a hash for quick access, two levels deep, of course.
             Now, I usually carry the banner here for "use the CPAN", and in 
              fact, there are numerous CPAN modules that parse INI files (too 
              many, I think). But let's take a different route here. Suppose we 
              were parsing a file that wasn't already CPANned to death. What tools 
              could we use?
              Well, certainly Perl's regular expressions are pretty powerful 
              in the first place, and this task really wouldn't be that difficult 
              with hand-written code, but we can go a bit further and pull out 
              a nifty tool from the CPAN: the "madman of Perl" Damian Conway's 
              Parse::RecDescent. This module permits extremely complex 
              parsers to be built by specifying a nice hierarchical description 
              of the data (as a grammar), and a series of actions to be taken 
              as each portion of the data is returned. I find it very simple to 
              use, and whipped up a parser in no time.
              The key to a useful grammar is getting the description right, 
              and what to do once you've seen that. First, let's look at a file. 
              A file is a series of sections, so in the grammar language, that's 
              given as:
              
             
file: sections
 
            Actually, a file is a bit more than that. If we just used that, the 
            grammar would match any prefix of the input that also had sections. 
            So, we need to anchor that:
             
             
file: sections /\z/
 
            Which says, match sections, and when you're done matching sections, 
            match the end of the string. If you're not at the end of the string 
            when you are done matching sections, this isn't a file that we want.
             And now, sections is zero or more sections, which we write 
              as:
              
             
sections: section(s?)
 
            with the (s?) suffix meaning "zero or more". Very readable 
            so far. A section is a section marker (the square-bracket line) and 
            some definitions:
             
             
section: section_marker definitions
definitions: definition(s?)
 
            And we've defined the definitions as well. So far, we've managed to 
            capture the essence of an INI-like file, but we've not actually matched 
            anything (except the end of string). That's because we've been constructing 
            "non-terminals". Grammar rules can also contain "terminals" (like 
            the end-of-string token above) to define specific things to match. 
            Let's start with a section marker:
             
             
section_marker: /\[.*\]/
 
            There. A section marker is a square-bracketed thingy. And what's a 
            definition?
             
             
definition: key /=/ value
 
            Yeah, it's a key and a value, separated by an equals. But what are 
            those? Why, more terminals!
             
             
key: /\w+/
value: /.*/
 
            And already with just a few lines of code, we've defined most of the 
            grammar. But now we need to introduce a bit more knowledge about Parse::RecDescent. 
            Between each of the items of the rules, the generated parser will 
            be permitted to skip over the current skip string, which is "whitespace" 
            by default. This is fine for section markers: we don't mind any preceding 
            whitespace being tossed. But it's a pain if whitespace gets in-between 
            the key and the rest of the line. Fortunately, we can define that 
            the skip string be altered for the remainder of a rule:
             
             
definition: key <skip: ''> /=/ value
 
            which means that the string '' (the empty string) is now the 
            skip string, meaning that the equals must be adjacent to the end of 
            the key, and the value starts immediately after the equals. Good!
             We could stick all the rules above into a string $GRAMMAR, 
              and then create a parser $PARSER using these rules as:
              
             
use Parse::RecDescent;
my $PARSER = Parse::RecDescent->new($GRAMMAR)
  or die;
 
            This $PARSER can then be used repeatedly to see whether a file 
            fits the specifications. To do that, we call the top-level rule (file) 
            as a method, passing it $INPUT, the contents of the file in 
            question:
             
             
if (defined(my $result = $PARSER->file($INPUT))) {
  print "It's a valid INI file!\n";
} else {
  print "No good.\n";
}
            Now, if all we were doing was verifying well-formedness, that's enough. 
            But we wanted to also use the data as it was parsed. To do that, we 
            need to also know that every rule is like a subroutine call, and passes 
            back the last value evaluated. By default, that's the string matching 
            the terminal (or $1 if it's included), or whatever value the 
            last subrule returns. (For the repetitions above, an arrayref is returned 
            of all the matches, if any.) However, we can include some Perl code 
            enclosed in a block as the last rule, and then that will be 
            the return value.
             For example, we really don't want the brackets included in the 
              section marker, so we can select (using $1) them away:
              
             
section_marker: /\[(.*)\]/
 
            There. Now the brackets are not part of the return value. If we didn't 
            know that $1 is automatically returned, we could return it 
            explicitly:
             
             
section_marker: /\[(.*)\]/ { $1 }
            which says to perform the regex match, and if it succeeds, evaluate 
            the block. As long as the block doesn't return undef, it's 
            also considered a "match", and as the last thing in a rule, it's also 
            the overall value of the rule.
             But what about the definitions? We want to note both the key and 
              the value, so we'll use some sort of Perl block at the end of the 
              rule. And we can return an arrayref of the two items just fine, 
              but we need to access the "value" of the key and value subrules 
              through the magical %item hash. The keys to this hash are 
              the names of the subrules. (Sorry for the overloading of the key/value 
              terms here.)
              
             
definition: key <skip: ''> /=/ value
  { [$item{key}, $item{value}] }
            And now a definition is an arrayref, consisting of the found key, 
            and its found value. (If there's more than one item called "key", 
            then you must resort to positional syntax, but it's almost always 
            easier and clearer to just invent a new non-terminal name for that 
            particular slot.)
             Similarly, a section needs the name of the section and all of 
              the definitions of that section.
              
             
section: section_marker definitions
  { [$item{section_marker}, $item{definitions}] }
            Note that definitions will already be an arrayref of individual 
            definitions, which are themselves references to two-element arrays. 
            All this stacking is taken care of automatically by the parser built 
            by Parse::RecDescent!
             Finally, the fun part. A file wants to be all the sections. And 
              we could just punt and return that:
              
             
file: sections /\z/ { $item{sections} }
            which will then be an arrayref pointing to a list of sections, each 
            section being an arrayref pointing to a list of definitions in that 
            section, each definition being an arrayref pointing to a key/value 
            tuple. But let's convert this into a hash for quick access:
             
             
  file:
    sections /\z/
    { my %return;
      my $sections = $item{sections};
      for my $section (@$sections) {
my ($section_marker, $definitions) = @$section;
for my $definition (@$definitions) {
  my ($key, $value) = @$definition;
  for ($return{$section_marker}{$key}) {
    if (not defined $_) {
      $_ = $value;
    } elsif (not ref $_) {
      $_ = [$_, $value];
    } else {
      push @$_, $value;
    }
  }
}
      }
      \%return;
    }
            Wow. What was that? Well, first we define a hash to be returned (as 
            a hashref), and then walk the multiple levels of the arrayrefs of 
            arrayrefs of tuples. The interesting part starts in the middle, which 
            is merely aliasing $return{$section_marker}{$key} to $_ 
            for the rest of the inner loop. If that value isn't defined, then 
            this is the first time we've seen a keyword under a given section, 
            so we stuff the value. If it's already defined, then we've seen the 
            same keyword twice. In this case, I decided to turn the value into 
            an arrayref, so that the values are individually extractable. And 
            finally, if it's already an arrayref, then we just push the latest 
            hit onto the end.
             The return value of calling the file method is now either 
              this hashref, or undef. So to get the "timeout" parameter 
              from the example INI file above, we'd say:
              
             
my $timeout = $result->{login}{timeout};
            Because the names are case sensitive, we might want to add a few other 
            things to force all the section names and keys to lowercase, or perhaps 
            we could do that while we were building the hash.
             There you have it: an INI-like file parser made with Parse::RecDescent. 
              I hope this brief intro to this powerful module will get you interested 
              enough to read the rest of the documentation and study its amazing 
              array of features. And you'll never fear parsing an odd-looking 
              file again. Until next time, enjoy!
              Randal L. Schwartz is a two-decade veteran of the software 
              industry -- skilled in software design, system administration, security, 
              technical writing, and training. He has coauthored the "must-have" 
              standards: Programming Perl, Learning Perl, Learning 
              Perl for Win32 Systems, and Effective Perl Programming, 
              as well as writing regular columns for WebTechniques and 
              Unix Review magazines. He's also a frequent contributor to 
              the Perl newsgroups, and has moderated comp.lang.perl.announce since 
              its inception. Since 1985, Randal has owned and operated Stonehenge 
              Consulting Services, Inc.
            |