by Frans
Some years ago, I started to think about a parser (generator) that would be small, flexible and easy to use. Because I never was very charmed with generated scanners, and had written hunders of scanning procedures myself, I decided to focus on the parser. First, I decided that the input grammar should be easy to read. That is why I choose for an extended NBF like that used in Slade, which has grouping (with round brackets), alternatives (using the "|"-symbol), and keywords like OPT, SEQ, LIST and CHAIN for commonly occuring constructs in grammars. I also decided that terminals for keywords and symbols would be represented by text between double qoutes inside the grammar. By adding some identifiers for common terminals such as integers, doubles and strings, I could quickly write some real life parsers.
When writing a compiler for a certain language, a good finger-exercise is to first write an interpreter. What is true for compilers, is also true for compiler-compilers. And because I wanted to experiment with back-tracking recursive descent parsers, I set out to develop an interpreting parser, directly working on the input grammar. Because I also needed a parser for the input grammar, I decided to use the parser for this as well. The boots-trapping problem was easy to overcome. This proved to be a rather successful approach, until encountered some problems with left-hand recursion (e.g. rules like E : E "+" T.). But these were resolved when I decided to use a simple internal representation for parsing rules, where these were dealt with as a separate case. This internal representation is build by applying some simple transformations on the input grammar, using a single pass recursive algorithm. Another benefit of using an internal representation, is that is simplified the code of the parsing procedures.
My intial goal was to be able to parse preprocessed C files based on the grammar which was taken from the back of the K&R book. (See here for the final grammar.) I choose C, because it has some expressions (like for example "t *v"), which are hard to parse by most parser generation systems, without falling back to using semantical information. I had to make some extensions to my parsing procedures and some small modifications to the grammar, to finally being able to parse preprocessed C files and build an abstract parse tree.
Because the parsing algorithm was using back-tracking, it was not very fast. I attempted to speed up the parser by turning the interpreting parser into a parser generator, simply by applying partial evaluation, which turned the parsing routines into a mixture of normal C statements (for the part interpreting the input grammar) and statement printing C statements (for the part calling the scanning routines). This parser generator produced a huge C program for the C grammar, which worked correctly, but did not result in any significant performance improvements. (It actually might have become slower because the executable was many times larger.) This made me decide to stick to the interpreting parser engine.
From earlier work with back-tracking algorithms I had learned that there are two techniques for improving the performance of such algorithms, namely, trying to cut recursion as quickly as possible and maintaining a cache of intermediate results. When the first compilers were designed, memory was expensive. This is no longer the case. This led to the radical design decision to store all intermediate parsing result (including APT's being build) and associate these with the position of the input file at which the parsing started. Experimentation affirmed that this indeed resulted in a huge performance improvement. This was further improved (using profiling) by letting some scanning routines cache the file position of the last succesfully scanned symbol, and to remember for each non-terminal the result of the last position the parser algorithm invoked it. This led to the current design of IParse.
The further development of IParse (read the detailed develment log) has focussed on adding semantical aspects, such as identifier scoping rules.
The grammar does support old and new style function definition. Note that it does not contain any "tricks" to make the parsing work correctly. (Except that the statement "t *v;" is always read as a variable declaration of "v" being a pointer to the type "t", instead of the multiplication of the variables "t" and "v".)
With this it is possible to parse a preprocessed C program like this, which was made by preprocessing the scanning routines of IParse. (I removed the excessive use of empty lines, to make it a little more interesting.) To print the parse tree of this preprocessed C program, execute the command:
IParse c.gr scan.pc -p
There are several reasons, why the development of the scanner in IParse is easy. The first reason is that the whole input file is stored in a single string buffer, and that a char pointer to the current scanning location is available. This means that look-ahead is no problem. The second reason is that the scanning routines are called from the parser only when the particular terminal is encounted in the grammar. This means that for each terminal a scanning procedure needs to be written, which returns true if the a terminal of that kind can be accepted, and false otherwise. A benefit of this approach is that it allows context sensitive scanning (think about JavaScipt embedded in HTML files), which are very hard to deal with in traditional scanner generators such as (F)lex. The third reason is that some common scanning procedures have already been included in IParse, and it is a well-known fact that most software engineers are good at adapting examples.
Release 1.2 implements a mechanism for registering scanning routines with terminal symbols. When the internal representation of the parsing rules is build, function pointers are stored inside the internal representation. (There is thus no additional lookup overhead incured during the execution of the parser.)
There is only one complicating factor in the design of the scanning routines, and that is the caching of intermediate results. However, one does not really need to do this. The code has been marked with "Memorization intro" and "Memorization prologue" in the example scanning procedures.
To indicate that an identifier is a defining occurences use the greater-than symbol (">") followed by an identifier specifying the identifier class of the identifier. (This is to support language that have several indepedent identifier classes, meaning that a certain identifier can be used in different contexts without refering to the same thing.) Whenever identifiers are defined when used for the first time, the symbol ">+" should be used instead. To indicate that an identifier is a using occurence use the smaller-than symbol ("<") followed by an identifier specifying the identifier class.
Examples of how to uses these are given in the example grammar for JavaScript. (No claims about the correctness of this grammar are made.) To check a JavaScript file, the following command can be used:
IParse js.gr my.js
root : nt_def SEQ eof. nt_def : ident ":" or_rule "." [nt_def]. or_rule : rule CHAIN "|". rule : opt_elem SEQ OPT ( "[" ident "]" ) OPT [rule] . opt_elem : list_elem "OPT" [opt] | list_elem . list_elem : prim_elem "SEQ" [seq] | prim_elem "LIST" [list] | prim_elem "CHAIN" string [chain] | prim_elem . prim_elem : string | ident | "ident" ">+" ident [identdefadd] | "ident" ">" ident [identdef] | "ident" "<" ident [identuse] | "ident" "!" ident [identfield] | "ident" [identalone] | "{" [opencontext] | "}" [closecontext] | "(" or_rule ")". |