Meta-Meta

Introduction

This page deals on some tools for creating the Meta-Meta enviroment based on the ideas exposed on my Art of Programming pages.

IParse 1.2

This program implements a PEG parser with the following unique features:

It is an interpreting parser, meaning that on the input it expects at least two arguments, namely:
- a filename of a file with a grammer, and
- a filename of a file to be parsed according to the grammar.
Implements a "greedy" back-tracking parser, thus supporting a wide range of grammers. Can deal with many ambigious grammers.
Uses a smart caching algorithm for high performance.
Grammar supports keywords like OPT, LIST, SEQ and CHAIN for commonly used grammar constructs, thus allowing very natural and easy to read grammars.
Grammer supports common C-like terminals, known under the names string, int, double, char and ident. Literals are specified as double-quoted strings in the grammar rules.
Scanner is controlled by parser, instead of being implemented as a first pass. This allows context-sensitve scanning.
Has output option to implement a parser with "hard-coded" grammar. So, you don't need to supply the first argument anymore.
Builds a abstact syntax tree from the parsed input based on a simple name giving strategy. This is done by putting a name between square brackets at the end of each rule.
Decorates the abstract parse tree (ATP) with symbol tables and matches identifier uses with identifier definitions based on some annotations in the grammar.
extremely small code base. (This file contains it all.)

Design considerations

Throughout my professional life as a software engineer, I often have been involved with scanning and parsing. Already before I went to university I wrote a compiler-compiler in LISP. When studying for my master degree in Computer Science, I followed a course, which used a LL(1) parser generator, which is now known as Slade. Also my final thesis was about compiler-compiler design. Later during my working life, I several times encountered lex/yacc parsers. Now of the parser generators I worked with looked particularily elegant. This made me think about developing one myself.

Some years ago, I started to think about a parser (generator) that would be small, flexible and easy to use. Because I never was very charmed with generated scanners, and had written hunders of scanning procedures myself, I decided to focus on the parser. First, I decided that the input grammar should be easy to read. That is why I choose for an extended NBF like that used in Slade, which has grouping (with round brackets), alternatives (using the "|"-symbol), and keywords like OPT, SEQ, LIST and CHAIN for commonly occuring constructs in grammars. I also decided that terminals for keywords and symbols would be represented by text between double qoutes inside the grammar. By adding some identifiers for common terminals such as integers, doubles and strings, I could quickly write some real life parsers.

When writing a compiler for a certain language, a good finger-exercise is to first write an interpreter. What is true for compilers, is also true for compiler-compilers. And because I wanted to experiment with back-tracking recursive descent parsers, I set out to develop an interpreting parser, directly working on the input grammar. Because I also needed a parser for the input grammar, I decided to use the parser for this as well. The boots-trapping problem was easy to overcome. This proved to be a rather successful approach, until encountered some problems with left-hand recursion (e.g. rules like E : E "+" T.). But these were resolved when I decided to use a simple internal representation for parsing rules, where these were dealt with as a separate case. This internal representation is build by applying some simple transformations on the input grammar, using a single pass recursive algorithm. Another benefit of using an internal representation, is that is simplified the code of the parsing procedures.

My intial goal was to be able to parse preprocessed C files based on the grammar which was taken from the back of the K&R book. (See here for the final grammar.) I choose C, because it has some expressions (like for example "t *v"), which are hard to parse by most parser generation systems, without falling back to using semantical information. I had to make some extensions to my parsing procedures and some small modifications to the grammar, to finally being able to parse preprocessed C files and build an abstract parse tree.

Because the parsing algorithm was using back-tracking, it was not very fast. I attempted to speed up the parser by turning the interpreting parser into a parser generator, simply by applying partial evaluation, which turned the parsing routines into a mixture of normal C statements (for the part interpreting the input grammar) and statement printing C statements (for the part calling the scanning routines). This parser generator produced a huge C program for the C grammar, which worked correctly, but did not result in any significant performance improvements. (It actually might have become slower because the executable was many times larger.) This made me decide to stick to the interpreting parser engine.

From earlier work with back-tracking algorithms I had learned that there are two techniques for improving the performance of such algorithms, namely, trying to cut recursion as quickly as possible and maintaining a cache of intermediate results. When the first compilers were designed, memory was expensive. This is no longer the case. This led to the radical design decision to store all intermediate parsing result (including APT's being build) and associate these with the position of the input file at which the parsing started. Experimentation affirmed that this indeed resulted in a huge performance improvement. This was further improved (using profiling) by letting some scanning routines cache the file position of the last succesfully scanned symbol, and to remember for each non-terminal the result of the last position the parser algorithm invoked it. This led to the current design of IParse.

The further development of IParse (read the detailed develment log) has focussed on adding semantical aspects, such as identifier scoping rules.

The example grammar for C

As an example for a real life grammar, demonstrating the power of IParse, I hereby give the grammar of the C language, which is based on the C grammar given in the K&R book. Identifier between square brackets indicate the name that should be given to the nodes in the Abstract Program Tree (ATP) that are made while parsing the rule. Keywords and symbols are automatically removed. No nodes are generated for empty rules. A nameless node (list node) is generated when no name is provided.

The grammar does support old and new style function definition. Note that it does not contain any "tricks" to make the parsing work correctly. (Except that the statement "t *v;" is always read as a variable declaration of "v" being a pointer to the type "t", instead of the multiplication of the variables "t" and "v".)

With this it is possible to parse a preprocessed C program like this, which was made by preprocessing the scanning routines of IParse. (I removed the excessive use of empty lines, to make it a little more interesting.) To print the parse tree of this preprocessed C program, execute the command:

   IParse c.gr scan.pc -p

Hard coded scanning routines

IParse makes use of "hard-coded" scanning procedures. This is definitely a weak point from the view point of a novice parser developer, as it requires a good understanding of scanning and the C language. However, I think most senior parser developers would prefer a hard-coded scanner for reasons of performance and having to deal with special cases.

There are several reasons, why the development of the scanner in IParse is easy. The first reason is that the whole input file is stored in a single string buffer, and that a char pointer to the current scanning location is available. This means that look-ahead is no problem. The second reason is that the scanning routines are called from the parser only when the particular terminal is encounted in the grammar. This means that for each terminal a scanning procedure needs to be written, which returns true if the a terminal of that kind can be accepted, and false otherwise. A benefit of this approach is that it allows context sensitive scanning (think about JavaScipt embedded in HTML files), which are very hard to deal with in traditional scanner generators such as (F)lex. The third reason is that some common scanning procedures have already been included in IParse, and it is a well-known fact that most software engineers are good at adapting examples.

Release 1.2 implements a mechanism for registering scanning routines with terminal symbols. When the internal representation of the parsing rules is build, function pointers are stored inside the internal representation. (There is thus no additional lookup overhead incured during the execution of the parser.)

There is only one complicating factor in the design of the scanning routines, and that is the caching of intermediate results. However, one does not really need to do this. The code has been marked with "Memorization intro" and "Memorization prologue" in the example scanning procedures.

Identifier scoping rules

Since release 1.2 of IParse, it contains code and grammar elements for defining identifier scoping rules and marking occurences of identifiers as defining or using. Scoping rules are defined with curly brackets ("{" and "}") in the input grammar. These curly brackets are copied into the Abstract Program Tree (ATP). There are thus no restrictions where to place these, except that each possible ATP has matching brackets when traversed top-down from left-to-right.

To indicate that an identifier is a defining occurences use the greater-than symbol (">") followed by an identifier specifying the identifier class of the identifier. (This is to support language that have several indepedent identifier classes, meaning that a certain identifier can be used in different contexts without refering to the same thing.) Whenever identifiers are defined when used for the first time, the symbol ">+" should be used instead. To indicate that an identifier is a using occurence use the smaller-than symbol ("<") followed by an identifier specifying the identifier class.

Examples of how to uses these are given in the example grammar for JavaScript. (No claims about the correctness of this grammar are made.) To check a JavaScript file, the following command can be used:

   IParse js.gr my.js

Grammar for the IParse grammar

The grammar supported by IParse, described using the IParse grammar itself, is:

root : nt_def SEQ eof. nt_def : ident ":" or_rule "." [nt_def]. or_rule : rule CHAIN "|". rule : opt_elem SEQ OPT ( "[" ident "]" ) OPT [rule] . opt_elem : list_elem "OPT" [opt] | list_elem . list_elem : prim_elem "SEQ" [seq] | prim_elem "LIST" [list] | prim_elem "CHAIN" string [chain] | prim_elem . prim_elem : string | ident | "ident" ">+" ident [identdefadd] | "ident" ">" ident [identdef] | "ident" "<" ident [identuse] | "ident" "!" ident [identfield] | "ident" [identalone] | "{" [opencontext] | "}" [closecontext] | "(" or_rule ")".

Known limitations

At this moment the identifier resolving algorithm does not work correctly for fields.
For a variant consist of a single keyword or symbol, a name (between square brackets) must be provided, otherwise a crash can be expected. One should thus not write:
storage_class_specifier : "typedef" | "extern" .
but:
storage_class_specifier : "typedef" [typedef] | "extern" [extern] .

Development log

For those who are interested in my ongoing development of the Meta-Meta project, and maybe want to learn something from evolutionary software development, read the development log.

C++ version of IParse

On June 13, 2010, I released a C++ version of IParse, which I started developing in 2007 but never released. This ZIP file contains all the code, which has been build and tested with Microsoft Visual C++ 6.0 and g++ (GCC) 3.4.4 (cygming special). See the included Readme.txt file for the details.

Java version of IParse

In 2008, I also developed a Java version of IParse. This ZIP file contains the contents Java files. This is a primary version and probably not the best implementation. My usages of Java may also be strange, because this was a quick reimplementation of the C++ version.

Second C++ version of IParse

On October 2, 2011, I released a second version of the C++ version of IParse, which I continued developing. This ZIP file contains all the code. This version has three different parsers included. It now also has a heap implementation of the back-tracking parser, which can be used when the (default) recursive version terminates in a stack-overflow. Besides a back-tracking parser, it also contains a recursive LL1-parser, which can be used for LL1 grammars.

Parallel parser

On May 11, 2012, I parsed the first text using a parallel parser, which I have been adding to IParse. It was only a very simple grammar, but an important milestone. On May 13, I encountered some serious performance problems. On May 17, I constructed a small counter example for a problem with the parallel parser. On May 20, I fixed a fundamental error in the implementation, which also seems to have solved the performance problem.

Version 1.7

In March 2015, I started working on the problem of processing Windows Resource files with the translation tool OmegaT. This resulted in some new work on IParse, resulting in the following new features:

Support for different character encoding, such as UTF-8, UTF-16, and Windows code page 1252 (West-Europe). (Other code pages can be added easily.)
Unparse functionality for generating a source file from an abstract parse tree based on a grammar file.
Introduction of white-space terminals, which can be used for formating during unparsing.

For more information, read entries of April 7 and April 9, 2015. All relevant code is included in this ZIP file.

Cursors into abstract parse trees

To manipulate abstract parse trees, I have introduced a cursor class, which is going to be available in the next release.

Parallel parsing

In August 2016, I got an idea for implement a parallel parser. This resulted in proto-type implementation.

Github

Code base now hosted on Github: IParse.

RawParser

I started working on, what I would call a 'raw' parser in C. I published a first version on Monday, December 30, 2019 in the GitHub repository RawParser.

Online interactive parser

I created an online interactive parser, which is a reimplementation of (a subset of) IParse for educational purposes. I announced this on January 31, 2021. The code is also used in the BF generator, which I released on May 11, 2021.

MCH2022

For the workshop A practical approach to parsing at MCH2022, I developed IParse Studio, which makes use of an implementation in JavaScript.

The Art of Programming | Thoughts on Software Engineering