Study Bible Project

This project involved the convertion of a Dutch study bible into a online version. This study bible was initial started by In de Ruimte around 1986, with a 17 volume edition of the New Testament. From 1998, this project was continued by Centrum voor Bijbelonderzoek. The printed version was produced with the help of QuarkXPress. Around 2001, I got involved in a project converting the available QuarkXPress files into a structured format, which could serve as a basis for developing an application for the study bible. This was realized through online version.

Reverse engineering QuarkXPress

In order to the read the QuarkXPress files it was necessary to reverse enginering the file format. I only reversed engineerd the format for the available files and only to as far as was needed. The resulting program read the available QuarkXPress files and produced file containing (making use of a Memory Mapped File) all the data. A QuarkXPress document basically exist of a collection of (possibly nested) boxes (frames) which contain text (or images). The program also tries to order all these boxes in the default reading order. I made a version available for converting one QuarkXPress file.

Analyzing page layout

The second program I developed, tried to parse the layout of the text and put it into a logical format. The study bible consisted three kinds of volumns: volumns dealing with the text of the New Testament, volumns containing word studies, and some volumes containing introductions and background material. The volumes dealing with the text of the New Testament used a four column layout over two pages where columns contained the Greek text, translations in various formats, and explainatory notes. The Greek text itself consisted for each word a unique number, the Greek with accents, a transliteration, and a literal translation. The Greek text also had annotations with respect to textual criticism. The volumns dealing with Greek words, would give a description of the word, all word forms used in the New Testament and a list of all occurances of the word together a small extract. Also a list of names and occurances was included. The other volumes where not included in the conversion.

Cross checking

To assure correctness of the conversion, I spend a lot of time developing algorithms performing cross checking between the text volumes and the word study volumes. It discovered many errors, some were due to my lack of understanding of spelling variations of Greek and such, but others where real errors that required input from the editors of the study bible. Some of the errors where know errors, some where new, and some where caused by corruptions in the input files (mostly sections missing). Because it was not really possible, nor feasible to edit the original QuarkXPress documents, I designed a structured text format to store the data. Error messages would be placed as comments in the structured text. I also set-up some batch files and a patch program to exchange edits throught email using diff and (un)zip. I also created a version of the MySample text editor with syntax colouring for the structured text format (including detailed checking for the Greek character encoding). Later, I included the syntax colouring in the normal version of MySample. It is activated for files with the extention "sbr" and it can be activated with View->Colorer->cvb.

Output

The above program also produced a structured output according to the logical structure of the data contained in text. This output was parsed and imported into the database for the online study bible. I was not involved in the development of the online study bible and the final corrections where made through an editor interface of the online study bible. In the mean time, they also have started working on the Old Testament and these are also added to the online study bible. I have lost contact and don't know how that is managed.

Software

The software I developed for this project can be downloaded. The software is rather useless without the original QuarkXPress files, which are not available. To use the programs the following directories need to be created in the directory where the software is placed:

The software requires Cygwin to be installed and uses the Bourne shell sh. As a preparation the cls2cpp program need to be compiled with the command build_cls2cpp.

The reading of the QuarkXPress command is done with the command run_scan. This will result in the file studybible.dat to be created in the tmp directory, which will be almost 300 Mbytes large. The file scan_out will be created containing the output produced by executing the scan program.

To calculate the reading order of all the text the command run_set_ro needs to be executed. This will modify the studybible.dat file. The program will report some errors on the output.

With the run_dumpb command the contents of the biblestudy.dat will used to generate one HTML files per volume in the directory dumpb containing the representation of the data read from the QuarkXPress files representing that volume.

With the run_dumptxt command the initial structured text files are created in the directory sbr. The structure text format uses curly brackets ('{' and '}') to delimit text. (I found this more practical than the traditional quotes.) Furthermore, it uses round and square brackets for grouping, where the open brackets can be preceded with an identifier for tagging purposes.

The program for analyzing and checking the data is parse.cpp, which can be build with the command build_parser. It is the largest program in which I invested most of my efforts. The program has the following command line options:

The options are followed by keywords to specify which parts of the data need to be processed. The keyword ALLES is used for everything. The keyword WSOPNIEUW causes the the file biblestudy2.dat in the tmp directory with information about the word studies to be rebuild. The keywords 2, 3, 4, 5, 6, 7A, 7B, 8, 9, 10, 11, 12, 13, 14, 15, and 16 are used to specify the volumes to be processed. When the file biblestudy2.dat does not exist or when the keyword WSOPNIEUW is used, the word study volumens are first processed. Next the specified volumes are processed and output in the logical structured format is produced in the out directory, one file per Bible book, one file per range of hundred word studies, and some additional files. When the output is directed to a file, the program mergeErrors.cpp can be used to merge the reported errors with the structured text files in the sbr directory. The program striperrors.cpp can be used to strip these messages from a file again.


My life as a hacker | How to crack a Binary File Format | Software engineering | Programs