Study Bible Project
This project involved the convertion of a Dutch study bible into a online version. This study bible was
initial started by In de Ruimte around 1986, with a
17 volume edition of the New Testament. From 1998, this project was
continued by Centrum
voor Bijbelonderzoek. The printed version was produced with the
help of QuarkXPress.
Around 2001, I got involved in a project converting the available
QuarkXPress files into a structured format, which could serve as a
basis for developing an application for the study bible. This was
realized through online version.
Reverse engineering QuarkXPress
In order to the read the QuarkXPress files it was necessary to reverse enginering the file format. I only reversed engineerd the
format for the available files and only to as far as was needed. The
resulting program read the available QuarkXPress files and produced
file containing (making use of a Memory
Mapped File) all the data. A QuarkXPress document basically exist
of a collection of (possibly nested) boxes (frames) which contain
text (or images). The program also tries to order all these boxes in
the default reading order. I made a version available
for converting one QuarkXPress file.
Analyzing page layout
The second program I developed, tried to parse the layout of the
text and put it into a logical format. The study bible consisted
three kinds of volumns: volumns dealing with the text of the
New Testament, volumns containing word studies, and some volumes
containing introductions and background material. The volumes
dealing with the text of the New Testament used a four column
layout over two pages where columns contained the Greek text,
translations in various formats, and explainatory notes. The
Greek text itself consisted for each word a unique number,
the Greek with accents, a transliteration, and a literal translation.
The Greek text also had annotations with respect to textual criticism. The volumns dealing with Greek words,
would give a description of the word, all word forms used in
the New Testament and a list of all occurances of the word
together a small extract. Also a list of names and occurances
was included. The other volumes where not included in the
conversion.
Cross checking
To assure correctness of the conversion, I spend a lot of time
developing algorithms performing cross checking between the
text volumes and the word study volumes. It discovered many
errors, some were due to my lack of understanding of spelling
variations of Greek and such, but others where real errors that
required input from the editors of the study bible. Some of the
errors where know errors, some where new, and some where caused
by corruptions in the input files (mostly sections missing).
Because it was not really possible, nor feasible to edit the
original QuarkXPress documents, I designed a structured text
format to store the data. Error messages would be placed as
comments in the structured text. I also set-up some batch
files and a patch program to exchange edits throught email
using diff and (un)zip. I also created a version of the MySample text editor with syntax colouring for the structured
text format (including detailed checking for the Greek character
encoding). Later, I included the syntax colouring in the normal
version of MySample. It is activated for files with the extention
"sbr" and it can be activated with View->Colorer->cvb.
Output
The above program also produced a structured output according
to the logical structure of the data contained in text. This
output was parsed and imported into the database for the online
study bible. I was not involved in the development of the online
study bible and the final corrections where made through an
editor interface of the online study bible. In the mean time,
they also have started working on the Old Testament and these
are also added to the online study bible. I have lost contact
and don't know how that is managed.
Software
The software I developed for this project can be downloaded. The software is rather useless without the original
QuarkXPress files, which are not available. To use the programs
the following directories need to be created in the directory where
the software is placed:
- Q: this directory should contain the QuarkXPress files.
- tmp: in this directory the files for the memory mapped
files containing the data.
- dumpb: directory for the dumpd program.
- sbr: directory for the structured text files.
The software requires Cygwin to
be installed and uses the Bourne shell sh. As a preparation the cls2cpp program need to be compiled with the
command build_cls2cpp.
The reading of the QuarkXPress command is done with the command
run_scan. This will result in the file studybible.dat
to be created in the tmp directory, which will be almost
300 Mbytes large. The file scan_out will be created
containing the output produced by executing the scan
program.
To calculate the reading order of all the text the command
run_set_ro needs to be executed. This will modify
the studybible.dat file. The program will report
some errors on the output.
With the run_dumpb command the contents of the
biblestudy.dat will used to generate one HTML files per
volume in the directory dumpb containing the representation
of the data read from the QuarkXPress files representing that
volume.
With the run_dumptxt command the initial structured text
files are created in the directory sbr. The structure
text format uses curly brackets ('{' and '}') to delimit text.
(I found this more practical than the traditional quotes.) Furthermore,
it uses round and square brackets for grouping, where the open
brackets can be preceded with an identifier for tagging purposes.
The program for analyzing and checking the data is parse.cpp,
which can be build with the command build_parser. It is
the largest program in which I invested most of my efforts. The
program has the following command line options:
- -wC1: Show warnings for first column text volumes.
- -wTC: Show warnings for parsing Greek texts.
- -wC2: Show warnings for first second text volumes.
- -wGW: Show warnings for matching Greek in text with Greek
from word studies.
- -dWS: Generate debug output with respect to parsing
word studies.
- -wWS: Show warnings for parsing word studies.
- -wTL: Show warnings for transliterations not matching Greek text.
- -oRG: Output Greek text as with multi character representation.
- -oGH: Output Greek text with two hexadecimal characters per character.
- -oSC: Skip parsing of commentary.
The options are followed by keywords to specify which parts
of the data need to be processed. The keyword ALLES
is used for everything. The keyword WSOPNIEUW causes
the the file biblestudy2.dat in the tmp
directory with information about the word studies to be
rebuild. The keywords 2, 3, 4, 5,
6, 7A, 7B, 8, 9,
10, 11, 12, 13, 14,
15, and 16 are used to specify the
volumes to be processed. When the file biblestudy2.dat
does not exist or when the keyword WSOPNIEUW is
used, the word study volumens are first processed. Next the
specified volumes are processed and output in the logical
structured format is produced in the out directory,
one file per Bible book, one file per range of hundred
word studies, and some additional files.
When the output is directed to a file, the program
mergeErrors.cpp can be used to merge the
reported errors with the structured text files in
the sbr directory. The program striperrors.cpp
can be used to strip these messages from a file again.
My life as a hacker |
How to crack a Binary File Format |
Software engineering |
Programs