How to crack a Binary File Format
By Frans
Important notice
Please do not send me binary files and expect me to reverse engineer them
for you for free. Requests for cracking passwords, breaking some form of DRM
and other forms of illegal activities are deleted immediately by me. At the
moment, I am not available for reverse engineering projects.
|
Introduction
I often receive request from people to help them with cracking a particular
binary file format. It seems that many software vendors are not willing to
document the format of the binary files that they are using. Maybe there are
good reasons for this. Personally, I feel that users have the right to access
the information that is stored in the document that they manipulate with a
program that they are legally allowed to use. That is why I do not consider
reverse engineering of a binary file format as an illegal act (opposite to
what the use of the verb "cracking" is suggesting). But it might still be
illegal according to the law, depending on where you live, your contract
status and/or user agreement you have agreed with. I am not a legal expect
and not in the position to give advice on legal matters. For some more
information see: EEFs: Coders' Rights Project Reverse Engineering FAQ.
The most likely reason why people ask me for help is because I already have
reversed engineerd some binary file formats. So far, I have worked on:
Although, cracking a binary file format can be very rewarding, like a form of
mental mountain climbing, there are also some basic skills that are absolutely
neccessary, if you ever want to succeed. These are (in order of importance):
- A good reason why you want to crack the format at all.
- A lot of time.
- A lot of preseverence.
- A lot of creativity.
- Willingness to do repeative tasks for hours and hours.
- Good understanding of how numbers and strings are represented in binary
manner.
- Excellent command of the C programming language.
Cracking a file format is not easy. It can drive you crazy. Try to avoid
it if you can. And, of course, you first search the Internet for at least one
day seeing what is already known about the format, or to find out if someone
already offers some conversion service or tool to a format that you can read.
(Some starting points are: List of file formats, File-Extensions.org, FilExt, and
FileFormat.Info.)
Also, I think you should have read the stories of attempts
to crack a binary file format, and studied some programs
that read binary files. (Follow the links mentioned in the
introduction to start with.) Please do not contact me with request to
give any form of instruction and/or education.
Besides this, you also need:
- a computer (quite obvious),
- a simple text editor capable of handling long lines,
- (optional) a hex viewer or editor,
- a calculator with hexadecimal-to-decimal conversion and vice-versa
functionality,
- a C or C++ compiler,
- enough sample files about which you know what data they contain, and
- (optional) the application which uses the files.
A professional hex editor was brought under my attention with some nice
features for cracking binary files. 010 Editor allows you to write templates for describing the structure of
the binary file, and having it parse on the fly showing the results. The
languages used has some powerful features. It some what resembles my ideas for
BFF: a grammar for Binary File Formats.
Another professional hex editor that has been brought under my attention is
Hexinator, which is available under
Windows, Mac and Linux. It too has abilities for specifying grammars with a
build in grammar editor. The grammars are stored in XML. A collection of
grammars can be found here.
I personally would prefer the ability to specify a grammar in a text format.
I also do not see any possibilities to unparse the contents of a binary file
in a textual representation (as could be done with the latest version of
IParse). It would also be nice to have a feature to
generate a program that based on a grammar produces a program (in C++) for
reading files in that format.
What is the fun?
Indeed, cracking a format is like climbing a
mountain, when you reach the top it gives you
a great reward. But it is also like entering
the mind of someone else. With every step you
come further, you know something more about the
program being used to read and write the file.
And by this, you also know more about the mind
of the person (or persons) who designed that
program. However, sometimes this person becomes
your enemy, when attempts have been made to prevent
total reverse engineering of the format. Yes,
those people designing a format sometimes are
given the assignment to make reverse engineering
difficult. There are many ways to do it. Custom
made compression and encrytion are two techniques.
Luckily, both these can reduce performance of
loading and saving. But there are also small
tricks you can use, which make it a lot harder.
Programming style
Although, cracking a binary file format may tempt
you to write instant code, you should realize that
it is a long and complicated task, which often
requires the code to undergo major revisions. Code
revisions are much easier if your code is clean and
well structured. (The principles of
eXtreme Programming do also
apply to this kind of programming work.)
Some practical remarks:
- Document your code (either through comments
or, better, by giving meaningful names to
variables and procedures).
- Make regular back-ups. Although this should
be a standerd procedure, with cracking a file
format you will see yourself more often going
back to an earlier version of the program
whenever you get stuck in a certain area.
- Build into your program switches (simple
Boolean variables) which allow you to dump
any intermediate data during any stage of
the parsing. You will find that you often
need to go back one step to make a step
forward.
How to work
Where to start
The first thing you need to do is to understand the
global structure of the binary files. Look at the
sample files with a hex-viewer, or write a small
hex-dump program, which dumps each next 32 bytes
on a separate line, generating an empty line
after each 256 bytes. (Alternatively, you can also
dump 256 bytes on each line.)
You can use one of the following statements to
print a single byte:
printf("%02X", byte);
printf("%02X ", byte);
printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned
char, otherwise you may get confusing effects. (If
you don't understand the above code fragments, you clearly
do not meet the requirements, so,
don't ask me to explain it to you.)
Once the global structure has been discovered, you
can start writing a C program, which simply dumps
the file along this structure.
A lot the work, just consists of searching for
structure in the hex-dumps and editing them in long
lines representing the various records on a single line,
in order to discover the structure and determine the
meaning of the various parts.
For a primary investigation of a format, the
strings program might be useful. Sometimes
binary format contain whole chunks of text.
Block structure
Some binary file formats use a block structure
where each chunk of data is stored in a number
of equal sized blocks. A block size of 256 is
often used. Examples of these are MS Word 5.0 and
Quark Xpress. Whenever a block structure is
used, data structures are often referenced by the
block number in which the data start.
Number representation
The most common used manner to represent numbers
is to interpret a sequence of bytes as a binary
number. Numbers can also be stored as strings,
where each character is representes one digit.
(These are relatively easy to recognize in a
binary file.) Because only a limit set of characters
is needed to represent all kinds of numbers (including
fractions and exponents), two characters can be
put in a single byte. In the past there were even
processors that had instructions to perform
arithematic operations on such numbers. (The two
parts of the byte were often refered to as nibbles.)
Nowadays this format is not in use any more (but
it might still occur in some binary file format,
as these sometimes survive their original platform).
Negative integers
For simple positive numbers interpreting a sequence of
bits as a binary number is the rather obvious choice.
But for numbers that can be both positive and negative
this is not. One interpretation is to consider one of the
bits (the first usually) as the sign bit, and the rest of
the bits as a binary number. A problem with this
interpretation is that there are two representations for
zero. Another solution is to interpret a sequence of
bits as a binary number, and to subtract a large
value from it, in case the number is larger than some
number. There are two options here, with respect to
which numbers is subtracted with a certain amouth of
available bits. These are called 1's compliment and 2's compliment
number representation. Nowadays, the 2's compliment method
is most often used. This means that for a byte (8 bits)
the number 256 is subtracted, when the number is larger
than 127. That means that the unsigned number 128
represent the signed number -128, and the unsigned number
255 represent the signed number -1. (With the 1's compliment
number representation, the number 255 is subtracted.
Again, this introduces two representations for zero, as the
unsigned number 255 becomes the signed number 0.)
Note that with the 2' compliment the first bit can be
seen as the sign bit. The only disadvantage the 2' compliment
has is that there is always a negative number that does not
have a positive counterpart, e.g., for the byte there is
a representation for -128 but not for 128.
An important advantage of the 2-compliments notation is
that the arithmetic operations for the signed and unsigned
representation are exactly the same. The only time the
difference become obvious is when a number is cast to a
larger representation form. For example the hexadecimal
byte value 0xFF needs to be changed to 0xFFFF, when the
signed number -1 is cast from a byte to a word, and has
to be changed to 0x00FF, when the unsigned number 255 is
cast from a byte to a word.
In a binary file format, it can occur that a unsigned byte
value is stored as a word with the signed expansion applied.
As a result of this it can happen that 127 is stored
hexadecimal as 0x007F, where 128 is stored hexadecimal
as 0xFF80.
little and big endian
Processors can be divide into little and big endian.
This determines the order of the bytes in for short
and long integer values. little endian processors
start with the least significant byte. Big endian
processors start with the most significant byte.
The reason why little endian processors start with
the least significant by first, is because it gives
an advantage for 8 bits processors when having to
perform adding and subtraction operations. Although
this plays no role in current 32 bit and 64 bit
processors, some processors are still little endian
because of their origin. The Pentium processor is
a little endian. The PowerPC (used in Macs) is
big endian. This difference is especially important
for exchanging binary data between different processors.
One should be aware that some applications running on
a big endian processor, may use the little endian
byte order, because the application was originally
developed to run on a little endian processor.
Sometimes, long values (32 bits) are stored as two
big endian words, with the least significant word
first.
floating point formats
The possibilities for representing floating number
far exceed those of integers. Luckily, nowadays it
has been pretty much standardized. Most of the time
it matches the way common processors are using. So,
a simple cast to float or double
will work, sometimes in combination with a byte
reordering with respect to little and big endian
differences.
Strings
Usually, it is rather easy to recognize strings
in a binary format. One should realize that each
character in a string can either use one or two
bytes (UCS or UTF-16), and that little and big endian
could play a role. One should realize that there
are many string encoding methods, some based on
code pages, some based on Unicode (such as UTF-8).
There is a difference between
fixed and variable size strings. Fixed strings
take a fixed amouth of space in the file. They
are often filled up with spaces or the
null-character. Variable sized strings, either
can have the length at the start, or use a
termination character (most often the null-character)
to indicate the end of the string. Sometimes, the
next piece of data does not directly start at the
next byte following the string, but could be at
a word or long word boundary.
Packed format
To save space, some binary format may make use
of bit packing, which means that the values stored
in the binary file do not necessary start at byte/word
boundaries. (In C this can be achieved by using the ":"
followed by a number in the struct/class member
definitions.) One such format is the
AutoCAD DWG13 file format.
See the program for some suggestions.
More often it happens that a number of Boolean values
are stored in a single byte or word. Sometimes it
can be helpful to print numbers in their binary
representation.
Rubbish
Be aware that many binary files just contain
chunks of rubbish, especially those that allocate
fixed blocks for storing data in the binary file.
Also memory mapped files could contain large
sections of rubbish. Rubbish can be very misleading
and a great trouble in reverse engineering binary
files, because sometimes it just seems to contain
valid data. On the other hand, something what looks
like rubbish could actually contain some information.
Reading files
When reading a binary file, one often has to access
the file in a random manner, or at least often look
ahead from the current reading position.
Direct random access of a binary file is slow. To
avoid this, it is best (if size permits this) to
store the whole file into a single byte array.
(For an example see my CBuf class, which
I used for reading Quark Xpress
files.)
If size does not permit it, you could try the
Memory Mapped file option. For an example of
this under Windows see the class Mmfile
in the Suneido
sources that uses the functions CreateFileMapping
and MapViewOfFile. Or study my own implementation of an Object-Oriented persistent
store based on Memory Mapped files.
The difference strategy
One way of cracking a binary file format is to make
many, preferable small, files with the application
that contain small changes compared to each other.
For example, you could start to write a document with
the application and each time save a document under
a different name.
You can use a Binary File Compare
utility to compare the files and discover how certain
changes are reflected in the binary file, which gives
you information about the where the information is
stored.
Links
In some random order:
Tools
Open source:
Closed source:
(Background image was taken from http://mathcs.wcsu.ctstateu.edu/joelw/Arch/)
My life as a hacker |
Software engineering |
home and email address