How to crack a Binary File Format

By Frans

Important notice

Please do not send me binary files and expect me to reverse engineer them for you for free. Requests for cracking passwords, breaking some form of DRM and other forms of illegal activities are deleted immediately by me. At the moment, I am not available for reverse engineering projects.

Introduction

I often receive request from people to help them with cracking a particular binary file format. It seems that many software vendors are not willing to document the format of the binary files that they are using. Maybe there are good reasons for this. Personally, I feel that users have the right to access the information that is stored in the document that they manipulate with a program that they are legally allowed to use. That is why I do not consider reverse engineering of a binary file format as an illegal act (opposite to what the use of the verb "cracking" is suggesting). But it might still be illegal according to the law, depending on where you live, your contract status and/or user agreement you have agreed with. I am not a legal expect and not in the position to give advice on legal matters. For some more information see: EEFs: Coders' Rights Project Reverse Engineering FAQ.

The most likely reason why people ask me for help is because I already have reversed engineerd some binary file formats. So far, I have worked on:

Requirements

Although, cracking a binary file format can be very rewarding, like a form of mental mountain climbing, there are also some basic skills that are absolutely neccessary, if you ever want to succeed. These are (in order of importance): Cracking a file format is not easy. It can drive you crazy. Try to avoid it if you can. And, of course, you first search the Internet for at least one day seeing what is already known about the format, or to find out if someone already offers some conversion service or tool to a format that you can read. (Some starting points are: List of file formats, File-Extensions.org, FilExt, and FileFormat.Info.) Also, I think you should have read the stories of attempts to crack a binary file format, and studied some programs that read binary files. (Follow the links mentioned in the introduction to start with.) Please do not contact me with request to give any form of instruction and/or education.

Besides this, you also need:

010 Editor - Hex Editor

A professional hex editor was brought under my attention with some nice features for cracking binary files. 010 Editor allows you to write templates for describing the structure of the binary file, and having it parse on the fly showing the results. The languages used has some powerful features. It some what resembles my ideas for BFF: a grammar for Binary File Formats.

Hexinator

Another professional hex editor that has been brought under my attention is Hexinator, which is available under Windows, Mac and Linux. It too has abilities for specifying grammars with a build in grammar editor. The grammars are stored in XML. A collection of grammars can be found here. I personally would prefer the ability to specify a grammar in a text format. I also do not see any possibilities to unparse the contents of a binary file in a textual representation (as could be done with the latest version of IParse). It would also be nice to have a feature to generate a program that based on a grammar produces a program (in C++) for reading files in that format.

What is the fun?

Indeed, cracking a format is like climbing a mountain, when you reach the top it gives you a great reward. But it is also like entering the mind of someone else. With every step you come further, you know something more about the program being used to read and write the file. And by this, you also know more about the mind of the person (or persons) who designed that program. However, sometimes this person becomes your enemy, when attempts have been made to prevent total reverse engineering of the format. Yes, those people designing a format sometimes are given the assignment to make reverse engineering difficult. There are many ways to do it. Custom made compression and encrytion are two techniques. Luckily, both these can reduce performance of loading and saving. But there are also small tricks you can use, which make it a lot harder.

Programming style

Although, cracking a binary file format may tempt you to write instant code, you should realize that it is a long and complicated task, which often requires the code to undergo major revisions. Code revisions are much easier if your code is clean and well structured. (The principles of eXtreme Programming do also apply to this kind of programming work.)

Some practical remarks:

How to work

Where to start

The first thing you need to do is to understand the global structure of the binary files. Look at the sample files with a hex-viewer, or write a small hex-dump program, which dumps each next 32 bytes on a separate line, generating an empty line after each 256 bytes. (Alternatively, you can also dump 256 bytes on each line.) You can use one of the following statements to print a single byte:
   printf("%02X", byte);
   printf("%02X ", byte);
   printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
   printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned char, otherwise you may get confusing effects. (If you don't understand the above code fragments, you clearly do not meet the requirements, so, don't ask me to explain it to you.) Once the global structure has been discovered, you can start writing a C program, which simply dumps the file along this structure. A lot the work, just consists of searching for structure in the hex-dumps and editing them in long lines representing the various records on a single line, in order to discover the structure and determine the meaning of the various parts.

For a primary investigation of a format, the strings program might be useful. Sometimes binary format contain whole chunks of text.

Block structure

Some binary file formats use a block structure where each chunk of data is stored in a number of equal sized blocks. A block size of 256 is often used. Examples of these are MS Word 5.0 and Quark Xpress. Whenever a block structure is used, data structures are often referenced by the block number in which the data start.

Number representation

The most common used manner to represent numbers is to interpret a sequence of bytes as a binary number. Numbers can also be stored as strings, where each character is representes one digit. (These are relatively easy to recognize in a binary file.) Because only a limit set of characters is needed to represent all kinds of numbers (including fractions and exponents), two characters can be put in a single byte. In the past there were even processors that had instructions to perform arithematic operations on such numbers. (The two parts of the byte were often refered to as nibbles.) Nowadays this format is not in use any more (but it might still occur in some binary file format, as these sometimes survive their original platform).

Negative integers

For simple positive numbers interpreting a sequence of bits as a binary number is the rather obvious choice. But for numbers that can be both positive and negative this is not. One interpretation is to consider one of the bits (the first usually) as the sign bit, and the rest of the bits as a binary number. A problem with this interpretation is that there are two representations for zero. Another solution is to interpret a sequence of bits as a binary number, and to subtract a large value from it, in case the number is larger than some number. There are two options here, with respect to which numbers is subtracted with a certain amouth of available bits. These are called 1's compliment and 2's compliment number representation. Nowadays, the 2's compliment method is most often used. This means that for a byte (8 bits) the number 256 is subtracted, when the number is larger than 127. That means that the unsigned number 128 represent the signed number -128, and the unsigned number 255 represent the signed number -1. (With the 1's compliment number representation, the number 255 is subtracted. Again, this introduces two representations for zero, as the unsigned number 255 becomes the signed number 0.) Note that with the 2' compliment the first bit can be seen as the sign bit. The only disadvantage the 2' compliment has is that there is always a negative number that does not have a positive counterpart, e.g., for the byte there is a representation for -128 but not for 128.

An important advantage of the 2-compliments notation is that the arithmetic operations for the signed and unsigned representation are exactly the same. The only time the difference become obvious is when a number is cast to a larger representation form. For example the hexadecimal byte value 0xFF needs to be changed to 0xFFFF, when the signed number -1 is cast from a byte to a word, and has to be changed to 0x00FF, when the unsigned number 255 is cast from a byte to a word. In a binary file format, it can occur that a unsigned byte value is stored as a word with the signed expansion applied. As a result of this it can happen that 127 is stored hexadecimal as 0x007F, where 128 is stored hexadecimal as 0xFF80.

little and big endian

Processors can be divide into little and big endian. This determines the order of the bytes in for short and long integer values. little endian processors start with the least significant byte. Big endian processors start with the most significant byte. The reason why little endian processors start with the least significant by first, is because it gives an advantage for 8 bits processors when having to perform adding and subtraction operations. Although this plays no role in current 32 bit and 64 bit processors, some processors are still little endian because of their origin. The Pentium processor is a little endian. The PowerPC (used in Macs) is big endian. This difference is especially important for exchanging binary data between different processors. One should be aware that some applications running on a big endian processor, may use the little endian byte order, because the application was originally developed to run on a little endian processor. Sometimes, long values (32 bits) are stored as two big endian words, with the least significant word first.

floating point formats

The possibilities for representing floating number far exceed those of integers. Luckily, nowadays it has been pretty much standardized. Most of the time it matches the way common processors are using. So, a simple cast to float or double will work, sometimes in combination with a byte reordering with respect to little and big endian differences.

Strings

Usually, it is rather easy to recognize strings in a binary format. One should realize that each character in a string can either use one or two bytes (UCS or UTF-16), and that little and big endian could play a role. One should realize that there are many string encoding methods, some based on code pages, some based on Unicode (such as UTF-8). There is a difference between fixed and variable size strings. Fixed strings take a fixed amouth of space in the file. They are often filled up with spaces or the null-character. Variable sized strings, either can have the length at the start, or use a termination character (most often the null-character) to indicate the end of the string. Sometimes, the next piece of data does not directly start at the next byte following the string, but could be at a word or long word boundary.

Packed format

To save space, some binary format may make use of bit packing, which means that the values stored in the binary file do not necessary start at byte/word boundaries. (In C this can be achieved by using the ":" followed by a number in the struct/class member definitions.) One such format is the AutoCAD DWG13 file format. See the program for some suggestions.

More often it happens that a number of Boolean values are stored in a single byte or word. Sometimes it can be helpful to print numbers in their binary representation.

Rubbish

Be aware that many binary files just contain chunks of rubbish, especially those that allocate fixed blocks for storing data in the binary file. Also memory mapped files could contain large sections of rubbish. Rubbish can be very misleading and a great trouble in reverse engineering binary files, because sometimes it just seems to contain valid data. On the other hand, something what looks like rubbish could actually contain some information.

Reading files

When reading a binary file, one often has to access the file in a random manner, or at least often look ahead from the current reading position. Direct random access of a binary file is slow. To avoid this, it is best (if size permits this) to store the whole file into a single byte array. (For an example see my CBuf class, which I used for reading Quark Xpress files.) If size does not permit it, you could try the Memory Mapped file option. For an example of this under Windows see the class Mmfile in the Suneido sources that uses the functions CreateFileMapping and MapViewOfFile. Or study my own implementation of an Object-Oriented persistent store based on Memory Mapped files.

The difference strategy

One way of cracking a binary file format is to make many, preferable small, files with the application that contain small changes compared to each other. For example, you could start to write a document with the application and each time save a document under a different name.

You can use a Binary File Compare utility to compare the files and discover how certain changes are reflected in the binary file, which gives you information about the where the information is stored.

Links

In some random order:

Tools

Open source:

Closed source:

(Background image was taken from http://mathcs.wcsu.ctstateu.edu/joelw/Arch/)


My life as a hacker | Software engineering | home and email address