C
H A P T E RT
W O FUNDAMENTALS |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Most
of the advancements throughout the history of computers have been because
of particular problems which required solving. In this chapter, we'll take
a look at some of these problems and their solutions as they pertain to
the MPQ format. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hashes | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Problem: You have a
very large array of strings. You have another string and need to know if
it is already in the list. You would probably begin by comparing each
string in the list with the string other, but when put into application,
you would find that this method is far too slow for practical use.
Something else must be done. But how can you know if the string exists
without comparing it to all the other strings?
The previous code function demonstrates a very simple hashing algorithm. The function sums the characters in the string, shifting the hash value left one bit before each character is added in. Using this algorithm, the string "arr\units.dat" would hash to 0x5A858026, and "unit\neutral\acritter.grp" would hash to 0x694CD020. Now, this is, admittedly, a very simple algorithm, and it isn't very useful, because it would generate a relatively predictable output, and a lot of collisions in the lower range of numbers. Collisions are what happen when more than one string hash to the same value. The MPQ format, on the other hand, uses a very complicated hash algorithm (shown below) to generate totally unpredictable hash values. In fact, the hashing algorithm is so effective that it is called a one-way hash. A one-way hash is a an algorithm that is constructed in such a way that deriving the original string (set of strings, actually) is virtually impossible. Using this particular algorithm, the filename "arr\units.dat" would hash to 0xF4E6C69D, and "unit\neutral\acritter.grp" would hash to 0xA26067F3.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hash Tables | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Problem: You tried
using an index like in the previous sample, but your program absolutely
demands break-neck speeds, and indexing just isn't fast enough. About the
only thing you could do to make it faster is to not check all of the
hashes in the array. Or, even better, if you could only make one
comparison in order to be sure the string doesn't exist anywhere in the
array. Sound too good to be true? It's not.
Now, there is one glaring flaw in that explanation. What do you think happens when a collision occurs (two different strings hash to the same value)? Obviously, they can't occupy the same entry in the hash table. Normally, this is solved by each entry in the hash table being a pointer to a linked list, and the linked list would hold all the entries that hash to that same value. MPQs use a hash table of filenames to keep track of the files inside, but the format of this table is somewhat different from the way hash tables are normally done. First of all, instead of using a hash as an offset, and storing the actually filename for verification, MPQs do not store the filename at all, but rather use three different hashes: one for the hash table offset, two for verification. These two verification hashes are used in place of the actual filename. Of course, this leaves the possibility that two different filenames would hash to the same three hashes, but the chances of this happening are, on average, 1:18889465931478580854784, which should be safe enough for just about anyone. The other way that an MPQ's hash table differs from the conventional implementation is that instead of using a linked list for each entry, when a collision occurs, the entry will be shifted to the next slot, and the process repeated until a free space is found. Take a look at the following illustrational code, which is basically the way a file is located for reading in an MPQ:
However convoluted that code may look, the theory behind it isn't difficult. It basically follows this process when looking to read a file:
If
you were paying attention, you might have noticed from my explanation and
sample code is that the MPQ's hash table has to hold all the file entries
in the MPQ. But what do you think happens when every hash-table entry gets
filled? The answer might surprise you with its obviousness: you can't add
any more files. Several people have asked me why there is a limit (called
the file limit) on the number of files that can be put in an MPQ,
and if there is any way around this limit. Well, you already have the
answer to the first question. As for the second; no, you cannot get
around the file limit. For that matter, hash tables cannot even be resized
without remaking the entire MPQ from scratch. This is because the location
of each entry in the hash table may well change due to the resizing, and
we would not be able to derive the new position because the position is
the hash of the file name, and we may not know the file name. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Compression | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Problem:
You have a large program (let's say 50 megs), that you want to distribute
on the internet. But 50 megs is an awfully large download, and people
might not be interested in waiting the four and a half hours to download
this program. Solution: Compression. Compression is the art of representing a given amount of data in a smaller space. There are literally hundreds of different compression algorithms, each working in different ways. The actual algorithm that MPQs use is the Data Compression Library, licensed from PKWare (one of the leaders in applied compression), and is far too complicated to explain here. Instead, I'll try to explain a more simple compression algorithm for the purpose of illustration. THIS SECTION IS
INCOMPLETE BECAUSE THE AUTHOR IS UNAVAILABLE |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Encryption | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The need
for a system of protecting information from prying eyes is timeless.
People have been trying to privately pass information to others for
thousands of years. From handwritten letters carried on foot by the
couriers of ancient Greece, to Nazi submarine radio transmissions in World
War 2, to credit card transactions over the internet today, the ability to
assure that no unwanted people are reading your information is a
necessity. This complex art of protection is called encryption, and, while we don't know when the first algorithm was devised, we do know that the number of algorithms is too numerous to count. Everything from simple scrambling of data, to transmutation, and even algorithms in which the decryption key (sometimes called a password) is different from the encryption key (in a method called asymmetric encryption), has been done time and time again. This article certainly never claims, nor expects, to be a comprehensive authority on encryption methods, but it is necessary that you understand encryption is you intend to work with the MPQ format directly. Let us start with a simple encryption algorithm that was published in Basic Lab Notes (adapted to C by myself; some variable names changed for readability; comments removed):
Just like the demonstration hash code, this code is extremely simple, and shouldn't be used in an actual program where security is required. What it does is simple, even if the code is cryptic (no pun intended). It goes through the block to encrypt, XORing each byte with the corresponding byte of the password. It then modifies the byte in the password by adding 13 to it (13 was chosen because it is a prime number). This is done to make the code's pattern more difficult to recognize. Thus, with this algorithm, the string "encryption" (65 6E 63 72 79 70 74 69 6F 6E) encrypted with a password of "MPQ" (4D 50 51) would be an unreadable string (28 3E 32 28 24 2E 13 03 04 1A). Now, this algorithm is symmetrical. That means the key (or password) that is used to encrypt a block is the same as the key to decrypt it. In fact, because XOR is a symmetric operation, the exact same algorithm can be used to decrypt as to encrypt. Note that most symmetric encryption algorithms are not exactly symmetric, so they require the encrypt and decrypt functions to be different. Okay, this is where things get dirty. If you want to work with the MPQ format directly, you're gonna have to know the encryption algorithm, and that makes it my job to teach it to you. The MPQ encryption algorithm is an interesting hybrid of other encryption techniques. It creates an encryption table (which is also used in the hash function), and uses a file's encryption key to pick certain members out of the encryption table. It then XORs the data to be encrypted with the members from the table. Now, that's a pretty strange way to do things, so maybe some code will show you that it is overcomplicated :-p. The following code generates the cryptTable array of 0x500 longs:
Are you kinda getting the feeling that Blizzard hired a disgruntled calculus professor to write these algorithms? I know I am. Fortunately, it isn't a big problem if you don't understand this code. If you want to work directly with MPQs, you'll need these functions; you don't necessarily have to understand them. Anyway, after the cryptTable is initialized, we can decrypt MPQ data with the following function (don't expect me to explain it to you; I don't want to know how it works myself!):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Web site and content copyright © 2000 Justin Olbrantz(Quantam) unless otherwise noted. All rights reserved.