 |
 |
 |
 |
 |
 |
|
C
H A P T E R T
W O

FUNDAMENTALS
 |
|
|
|
Most
of the advancements throughout the history of computers have been because
of particular problems which required solving. In this chapter, we'll take
a look at some of these problems and their solutions as they pertain to
the MPQ format.

|
|
|
|
Hashes |
|
|
|
|
Problem: You have a
very large array of strings. You have another string and need to know if
it is already in the list. You would probably begin by comparing each
string in the list with the string other, but when put into application,
you would find that this method is far too slow for practical use.
Something else must be done. But how can you know if the string exists
without comparing it to all the other strings?
Solution:
Hashes. Hashes are smaller data types (i.e. numbers) that represent
other, larger, data types (usually strings). In this scenario, you could
store hashes in the array with the strings. Then you could compute the
hash of the other string and compare it to the stored hashes. If a hash in
the array matches the new hash, the strings can be compared to verify the
match. This method, called indexing, could speed things up by about
100 times, depending on the size of the array and the average length of
the strings.
unsigned
long HashString(char *lpszString)
{ |
|
unsigned
long ulHash = 0xf1e2d3c4;
while (*lpszString != 0)
{ |
|
|
ulHash <<= 1;
ulHash += *lpszString++; |
|
}
return ulHash; |
} |
|
The
previous code function demonstrates a very simple hashing algorithm. The
function sums the characters in the string, shifting the hash value left
one bit before each character is added in. Using this algorithm, the
string "arr\units.dat" would hash to 0x5A858026, and
"unit\neutral\acritter.grp" would hash to 0x694CD020. Now, this
is, admittedly, a very simple algorithm, and it isn't very useful, because
it would generate a relatively predictable output, and a lot of collisions
in the lower range of numbers. Collisions are what happen when more than
one string hash to the same value.
The
MPQ format, on the other hand, uses a very complicated hash algorithm
(shown below) to generate totally unpredictable hash values. In fact, the
hashing algorithm is so effective that it is called a one-way hash.
A one-way hash is a an algorithm that is constructed in such a way that
deriving the original string (set of strings, actually) is virtually
impossible. Using this particular algorithm, the filename
"arr\units.dat" would hash to 0xF4E6C69D, and
"unit\neutral\acritter.grp" would hash to 0xA26067F3.
unsigned
long HashString(char *lpszFileName, unsigned long dwHashType)
{ |
|
unsigned
char *key = (unsigned char *)lpszFileName;
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{ |
|
|
ch =
toupper(*key++);
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 +
seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3; |
|
}
return seed1; |
} |
|
 |
|
|
|
Hash
Tables |
|
|
|
|
Problem: You tried
using an index like in the previous sample, but your program absolutely
demands break-neck speeds, and indexing just isn't fast enough. About the
only thing you could do to make it faster is to not check all of the
hashes in the array. Or, even better, if you could only make one
comparison in order to be sure the string doesn't exist anywhere in the
array. Sound too good to be true? It's not.
Solution:
A hash table. A hash table is a special type of array in which the
offset of the desired string is the hash of that string. What I
mean is this. Say that you make that string array use a separate array of
fixed size (let's say 1024 entries, to make it an even power of 2) for the
hash table. You want to see if the new string is in that table. To get the
string's place in the hash table, you compute the hash of that string,
then modulo (division remainder) that hash value by the size of that
table. Thus, if you used the simple hash algorithm in the previous
section, "arr\units.dat" would hash to 0x5A858026, making its
offset 0x26 (0x5A858026 divided by 0x400 is 0x16A160, with a remainder of
0x26). The string at this location (if there was one) would then be
compared to the string to add. If the string at 0x26 doesn't match or just
plain doesn't exist, then the string to add doesn't exist in the array.
The following code illustrates this:
int
GetHashTablePos(char *lpszString, SOMESTRUCTURE *lpTable, int
nTableSize)
{ |
|
int
nHash = HashString(lpszString), nHashPos = nHash % nTableSize;
if (lpTable[nHashPos].bExists &&
!strcmp(lpTable[nHashPos].pString, lpszString)) |
|
|
return nHashPos ; |
|
else |
|
|
return -1; //Error
value |
} |
|
Now,
there is one glaring flaw in that explanation. What do you think happens
when a collision occurs (two different strings hash to the same value)?
Obviously, they can't occupy the same entry in the hash table. Normally,
this is solved by each entry in the hash table being a pointer to a linked
list, and the linked list would hold all the entries that hash to that
same value.
MPQs
use a hash table of filenames to keep track of the files inside, but the
format of this table is somewhat different from the way hash tables are
normally done. First of all, instead of using a hash as an offset, and
storing the actually filename for verification, MPQs do not store the
filename at all, but rather use three different hashes: one for the hash
table offset, two for verification. These two verification hashes are used
in place of the actual filename. Of course, this leaves the possibility
that two different filenames would hash to the same three hashes, but the
chances of this happening are, on average, 1:18889465931478580854784,
which should be safe enough for just about anyone.
The
other way that an MPQ's hash table differs from the conventional
implementation is that instead of using a linked list for each entry, when
a collision occurs, the entry will be shifted to the next slot, and the
process repeated until a free space is found. Take a look at the following
illustrational code, which is basically the way a file is located for
reading in an MPQ:
int
GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int
nTableSize)
{ |
|
const
int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
int nHash = HashString(lpszString, HASH_OFFSET), nHashA =
HashString(lpszString, HASH_A), nHashB =
HashString(lpszString, HASH_B), nHashStart = nHash %
nTableSize, nHashPos = nHashStart;
while (lpTable[nHashPos].bExists)
{ |
|
|
if
(lpTable[nHashPos].nHashA == nHashA &&
lpTable[nHashPos].nHashB == nHashB) |
|
|
|
return nHashPos; |
|
|
else |
|
|
|
nHashPos = (nHashPos
+ 1) % nTableSize;
|
|
|
if
(nHashPos == nHashStart) |
|
|
|
break; |
|
}
return -1; //Error value |
} |
|
However
convoluted that code may look, the theory behind it isn't difficult.
It basically follows this process when looking to read a file:
- Compute the three hashes (offset hash
and two check hashes) and store them in variables.
- Move to the entry of the offset hash
- Is the entry unused? If so, stop the
search and return 'file not found'.
- Do the two check hashes match the check
hashes of the file we're looking for? If so, stop the search and
return the current entry.
- Move to the next entry in the list,
wrapping around to the beginning if we were on the last entry.
- Is the entry we just moved to the same
as the offset hash (did we look through the whole hash table?)? If so,
stop the search and return 'file not found'.
- Go back to step 3.
If
you were paying attention, you might have noticed from my explanation and
sample code is that the MPQ's hash table has to hold all the file entries
in the MPQ. But what do you think happens when every hash-table entry gets
filled? The answer might surprise you with its obviousness: you can't add
any more files. Several people have asked me why there is a limit (called
the file limit) on the number of files that can be put in an MPQ,
and if there is any way around this limit. Well, you already have the
answer to the first question. As for the second; no, you cannot get
around the file limit. For that matter, hash tables cannot even be resized
without remaking the entire MPQ from scratch. This is because the location
of each entry in the hash table may well change due to the resizing, and
we would not be able to derive the new position because the position is
the hash of the file name, and we may not know the file name.

|
|
|
|
Compression |
|
|
|
|
Problem:
You have a large program (let's say 50 megs), that you want to distribute
on the internet. But 50 megs is an awfully large download, and people
might not be interested in waiting the four and a half hours to download
this program.
Solution:
Compression. Compression is the art of representing a given amount
of data in a smaller space. There are literally hundreds of different
compression algorithms, each working in different ways. The actual
algorithm that MPQs use is the Data Compression Library, licensed from
PKWare (one of the leaders in applied compression), and is far too
complicated to explain here. Instead, I'll try to explain a more simple
compression algorithm for the purpose of illustration.
THIS SECTION IS
INCOMPLETE BECAUSE THE AUTHOR IS UNAVAILABLE

|
|
|
|
Encryption |
|
|
|
|
The need
for a system of protecting information from prying eyes is timeless.
People have been trying to privately pass information to others for
thousands of years. From handwritten letters carried on foot by the
couriers of ancient Greece, to Nazi submarine radio transmissions in World
War 2, to credit card transactions over the internet today, the ability to
assure that no unwanted people are reading your information is a
necessity.
This
complex art of protection is called encryption, and, while we don't
know when the first algorithm was devised, we do know that the number of
algorithms is too numerous to count. Everything from simple scrambling of
data, to transmutation, and even algorithms in which the decryption key
(sometimes called a password) is different from the encryption key
(in a method called asymmetric encryption), has been done time and
time again.
This
article certainly never claims, nor expects, to be a comprehensive
authority on encryption methods, but it is necessary that you understand
encryption is you intend to work with the MPQ format directly.
Let
us start with a simple encryption algorithm that was published in Basic
Lab Notes (adapted to C by myself; some variable names changed for
readability; comments removed):
void
EncryptBlock(void *lpvBlock, int nBlockLen, char
*lpszPassword)
{ |
|
int
nPWLen = strlen(lpszPassword), nCount = 0;
char *lpsPassBuff = (char *)_alloca(nPWLen);
memcpy(lpsPassBuff, lpszPassword, nPWLen);
for (int nChar = 0; nCount < nBlockLen; nCount++)
{ |
|
|
char cPW =
lpsPassBuff[nCount];
lpvBlock[nChar] ^= cPW;
lpsPassBuff[nCount] = cPW + 13;
nCount = (nCount + 1) % nPWLen; |
|
}
return; |
} |
|
Just
like the demonstration hash code, this code is extremely simple, and
shouldn't be used in an actual program where security is required. What it
does is simple, even if the code is cryptic (no pun intended). It goes
through the block to encrypt, XORing each byte with the corresponding byte
of the password. It then modifies the byte in the password by adding 13 to
it (13 was chosen because it is a prime number). This is done to make the
code's pattern more difficult to recognize. Thus, with this algorithm, the
string "encryption" (65 6E 63 72 79 70 74 69 6F 6E) encrypted
with a password of "MPQ" (4D 50 51) would be an unreadable
string (28 3E 32 28 24 2E 13 03 04 1A).
Now,
this algorithm is symmetrical. That means the key (or password) that is
used to encrypt a block is the same as the key to decrypt it. In fact,
because XOR is a symmetric operation, the exact same algorithm can
be used to decrypt as to encrypt. Note that most symmetric encryption
algorithms are not exactly symmetric, so they require the encrypt and
decrypt functions to be different.
Okay,
this is where things get dirty. If you want to work with the MPQ format
directly, you're gonna have to know the encryption algorithm, and that
makes it my job to teach it to you. The MPQ encryption algorithm is an
interesting hybrid of other encryption techniques. It creates an
encryption table (which is also used in the hash function), and uses a
file's encryption key to pick certain members out of the encryption table.
It then XORs the data to be encrypted with the members from the table.
Now, that's a pretty strange way to do things, so maybe some code will
show you that it is overcomplicated :-p. The following code generates the
cryptTable array of 0x500 longs:
void
prepareCryptTable()
{ |
|
unsigned
long seed = 0x00100001, index1 = 0, index2 = 0, i;
for(index1 = 0; index1 < 0x100; index1++)
{ |
|
|
for(index2
= index1, i = 0; i < 5; i++, index2 += 0x100)
{ |
|
|
|
unsigned long temp1,
temp2;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp1 = (seed & 0xFFFF) << 0x10;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp2 = (seed & 0xFFFF);
cryptTable[index2] = (temp1 | temp2); |
|
|
} |
|
} |
} |
|
Are
you kinda getting the feeling that Blizzard hired a disgruntled calculus
professor to write these algorithms? I know I am. Fortunately, it isn't a
big problem if you don't understand this code. If you want to work
directly with MPQs, you'll need these functions; you don't necessarily
have to understand them. Anyway, after the cryptTable is initialized, we
can decrypt MPQ data with the following function (don't expect me to
explain it to you; I don't want to know how it works myself!):
void
DecryptBlock(void *block, long length, unsigned long key)
{ |
|
unsigned
long seed = 0xEEEEEEEE, unsigned long ch;
unsigned long *castBlock = (unsigned long *)block;
// Round to longs
length >>= 2;
while(length-- > 0)
{ |
|
|
seed +=
stormBuffer[0x400 + (key & 0xFF)];
ch = *castBlock ^ (key + seed);
key = ((~key << 0x15) + 0x11111111) | (key >>
0x0B);
seed = ch + seed + (seed << 5) + 3;
*castBlock++ = ch; |
|
} |
} |
|
|
|
|
|
|
|
|
|