As hinted at in the previous post, the filename field in the WAD header (a 32-bit unsigned integer) is just a modified CRC32 of the actual filename (with some normalization applied).
But before we get to the actual algorithm, let's talk about two properties of file systems that will become important (we can view a WAD as a "filesystem in a file", and opening/closing files from that filesystem will involve looking up files by name).
Case Insensitive Filesystems
On Windows, the filesystem is usually case insensitive, which means
that the files hello.txt
and HELLO.TXT
are actually the same file,
if you try to open the uppercase version, and the lowercase version
exists, the lowercase version will be opened.
Filesystems can have other properties, such as case-preserving, which
means that even if they consider hello.txt
and HELLO.TXT
the same,
if you save as HeLLo.txt
, it will still preserve that case and not
normalize it to e.g. all-lowercase or all-uppercase.
On Linux and other UNIX-like operating systems, native filesystems are usually case-sensitive (but e.g. a mounted FAT32 partition is not). On macOS, it's case-insensitive by default, but it is an option when formatting a new partition (with HFS+ or APFS). So case sensitivity is a property of a filesystem rather than an operating system, but as always, defaults matter.
Path Separator
Filesystems are usually hierarchical, which means there can be
directories (also called folders), and to represent a "path" to a
certain file in the directory tree, a path separator character
is used and all intermediate directories just listed. On Windows,
this is the backslash (\
). Anywhere else (macOS, Linux, UNIX-like
operating systems) it is the slash (/
).
Normalizing File Names
One way to create a case insensitive "filesystem" is to just ignore the case when comparing filenames. But in case of WAD, we don't store the filename, just a hash of it. So the WAD filesystem is actually not case-preserving, it stores a "normalized" version of the filename (actually, it stores the hash of a "normalized" version of the filename, so not even the filename is preserved here, just a hash of it).
Compared to "real" filesystems, WAD files don't really have any kind of directory structure, each entry is in a "flat" list (this is not unusual, many file archive formats have the same property). The entry contains the full path (including path separator) of the file.
So in order to have case-insensitive matching and allow both backslash and slash to be used, the filename is normalized before hashing:
- Uppercase letters
A
toZ
are converted to lowercase (a
toz
) - Backslashes (
\
) are converted to forward slashes (/
)
For example, the filename
Data\Skins\default\skin.xml
would be normalized to
data/skins/default/skin.xml
This way, lookups can be case-insensitive and both backslash and slash can be used as path separators. Note that because only the hash of the filename is stored, we cannot know with which path separator and case the filename was stored -- this information is lost as part of the normalization before hashing.
Reversing Hashes
In general, it's not possible to calculate the plaintext (input) from a given hash. Hash algorithms map variable-length input into fixed-size hashes, so either a hash function is the best compression algorithm in the world, or there must be hash collisions (hint: it's not the best compression algorithm in the world). The smaller the hash, the more likely a collision is.
For CRC32, if you assume that you have 2^32 + 1 (4.294.967.297) different inputs, at least two of them must map to the same value (in practice you'll have hash collisions even before that).
So you can't calculate the filename from a CRC32 value alone. What you can do, however, is brute force the filenames. Which means just generate filenames based on "good guesses" and then see if they end up having a CRC32 value as the entries in a WAD file.
For this to work, you obviously need to know the hash algorithm used -- and in case of WAD in WipEout Pure and Pulse, it's simply CRC32 with a starting value of 0xFFFFFFFF.
A Known Hash Value
Let's assume that the filename from above
data/skins/default/skin.xml
has a hash value in the WAD file of:
0xf629ddbe
If you have a copy of WipEout Pure (any region, as the file should
appear in all of the releases, including the demos), you'll notice
that the little-endian representation of this value (be dd 29 f6
)
will appear quite early in both data.wad
and fe.wad
, which is
easy to check with a hex editor, or hexdump -C
).
Now let's see if we can "find" a function that maps the filename to this hash value.
CRC32
The CRC32 algorithm is widely available, implemented basically everywhere from filesystems to network protocols and other places. It is not cryptographically secure, but it's cheap to compute and small (only 4 bytes) to store.
Usually a CRC32 is used to calculate a checksum of the payload bytes ("file contents") to check if there have been any transmission errors. In the case of our WAD files, it is used to "compress" file paths to 4-byte values, which has some nice properties (easy lookup, fixed-size index table, ...).
CRC32 calculation for WAD files
By default, CRC32 starts out with an initial value of 0x00000000. This value is then bit-inverted (XOR'ed with 0xFFFFFFFF) to the starting value and each incoming byte is mixed into the hash to generate a new hash value. A the end, the result bits are again bit-inverted (XOR'ed with 0xFFFFFFFF) and the value is returned.
The algorithm for the WAD files is slightly different in that it skips the initial XOR'ing -- or in other words, the initial value is 0x00000000 AFTER the bit-invert step. This of course leads to different hashes (this is using Python 3):
>>> import zlib
>>> hex(zlib.crc32(b'data/skins/default/skin.xml'))
'0x13ac70f2'
Note that 0x13ac70f2
is obviously different from 0xf629ddbe
.
Luckily, Python's zlib.crc32()
function takes a second parameter
value
, the "starting value of the checksum". If we take into
account the bit-inversion, and we want the algorithm to start
with 0x00000000, we can just pass in 0xFFFFFFFF as the initial
value and it should do the same thing that the game does:
>>> hex(zlib.crc32(b'data/skins/default/skin.xml', 0xFFFFFFFF))
'0xf629ddbe'
And luckily, this is the same value that appears in the WAD file.
If you are using the C language, zlib
's crc32()
function also
takes the initial value as parameter.
Observing File Names
So now that we know how filenames are normalized and hashed, the next step would be to "solve" all hashes of the WAD files and get a list of filenames contained in it. Since the game looks up files in WAD files by their filename, one easy way would be to just log all filenames the game tries to look up in the WAD file.
To do this, I have taken the PPSSPP emulator and added Python scripting support to it.
The CRC32 function for hashing a filename in the UCES00001 (EUR)
release of WipEout Pure is at 0x0009cad0
relative to the default
load address (0x08804000
). The same function is also in all other
releases of the game (and its sequel), but at different offsets.
Here's a script to print each filename as it's looked up by the game:
import ppsspp
hash_filename_func = ppsspp.default_load_address + 0x0009cad0
def read_cstring_at(addr):
result = []
while True:
ch = ppsspp.read_memory_u8(addr)
if ch == 0:
break
result.append(ch)
addr += 1
return bytes(result)
def on_breakpoint_function(addr):
if addr == hash_filename_func:
a0 = ppsspp.get_reg_value(ppsspp.MIPS_REG_A0)
filename = read_cstring_at(a0)
print('Filename:', filename)
if ppsspp.get_game_id() == 'UCES00001':
ppsspp.add_breakpoint(hash_filename_func)
ppsspp.on_breakpoint(on_breakpoint_function)
You can download this script here (check the PR above for the PPSSPP feature that can load this script): uces00001_ppsspp_filename_hash_hook.py
If you run the game with this script loaded, soon you should see lots of filenames being printed that the game tries to open (not all of those files actually exist in WAD files, the game might just try to open a non-existent file to see if it exists or not).
If you don't want to build PPSSPP from source, use a Windows build
of PPSSPP (which includes the GUI Debugger) and set a breakpoint
at 0x088a0ad0
. When it is hit, look at the value of the A0 register
and point the memory viewer at this address. A C-style string should
be at that location containing the filename that is looked up.
Now we have a way of "observing" file names just by playing the game.
Brute-Forcing Missing Filenames
This still only gets us an incomplete list of filenames (because the game actually has to try to load the file for us to "see" it), but with some clever observations (e.g. if the files right before and after an "unknown" file happen to be in the same folder) we can find out missing names, too. Based on the contents of the file, we might be able to guess its filename extension, which means we only need to "brute-force" the basename and append the fixed extension.
For example, assume we have gathered a list of filenames, and now in between two known filenames (based on the order they appear in the WAD file), we have an unknown one:
Name CRC32 Size Filename
... ... ...
0x7088d603 33808 data/skins/hacker/images/screen_ja_jap.mip
0xdda15a82 33808 (UNKNOWN?)
0x48e16bef 33808 data/skins/hacker/images/screen_fr_us.mip
... ... ...
Not even looking at the contents, because of the file size we will
assume that file with the filename hash 0xdda15a82
will also have
the same type, so we further assume that the filename ends with
.mip
. And we can assume that the folder could also be
data/skins/hacker/images/
. In this case we could also assume that
the filename starts with screen_
, making the search space quite
small (with a limited alphabet for filenames, as there most likely
won't be special characters in there).
Can you figure out the filename of 0xdda15a82
?
Brute-Force Optimisations
This "has the same prefix" kind of brute forcing with the folder name can also be a great optimization opportunity: You can pre-calculate the intermediate CRC32 of the common prefix, and then just calculate the hash based on that, which can increase your hashes/second rate for brute forcing (unfortunately the suffix cannot be pre-calculated, but it's usually short anyway).
Also, if you know that e.g. data/skins/hacker/images/
has a file
named screen_fr_us.mip
, chances are that if you encounter a folder
named data/skins/123klan/images/
, it might make sense to also try
that basename. This also applies to other common file names such as
screen.xml
, stringtable.xml
and definition.xml
, which are
spread all over the WAD "folder structure".
Note that in some cases, the XML files mention files by filename,
so the contents of those files (plus strings in BOOT.BIN
) are
also good starting points for generating possible filename candidates.
Where To Go From Here
Across all releases of Pure (including the demos and DLCs), Pulse (including its demo and DLCs), I was able to reverse close to 3000 hashes to filenames, but am still missing some. The ability to give filenames to WAD contents will be useful for further exploration.
More on that in some other post.