Filename hashing in WipEout Pure/Pulse WAD files

2020-12-24

As hinted at in the previous post, the filename field in the WAD header (a 32-bit unsigned integer) is just a modified CRC32 of the actual filename (with some normalization applied).

But before we get to the actual algorithm, let's talk about two properties of file systems that will become important (we can view a WAD as a "filesystem in a file", and opening/closing files from that filesystem will involve looking up files by name).

Case Insensitive Filesystems

On Windows, the filesystem is usually case insensitive, which means that the files hello.txt and HELLO.TXT are actually the same file, if you try to open the uppercase version, and the lowercase version exists, the lowercase version will be opened.

Filesystems can have other properties, such as case-preserving, which means that even if they consider hello.txt and HELLO.TXT the same, if you save as HeLLo.txt, it will still preserve that case and not normalize it to e.g. all-lowercase or all-uppercase.

On Linux and other UNIX-like operating systems, native filesystems are usually case-sensitive (but e.g. a mounted FAT32 partition is not). On macOS, it's case-insensitive by default, but it is an option when formatting a new partition (with HFS+ or APFS). So case sensitivity is a property of a filesystem rather than an operating system, but as always, defaults matter.

Path Separator

Filesystems are usually hierarchical, which means there can be directories (also called folders), and to represent a "path" to a certain file in the directory tree, a path separator character is used and all intermediate directories just listed. On Windows, this is the backslash (\). Anywhere else (macOS, Linux, UNIX-like operating systems) it is the slash (/).

Normalizing File Names

One way to create a case insensitive "filesystem" is to just ignore the case when comparing filenames. But in case of WAD, we don't store the filename, just a hash of it. So the WAD filesystem is actually not case-preserving, it stores a "normalized" version of the filename (actually, it stores the hash of a "normalized" version of the filename, so not even the filename is preserved here, just a hash of it).

Compared to "real" filesystems, WAD files don't really have any kind of directory structure, each entry is in a "flat" list (this is not unusual, many file archive formats have the same property). The entry contains the full path (including path separator) of the file.

So in order to have case-insensitive matching and allow both backslash and slash to be used, the filename is normalized before hashing:

Uppercase letters A to Z are converted to lowercase (a to z)
Backslashes (\) are converted to forward slashes (/)

For example, the filename

Data\Skins\default\skin.xml

would be normalized to

data/skins/default/skin.xml

This way, lookups can be case-insensitive and both backslash and slash can be used as path separators. Note that because only the hash of the filename is stored, we cannot know with which path separator and case the filename was stored -- this information is lost as part of the normalization before hashing.

Reversing Hashes

In general, it's not possible to calculate the plaintext (input) from a given hash. Hash algorithms map variable-length input into fixed-size hashes, so either a hash function is the best compression algorithm in the world, or there must be hash collisions (hint: it's not the best compression algorithm in the world). The smaller the hash, the more likely a collision is.

For CRC32, if you assume that you have 2^32 + 1 (4.294.967.297) different inputs, at least two of them must map to the same value (in practice you'll have hash collisions even before that).

So you can't calculate the filename from a CRC32 value alone. What you can do, however, is brute force the filenames. Which means just generate filenames based on "good guesses" and then see if they end up having a CRC32 value as the entries in a WAD file.

For this to work, you obviously need to know the hash algorithm used -- and in case of WAD in WipEout Pure and Pulse, it's simply CRC32 with a starting value of 0xFFFFFFFF.

A Known Hash Value

Let's assume that the filename from above

data/skins/default/skin.xml

has a hash value in the WAD file of:

0xf629ddbe

If you have a copy of WipEout Pure (any region, as the file should appear in all of the releases, including the demos), you'll notice that the little-endian representation of this value (be dd 29 f6) will appear quite early in both data.wad and fe.wad, which is easy to check with a hex editor, or hexdump -C).

Now let's see if we can "find" a function that maps the filename to this hash value.

CRC32

The CRC32 algorithm is widely available, implemented basically everywhere from filesystems to network protocols and other places. It is not cryptographically secure, but it's cheap to compute and small (only 4 bytes) to store.

Usually a CRC32 is used to calculate a checksum of the payload bytes ("file contents") to check if there have been any transmission errors. In the case of our WAD files, it is used to "compress" file paths to 4-byte values, which has some nice properties (easy lookup, fixed-size index table, ...).

CRC32 calculation for WAD files

By default, CRC32 starts out with an initial value of 0x00000000. This value is then bit-inverted (XOR'ed with 0xFFFFFFFF) to the starting value and each incoming byte is mixed into the hash to generate a new hash value. A the end, the result bits are again bit-inverted (XOR'ed with 0xFFFFFFFF) and the value is returned.

The algorithm for the WAD files is slightly different in that it skips the initial XOR'ing -- or in other words, the initial value is 0x00000000 AFTER the bit-invert step. This of course leads to different hashes (this is using Python 3):

>>> import zlib
>>> hex(zlib.crc32(b'data/skins/default/skin.xml'))
'0x13ac70f2'

Note that 0x13ac70f2 is obviously different from 0xf629ddbe. Luckily, Python's zlib.crc32() function takes a second parameter value, the "starting value of the checksum". If we take into account the bit-inversion, and we want the algorithm to start with 0x00000000, we can just pass in 0xFFFFFFFF as the initial value and it should do the same thing that the game does:

>>> hex(zlib.crc32(b'data/skins/default/skin.xml', 0xFFFFFFFF))
'0xf629ddbe'

And luckily, this is the same value that appears in the WAD file.

If you are using the C language, zlib's crc32() function also takes the initial value as parameter.

Observing File Names

So now that we know how filenames are normalized and hashed, the next step would be to "solve" all hashes of the WAD files and get a list of filenames contained in it. Since the game looks up files in WAD files by their filename, one easy way would be to just log all filenames the game tries to look up in the WAD file.

To do this, I have taken the PPSSPP emulator and added Python scripting support to it.

The CRC32 function for hashing a filename in the UCES00001 (EUR) release of WipEout Pure is at 0x0009cad0 relative to the default load address (0x08804000). The same function is also in all other releases of the game (and its sequel), but at different offsets.

Here's a script to print each filename as it's looked up by the game:

import ppsspp

hash_filename_func = ppsspp.default_load_address + 0x0009cad0

def read_cstring_at(addr):
    result = []
    while True:
        ch = ppsspp.read_memory_u8(addr)
        if ch == 0:
            break
        result.append(ch)
        addr += 1
    return bytes(result)

def on_breakpoint_function(addr):
    if addr == hash_filename_func:
        a0 = ppsspp.get_reg_value(ppsspp.MIPS_REG_A0)
        filename = read_cstring_at(a0)
        print('Filename:', filename)

if ppsspp.get_game_id() == 'UCES00001':
    ppsspp.add_breakpoint(hash_filename_func)
    ppsspp.on_breakpoint(on_breakpoint_function)

You can download this script here (check the PR above for the PPSSPP feature that can load this script): uces00001_ppsspp_filename_hash_hook.py

If you run the game with this script loaded, soon you should see lots of filenames being printed that the game tries to open (not all of those files actually exist in WAD files, the game might just try to open a non-existent file to see if it exists or not).

If you don't want to build PPSSPP from source, use a Windows build of PPSSPP (which includes the GUI Debugger) and set a breakpoint at 0x088a0ad0. When it is hit, look at the value of the A0 register and point the memory viewer at this address. A C-style string should be at that location containing the filename that is looked up.

Now we have a way of "observing" file names just by playing the game.

Brute-Forcing Missing Filenames

This still only gets us an incomplete list of filenames (because the game actually has to try to load the file for us to "see" it), but with some clever observations (e.g. if the files right before and after an "unknown" file happen to be in the same folder) we can find out missing names, too. Based on the contents of the file, we might be able to guess its filename extension, which means we only need to "brute-force" the basename and append the fixed extension.

For example, assume we have gathered a list of filenames, and now in between two known filenames (based on the order they appear in the WAD file), we have an unknown one:

Name CRC32  Size   Filename
...         ...    ...
0x7088d603  33808  data/skins/hacker/images/screen_ja_jap.mip
0xdda15a82  33808  (UNKNOWN?)
0x48e16bef  33808  data/skins/hacker/images/screen_fr_us.mip
...         ...    ...

Not even looking at the contents, because of the file size we will assume that file with the filename hash 0xdda15a82 will also have the same type, so we further assume that the filename ends with .mip. And we can assume that the folder could also be data/skins/hacker/images/. In this case we could also assume that the filename starts with screen_, making the search space quite small (with a limited alphabet for filenames, as there most likely won't be special characters in there).

Can you figure out the filename of 0xdda15a82?

Brute-Force Optimisations

This "has the same prefix" kind of brute forcing with the folder name can also be a great optimization opportunity: You can pre-calculate the intermediate CRC32 of the common prefix, and then just calculate the hash based on that, which can increase your hashes/second rate for brute forcing (unfortunately the suffix cannot be pre-calculated, but it's usually short anyway).

Also, if you know that e.g. data/skins/hacker/images/ has a file named screen_fr_us.mip, chances are that if you encounter a folder named data/skins/123klan/images/, it might make sense to also try that basename. This also applies to other common file names such as screen.xml, stringtable.xml and definition.xml, which are spread all over the WAD "folder structure".

Note that in some cases, the XML files mention files by filename, so the contents of those files (plus strings in BOOT.BIN) are also good starting points for generating possible filename candidates.

Where To Go From Here

Across all releases of Pure (including the demos and DLCs), Pulse (including its demo and DLCs), I was able to reverse close to 3000 hashes to filenames, but am still missing some. The ability to give filenames to WAD contents will be useful for further exploration.