Measuring Disassembly

We recently published a paper which is devoted entirely to exploring several aspects of x86/x64 disassembly. Among other things, we measured the prevalence of complex corner cases generated by modern compilers, and the precision with which disassemblers handle these cases. We released our complete data set, in part because there are too many results to fit in the paper, and also to allow others to compare their own results to ours.

Since we’ve received several questions asking for details on how to implement such a comparison, the below provides an example. Assuming that you’ve already downloaded our data set and generated the ground truth (as detailed in ~/disasm/README in the provided VM), getting results for a new disassembler requires two steps.

  1. Write a script that parses the output of the disassembler you want to evaluate, and puts it into a format useful for further processing.
  2. Compare the disassembler output to the ground truth, using another script for the specific primitive you want to evaluate.

We give examples of both steps. Though at first it may look like lots of work to fit these scripts to your own evaluation requirements, this should actually be quite straightforward, since you can reuse much of the code verbatim regardless of the specific test setup.

Parsing disassembler output

Since every disassembler is different, we need to make a specifically tailored script that parses the output of the disassembler we want to test, and puts it into a normalized format that we can process further. To keep things simple, the example presented here is based on objdump, but to create a script for another disassembler you can use the exact same basic idea. Without further ado, here is the bash script we used in our paper to parse the instructions output by objdump for our SPEC CPU2006 test suite (the scripts for our other tests are nearly identical).

Lines 3-4 are simply lists of all the SPEC CPU2006 C and C++ test cases, which we later iterate over to disassemble each test. On lines 40-66, we call the main disassembly function (described next) with various parameters, for each of the compiler configurations we test.

The important bit is the disasm function declared on line 8. It starts by reading its parameters into named variables and making the directories where we will output our results. Then, on line 19, we begin a loop over all test cases for the given configuration.

For each test case, we loop over all the optimization levels (line 23), and determine the name of the binary for the current test case/optimization level, skipping an iteration and yielding a warning if the file does not exist (lines 25-33). Note that we assume a particular format for the directory and binary names. For instance, we assume that all the stripped C++ test binaries as compiled with gcc 5.1/64-bit are located in a directory called truth/gcc510-64/bin/stripped/C++, and that binaries generated with Visual Studio have the .exe extension. If you are using the ground truth provided by us, these requirements are all met.

So far, the entire script has been disassembler-agnostic; you can reuse those parts for any disassembler you want to test. Lines 34-35 are the only lines that need to be tailored to the specific disassembler that is being tested. These are the lines where the actual disassembler is run, and its output parsed and dumped to file. Moreover, both these lines are identical except that line 34 disassembles a binary with symbols, while line 35 disassembles a stripped binary. For our example, in both cases we simply run objdump, grep for all the disassembled addresses, give each address a 0x prefix, and write the results to an output file for the specific test case/configuration. We store instruction addresses instead of mnemonics because the addresses are much easier to compare to our ground truth (as discussed below).

As you can see, the script generalizes to other disassemblers in a very straightforward way. Some disassemblers, such as IDA Pro, have a more complicated user interface that we cannot just parse with grep. In such cases, we require that the disassembler is scriptable, and can be run in an automated way. For instance, for IDA Pro we created a simple IDA Python script that dumps all the primitives we are interested in to file, and then ran the script in the above loop using IDA Pro’s “autonomous mode” (requiring no user interaction). In our objdump example, we save only instruction output, but for disassemblers which support other primitives, these can be parsed and written to file in an analogous way.

Comparing to the ground truth

So far, we have created a bash script which uses our chosen disassembler (objdump) to disassemble all our test cases and save the instruction addresses to file. Now, we want to compare these addresses to the ground truth provided in our data set. For this, we use a Python script (called that takes as input the ground truth file for a single test case (one of the * files provided in our data set), and a disassembler output file as generated by the disassembler-specific bash script described above.

For instance, here is a result you might get when calling this script from the command line for a particular test case (the files in ins/ are generated by the disassembly script we created above).

The script compares instruction addresses (as found by the disassembler) to the ground truth. To create scripts for other primitives, please refer to the README file provided in our data set. It completely describes our ground truth format, which is designed to be easily parseable by both humans and machines. The README file also describes the output format of our comparison scripts.

Let’s take a look at the main function, at line 93. It consists of three phases.

  1. Read the instruction-level ground truth into the bounds dictionary (lines 98-113), using instruction addresses as key, and mapping them to a descriptor of the instruction type (as described in the ground truth format section in the README file).
  2. Load all the instruction addresses found by the disassembler into the ins dictionary (lines 118-127).
  3. Compare the ground truth (bounds) to the disassembled instructions (ins), counting true positives, false positives and false negatives and then printing out the statistics (lines 129-160).

The certain_code and certain_data functions are used to parse a ground truth instruction descriptor, and find out if a particular address is code or data. To this end, both of these functions rely on insmap_byte, which is just a utility function that returns the type of a particular byte in the descriptor. (Each descriptor describes a single instruction, which may consist of multiple bytes.)

As an example of how to evaluate a primitive other than instructions, suppose that we instead want to measure the correctness of function information. In that case, you would fill the bounds dictionary in a similar way, but this time loading the function-level ground truth instead of the instruction-level ground truth. This simply means that instead of loading the lines that start with an '@' symbol (instruction descriptors), you would load the lines that start with 'F ' (an F followed by a space), and then compare the ground truth addresses to those found by the disassembler (in this case you won’t even need the certain_code/certain_data functions, but can just compare addresses directly). To get an intuitive feeling of how to parse for each kind of primitive, it is a good idea to open up one of the * files and skim/grep through it.

Now that we can compare ground truth and disassembler output for one test case at a time, it would be convenient to automate the process of doing this for all test cases. For this, we use one last bash script, which is similar in structure to the script used for disassembly.

In essence, the output files created by this script combine the outputs of for all test cases given a particular compiler/architecture configuration, one test case per line. As before, we have a loop over all test cases and optimization levels. This time, we have an additional loop at line 38, which goes over an array containing all disassemblers we want to evaluate. This way, we don’t have to manually run the comparison script for each disassembler. Note that the disassembler names, as specified in the array, need to match those used in the output file names generated by our disassembly script.

The script first resets all output files (lines 24-29), and then begins its main loop. The main loop simply calls for each possible configuration, and saves the statistics to file, printing warnings for any test cases or ground truth files which cannot be found. After the script completes, you will find a collection of combined statistics files in the ins directory, with one file per combination of compiler/architecture/language/disassembler. The file contents should look something like this (truncated for brevity).