BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

December 12, 2020

Binary code similarity detection, which answers whether two pieces of binary code are similar, has been used in a number of applications, such as vulnerability detection and automatic patching. Existing approaches face two hurdles in their efforts to achieve high accuracy and coverage: (1) the problem of source-binary code similarity detection, where the target code to be analyzed is in the binary format while the comparing code (with ground truth) is in source code format. Meanwhile, the source code is compiled to the comparing binary code with either a random or fixed configuration (e.g., architecture, compiler family, compiler version, and optimization level), which significantly increases the difficulty of code similarity detection; and (2) the existence of different degrees of code similarity. Less similar code is known to be more, if not equally, important in various applications such as binary vulnerability study. To address these challenges, we design BugGraph, which performs source-binary code similarity detection in two steps. First, BugGraph identifies the compilation provenance of the target binary and compiles the comparing source code to a binary with the same provenance. Second, BugGraph utilizes a new graph triplet-loss network on the attributed control flow graph to produce a similarity ranking. The experiments on four real-world datasets show that BugGraph achieves 90% and 75% TPR (true positive rate) for syntax equivalent and similar code, respectively, an improvement of 16% and 24% over state-of-the-art methods. Moreover, BugGraph is able to identify 140 vulnerabilities in six commercial firmware.

Read more