Static analysis tools are useful for finding serious programming defects and security vulnerabilities in source and binary code. Most static analysis checkers work by searching the code for known patterns or conditions that cause the program to fail, or that indicate violations of coding standards. The set of defects that such tools can find is limited to problems anticipated by the tool-designer.
Newer, advanced tools using machine learning techniques can automatically determine new properties to check by deducing normal usage, and then looking for parts of the code that deviate from that practice in significant ways, on the assumption that such abnormal code is incorrect. This approach has previously been limited to the scope of the body of code under analysis, but the ever- increasing volume of open source software, combined with advances in machine learning, means that it is now possible to deduce normal usage by mining very large software collections. This technique is particularly useful for finding anomalies in API usage, especially for popular operating system interfaces or open source libraries.
This paper describes how the technique works and shows how it was able to find several previously unknown bugs in high-profile software systems with high precision (i.e., few false positives).