Perfect is the enemy of good: Improving anti-virus software with approximations

Daniel Tkacik

Dec 21, 2017

Anti-virus software, that tiny icon on most computer screens that provides a huge service as the protective policeman of computer systems, is on its way to receiving a significant improvement.

CyLab’s Kyle Soska, a Ph.D. student in the department of Electrical and Computer Engineering, spent a summer at anti-virus company Symantec as a Symantec Research Labs Fellow, working towards improving the performance of its anti-virus software.

In a study presented at the SIGKDD International Conference on Knowledge Discovery and Data Mining in Halifax, Nova Scotia, Soska and colleagues outline an automated technique to group files with the app they comprise, which would increase the accuracy of anti-virus software.

"This all began when Symantec realized something really convenient about the mobile space: everything is packed into an application or an app," Soska said. "This is nice because you only have to make a security decision about an app instead of the individual files that make up the app."

If I tell the program to run for two milliseconds, I can find eight of the 10 closest neighbors... If I sacrifice just two of the top 10, I get huge savings in computing time.
Kyle Soska, Ph.D. student, Electrical and Computer Engineering

Currently, most virus scanners only understand how files pair with apps for the top 50 or so most popular software companies (e.g. Microsoft, Google, Adobe, etc.). But for the tens of thousands of remaining pieces of less-popular software used today, scanners are forced to check for viruses on a file-by-file basis, which is cumbersome and yields lower confidence in whether an application is malicious or not.

"Our study demonstrates an automatic approach that can hopefully get us from the top 50 to the top 5,000 applications by making it automatic and working at application-levels," Soska says.

To do this, Soska’s team needed to develop a way of comparing files to one another in order to infer relationships (e.g. is File A similar to File B?). Files that are alike can then be clustered. However, this clustering becomes very challenging as the number of files increases. For example, finding a single file’s 10 closest "neighbors" in similarity may take tens of seconds.

"If you have n files in your dataset, then to find each file’s nearest neighbors you have to do n of these comparisons," says Soska. "If you have billions of files, clustering in this way would take roughly 3,000 years."

This is where Soska’s team introduced approximation techniques to the process.

"If I tell the program to run for two milliseconds, I can find eight of the 10 closest neighbors," Soska says. "This turns out to be good enough in terms of clustering files. If I sacrifice just two of the top 10, I get huge savings in computing time."

In the future, Soska says, Symantic will likely use these findings to improve its anti-virus software.

Other authors of the study included CyLab’s Nicolas Christin, a professor in the departments of Engineering and Public Policy and the Institute for Software Research, and Symantec researchers Chris Gates and Kevin Roundy.