Skip to main content

Distributed Stream Algorithms

Researchers: Christopher Olston


Distributed Stream Algorithms to Detect Rapidly-spreading Cyber Attacks

Cyber attacks in the form of fast-spreading malcode are the number-one security threat against large networks. While many research efforts strive to develop new methods and technologies to secure networks, eradicating these attacks remains a challenging problem. We contend that network attack detection is fundamentally a data stream mining problem: large volumes of streaming data must be processed and mined in real-time to detect and suppress ongoing (and possibly zero-day) attacks. The data management community has studied the problem of real-time and distributed data stream processing extensively, and has developed many sophisticated algorithms. Computer security research to date, however, has largely ignored these results. In this work, we propose a research effort to design distributed streaming algorithms for the purpose of rapid and highly effective detection against fast-spreading attacks.

We believe that streaming processing technologies can give rise to powerful techniques to solve this long-standing security problem. There are two primary research thrusts in this work. The first one is developing powerful streaming algorithms to perform anomaly detection over a single observation point (e.g., local streams). The second research effort focuses on developing distributed algorithms to enable more rapid and more effective intrusion detection. More specifically, we propose the following tasks:

Task A: Local Stream Processing

We will develop new techniques for detecting and generating signatures for worm payloads by finding commonalities in multiple traffic streams in real time. We will investigate two classes of techniques: (1) identifying commonalities among multiple inbound traffic streams; (2) identifying commonalities between inbound and outbound traffic. Both can be used to construct signatures of previously-unknown worms.

Task B: Distributed Stream Processing

The goal of this task is distributed detection by correlating locally observed phenomena. There are clear and well-known advantages of performing detection network-wide, but the principle challenge lies in minimizing the amount of coordination and communication needed for such an approach. We have previously developed a top-k monitoring technique for distributed data streams. The goal is to continually report the top k elements of a large set (with k as a flexible parameter), where each element has an associated numerical value. The value associated with each element is expressed as a summation over n component values, with one component value for each of n monitoring points. Component values may fluctuate rapidly with time. As an example, elements may represent bit strings of length l, with frequency of occurrence at each monitoring point taking the role of component values. In this case our technique will track in real-time the top k most frequent bit strings, in terms of overall frequency aggregated across all monitoring points. Observe that simply finding the elements with the top k local values at each monitoring node, and subsequently shipping these lists to the coordinator node to be combined, is not correct because the overall top element may not occur in any local top-k set. Our algorithm produces correct results with little communication.

We observe that most global alarm-condition functions (such as detecting frequently occurring bit-strings, or an abundance of inbound/outbound payload replication) can be translated into distributed top-k monitoring tasks. The advantage of our distributed top-k monitoring approach is that measurement functions need only be invoked during polling or reporting actions, which only involve data items that dabble with entering the top-k set. Measurements of most items need never be transmitted beyond local detection points. Our performance measurements on the initial algorithm indicate that the frequency with which polling and reporting is required is low, and we will conduct extensive evaluation in the context of security applications to verify that overhead can be kept low enough for distributed monitoring. To this end we will collaborate with CyLab members Dawn Song and Chenxi Wang, both of whose research interests overlap with our own.

We will also assess the ability of our technique to resist various forms of attack, and explore ways to ensure robustness in the face of attack as well as evasion. Note that the salient feature of our distributed top-k monitoring approach is that it enables security threats to be detected even if they spread themselves thinly enough to evade detection at every individual observation point. If security detection mechanisms are not coordinated on a global scale, worm attacks will be able to hide “below the radar” of local detection mechanisms.