By: Dr. Murat Kantarcioglu, UT Dallas
Like many application domains, more and more data are collected for cyber security. Examples of these collected data include system logs, network packet traces, account login formation, etc. Since the amount of data collected is ever increasing, it has become impossible to analyze all collected data manually to detect and prevent attacks. Therefore, data analytics are being applied to large volumes of security monitoring data to detect cyber security incidents. For example, a report from Gartner claims that “Information security is becoming a big data analytics problem, where massive amounts of data will be correlated, analyzed and mined for meaningful patterns”. There are many companies that already offer data analytics solutions for this important problem. Of course, data analytics is a means to an end where the ultimate goal is to provide cyber security analysts with prioritized actionable insights derived from big data.
Still, direct application of data analytics techniques to the cyber security domain may be misguided. Unlike most other application domains, cyber security applications often face adversaries who actively modify their strategies to launch new and unexpected attacks. The existence of such adversaries in cyber security creates unique challenges compared with other domains where data analytics tools are applied. First, the attack instances are frequently being modified to avoid detection. Hence a future dataset may no longer share the same properties as current datasets. For example, attackers may change spam e-mails by adding some words that are typically associated with legitimate e-mails. Therefore, spam e-mail characteristics may be changed significantly by spammers as often as they want. Secondly, when a previously unknown attack appears, data analytics techniques need to respond to the new attack quickly and cheaply. For example, when a new type of ransomware appears in the wild, we may need to update existing data analytics techniques quickly to detect such attacks. Thirdly, adversaries can be well-funded and make big investments to camouflage the attack instances. For example, a sophisticated group of cyber attackers may create malware that can evade all existing signature-based malware detection tools using zero day exploits (i.e., software bugs that were previously unknown).
Consequently, data analytics techniques for cyber security must address the above-mentioned unique challenges. They need to be resilient against the adaptive behaviors of adversaries, and able to quickly detect previously unknown new attack instances. Recently, various adversarial data mining techniques (including our proposed techniques developed using Army Research Office funding) have been developed to counter adversaries’ adaptive behaviors. For example, in our earlier work, we developed a game theoretical framework to discover an optimal set of attributes to build machine learning models against active adversaries. In other work, we modified an existing and popular machine learning tool named Support Vector Machine (SVM) to be more resistant against adversarial attacks. The attack models are defined in terms of adversaries’ capabilities to modify data. Our solutions minimize the worst-case loss corresponding to the attack models and show that such tailored tools could be more resistant to adversarial behavior compared with existing SVM alternatives. Of course, more research needs to be done to address many practical issues that are emerging.
Still, these developments show the important interaction between cyber security and data analytics and why we need novel tools and techniques that are tailored for cyber security. In addition, we need more cyber security data analysts that are proficient both in these data analytics techniques as well as cyber security. At UT Dallas, we are already looking into developing courses to teach these topics. For example, we have a seminar course on the topic and are developing a regular course that could be offered at the master level. We believe that like all other domains, knowledge extracted from big data will be crucial in winning against future cyber-attacks.