Dr. Latifur Khan’s Data Mining Research Group Makes Advances in Data Stream Mining

Dr. Latifur Khan, Professor of Computer Science at UT Dallas and Director of the UT Dallas Data Management and Analytics, and his research group have been doing research in data science for a long time. One of their research focus is data stream mining, where they have been working on a number of project supported by grants from the National Science Foundation (NSF), Air Force Office of Scientific Research (AFOSR), and National Air and Space Administration (NASA). Dr. Khan’s research group also collaborates data mining researcher at the University of Illinois Urbana-Champaign and IBM Thomas J. Watson Research Center. As part of their collaborative effort, they have proposed a novel class detection technique that automatically detects novel classes in data streams. This technique has been granted a US patent, and is widely used by the data mining community.

The research in data-stream mining serves as the dissertation topic of two of Dr. Khan’s students: Swarpu Chandra and Ahsanul Haque, and together the three of them have published in the proceedings of a number of leading computer science conferences. In 2016 alone, they published the following papers in this area:

Ahsanul Haque, Latifur Khan, and Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. Proceedings of The Thirtieth AAAI Conference on Artificial Intelligence (AAAI),
Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, and Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE),
Swarup Chandra, Ahsanul Haque, Latifur Khan, and Charu C. Aggarwal: An Adaptive Framework for Multistream Classification. Proceedings of The Conference on Information and Knowledge Management (CIKM),
Swarup Chandra, Ahsanul Haque, Latifur Khan, and Charu C. Aggarwal: Efficient Sampling-based Kernel Mean Matching. Proceedings of The IEEE International Conference on Data Mining (ICDM),
Ahsanul Haque, Zhuoyi Wang, Swarup Chandra, Latifur Khan, Charu C. Aggarwal, and Yupeng Gao: Sampling based Distributed Kernel Mean Matching using Spark. Proceedings of the IEEE International Conference on Big Data (IEEE Big Data),

The first paper titled “SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream,” introduces an efficient semi-supervised framework which uses change detection on classifier confidence to detect concept drifts, and to determine chunk boundaries dynamically. It also addresses concept evolution problem by detecting outliers having strong cohesion among themselves. In the second paper titled “Efficient handling of concept drift and concept evolution over Stream Data,” the authors present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, they provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. In the third paper titled “An Adaptive Framework for Multistream Classification,” the authors present a novel stream classification problem setting involving two independent non-stationary data generating processes, relaxing the above assumptions.” In the fourth paper titled “Efficient Sampling-based Kernel Mean Matching,” a sampling-based technique is proposed, which can provide faster solutions to data shift between two distributions. In the fifth paper, an efficient distributed algorithm has been proposed using a state-of-the-art distributed framework called Spark. In the fifth paper, an efficient distributed algorithm has been proposed using a state-of-the-art distributed framework called Spark.

The methods proposed in the fourth and the fifth paper have paved the way for efficient data shift adaptation required for multi-stream classification. In the future, Prof. Khan’s group is working to find a more efficient method for multi-stream classification with additional features such as detecting new data patterns over time.

ABOUT THE UT DALLAS COMPUTER SCIENCE DEPARTMENT

The UT Dallas Computer Science program is one of the largest Computer Science departments in the United States with over 2,100 bachelor’s-degree students, more than 1,000 MS master’s students, 150 PhD students, and 86 faculty members, as of Fall 2016. With The University of Texas at Dallas’ unique history of starting as a graduate institution first, the CS Department is built on a legacy of valuing innovative research and providing advanced training for software engineers and computer scientists.