Research in Big Data is Big at UT Dallas

At the weekly Computer Science mixer on Friday, 4/18/2014, Luis Mojica, Tahrima Rahman, David Smith, Somdeb Sarkel, Vibhav Gogate, and Deepak Venugopal (pictured above) described their research, related to Big Data. The camaraderie among the PhD graduate students and their adviser, Dr. Vibhav Gogate, was evident on the first page of their presentation. The description of their adviser said that he was the one who “gets all the credit and does none of the work!” The research work was described by each graduate student and is summarized below.

Deepak Venugopal (dxv021000@utdallas.edu)

Research Title: Advancing the state-of-the-art for inference in Probabilistic Graphical Models (PGM) and Markov Logic (MLN)

Description: This research involves 3 components, namely, developing advanced inference algorithms for PGMs, leveraging symmetries in first-order structure (lifted inference) for scalable inference in MLNs, and applying Markov Logic to extract structured information from unstructured text.

Examples: This research is useful in application domains where scalability is a key requirement. For example, natural language processing tasks. Work is currently being done to automatically extract biomedical events from text-data using a MLN-based framework.

Advantages & Uses: One research goal is lifted inference. It is essential to scale-up inference in MLNs to the level where it is useful for large real-world domains. To help in this effort, a software system is being developed to implement a suite of state-of-the-art lifted inference algorithms.

David Smith (dbs014200@utdallas.edu)

Research Title: Search-based methods for Lifted Weighted Model Counting

Description: The current project involves the application of a general search-based framework for performing inference to statistical relational models. The main goal is to build a theory to allow for a rigorous analysis of the complexity of exact inference on a wide class of statistical relational models. The aim is to use the theory to formulate effective search-based approximate inference algorithms that can make the most of the resources at hand.

Examples: This current work has plenty of applications. Statistical relational models are useful in those areas that require models in which there are large numbers of variables that may be queried, but many of those variables might be indistinguishable from each other. Natural language processing tasks are a natural fit. Another viable application is inference over social networks.

Advantages & Uses: The advantage of lifted inference is that if the models contain large numbers of indistinguishable variables, it has the potential to scale to much larger problems than traditional propositional models, and it is able to do so with a much more compact representation of the model.

Future Plans: One plan for the work is to define the semantics of the lifted search space in terms of a programming language. Such a language would allow a user to specify a model in more detail than current statistical relational models, in that the model specification would describe not just the model but also (on a high level) how to perform inference on the model. Thus the language would encourage users to learn and understand the inference procedure, rather than treating it as a black box.

Somdeb Sarkhel (somdeb.sarkhel@utdallas.edu)

Research Title: Making optimization problems for Markov Logic Networks (MLN) scalable

Description: This current research involves exploiting symmetries present in MLN to develop efficient inference techniques for one of the optimization problem for MLN, the Maximum-a-Posteriori problem (MAP).

Examples: The MAP problem is the most widely used problem in a PGM. People use to infer the most probable cause for an observation (eg. which genes are causing cancer). This research can be applied in many bioinformatics task or natural language processing tasks.

Advantages & Uses: Lifted techniques can make inference in large problems much more scalable compared to traditional inference techniques. This helps to model complex problems more accurately.

Future Plans: Investigate other optimization problems such as marginal MAP, or divergent MAP problem. Also, apply this current research in some suitable application area.

Tahrima Rahman (tahrima.rahman@utdallas.edu)

Research Title: Learning Tractable Probabilistic Graphical Models (PGM) from Data, Finding new similarity measures in AND/OR search spaces.

Description: This research includes both learning and inference tasks for PGMs. A project of learning cutset networks from data was recently completed. Cutset networks are tractable probabilistic graphical models which combine OR search spaces with tree Bayesian networks. Work is currently being done to find new similarity measures for AND/OR search spaces for graphical models.

Examples: The cutset network learner was evaluated by learning several real-world benchmark models. It empirically proved to perform better than other state-of-the-art tractable model learners. Finding novel similarity measures for AND/OR decision diagrams can help to speed up the task of inference by search in PGMs.

Advantages & Uses: Cutset networks can be learned very efficiently from data compared to other state-of-the-art tractable model learners. Learning these networks are preferable for application domains with high-dimensionality such as bioinformatics, social networks, NLP etc. As for AND/OR decision diagrams, novel similarity measures can reduce the size of the search space during inference and provide valuable insights into the relational structure among the objects in the domain during learning.

Future Plans: Improve the cutset network learner to learn more expressive models in high dimensional domains. Also, investigate whether(and how) similarity measures applied to AND/OR search spaces can help us learn MLN rules from data.

The Department of Computer Science at UT Dallas [www.cs.utdallas.edu] is one of the largest CS departments in the United States with more than 750 undergraduate, 500 master, and 125 PhD students. They are committed to exceptional teaching and research in a culture that is as daring as it is supportive.