Research Statement 2018. (Archived)

19 minute read


Broadly, my research interest lies at the intersection of Text Mining, Natural Language Processing and Information Retrieval. More specifically, I have been studying how to mine Big Text Data across different application domains to find interesting patterns that can provide novel insights to domain experts, which is, otherwise, difficult to perceive due to the scale of the data.

Research Statement by Shubhra Kanti Karmaker (“Santu”)

The goal of my research is to develop intelligent systems by exploiting the web-scale Big Data which can enhance human’s perception of the real world and thus, facilitate informed decision making in a wide range of application areas. Broadly, my research interest lies at the intersection of Text Mining, Natural Language Processing and Information Retrieval. More specifically, I have been studying how to mine Big Text Data across different application domains to find interesting patterns that can provide novel insights to domain experts, which is, otherwise, difficult to perceive due to the scale of the data. The primary objective of Big Data applications is to help an organization make more informed decisions by analyzing large volumes of data. The power of big data enables a lot of important application areas including intelligent healthcare systems, Automated Quality Control in Manufacturing, Recommender Systems, User Behavior Analysis, Cyber Security and Intelligence, Crime Prediction and Prevention, Acceleration of Scientific Discovery, Stock Market prediction and so on. These intelligent systems help in improving our quality of life and thus, contribute to build a better society. There are multiple prominent challenges in Big Data domain, including but not limited to: 1) Filtering small amount of relevant data from the ocean of Big Data, 2) Deeper analysis/mining of the relevant data to reveal patterns/insights, and 3) Preserving the privacy of sensitive information of the users. To address these challenges, I have conducted research in the following major directions: 1) Influence Mining in Stream Text Data (related to deeper analysis of relevant data), 2) Retrieval in E-commerce domain (related to filtering relevant data), 3) General techniques for Text Data Mining (again, related to deeper analysis of relevant data) and 4) Privacy Preserving Data Mining (related to privacy of sensitive information).

Research Accomplishments

I have published 9 (nine) research papers at premier conferences including ACM SIGIR [2], WWW [3], ACM CIKM [1, 4, 5], IEEE CEC [6] and WPES [8]. I also have 2 (two) more research papers currently under submission. As a researcher, I am focused on developing general algorithms that are grounded on solid theoretical foundations and also empirically effective. The benefit of grounding a method on theory is that the behavior of the algorithm can be systematically analyzed, but ultimately, the algorithms/models have to be useful in real-world products, thus, empirical validation of the proposed algorithm is necessary. As a result, my research outcomes generally include creation of new models and algorithms that are immediately applicable to multiple application domains. Following this philosophy, I have developed models/algorithms for analyzing Big Text Data where human and machine can interactively collaborate to complete the desired goal task. My research so far has touched the following directions:

Influence Mining in Stream Text Data

This is a new research direction/ area which I have introduced during my PhD. The motivation for this direction is as follows: A crucial component of any intelligent system is to understand and predict the behavior of its users. A correct model of the user behavior enables the system to perform effectively to better serve the users need. While much work has been done on user behavior modeling based on historical activity data, little attention has been paid to how external factors influence the user behavior, which is clearly important for improving an intelligent system. The influence of external factors on user behavior are mostly reflected in two different ways: 1) Through significant growth of users’ thirst about information related to the external factors (e.g., user may conduct a lot of search related to a popular event), and 2) Through the user generated contents that are directly/indirectly related to the external factors (e.g. user may tweet about a particular event). To capture these two aspects of user behavior, I introduced Influence Models for Information Thirst as well as for Content Generation, respectively.

Influence Models for Information Thirst: I introduced a new data mining problem, i.e., how to mine the influence of real world events on users’ information thirst, which is important both for social science research and for designing better search engines for users. I solved this mining problem by proposing computational measures that quantify the influence of an event on a query to identify triggered queries and then, proposing a novel extension of Hawkes process to model the evolutionary trend of the influence of an event on search queries. Evaluation results using news articles and search log data show that the proposed approach is effective for identification of queries triggered by events reported in news articles and characterization of the influence trend over time. This work [3] was published in World Wide Web Conference (WWW), 2017. However, the problem formulation in [3] was based on the strong assumption that each event poses its influence independently. This assumption is unrealistic as there are many correlated events in the real world which influence each other and thus, would pose a joint influence on the user search behavior rather than posing influence independently. To relax this assumption, I proposed a Joint Influence Model [1] based on the Multivariate Hawkes Process which captures the interdependence among multiple events in terms of their influence. This work has been published at CIKM, 2018.

Influence Models for Content Generation: Influence Models for Information Thirst discussed above usually involve mining a single stream of text (e.g. search query log) with respect to an external influential factor (e.g., a popular event). A more challenging problem is to model the influences between two/multiple streams of text. More specifically, I introduced another new problem, where, the task is to model how content generation in one stream influences the generation of contents in the other stream. For example, research papers published by machine learning conferences like NIPS, ICML often have a direct influence on the papers published by more applied conferences like KDD, SIGIR etc. One particular limitation of the existing text generation techniques is that most of them assume the generation process is static, i.e., it does not evolve over time. This clearly limits its application on text stream data as most text stream data often evolve over time showing distinct patterns at different instant of timestamps. The influence from co-related text streams often contribute to these temporal patterns. To capture these influence dynamics, I proposed a deep learning architecture based Dynamic Text Generation Process (DTG) [9]. I demonstrated how DTG can be learnt by mining multiple streams of text and then be applied to two different useful text mining applications including influence inference among different research communities (e.g., between NIPS and SIGKDD) and forecasting of future contents to be generated. This work is currently under submission at KDD, 2019.

E-Commerce Search

E-Commerce (E-Com) search is an important emerging new application of information retrieval. Virtually all major retailers have their own product search engines, with popular engines processing millions of query requests per day. As E-shopping becomes increasingly popular, optimization of their search quality is increasingly important since an improved E-Com search engine can potentially save all users’ time while increasing their satisfaction.

Learning to Rank for E-Com Search: Over the past decade, Learning to Rank (LETOR) methods, which involve applying machine learning techniques on ranking problems, have proven to be very successful in optimizing search engines. However, applications of these methods to any search engine optimization would still face many practical challenges, notably how to define features, how to convert the search log data into effective training sets, how to obtain relevance judgments including both explicit judgments by humans and implicit judgments based on search log, and what objective functions to optimize for specific applications. In this research [2], I discussed the practical challenges in applying learning-to-rank methods to E-Com search, including the challenges in feature representation, obtaining reliable relevance judgments, and optimally exploiting multiple user feedback signals such as click rates, add-to-cart ratios, order rates, and revenue. This study yielded multiple interesting insights and results regarding E-Com search practice, that received significant attention from the search industry including Walmart, Flipkart, @Unbxd etc and started a collaboration between me and @Unbxd with a grant money of $40,000. The research findings were published in SIGIR 2017, which also attracted significant attention from the researchgate community with more than 3,000 reads.

Evaluation for E-Com Search: In this research [10], I looked deeper into evaluation metrics for E-Com search. While, nDCG is a commonly used measure for evaluating information retrieval systems, it is usually computed for a fixed cutoff k, i.e., nDCG@k and some fixed discounting coefficient. Such a conventional query-independent way to compute nDCG does not accurately reflect the utility of search results perceived by an individual user for E-Com search, where, the user’s exploration depth varies a lot across different products. To address this limitation, I studied query-specific nDCG to evaluate E-com search engines, particularly to see whether using a query-specific nDCG would lead to a different conclusion about the relative performance of multiple LETOR methods than using the conventional query-independent NDCG would otherwise. Our experimental results show that the relative ranking of LETOR methods using query-specific nDCG can be dramatically different from those using the query-independent nDCG at the individual query level. Although we conducted our evaluation in the context of E-Com Search, the findings of this work may potentially improve the current practice of evaluating all kinds of search engines. This work is currently under review at JASIST.

General Techniques for Text Data Mining

I have developed multiple general purpose Text Mining techniques which can be applied across a wide variety of application domains. The main focus of this research direction is to make Text Mining accessible to a wide variety of users and enable non-experts to play with different text mining techniques without worrying about underlying technical details. Below, I present three such general techniques:

Generative Language Model for Concept Tagging: One fundamental challenge in any Text Data Management system is to transform the unstructured text data into a more organized structure by tagging text segments with relevant concepts. Although, there has been numerous works that try to solve this problem through supervised classification, obtaining a lot of labeled data for supervised learning is often infeasible and costly. To address this challenge, I proposed a new approach based on generative feature language models that can mine the implicit concepts more effectively through unsupervised statistical learning [5] without requiring any labeled data from the users. Experiments on a customer product review data showed that my method outperforms existing algorithms by a large margin. I also created and published eight new data sets to facilitate evaluation of this task in English. This research resulted in a publication at ACM CIKM, 2016.

SOFSAT: Towards a Set-like Operator based Framework for Semantic Analysis of Text: As data reported by humans about our world, text data play a very important role in all data mining applications, yet, how to develop a general text analysis system to support all text mining applications is a difficult challenge. To address this challenge, I introduced SOFSAT [11], a new framework that can support set-like operators for semantic analysis of natural text data with variable text representations. It includes three basic set-like operators—TextIntersect, TextUnion, and TextDifference—that are analogous to the corresponding set operators intersection, union, and difference, respectively, which can be applied to any representation of text data, and different representations can be combined via transformation functions that map text to and from any representation. Just as the set operators can be flexibly combined iteratively to construct arbitrary subsets or supersets based on some given sets, I showed that the corresponding text analysis operators can also be combined flexibly to support a wide range of analysis tasks that may require different workflows, thus enabling an application developer to “program” a text mining application by using SOFSAT as an application programming language for text analysis. This position paper has been accepted for publication at SIGKDD Explorations 2018.

Text Based Forecasting System: Forecasting of time series [7] is highly beneficial (and necessary) for optimizing decisions, yet a very challenging problem; using only the historical values of the time series is often insufficient. To address this limitation, I collaborated with two first year PhD students to study the problem of how to construct effective additional features based on related text data for time series forecasting [4]. More specifically, we proposed to use contemporary text data (e.g., news articles) for constructing additional features that can help predicting external time series variables like stock price. We evaluated feature effectiveness through a data set for predicting stock price changes using news text and found Text-based features outperform time series-based features for stock prediction. This work was published in CIKM, 2017.

Privacy Preserving Data Mining

The promise of big data relies on the release and aggregation of data sets. When these data sets contain sensitive information about individuals, it has been scalable and convenient to protect the privacy of these individuals by de-identification. However, studies show that the combination of de-identified data sets with other data sets risks re-identification of some records. Yet, there has been no attempt to quantify such re-identification risk due to the hardness of the problem, which is due to potentially unlimited number of factors that have impact over the re-identification risk. In this work [8], I introduced a general probabilistic re-identification framework (NRF) that can be instantiated in specific contexts to estimate the probability of compromises (re-identification risk) based on explicit assumptions. As a case study, I showed how we can apply NRF to analyze and quantify the risk of re-identification arising from releasing de-identified medical data in the context of publicly-available social media data. This research has been published by the renowned workshop WPES 2018, which was held in conjunction with ACM CCS 2018.

Ongoing Works and Research Agenda

My long term research goal is to augment human intelligence with machine intelligence by analyzing patterns hidden inside the web-scale BIG Data. I strongly believe that we will achieve far more as a society if human cognition can efficiently collaborate with the immense computational power of the machines. As part of that long-term goal, I have set the following short-term research plan:

A General purpose Text Mining Programming Language: Currently, SOFSAT [11] is a hypothetical Text Mining language with no implementation. So, one of my primary research agenda is the realization of SOFSAT system and releasing it as a general toolkit which can be used easily by non-experts (e.g. business professionals, government officials) to quickly analyze a large amount of documents. Once implemented, programmers will be able to include SOFSAT as a package and write efficient and concise text mining code without worrying about the underlying complexity of the mining algorithms. I have provided a detailed roadmap for future research for this new direction in [11], including, Full specification of the SOFSAT text analysis language, implementation of different Set-like operators, incorporating a wide variety of text representations, Optimization of SOFSAT system etc.

Next Generation Conversational AI Systems: Conversational AI agents/Chatbots, e.g., Siri, Alexa, Cortana etc. are now very popular, who also serve as intelligent digital assistants. The core of these conversation AI systems is the language understanding task, which consists of user intent identification, entity extraction, action selection and text response generation. There are a lot of open challenges in this domain including how to understand user intent given that the same intent can be expressed in many different ways, how to deal with ambiguous entities present in user utterances, how to tackle complex utterances including multiple intents, how to generate text to provide the correct response to the user, etc. I plan to address these challenges by formulating meaningful representations for user intents, entities and actions, which in turn, conform with the representation of the utterance itself. I also plan to extend the language understanding power of the Conversational AI Systems by plugging in SOFSAT framework as a module inside these agents.

Next Generation E-Commerce Search Engine: The state-of-the-art of E-commerce search is still in its infancy with a lot of open challenges to be addressed. Some prominent challenges in this domain include Dynamic Relevance of Items, Noisy Product Catalog, Expertise transfer learning for domain specific E-com search and Cold Start problem for new products. I believe my previous research experience with E-Commerce Search and Text Mining would provide me a significant edge towards solving these problems.

My research is highly interdisciplinary and is related to many areas in Computer science and Information Science, particularly Data Mining, Natural Language processing, Information retrieval, Data Security and Privacy, Online Education Platforms and Health informatics. I’m looking forward to collaborating with people in all these areas.


[1] Shubhra Kanti Karmaker Santu, Liangda Li, Yi Chang and Chengxiang Zhai . “JIM: Joint Influence Modeling for Collective Search Behavior “. in ACM CIKM, 2018

[2] Shubhra Kanti Karmaker Santu, Parikshit Sondhi and ChengXiang Zhai. “On Application of Learning to Rank for E-Commerce Search. In proceedings of ACM SIGIR, 2017: 475-484.

[3] Shubhra Kanti Karmaker Santu, Liangda Li, Dae Hoon Park, Chengxiang Zhai, Yi Chang, “Modeling the Influence of Popular Trending Events on User Search Behavior”, Proceedings of the 26th International World Wide Web Conference (WWW), 2017: 535-544.

[4] Yiren Wang, Dominic Seyler, Shubhra Kanti Karmaker Santu and Chengxiang Zhai. “A Study of Feature Construction for Text-based Forecasting of Time Series Variables”. ACM CIKM, 2017: 2347-2350 [Short Paper].

[5] Shubhra Kanti Karmaker Santu, Parikshit Sondhi and ChengXiang Zhai. “Generative Feature Language Models for Mining Implicit Features from Customer Reviews”. In proceedings of ACM CIKM, 2016: 929-938.

[6] Shubhra Kanti Karmaker Santu, Md. Mustafizur Rahman, Md. Monirul Islam and Kazuyuki Murase. “Towards better generalization in Pittsburgh Learning Classifier Systems”. In IEEE Congress on Evolutionary Computation (IEEE CEC), 2014 (pp. 1666-1673). IEEE.

[7] Md. Mustafizur Rahman, Shubhra Kanti Karmaker Santu, Md. Monirul Islam and Kazuyuki Murase. “Forecasting Time Series - A Layered Ensemble Architecture”. In International Joint Conference on Neural Networks (IJCNN), 2014 (pp. 210-217). IEEE.

[8] Shubhra Kanti Karmaker Santu, Vincent Bindschaedler, Chengxiang Zhai and Carl A. Gunter. “NRF: a Naive Probabilistic Re-identification Framework”. in WPES@ACM CCS, 2018.

[9] Shubhra Kanti Karmaker Santu, Bin Bi, Hao Ma, ChengXiang Zhai and Kuansan Wang. “Dynamic Text Generation Process”. To be submitted to WWW 2019.

[10] Shubhra Kanti Karmaker Santu, Parikshit Sondhi and ChengXiang Zhai. “Query-Specific Customization of nDCG: A Case-Study with Learning-to-Rank Methods”. Submitted to JASIST.

[11] Shubhra Kanti Karmaker Santu, Chase Geigle, Duncan Ferguson, William Cope, Mary Kalantzis, Duane Searsmith and Chengxiang Zhai. “SOFSAT: Towards a Set-like Operator based Framework for Semantic Analysis of Text”, accepted for publication at SIGKDD Explorations 2018.