loading page

ClusTop: An unsupervised and integrated text clustering and topic extraction framework
  • +2
  • Zhongtao Chen ,
  • Chenghu Mi ,
  • Siwei Duo ,
  • Jingfei He ,
  • Yatong Zhou
Zhongtao Chen
Author Profile
Chenghu Mi
Author Profile
Siwei Duo
Tianjin Huizhi Xingyuan Information Technol- ogy Co.,Ltd.

Corresponding Author:[email protected]

Author Profile
Jingfei He
Author Profile
Yatong Zhou
Author Profile

Abstract

Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations under this framework.