loading page

Mining Multiple Longest Common Subsequence in Big Sequences
  • +3
  • Yanni Li ,
  • Bing Liu ,
  • Tihua Duan ,
  • Zhi Wang ,
  • Hui Li ,
  • Jiangtao Cui
Yanni Li
Xidian University

Corresponding Author:[email protected]

Author Profile
Tihua Duan
Author Profile
Jiangtao Cui
Author Profile

Abstract

Multiple longest common subsequence (MLCS) mining (a classical NP-hard problem) is an important task in many fields. Numerous applications in these fields can generate very long sequences (i.e., the length of the sequences ≥ 10^4), called big sequences. Such big sequences present a serious challenge to existing MLCS algorithms. Although significant efforts have been made to tackle the challenge, both existing exact and approximate MLCS algorithms fail to deal with big sequences as their problem-solving model MLCS-DAG (Directed Acyclic Graph) is too large to be calculated due to the memory explosion. To bridge the gap, this paper first proposes a new identification and deletion strategy of different classes of non-critical points, which are the points that do not contribute to the solution on the MLCS-DAG. It then proposes a new MLCS problem-solving graph model, called KP-MLCS-DAG (Key Point based MLCS-DAG). A novel parallel MLCS algorithm, called KP-MLCS (Key Point based MLCS), is also presented, which can mine and compress all MLCSs of big sequences effectively and efficiently. Extensive experiments on both synthetic and real-world biological sequences show that the proposed algorithm KP-MLCS drastically outperforms the existing state-of-the-art algorithms in terms of efficiency and effectiveness. The source code of KP-MLCS and related test datasets, etc., can be found online: https://github.com/kp-mlcs/KP-MLCS.