loading page

ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures
  • +1
  • Han Guo ,
  • Mingjia Huo ,
  • Ruiyi Zhang ,
  • Pengtao Xie
Mingjia Huo
Author Profile
Ruiyi Zhang
Author Profile
Pengtao Xie
Author Profile

Abstract

The study of proteins is critical in various scientific disciplines, but understanding their complex structure-function relationships remains challenging. Recent advancements in large language models (LLMs) have demonstrated their ability to comprehend task-specific knowledge, suggesting the potential for specially trained ChatGPT-like systems to accelerate protein research. In this work, we introduce ProteinGPT, a prototype model aimed at learning and understanding protein 3D structures. ProteinGPT enables users to upload proteins, ask questions, and engage in interactive conversations to gain insights. The ProteinChat system consists of three main components: a composite encoder block, a projection layer, and an LLM. The protein undergoes encoding to form a protein embedding, which is then projected to conform with the LLM. The LLM combines user questions with the embedding to generate informative answers. To train ProteinChat, we curated the RCSB-PDB Protein Description Dataset, comprising 143,508 protein-description pairs from publicly available sources. By leveraging the capabilities of ProteinGPT, researchers can potentially expedite their investigations into protein functionalities and structures, benefiting areas such as drug development and therapeutics, food science and nutrition, and various aspects of our lives. This initial step lays the foundation for further exploration and utilization of the ChatGPT-like system in protein research. The code is available at \url{https://github.com/UCSD-AI4H/proteinchat} and the dataset can be downloaded from \url{https://drive.google.com/file/d/1AeJW5BY5C-d8mKJjAULTax6WA4hzWS0N/view?usp=share_link}.